Logical Models and Basic Numeracy in Social Sciences

http://www.psych.ut.ee/stk/Beginners_Logical_Models.pdf

Rein Taagepera © 2015 REIN TAAGEPERA, professor emeritus at University of California, Irvine, and (), is the recipient of the Johan Skytte Prize in , 2008. He has 3 research articles in physics and over 120 in social sciences.

Table of Contents

Preface 8

A. Simple Models and Graphing

1. A Game with Serious Consequences 13 A guessing game. Skytte Prize 2008. An ignorance-based logical model. Exponents.

1a. Professionals Correct Their Own Mistakes 18 Means and median. Do not trust – try to verify by simple means. Pro- fessionals correct their own mistakes. But can this be so?

2. Pictures of Connections: How to Draw Graphs on Regular Scales 22 Example 1: Number of parties and cabinet duration. Constructing the framework. Placing data and theoretical curves on graphs. Making sense of the graph. Example 2: Linear patterns. How to measure the number of parties.

3. Science Walks on Two Legs: Observation and Thinking 29 Quantitatively predictive logical models. The invisible gorilla. Gravi- tation undetected. The gorilla moment for the number of seat-winning parties. What is “Basic Numeracy”?

4. The Largest Component: Between Mean and Total 36 Between mean and total. How to express absolute and relative differen- ces. Directional and quantitative models. Connecting the connections. How long has the geometric mean been invoked in political science?

5. Forbidden and Allowed Regions: Logarithmic Scales 42 Regular scale and its snags. Logarithmic scale. When numbers multiply, their logarithms add. Graphing logarithms. Logarithms of numbers between 1 and 10 – and beyond.

6. Duration of Cabinets: The Number of Communication Channels 49 The number of communication channels among n actors. Average duration of governmental cabinets. Leap of faith: A major ingredient in model building. The basic rule of algebra: balancing. Laws and models.

Table of Contents

7. How to Use Logarithmic Graph Paper 57 Even I can find log2! Placing simple integer values on logarithmic scale. Fully logarithmic or log-log graph paper. Slopes of straight lines on log-log paper. Semilog graphs. Why spend so much time on the cabinet duration data? Regular, semilog and log-log graphs – when to use which?

B. Some Basic Formats

8. Think Inside the Box – The Right Box 66 Always graph the data – and more than the data! Graph the equality line, if possible. Graph the conceptually allowed area – this is the right box to think in. The simplest curve joining the anchor points. Support for Democrats in US states: Problem. Support for Democrats in US states: Solution

9. Capitalism and Democracy in a Box 78 Support for democracy and capitalism: How can we get more out of this graph? Expanding on the democracy-capitalism box. Fitting with fixed exponent function Y=Xk. Why Y=Xk is simpler than y=a+bx ? Fitting with fixed exponent function 1-Y=(1-X)k. What is more basic: support or opposition? Logical formats and logical models. How can we know that data fits Y=Xk? A box with three anchor points: Seats and votes.

10. Science Means Connections Among Connections: Interlocking Relationships 89 Interlocking equations. Connections between constant values in relation- ships of similar form. Why linear fit lack connecting power. Many variables are interdependent, not “independent” or “dependent”.

11. Volatility: A Partly Open Box 96 The logical part of a coarse model for volatility. Make your models as simple as possible – but no simpler. Introducing an empirical note into the coarse model. Testing the model with data. Testing the model for logical consistency.

12. How to Test Models: Logical testing and Testing with Data 104 Logical testing. Testing with data. Many models become linear when logarithmic are taken. What can we see in this graph? The tennis match between data and models. Why would the simplest forms prevail? What can we see in this graph? – A sample list.

13. Getting a Feel for Exponentials and Logarithms 113 Exponents. Fractional exponents of 10. Decimal logarithms. What are logarithms good for? Logarithms on other bases than 10.

4 Table of Contents

14. When to Fit with What 118 Unbounded field – try linear fit. One quadrant allowed – try fixed exponent fit. Two quadrants allowed – try exponential fit. How to turn curves into straight lines. Calculating the parameters of fixed expo- nent equation in a single quadrant. Calculating the parameters of exponential equation in two quadrants. Constraints within quadrants: Two kinds of “drawn-out S” curves.

C. Interaction of Logical Models and Statistical Approaches

15. The Basics of Linear Regression and Correlation Coefficient R2 129 Regression of y on x. Reverse regression of x on y. Directionality of the two OLS lines: A tall twin’s twin tends to be shorter than her twin. Non- transitivity of OLS regression. Correlation coefficient R2. The R2 measures lack of scatter: But scatter along which line?

16. Symmetric Regression and its Relationship to R2 142 From minimizing the sum of squares to minimizing the sum of rectangles. How R-squared connects with the slopes of regression lines. EXTRA: The mathematics of R2 and the slopes of regression lines.

17. When is Linear Fit Justified? 149 Many data clouds do not resemble ellipses. Grossly different patterns can lead to the same regression lines and R2. Sensitivity to outliers. Empirical configuration and logical constraints.

18. Federalism in a Box 157 Constitutional rigidity and judicial review. Degree of federalism and central bank independence. Bicameralism and degree of federalism. Conversion to scale 0 to 1.

19. The Importance of Slopes in Model Building 167 Notation for slopes.Equation for the slope of a parabola – and for y=xk. Cube root law of assembly sizes: Minimizing communication load. Expo- nential growth as an ignorance-based model: Slope proportional to size. Simple logistic model: Stunted exponential. How slopes combine – evidence for the slope of y=xk. Equations and constraints for some basic models.

D. Further Examples and Tools

20. Interest Pluralism and the Number of Parties: Exponential Fit 181 Interest group pluralism. Fitting an exponential curve to interest group pluralism data. The slope of the exponential equation. Why not fit with fixed exponent format? EXTRA 1: Why not a fit with fixed expo- nent? EXTRA 2: Electoral disproportionality.

5 Table of Contents

21. Moderate Districts, Extreme Representatives: Competing Models 192 Graph more than the data. A model based on smooth fit of data to anchor points. A kinky model based on political polarization. Com- paring the smooth and kinky models. The envelopes of the data cloud. Why are representatives more extreme than their districts?

22. Centrist Voters, Leftist Elites: Bias Within the Box 202

23. Medians and Geometric Means 207 A two-humped camel’s median hump is a valley. Arithmetic mean and normal distribution. Geometric mean and lognormal distribution. The median is harder to handle than the means. The sticky case of almost lognormal distributions. How conceptual ranges, means, and forms of relationships are connected.

24. Fermi’s Piano Tuners: “Exact” Science and Approximations 215 As exact as possible – and as needed. How many piano tuners? The range of possible error. Dimensional consistency.

25. Examples of Models Across Social Sciences 220 Sociology: How many journals for speakers of a language? Political history: Growth of empires. Demography: World population growth over one million years. Economics: Trade/GDP ratio.

26. Comparing Models 229 An attempt at classifying quantitatively predictive logical models and formats. Some distinctions that may create more quandaries than they help to solve. Ignorance, except for constraints. Normal distribution – full ignorance. Lognormal distribution – a single constraint. The mean of the limits – two constraints. The dual source of exponential format: Constant rate of change, and constraint on range. Simple logistic change: Zone in two quadrants. Limits on ranges of two interrelated factors: Fixed exponent format. Zone in one quadrant: Anchor point plus floor or ceiling lead to exponential change. Box in one quadrant, plus two anchor points. Box in one quadrant, plus three anchor points. Communication channels and their consequences. Population and trade models. Substantive models: Look for connections among connections.

Appendix A: What to Look for and Report in Multivariable Linear Regression 241 Making use of published multi-variable regression tables: A simple example. Guarding against colinearity. Running a multi-variable linear regression. Processing data prior to exploratory regression. Running exploratory regression. Lumping less significant variables: The need to report all medians and means. Re-running exploratory regression with fewer variables. Report the domains of all variables! Graphing the predictions of the regression equation against the actual outputs. Model- testing regression. Substantive vs. statistical significance.

6 Table of Contents

Appendix B: Some Exam Questions 255

Appendix C: An Alternative Introduction to Logical Models: Basic “Graphacy” in Social Science Model Building 267

References 296

7 Preface

This is a hands-one book that aims to add meaning to statistical ap- proaches, which often are the only quantitative methodology social science students receive. The scientific method includes stages where observation and statistical data fitting do not suffice – broader logical thinking is required. This book offers practical examples of logical model building, along with basic numeracy needed for this purpose. It also connects these skills to statistical approaches. I have used drafts of this book for many years with undergraduate, masters’, and doctoral students at the University of California, Irvine and the University of Tartu, Estonia. The topics presented can be largely addressed in one semester (30 hours of instruction). Ideally, this book should be taught to beginning college students who still remember their secondary school mathematics. My dream is a yearlong course that begins with half a semester using this book (up to Chapter 16), continues with two half-semesters on statistics, and con- cludes with the other half of this book. The idea is to start with logical thinking, leading to transformation of data before applying statistics, and to conclude again with logical thinking, which adds meaning to computer print-outs. Those students who prefer a more systematic pre- sentation of models might read Chapter 26 firsts. A previous book, Making Social Sciences More Scientific: The Need for Predictive Models (Taagepera 2008), showed why excessive depen- dence on push-button statistical analysis and satisfaction with merely directional hypotheses hampers social sciences. The intended readers were those social scientists who have already been exposed to statistics. They need not only more skills in logical model construction but also some deprogramming: reducing overdependence on computer pro- grams. Hence, Making… had its polemical parts, pinning down the limitations of narrowly descriptive “empirical models”. The undergraduates who have not been exposed to statistics have nothing to unlearn. Their basic algebra is relatively fresh, even while quite a few need help with basic numeracy. Here one can introduce the beauty of logical models without having to fight the excesses in the use of statistics. The present book is also more hands-on. Indeed, one may see some merit in logical model building, yet find it hard to proceed from philosophical acceptance to actual application. How to begin? This is often the hardest part. One needs examples on which to practice. Section A, “Simple Models and Graphing”, introduces simple but powerful devices for constructing quantitatively predictive models. Basic mathematical skills such as graphing, geometric means, and loga- Preface rithms have to be refreshed and upgraded from memorization to an operational level. This is what basic numeracy means. My introduction to logarithms may look simplistic. But I meet too many people who say they have forgotten all they once knew about logarithms. Forgetting logarithms is as impossible as forgetting how to ride a bike – if one ever has truly understood them, rather than merely cramming formulas. So I try to make the logarithms truly understood. The students are more motivated to make the effort to understand when they know that this time it is not math for math’s sake but something that cannot be avoided, if we want to do meaningful social science. Section B, “Basic Formats”, develops the crucial ability of knowing “when to fit with what” – knowing when linear, fixed exponent or exponential data fit would be the most promising, as a first step. The fixed exponent format is stressed, with sufficient repetition so that students may advance beyond the point of using it merely upon command, like trained dogs. The goal is to build up sufficient under- standing and confidence so that students would recognize opportunities to use their skills in further studies of social phenomena. I introduce examples from published social science literature, by scholars whom I highly respect. Even in the best that social sciences offer, modeling and data analysis can at times be expanded on, for maximal payoff. I thank the authors of these examples for their kind permission to make use of their work. Section C, “Interaction of Logical Models and Statistical Ap- proaches”, makes the connection to statistical approaches, including symmetric regression, so as to develop ability to use them in proper conjunction with model building. These topics demand some further mathematics, including the use of “slopes”, my less threatening way to slip in elements of differential equations. Some aspects of Sections B and C have been published separately as “Adding Meaning to Re- gression” (Taagepera 2010). Section D presents “Further Examples and Tools”. In particular, it introduces exponential functions somewhat more systematically. How- ever, my experience is that ability truly to use exponentials requires more than a few weeks of instruction. Appendix A offers advice about improved multivariable regression. This tends to be lost on students who have not reached this topic in statistics. Appendix B presents some exam questions. As one reads the chapter titles, no logical sequence may seem to emerge. Why isn’t there one section called Basic Numeracy, followed by sections on different types of formats and models? Trouble is that a sequence that makes sense logically can be inefficient pedagogically. Cramming all basic numeracy into the beginning is a recipe for

9 Preface overwhelming students with methodology, the usefulness of which as yet escapes them. To the contrary, I introduce bits of math only when one can no longer proceed without them. This approach increases student motivation and allows the new skills to sink in before something else begs for attention. I put much emphasis on graphical representation, both of data and also more than the data (meaning logical constraints). Social science papers all too often present data (or merely statistical analysis without data!) without any attempt at graphing – or presenting only bar graphs, those half-brothers of data tables. Yet, in all too many cases, the moment y is graphed against x, configurations jump to the eye that tell us that standard statistical approaches would be pointless unless the data are thoughtfully transformed. Ill-applied statistical analysis does not advance our knowledge of the social world. Using graphs, I gradually reinforce students’ ability to use formats such as exponential and fixed exponent – and the ability to recognize when to use such formats! The more I teach, the more I become aware of small but essential tricks of the trade that I have taken for granted but are not self-evident to students and so have to be taught explicitly. Logical model construction makes use of usual mathematical tools, but it requires true understanding of those tools, beyond automatic application of formulas or push-button computation. This is why I stress the “primitive” stages in logarithms etc., which involve thinking, and which students often overlook or forget in the rush towards ever more complex equations. The emphasis is on basic understanding. Exercises are spiked throughout the text. Asterisks indicate those I have often assigned as homework to be turned in and corrected, in 9 batches: Exercises 2.1, 2.3 / 4.2, 5.2 / 6.3, 7.1, 7.2 / 7.4, 8.1 /9.3, 9.4 / 10.1, 10.2 / 14.1, 16.2 / 17.2, 18.1, 18.2 / 19.2, 22.1. I do have a set of brief answers for a part of the exercises, but I am reluctant to distribute them electronically. I have been asked about possible follow-up courses. Almost any of the usual courses on quantitative methods would profit from the basic skills developed here. A special logical models seminar could be based on Making Social Sciences More Scientific (Taagepera 2008), encouraging students to apply model construction to topics of their own choice. This has been successful with doctoral students at the University of Tartu. Many students at the University of Tartu and at the University of California, Irvine have contributed to the present draft, by their questions and also by their mistakes when solving the problems pre- sented in exercises. These mistakes showed me where my wordings were imprecise or more intermediary steps were needed. Many collea-

10 Preface gues also have helped, wittingly and unwittingly. Rein Murakas in parti- cular has contributed to formatting.iThe long list of people acknow- ledged in the Preface of Making…could well be repeated and expanded on. Special thanks go to colleagues who graciously agreed to have their graphs subjected to further analysis: , Russell J. Dalton along with Doh Chull Shin, and Richard Johnston along with Michael G. Hagen and Kathleen H. Jamieson. I still have to ask and for permission to include one of their graphs in an exercise.

Rein Taagepera [email protected]

11

A. Simple Models and Graphing

1. A Game with Serious Consequences

 In the absence of any further information, the mean of the conceptual extreme values is often our best guess.  The geometric mean is often preferable to the arithmetic mean. For the geometric mean of two numbers, multiply them together and take the square root.  “In the absence of any further information” is a term we will encounter often in this book. This is the key for building parsimonious logical models. We may call it the “ignorance- based approach”.

What are “logical models”? What is “numeracy”? Let us start with a game – a game of guessing. Suppose a representative assembly has one hundred seats, and they are allocated nationwide. Suppose some pro- portional representation rule is used. This means that even a party with 1 % votes is assured a seat. The question is: How many parties would you expect to win seats, on the average?

A guessing game Think about it for a moment. I repeat:

How many parties would you expect to win at least one seat, out of the one hundred seats in the assembly? Would you guess at 2 parties, 5, 10, 20, or 50 parties?

You may protest that you cannot possibly know how many parties get seats, because I have not given you any information on how the votes are distributed among parties. If this is the way you think, you are in good company. For decades, I was stuck at this point. Now suppose I told you two hundred parties would get seats. Would you buy that? Most likely, you wouldn’t. You’d protest that this could not be so when only one hundred seats are available. Fair enough, so what is the upper limit that is still logically possible? It’s one hundred. This is not likely to happen, but in principle, 100 parties could win one seat each. This is the upper limit. Let’s now consider the opposite extreme. What is the lower limit? It’s 1. This is not likely to happen either, but in principle, one party could win all 100 seats. 1. A Game with Serious Consequences

So you did have some information, after all – you knew the lower and upper limits, beyond which the answer cannot be on logical grounds. At this point make a guess in the range 1 to 100 and write it down. Call this number n:

n = the number of parties.

Now, let us proceed systematically. When such conceptual limits are given, our best guess would be half way between the limits. In the absence of any further information, nothing else but the mean of the limits could be justified. However, there are many kinds of means. The good old arithmetic mean of 1 and 100 would be around 50:(1+100)/ 2=50.5. But having 50 parties getting seats would mean that on the average they would win only two seats each. Only two seats per party? This might look rather low. If so, then let’s ask which number would not look too low. We now have a new question:

How many seats would we expect the average party to win when 100 seats are available? How would we proceed now? Yes, we should again think about conceptual limits. The average number of seats per party must be at least 1 (when every party wins only one seat) and at most 100 (when one party wins all the seats). At this point, make a guess in the range 1 to 100 and write it down. Call this number s:

s = mean number of seats per party.

Now, if we really think that n parties win an average of s seats each, then the total number of seats (T) must be the product n times s:

T=ns.

Here ns must be 100. Calculate this product for the two guesses we have written down. If this product isn’t 100, our two guesses do not fit and must be adjusted. In particular, arithmetic means do not fit. Indeed, 50 parties winning 50 seats each would require 2500 seats – way above the 100 we started with! This approach clearly does not work. If the product of your two guesses came out as 100 seats, congratu- lations – your guesses are mutually consistent. But these guesses might still be less than optimal. Suppose someone guessed at 5 parties winning an average of 20 seats each, while someone else guessed at 20 parties winning an average of 5 seats each. What would be the justification for assuming that there are more parties than seats per party – or vice versa?

14 1. A Game with Serious Consequences

In the absence of any further information on which way the tilt goes, the neutral assumption is that the two are equal. This means 10 parties winning an average of 10 seats each. This is what we call the geometric mean. The geometric mean of two numbers is the square root of their product. This means that the square of the geometric mean equals the product of these two numbers. In the present case, 10 times 10 is the same as 1 times 100. We’ll see later on (Chapter 23) why we should use the geometric mean (rather than the arithmetic) whenever we deal with quantities that logically cannot go negative. This is certainly the case for numbers of parties or seats. Do we have data to test the guess that 10 parties might win seats? Yes, from 1918 to 1952 The Netherlands did have a first chamber of 100 seats, allocated on the basis of nationwide vote shares. Over these 9 elections the number of seat-winning parties ranged widely, from 8 up to as many as 17. But the geometric mean was 10.3 parties, with an average of 9.7 seats per party. This is pretty close to 10 parties with an average of 10 seats. As you see, we could make a prediction with much less information than you may have thought necessary. And this approach actually worked! This is an example of an ignorance-based logical model. It is based on what may look like nearly complete ignorance. All we knew were the conceptual limits 1 and 100. I Mind you, we were lucky that The Netherlands fitted so well. But suppose the actual average had been 8 parties. By guessing 10, we would still have been within 20%. This is much better than lamely saying: “I don’t know” and give up.

Skytte Prize 2008 Why have I dwelled so long on this simple guessing game? In 2008, I received the Skytte Prize, one of the highest in political science. And I basically received it for this guessing game! Oh well, lots of further work and useful results followed, but the breakthrough moment did came around 1990, when I was puzzled about the number of seat- winning parties and suddenly told myself: Consider the mean of the extremes. Using this approach twice enabled me to calculate the number of parties in the entire representative assembly when a country allocates assembly seats in many districts. All I needed was assembly size and the number of seats allocated in the average electoral district. In turn, the number of parties could be used to determine the average duration of governmental cabinets. (This model will be tested with data

15 1. A Game with Serious Consequences in the next chapter.) Here the logical model is quite different from the previous – we’ll come to that (Chapter 6). The overall effect is that we can design for a desired cabinet duration by manipulating the assembly size and the number of seats allocated in the average district. This is of practical use, even while the range of error is as yet quite large.

An ignorance-based logical model We can generalize beyond the limits 1 and 100. This is what science is about: making broad connections. If something can conceptually range only from 1 to T, then our best guess is the geometric mean of 1 and T, which is the square root of T. This square root can also be written as T1/2 or T0.5. Thus, the number of seat-winning parties can be expressed as a equation: n=T1/2.

Exponents Can you see why square root of a number T can be written as T1/2? The square root of T is a number such that its square is T. Now try to square T1/2. We get (T1/2)2, which is T1. If your grasp of exponents (like ½ in T1/2) is shaky, look up Chapter 13. We will have to use exponents over and over.

Even more broadly, when the conceptual limits are minimum m and maximum M, both positive, then our best guess is the geometric mean of m and M:

Best guess between limits m and M: g = (mM)1/2.

This is an ignorance-based logical model. It is based on what may look like nearly complete ignorance. It asks what we can infer “in the absence of any further information.” This is a term we will encounter often in this book. This is the key for building logical models that are “parsimonious”. This means they include as few inputs as possible. What if you don’t know how to calculate this mysterious (mM)1/2? Calculate the product mM, and then use the three bears’ approach in Exercise 1.1 below. These exercises are interspersed within the text. They may at times look irrelevant to the topic on hand, yet the basic approach is similar. This way, they illustrate the wide scope of the particular method.

16 1. A Game with Serious Consequences

Bypass these exercises at your own risk – the risk of thinking that you have understood the text while actually you have not. Do not just read them; do them! Without such understanding the following chapters may become increasingly obscure. The exercises marked with * are especial- ly important.

Exercise 1.1 The three bears tried to figure out the square root of 2013. Papa Bear offered 100. But 100 times 100 is 10,000 – that’s too much. Mama Bear offered 10. But 1010=100 – that’s much too little. Little Bear then proposed 50. Well, 5050=2,500 is getting closer. a) Without using the square root button of a pocket calculator, help the bears to determine this square root within ±1. This means finding a number n such that n2 is less than 2013, while (n+1)2 is already larger than 2013. b) What do you think is the broad purpose of this exercise? Write down your guess, and then compare with suggestions toward the end of this chapter. Exercise 1.2 As I was completing secondary school in Marrakech, in 1953, the first Moroccan uprisings against the French “protectorate” took place in Casablanca. The French-controlled newspapers wrote that 40 people were killed. Our Moroccan servant, however, reported rumors that several thousand were. Take this to mean around 4,000. My friend Jacques, with family ties in high military circles, asked me to guess how many people actually were killed, according to classified army reports. a) Which estimate did I offer, in the absence of any further information? Write down your reasoned guess. b) What do you think is the broad purpose of this exercise? Exercise 1.3 There is an animal they call lmysh in Marrakech. What is your best guess at roughly how much a lmysh weighs? The only information is that lmysh is a mammal. The smallest mammal is a shrew (about 3 grams), and the largest is a blue whale (30 metric tons). a) Convert the weights of shrews and blue whales from grams and tons into kilograms. 1 ton=1000 kg. 1 gram= 1/1000 kg. b) Estimate the weight of lmysh. c) What do you think is the broad purpose of this exercise?

17 1. A Game with Serious Consequences

Each of these exercises also ask you about the broad purpose of this exercise. The broadest purpose is to give you a feeling of empower- ment. Yes, you can figure out many things approximately, and “approximate” is often sufficient. Dare to muddle through! If your secondary school math included the exact procedure for calculating the square root, then you might have received a bad grade if you used the three bears’ approach, even if you got pretty close to the correct answer. What is the long-term outcome of the school approach? Probably: “I have forgotten how to calculate the square root of a number.” But with the three bears’ approach, can you ever forget it? The purpose of Exercise 1.1 is not merely to calculate the square root of a number. It is to encourage relaxed muddling through, more broadly. And to encourage you to ask, whatever your further employment: Why is the boss asking me to do this? It is advisable to ask this question about every exercise in this book, even while its text does not pose this question: What is the broad purpose of this exercise?

1a. Professionals Correct Their Own Mistakes

 Never believe your answers without asking: “Does it make sense?” Do not offer an answer you don’t believe in, without hedging.  The arithmetic mean adds; the geometric mean multiplies. Use the geometric mean when all components are positive and some are much larger than some others.

In school, we offer answers to instructors who presumably already know the correct answer. In later life, this rarely is so. Our boss asks us to find out about something precisely because she does not know. If our answer is wrong, there is no one to correct us. But decisions will be made, partly based on our work, and there will be consequences. Too many wrong answers will catch up with us.

Means and median First, some basics about the means need to be reviewed. In our guessing game we had to use various means for two numbers. We’ll need them

18 1. A Game with Serious Consequences later on, too – and for more than two numbers. So let us clarify some terms. As an example, take the numbers 2, 3, 5, 10 and 19. The median is the point where half the values are smaller and half are larger. For 2, 3, 5, 10 and 19, it is 5. (If we had 2, 3, 5, 6, 10 and 19, the median would be (5+6)/2=5.5.) This is what we are after, most often. But the median is awkward to handle (see Chapter 23), so instead, we try to use various means. The arithmetic mean of n values adds the values, then divides by n. For the 5 numbers above, it is (2+3+5+10+19)/5=39/5=7.8. In other words, 7.8+7.8+7.8+7.8+7.8 and 2+3+5+10+19 are equal. The geometric mean, in contrast, multiplies the n values and then takes the nth root – which is the same as exponent or “power” 1/n. For the 5 numbers above, (2351019)1/5=5,7001/5=5.64. How did I get that? On a typical pocket calculator, calculate the product 235 1019, then push the key “yx”, enter “5” then push “1/x” and “=”. In other words, if 5.64 is the geometric mean of 2, 3, 5, 10 and 19, then 5.645.645.645.645.64 and 2351019 are equal. When do we use which mean? Broadly speaking, use the geometric mean when all components are positive and some are much larger than some others (like 1, 10 and 100). Why? Because in such cases the median is closer to the geometric mean. Otherwise, use the arithmetic mean. In the example above, the arithmetic mean (7.8) is higher than the geometric mean (5.64). Indeed, the arithmetic mean is never smaller than the geometric and can be much larger. It was previously said that, over 9 elections in The Netherlands, the number of seat-winning parties ranged from 8 to 17, but their geometric mean was 10.3 parties. Now it becomes clearer what this meant. The arithmetic mean would be somewhat higher than the geometric.

Do not trust – try to verify by simple means This business of “5,7001/5” in the calculations above may well be new to you. The instructions on how to calculate it on a pocket calculator enable you to carry it out, but they do not help you in understanding what it means. Try to find ways to check whether the outcome makes sense. Multiply 5.64 five times by itself and see if you get back 5,700. Grab your pocket calculator and do it right now! You actually get 5,707 – this is close enough. What if your pocket calculator does not have a key “yx”? You can always use the three bears’ approach (Exercise 1.1). If 55555= 3,125 is a bit too little and 66666=7,776 is much too much,

19 1. A Game with Serious Consequences compared to 5,700, then try next something between 5 and 6. For most purposes, you don’t have to be “exact”. Even 5.64 might be more than you need – 5.6 might well suffice.

Professionals Correct Their Own Mistakes “We correct our own mistakes. This is what tells us apart from the lab assistants,” told me Jack Ballou, my first boss at the DuPont Company Pioneering Lab. “See our helper Bill. He’s real good, one of the best lab assistants we have. But when he makes mistakes in his calculations, he accepts the results. Professionals also make mistakes, but they sense when something is wrong. So it’s part of our job to catch our mistakes as well as those of people like Bill.” How do they do it? They never take a numerical result for granted. They always ask: Does this answer make sense? If it does not, a red warning light should start blinking in our minds: Perhaps we have made a mistake. Never believe your answers without checking! After completing a more complex calculation, repeat it with approxi- mate numbers. Suppose we find that the division 8.4÷2.1 yields 6.3. Does it make sense? Redo it approximately: 8÷2 is 4. In comparison, 6.3 looks much too large, so something is fishy. Sure enough: Instead of the division button on the calculator we have pushed the subtraction button. This happens. Sometimes closing-the-loop is possible. Take again the division 8.4÷2. If our answer comes out as 6.3, do mentally the approximate inverse operation – the multiplication: 6.3×2.1 ≈ 6×2=12. (This “≈” means “approximately equal”.) This 12 is far away from 8.4, so again 6.3 looks suspect. At other times, the magnitude just does not make sense. Suppose we calculate the “effective number of parties” (whatever that means) in a country and get 53. How often do we see a country with 50 parties? Maybe we have misplaced a decimal, and it should be 5.3 instead?

But can this be so? Every professional develops her/his own small tricks to check on calculations – and also on qualitative errors of reasoning! The common thread is that they draw in some comparison with something else and then ask: “But can this be so?” This is what my mother asked me when I was 4. We played at multiplying by 3: twice 2, 3 times 3, and so on. Suddenly, she jumped way ahead and asked: “How much is 11 times 3?” This was way out of

20 1. A Game with Serious Consequences my range. I blurted out 28. She did not correct me. She did not tell me the right answer – and this is the crux of the story. Instead, she asked: “How much is 10 times 3” This was easy. I already knew I only had to add a zero: 30. And then she asked the all-important question: “But how can 11 times 3 be smaller than 10 times 3?” This is when I became a scientist. I discovered the power of comparing things and asking “But can this be so?” Never believe your answers without asking: “Does it make sense?” Never offer your professor or your boss an answer you don’t believe in, without hedging. If you cannot find a mistake, but the answer looks suspect for whatever reason, tell her so. Imagine your boss making a decision based on your erroneous number. She may lose money, and you may lose your job. This does not mean that common sense is always the supreme arbitrator. Many physics laws look counterintuitive, at first. Newton’s First Law says that a body not subject to a force continues to move forever at constant velocity. OK, push a pencil across the table, and then stop pushing. The pencil soon stops – no movement at constant speed! One has to invoke a “frictional force” stopping the movement, and at first it may look like a pretty lame way out, to save Newton’s skin. Only gradually, as you consider and test various situations, does it become evident that it really makes sense to consider friction a force – a passive one. In social science, too, some valid results may look counterintuitive. Don’t reject everything that does not seem to make sense, but ask the question and double check for possible mistakes.

21 2. Pictures of Connections: How to Draw Graphs on Regular Scales

 Most of what follows in this book involves some graphing. The deeper our understanding of graphing is, the more we will be able to understand model construction and testing.  The “effective number” of components such as seat shares of 2 parties is N=1/Σ(pi ), where pi is the fractional share of the i-th component and the symbol Σ (sigma) stands for SUM.

Science is largely about making connections between things that can vary, so that, when knowing the value of one variable, one can deduce the value of the other. While logical models most often take the shape of equations (such as p=M1/2), it is often easier to visualize them as graphs. Hence, most of what follows in this book involves some graphing. This is why graphing is introduced here early on, so that possible mistakes can be corrected as soon as possible. The deeper your understanding of graphing is, the more you will be able to understand model construction and testing. We’ll go slowly. Indeed, we’ll go so slow that you may feel like bypassing the first half of this chapter, thinking that you already know all that. Don’t bypass it. While reading, try to tell apart things you know, tidbits that are new and may come handy, and – things you have been doing while not quite knowing why. (I myself found some of those while writing this chapter!)

Construct the graphs by hand – NOT by computer. This way you learn more about the problem on hand, and you do not have to fight the peculiarities of computer programs. Computer-drawn graphs have their proper place, but you have to do enough graphing by hand before you can become the master of computer programs rather than slave to their quirks and rigidities.

Example 1: Number of parties and cabinet duration Suppose you are given some data for the effective number of parties in the assembly (N) and the duration of government cabinets (C) – Table 2.1. (The “effective number” of parties is defined at the end of the chapter.) What can we tell about the relationship between C and N when looking at this table? Pretty little. We can see much more when showing them as a picture – a graph, that is. This is the task in Exercise 2.1. 2. Pictures of Connections: How to Draw Graphs on Regular Scales

Make sure you heed the advice in the sections below (Constructing the framework, Placing data and theoretical curves, and Making sense of the graph).

* Exercise 2.1 a) Graph data in Table 2.1, C vs. N. b) The number of parties cannot be less than 1. So draw a vertical line at N=1 and mark its left side by shading or slanted lines – this is a “conceptually forbidden area”. c) Draw the trend curve, meaning a smooth curve passing in between the data points. d) Superimposed on these data points and trend curve (not separately!), also graph the logical model C=42 years/N2 (which will be pre- sented soon). e) Compare data, trend curve and model. Does this model express the general trend of the data points? f) Also graph the curves C=21 years/N2 and C=84 years/N2 (again superimposed, not separately!) Describe their locations compared to that of curve C=42 years/N2 and data points, with special attention to Greece and Botswana. g) Try to draw some conclusions: What is the purpose of this exercise, besides practicing basic graphing?

Table 2.1. Number of parties and cabinet duration. This is a represen- tative sample of 35 countries tested

Country N C (yrs.) Botswana 1.35 40 Bahamas 1.7 14.9 Greece 2.2 4.9 Colombia 3.3 4.7 Finland 5.0 1.3

Exercise 2.2 a) Continuing the previous exercise, calculate the arithmetic mean for the N values of the 5 countries, and similarly for their C values. Enter this point (mean N, mean C) on the graph. Where is it located relative to the curve C=42 years/N2? How come that the individual countries fit the model better than their mean? (You may not be able to answer this question, but give it a try.) zu.sbaoah, ohatlet.zuz

23 2. Pictures of Connections: How to Draw Graphs on Regular Scales

b) Now calculate the geometric means for N and C of these countries and enter this mean point on the graph. Where is it located relative to the curve C=42/N2? Compare the locations of arithmetic and geometric means. zu.arbazuz, sba.zutsõud

Constructing the framework Try to be precise – otherwise important conclusions may be missed. Precision includes the following.

A! Get some square paper (or special graph paper with small squares within larger ones). Make the graphs at least the size of a regular half- page. Why? Tiny freehand sketches on blank paper are not precise enough for drawing useful conclusions. (The graphs you see in books are mostly reduced from larger originals.) Until they receive a bad grade, one-third of students tend to overlook this requirement.

B! Graph one variable on x-axis and another on the y-axis. Do NOT use the so-called bar graph format. Why not? Bar graphs hardly convey more information than a table. The y do not show how one variable relates to another.

C! By convention, “C vs. N” means that C is on the vertical “y-axis” and N is on the horizontal “x-axis”. Do not reverse them. Why not? This is like the convention to drive on the right side of the road – it avoids confusion. About one-tenth of students tend to overlook this rule, at first.

D! Inspect the data table to see how much space you need. In Table 2.1, the y scale must accommodate values from 0 to close to 50 years, to fit Botswana and Finland. The x scale must accommodate values from 0 to 5, but better include the range 0 to 6, to be on the safe side.

E! On both scales, mark locations at equal distances. For C in Exercise 2.1, one could indicate 0, 10, … 40, 50. For N, 0, 1, … 5, 6 impose themselves. DO NOT laboriously write in 0, 1, 2, … 50 on the y scale. Why not? Such crowding blurs the picture.

F! Make sure that intervals between these main divisions include 2, 5 or 10 squares of the squared paper – then you can easily place the data points precisely. Do NOT use intervals of 3 or 7 squares. Why not? Just try to place 1.6 precisely, on a scale with 7 squares between “1” and “2”!

24 2. Pictures of Connections: How to Draw Graphs on Regular Scales

G! DO include the point (0,0) in the graph, if at all possible, and DO label it on both axes with “0” just as you do for “1” or “2”. Zeros are numbers too, and you hurt their feelings when you omit them. (Indeed, arithmetic really got going only when the Hindus invented a symbol for “nothing”, something that eluded the “practical” Romans.)

H! NEVER use unequal interval lengths for equal distances, just because some intervals have data points and others don’t. (One does encounter published population graphs where the dates 1950, 1980, 2000, 2005, 2006 and 2007 are shown at equal intervals. Population growth seems to slow down, even when it actually does not. This is lying with graphs, intentional or not.)

I! Label the axes. In the present case show “Number of parties (N)” along the x axis, and “Cabinet Duration, years (C)” along the y axis. Do not forget “years” – all durations would be 12 times larger, monthsif were used!

J! When drawing straight lines (such as y and x axes), use a ruler – do not do it freehand. You don’t have a ruler handy? Of course you do. Use the edge of a book or writing bloc.

Placing data points and theoretical curves on graphs K! Indicate data points by small dots at the precise location, comple- mented by a larger mark around it – an “O” or an “X”, etc. Do NOT use just a small dot, which looks like an accidental speck, or a blurb the center of which is fuzzy.

L! Do NOT connect the data points with a zigzag line. Instead, show the general trend, a smooth curve or straight line passing in the middle of the cloud of data points. Why? There is “noise” – random divergence of data points from the general trend. What interests us in the first pace is this general trend. One-third of the students tend to play such a game of “connect-the-dots”. I understand – I remember how I did so in my first high school physics lab .

M! For a theoretical curve such as C=42 years/N2, calculate C for simple values of N such as 1, 2, 4, 6. Mark them on the graph with tiny symbols, quite different from those for data points. Draw a smooth curve through them. Do NOT make life hard for yourself by calculating C at N=1.35 or 1.7 just because Botswana and Bahamas happen to have

25 2. Pictures of Connections: How to Draw Graphs on Regular Scales these values. But you might need to calculate C at N=1.5 because the curve drops steeply between N=1 and N=2.

N! Before drawing the curve, turn your graph on the side so that the y scale is on top. Your natural wrist movement readily allows you to join the theoretical points smoothly.

Making sense of the graph We have constructed the graph. Now comes the main part: Making sense of it. In the present case, the main question is: Does the cloud of data points agree with the theoretical curve (the model)? When can we say that this is the case? First, there should be a roughly equal number of points above and below the curve. If most data points lie above the theoretical curve, the model cannot be said to fit. But even more is needed. There should be a roughly equal number of points above and below the curve at both extremes of the curve. If all data points were too high at one end and too low at the other, then some other curve would fit better than the curve proposed. This is coarse advice, but fairly foolproof. In the actual case in Exercise 2.1, some points are above and some below the theoretical curve – so the model may well hold, with some random scatter. But mere 5 data points usually do not suffice to test a model.

Example 2: Linear patterns Now consider the data in Table 2.2, which applies to the previous 5 countries. What x and y stand for will be explained later on. What can we tell about the relationship between x and y?

Table 2.2. Some data on 5 countries.

Country x y Botswana 0.130 1.602 Bahamas 0.230 1.173 Greece 0.342 0.690 Colombia 0.519 0. 672 Finland 0.699 0.114

26 2. Pictures of Connections: How to Draw Graphs on Regular Scales

* Exercise 2.3 a) Graph y vs. x for countries in Table 2.2. Use the same scale on both axes (i.e., same length for distance from 0 to 1). Make sure your axes cross at (0,0). b) Comment on the shape of the resulting pattern. c) Use a ruler, preferably a transparent one, so you can see all the data points. By eye, draw the best-fit line through these points. This means that there should be about the same number of points above and below the line. Moreover, such balance should hold both at low and at high ends of the data cloud. d) Any straight line has the form y=a+bx. Determine the constants a and b for this particular line. How is this done? See Figure 2.1. Write out the equation specific to this particular straight line; this means plugging in the numerical values of a and b into y=…+…x. oah.ot, egytmenhet e) According to a logical model, the line y=1.62-2x is expected. Graph it superimposed on the previous (not separately!). To do so, plug 3 simple values of x into y=1.62-2x – like x=0, 0.5, and 1 – and calculate the resulting y. If the 3 points do not fall perfectly on a line, you’ll know a mistake has been made. f) How close is your visually drawn straight line fit to the model? In particular, how close is your best-fit slope to -2, the slope predicted by the model?

Figure 2.1. For the line y=a+bx, “intercept” a is the left side of the triangle formed by this line and the point (0;0). The slope b is the ratio of lengths of the left and bottom sides. Since the line in this figure slopes downward, the slope must carry a negative sign.

27 2. Pictures of Connections: How to Draw Graphs on Regular Scales

How to measure the number of parties We usually face a mix of larger and smaller parties, whether we consider their vote shares or seat shares. How do we measure the “real” number of parties when some are large and some are negligible? Most often the following “effective number of components” is used: 2 N = 1/Σ(pi ).

Here pi is the fractional share of the i-th component and the symbol Σ (sigma) stands for SUM. What does this mean? Suppose the seat shares of 4 parties are 40, 30, 20 and 10, for a total of 100. The “fractional shares” are those numbers divided by the total. Then N=1/(0.402+0.302+0.202+0.102)= 1/0.30=3.3. There are less than 4 but more than 3 serious parties, sort of. This “Laakso-Taagepera effective number” is never larger than the total number of parties (which here is 4). The values of N shown in this chapter are effective numbers of assembly parties, i.e., numbers based on seats. For other purposes, one might use vote shares.

Exercise 2.4 a) Guess at the effective number when the seat shares of parties are 45, 35, 10, 9 and 1. Write it down. Then calculate it. Compare the result to the guess. b) Do the same when the numbers of seats for three parties are 100, 80 and 20, respectively. (CAUTION: To get the fractional shares, one first has to divide by the total number of seats.)

28 3. Science Walks on Two Legs: Observation and Thinking

 In addition to asking how things are, we must also ask how they should be, on logical grounds. Science walks on these two legs.  We largely see only what we look for.  Models based on (almost complete) ignorance are only one of many different types of logical models.  Make the models as simple as possible.  Quantitative models have more predictive power than directio- nal models.  Knowing how to do something is of little use when one does not know when the time has come to make use of one’s skills.

This book is about social science – the approach used in science, applied to things social. Science is more than just learning facts. A person who has memorized the encyclopedia and quotes from it is not a scientist. Science deals with making connections among separate pieces of knowledge. Making connections among known facts can lead to new questions and new, previously unexpected vistas. Connections can be expressed in words, but they are more precise when they can be expressed in equations. Four blind men tried to figure out what an elephant was, by touching it. The one who happened to grope around a leg concluded elephant was a tree trunk. Oh well, you know the story… except my way to end it. Only three of them began to argue with each other, insisting on their own individual truths. The fourth person thought: These are reasonable people who know what they have encountered. Let me try to construct a picture that connects all of their observations. What he got was nothing like an elephant. It was a tree with a hose hanging from it, and so on. This description fitted the known facts, but it did not make much sense. So he kept on thinking. Finally, some detail in the description of the elephant’s trunk made him hit on a broad idea, a logical model far beyond the details described: This must be a live animal! Then everything began to fall in place. Well, almost everything. This person was a scientist: He tried to connect facts to each other – and then to connect the various connections. Science walks on two legs – see Figure 3.1. One leg refers to the question: How things are? It leads to careful observation, description, measurement, and statistical analysis. The other leg refers to the question: How things should be, on logical grounds? That question guides the first 3. Science Walks on Two Legs: Observation and Thinking one. The question “How things are?” assumes that we know which aspects of things are worth paying attention to. But we largely see only what we look for. And it’s the question “How things should be?” that tells us what to look for. This is the question we asked about the number of seat-winning parties – even while it may not look so. That science walks on two legs is a notion as old as social science. Auguste Comte, one of the initiators of social studies, put it as follows, two centuries ago, in his Plan of Scientific Studies Necessary for Reorganization of Society:

If it is true that every theory must be based upon observed facts, it is equally true that facts cannot be observed without the guidance of some theory. Without such guidance, our facts would be desultory and fruitless; we could not retain them: for the most part we could not even perceive them. (As quoted in Stein 2008: 30–31)

We largely see only what we look for. In this quote the logical model ("theory") may seem to come first, but actually a continuous interaction is meant: “some theory” as guidance, some observation, some further model refinement… The chicken and the egg evolve conjointly.

Figure 3.1. Science walks on two legs: Observation and Thinking

SCIENCE How things How Statistical testing things of quantitatively SHOULD predictive logical BE on ARE models logical Empirical Quantitatively relationships predictive logical grounds ↑ models Data analysis -- ↑ statistical etc. Thinking ↑ ↑  Data Directional Measurement prediction ↑ ↑ Observation Thinking

30 3. Science Walks on Two Legs: Observation and Thinking

Quantitatively predictive logical models What are “logical models”? Now that we have an example in deve- loping p=M1/2, the following answer may make sense to you. A logical model is a model that one can construct without any data input – or almost so. Just consider how things should be or, as importantly, how they cannot possibly be. In the previous case, we had no data, just the knowledge that the answer could not be less than 1 or more than 100. Furthermore, we should aim at “quantitatively predictive logical models”. This means that such logical models enable us to predict in a quantitative way. In the previous model the prediction was not just a vague “between 1 and 100” but a more specific “around 10”. “Quanti- tative” does not have to mean “exact” – just much more specific than “between 1 and 100”. Logical models should not be believed in. They must be tested. In the preceding chapter, we compared with some data on elections in The Netherlands. This is not enough. This was merely an illustration, showing that the model might work. Much more testing was needed to make the model credible. How does one go on to build logical models? One learns it best by doing, because it’s an art. Each situation requires a different model. The previous chapter used a model of (almost complete) ignorance, but this is only one of many approaches. If there is one general advice, it is: Make it as simple as possible. (Albert Einstein reputedly added “and no simpler”, but let’s leave that for later.) For Stephen Hawking (2010), a good model has the following features:

 It is elegant. (This means simplicity, symmetry…)  It has few arbitrary or adjustable components – it is parsi- monious.  It generalizes, explaining all available observations.  It enables one to make detailed predictions; the actual data may possibly contradict these predictions and thus lead to a revision of the model.

Some of the greatest truths in life and science are very simple. Indeed, they are so simple that we may overlook them. And even when pointed out to us, we may refuse to accept them, because we say: “It cannot be that simple.” (Maybe you had this reaction to p=M1/2.) Sometimes it can be simple. This does not mean that it is simple to find simple truths. Moreover, combining simple building blocks can lead to quite complex constructions. So, when we think we have a model, simple or complex, we should verify whether it really holds. This is what testing a model means, in science.

31 3. Science Walks on Two Legs: Observation and Thinking

The invisible gorilla We largely see only what we look for. A gorilla taught me so. At a science education conference in York we were shown a film clip. People were playing basketball, and we were instructed to count the number of passes. The action was too fast for me. I soon gave up counting and just watched idly, waiting for the end. Thereafter we were asked if we had noticed anything special. A few laughed knowingly and shouted: “The gorilla!” Oh my! I vaguely recalled reading something about a gorilla experi- ment. Was I among the ones taken in? I surely was. The clip was run again. While the game went on in the background, a person in gorilla suit slowly walked across in the foreground. Center stage it stopped, turned and looked at us. Then he continued at a slow pace and exited. It was as plain as anything could be, once we were given a hint – but without such a hint most of us had not seen him! In science, the word “should” – as in “How things should be?” – is often the gorilla word. We see only what we think we should look for. If science were reduced to supposedly hard-boiled observation and analysis of facts but nothing else, we might improve our ability to count the passes while still missing the gorilla. The following test illustrates it.

Gravitation undetected I sent some 35 social scientists data where the output was calculated exactly from the formula for the universal law of gravitation – but I didn’t tell them. The law is F=GMm/r2 – force of attraction F between two bodies is proportional to their masses (M and m) and inversely proportional to the square of their distance (r). G is a universal constant. I simply sent my colleagues a table of values of y, x1, x2, and x3, and told them that y might depend on the other variables. Where did I get the x- values? I used a telephone book to pick essentially random values. Then 2 y came from y=980x1x3/x2 . What was the purpose of this experiment? If data analysis sufficed to detect how things are connected, some of my colleagues should have found the actual form of the relationship. All of them were highly competent in data analysis. Social data usually comes with large random variation, which makes detection of regularities so much more difficult. My pseudo-data had no such scatter. Yet no one found the form of the relationship. They tried out standard formulas used by statisticians, usually of the type y=a+bx1-cx2+dx3. This linear expression uses addition or subtraction, while the actual equation involves multi- plication and division.

32 3. Science Walks on Two Legs: Observation and Thinking

If only they would have found no connection! But it was worse than that. All those who responded found quite satisfactory results by the usual criteria of data analysis. Why was it worse than finding nothing? If we do not get a satisfactory result, we may keep on working. But if we get a result that looks satisfactory, yet is off the mark, then we stop struggling to find anything better. I cannot blame my colleagues for not detecting the law of gravity – I only gave them plenty of “what is” but no clue about “what should be”. Small wonder they missed it. However, the “should” part is underdeveloped and underestimated in today’s social sciences. While science in general walks on two legs, today’s social sciences show a tendency to hop on one leg (Figure 3.2). “How things should be on logical grounds” tends to be reduced to directional models, such as “If the number of seats available increases, then so does the number of parties represented”. In shorter notation: “M up  p up.” How much up? The directional model does not predict it. But with a little thinking, we could propose p=M1/2, which not only includes “M up  p up” but also does offer a specific quantitative prediction. Quantitative models have more predictive power than directional models. The validity of a logical model must be tested – this is where statistics comes in. But without a logical model there is nothing to test. We would just be measuring.

Figure 3.2. Today’s social science tends to hop on one leg, Obser- vation.

Today’s SOCIAL SCIENCE How Statistical testing things of directional ARE prediction

Empirical How things SHOULD BE on relationships logical grounds ↑ Data analysis -- statistical ↑  Data Directional Measurement prediction ↑ ↑ Observation Thinking

33 3. Science Walks on Two Legs: Observation and Thinking

The gorilla moment for the number of seat-winning parties Of course, the distinction between “is” and “should” isn’t always so clean cut. Most often they enter intermixed. Consider how the number of seat-winning parties eluded me (and everyone else) for decades. The obvious part was that this number depends on how people vote – but pinning down typical votes patterns was even harder than counting the number of passes in a basketball game. In hindsight, this difficulty applied only to districts with many seats. In one-seat districts obviously one and only one party would win the single seat available, regardless of how many parties competed and how people voted. This observation looked trivial and not even worth stating – and this was the problem. It was “obviously” irrelevant to the puzzle in multi-seat districts. But suppose someone had spelled out the following: “In a one-seat district, the number of seat-winning parties is one, regardless of how voters vote.” It would have been like someone shouting: “The gorilla!” Indeed, if the way voters vote is completely overridden by something else in one-seat districts, could it be the same in multi-seat districts, at least partially? We are directed to see that the total number of seats available should matter, at least on the average. The rest was easy. When did the shift take place, from fact-oriented “is” toward “should”? It’s hard to say. The observation that a district of 100 seats offers room for more parties than a district of 10 seats or a district of 1 seat is sort of factual. Yet it brings in a fact that previously was thought irrelevant. It supplied the jumping board for the first sentence above where the word “should” explicitly enters: “The total number of seats available should matter.” Moreover, don’t overlook the expression “on the average”. Dwarfs and giants may occur, but first try to pin down the usual.

34 3. Science Walks on Two Legs: Observation and Thinking

What Is “Basic Numeracy”?

Literacy means ability to read and write – and to understand what one reads. The term “numeracy” has been coined to mean the ability to handle numbers. Does one have some sense of the distance between New York and Los Angeles or Paris and Berlin? Can one say whether a per capita Gross Domestic Product of 10,000 dollars or euros feels large or small? Can one carry out simple mathematical operations when ordered to do so? And if so, does one have a feeling for what the calculated result means or implies for some substantive purpose? For this purpose, does one know when to carry out simple mathematical operations, without someone else giving an order to do so? The latter is critical. Knowing how to do something is of little use when one does not know when the time has come to make use one’s skills. We have taken steps in this direction. “Does this answer make sense?” is a basic criterion for numeracy. “But this is what my computer program prints out” is a poor substitute. Figuring out approximate answers to problems is another aspect of basic numeracy. Desperately trying to figure out which ready-made formula to pick and plug in is a poor substitute. Almost everyone profits from basic numeracy. Ability to make one’s personal financial decisions calls for arithmetic operations, operations with percentages, and a general sense of sizes and distances. What is part of further basic numeracy varies, when it comes to specific occupations. This book will not try to define what kind of numeracy should be needed in social sciences as such. Its prime purpose is to develop skills in constructing logical models. So it introduces only whatever aspects of numeracy are needed, for this purpose. It does so at the point when they are needed and only to the degree they are needed. Introducing the basics of means and the median in the first chapter is an example – it was needed right then. Only one aspect of numeracy is so important for logical model building that it was introduced before the need became obvious: con- structing graphs y vs. x. While needed for model building, most of these skills are also needed for social sciences more broadly.

35 4. The Largest Component: Between Mean and Total

 The share of the largest component is often close to the total size divided by the square root of the number of components. This is a quantitative model.  In contrast, “The largest share tends to go down when the number of components increases” is a directional model.  We should try to go beyond directional models, because quantita- tive models have vastly more predictive power.  To express closeness of estimates, use relative differences rather than absolute differences. When relative differences are large, it is more meaningful to express relative error by multiplicative factors, such as ×÷2 (“multiply or divide by 2”) rather than percent differences, such as ±50%.

The United States has about 300 million people (312 million in 2011) divided among 50 states. (We make it simple and ignore DC and Puerto Rico.) What could be the population of the largest state? Write down your gut-level guess. Call it P1. The US also has an area of 10 million square kilometers (9.88, if one wishes to be overly precise). What could be the area of the largest state? Write down your gut-level guess. Call it A1. In Chapter 1 we found that in a district with 100 seats we can expect about 10 parties to be represented. How many seats might go to the largest of these parties? Write down your gut-level guess. Call it S1.

Between mean and total Once again, our first reaction might be that one cannot know, short of running for an almanac. But think about the conceptual limits. Obvious- ly, the largest state cannot exceed the total. But what is the least size it could have? Stop and think. Here the number of states enters. If all states had the same popu- lation and area, each would have 300/50=6 million people and 10/50= 0.2 million square kilometers. If they are unequal, the largest state must have more than that. So we know that 6 million < P1 < 300 million, and 2 2 0.2 millions km < A1 < 10 million km . (Note that the largest state by area need not be the largest by popu- lation!) 4. The Largest Component: Between Mean and Total

In the absence of any other information, our best estimate is the mean of the extremes. This is another ignorance-based logical model. Since population and area cannot go negative and vary hugely, the geometric 2 mean applies. The result is P1=42 million and A1=1.4 million km . Actually, the most populous state (California, 38 million in 2011) was smaller than we guessed, while the largest by area (Alaska, 1.7 million km2) was larger. Yet, both estimates are within 25% of the actual figures. We may also say that it is within a factor of 1.25 – meaning “multiply or divide the actual value by 1.25”. Now compare your gut- level guesses with the figures above. Did you get it as close as the geometric mean of mean size and the total?

Exercise 4.1 Note that we made use of both means. First, 6 million was the arithmetic mean size of the 50 states. Then we took the geometric mean of mean size and the total. Why? OK, try to do it the other way round. To obtain 50 equal-sized states, we have to divide the total by 50, meaning the arithmetic mean. There is no other way. But we can take the arithmetic means of 6 and 300. Do it and see what population it would predict for California. Do the same for area. Roughly how large would the largest component be, compared to the total? How much would it leave to al the other states combined? Would it agree with your gut feeling?

We can repeat the estimates for or Australia, and the same error limits hold. This approach fails, however, when the federal units are purposefully made fairly equal (such as in Austria) or a large state has later added small satellites (Prussia within Imperial Germany, Russia within the Soviet Union). On the other hand, the approach applies much wider than to sizes of federal units. In particular, it applies to parties. If 100 seats are distributed among 10 parties, their mean share is 10 seats. The largest party can have as few as 10 and as many as (almost) 100 seats. Hence our best guess is (100x10)1/ 2=31.6 seats. In the aforementioned case of The Netherlands 1918-1952, the actual largest share over 9 elections ranged from 28 to 32, with an average of 30.6. Compare this to your gut-level guess at the start of the chapter. The fit isn’t always that good, but the estimate works as the average over many countries. Indeed, this relationship was a major link in my series of models leading to prediction of cabinet duration. It may be useful for some other sociopolitical problems. Up to now, we have worked out individual cases. It’s time to establish the general formula. When a total size T is divided among n

37 4. The Largest Component: Between Mean and Total

components, then the largest component (S1) can range from T/n to T. The best estimate for S1 is the geometric mean of these extremes: S1 = [(T/n)T]1/2 = [T2/n]1/2 = T/n1/2. In short:

1/2 S1 = T/n .

The largest component is often close to the total size divided by the square root of the number of components. The relative share of the largest component, out of the total, is s1=S1/T. Then we get an even simpler relationship, and we can also express it in percent:

1/2 1/2 s1 = 1/n =100%/ n .

* Exercise 4.2 Australia had a population of 22.5 million in 2011. How large would you expect the population of its largest federal component to be? Your guess would depend on the number of components, but there is a snag. Australia has 6 states plus a large and empty Northern Territory with limited self-rule and also a separate Capital Territory (Canberra). Should we base our estimate on 6 or 8 components? a) Estimate the largest component both for n=6 and n=8. b) The actual figure for the most populous component (New South Wales) was 7.2 million in 2011. By how much are your two estimates off? (See next section, before answering.) c) Draw some general conclusions. (Students often too quickly conclude that what works better in one particular case is preferable in general. Keep random variation in mind. The US 1/2 case does not imply that S1 = T/n always overestimates populations but underestimates areas of largest subunits!)

How to express absolute and relative differences Suppose we overestimate the weight of an animal by 2 kilograms (about 5 lbs.). The absolute difference is 2 kilograms. Is this little or much? It depends. For an elephant, it would be a remarkably good estimate. For a mouse, it would be unbelievably off. So we are better off using the relative difference. When relative differences are small, it is convenient to express them in percent. When our estimate is 500 while the actual number is 525, then the percent difference is [(525-500)/525]100%=5%. But when the differences are large, then we run into trouble.

38 4. The Largest Component: Between Mean and Total

Suppose your boss tells you that business is bad and your salary must be cut by 50%. “But don’t worry. When business gets better, I’ll increase your salary by 50%.” If you buy that, you are in for permanent salary reduction. Indeed, suppose your initial alary was 1000 euros. A 50% cut will take it to 500 euros. Increase 500 by 50%, and you end up with 750. Percent differences lack symmetry: They cannot be lower than -100%, but they can be very much higher than +100%. Hence a percent error of ±5% is fairly clear, but an error of ±50% becomes ambiguous. Use percent difference or error only when it is much less than 50%. When relative differences are large, it is more meaningful to express them by multiplicative factors, such as ×÷2 (“multiply or divide by 2”). Being low by a factor of 2 is the same as being low by 50%. But being high by a factor of 2 is the same as being high by 100%. If your boss cuts your salary by a factor of two and later increases it by a factor of two, then you do break even. This is a marked advantage. By analogy with the widely used notation “±”, I’ll use “×÷” (“multiply or divide”) for differences “by a factor of”. But be aware that this is not a widely practiced notation – most people will not recognize it, unless you explain it.

Directional and quantitative models This difference was pointed out in the previous chapter. Let us apply it here. A logical model can be merely directional, such as “If the number of components goes up, the share of the largest component goes down.” But how fast does it go down? We get much more out of a quantitative 1/2 model such as S1 = T/n . It tells us everything the directional model does, but also much more. For instance, if n increases 4-fold, then S1 is reduced by one-half. Compare this to the directional statement: “If n increases 4-fold, then S1 is reduced”. By how much? It does not tell us. We should always aim at going beyond a directional model, because a quantitative model has vastly more predictive power. Some people think that every research project must start with a “hypothesis”. For the directional model, this would be the same as the model: “If the number of components goes up, the share of the largest component goes down.” But what could be the starting “hypothesis” when we aim at a quantitative model? We do not know its shape, ahead of the time. So we can only run the stilted hypothesis: “There is some quantitative relationship between the number of components and the share of the largest component – and if I try, I can find it.” It is less stilted to replace the “research hypothesis” by the research question:

39 4. The Largest Component: Between Mean and Total

“What is the quantitative relationship between the number of compo- nents and the share of the largest component?”

Connecting the connections Science deals with making connections among facts, but it aims at more than that: It tries to make connections among those connections. What does this mean? 1/2 The model S1=T/n predicts that with 100 seats and 10 parties, the largest party share would be around 100/101/2=31.6. In Chapter 1 we established the model n=T1/2 for parties. These two models interconnect. 1/2 1/2 Indeed, we can “plug” n=T into S1=S/n . This means replacing n by its equivalent, T1/2:

1/2 1/2 1/2 1/4 3/4 S1=T/n = T/(T ) = T/T = T .

(If you feel uneasy about this T/T1/4=T3/4, take a look at Exponents in Chapter 13.) In short, when T seats are allocated by proportional representation, the largest party can be expected to win T3/4 seats:

3/4 S1= T .

Let us apply it to T=100:1003/4=31.6 – the same result we obtained when going by two separate stages. This is what “making connection between connections” means. Symbolically, we have more than two isolated connections, Tn and (T and n)S1 – in the case of parties, they connect into TS1. We have interlocking models. This is what makes science powerful.

40 4. The Largest Component: Between Mean and Total

How long has the geometric mean been invoked in political science?

At least for 250 years. Jean-Jacques Rousseau (1762) considered the govern- ment as an intermediary power between the people as the source of law (“the sovereign”) and the people as individuals, subjects to law (“the people”):

This last relation can be depicted as one between the first and last terms of a geometric progression, of which the geometric mean is the government. The government receives from the sovereign the orders which it gives to the people; and if the state is to be well balanced, it is necessary, all things being weighed, that the product of the power of the government multiplied by itself should equal the product of the power of the citizens who are sovereign in one sense and subjects in another. (Rousseau, The Social Contract, Book III, Chapter I “On Government in General” 1762/1968)

Thereafter, Rousseau plays with the idea of the power of the people as a whole corresponding to their number, say 10,000 citizens, while an individual is one. But then he veers off into generalities, shirking away from stating that the geometric mean is 100. He probably sensed that this “100” would raise questions he could not answer. Would it mean having a governmental payroll of 100 people, for such a small republic?

Exercise 4.3 Let us continue to play with the idea Rousseau fleetingly suggests. How large a “government” would the square root of population suggest for New South Wales, Australia, California, or the United States? How do those numbers compare with the actual number of employees in the public sector, at all levels? The broader question raised here is the following: Do larger units have a larger percentage of the work force in the public sector (directional model), and if so, what is the relationship to the population (quantitative model)?

41 5. Forbidden and Allowed Regions: Logarithmic Scales

 Some values for a quantity may be forbidden on conceptual grounds, or on grounds of factual knowledge. What remains is the allowed region.  The center of the allowed region is often easier to visualize when showing the regions on a logarithmic scale.  The logarithmic scale shows 0.1, 1, 10 and 100 at equal intervals.  The (decimal) logarithm of a number such as 10,000 – “1” followed by zeros – is simply the number of zeros.  When numbers multiply, their logarithms add: log(ab)=loga+log b.  It follows that log(1/b)=-logb, and log(an)=nloga.

In all the previous examples, we could represent our options on an axis ranging from minus to plus infinity (-∞ to +∞). How did we narrow down the options, say, for the weight of lmysh in Exercise 1.3? Let us proceed by systematic elimination, as shown in Figure 5.1.

Regular scale and its snags Negative weights are inconceivable, even in fairy tales. Therefore, we should first mark off the negative part of the scale as an utterly forbidden region for any weights whatsoever. Thereafter, we could mark the weights of shrews and blue whales and exclude what’s outside this range as forbidden regions for mammals. This would leave the allowed region, between Shrew and Whale. In Figure 5.1, the forbidden regions are dashed off, so that the allowed region stands out. The dashing is heavier for utterly forbidden regions, to visualize a stronger degree of exclusion.

Figure 5.1. Weights of mammals on an almost even scale: Allowed region.

5. Forbidden and Allowed Regions: Logarithmic Scales

All that’s left is to make the best estimate within this region. This is still quite a job, but without pinning down the allowed region, we could not even start. Within the allowed region we could still distinguish a central region, where we are most likely to expect to find most mammals, from cats to horses, and marginal regions of surprise, where we would be surprised to find lmysh, if we know that few mammals are as small as shrews or as large as whales. The scheme above is not drawn to an even scale. If we try to redraw it truly to scale (Figure 5.2), we run into double trouble. First, com- pared to the weight of the whale, the weight of the shrew is so tiny that it fuses with 0, and the forbidden zone between them vanishes. Second, the “central region” of little surprise would not be in the visual center between Zero/Shrew and Whale. This visual center represents the arithmetic mean of shrew and the largest whale, and it would correspond to a median-sized whale! Most mammals range around the geometric mean of Shrew and Whale, but in this picture this range would be so close to Shrew that it, too, would be indistinguishable from zero. In Figure 5.1, I had to expand this part of the graph to make it visible.) To get a more informative picture we must shifted to a different scale.

Figure 5.2. Weights of mammals on a strictly even scale.

Logarithmic scale What is the lo-gar-ithm of a number? It’s less scary than it looks. For numbers like 0.01, 0.1, 1, 1000 and 1,000,000, their logarithm (on the basis of 10) is simply the number of zeros. Count the zeros on the right of “1” as positive. Thus log1000=3 and log1,000,000=6. Count the zeros on the left of “1” as negative. Thus log0.01=-2 and log0.1=-1. What about 1? It has no zeros, so log1=0. Now graph numbers according to the distances between their logarithms. This is done in Figure 5.3.

43 5. Forbidden and Allowed Regions: Logarithmic Scales

Figure 5.3. Integer exponents of 10 and their logarithms.

In other words, logarithmic scale is a scale where you add equal distances as you multiply by equal numbers. On this scale, the distances from 1 to 10, from 10 to 100, and from 100 to 1000 are equal, because each time you multiply by 10. In the reverse direction, the same distance takes you from 1 to 0.1 and from 0.1 to 0.01, because each time you divide by 10. This is how we get the minus values for logarithms of 0.001 etc. Logarithmic scale goes by the “order of magnitude” of numbers, which roughly means the number of zeros they have. Note in Figure 5.3 that 10,000=104 and log10,000=4. Also 0.01=10-2 and log0.01=-2. Do you notice the correspondence between exponents and logarithms? We have 10,000=10log10,000, and 0.01=10log10.01! This is so for all numbers:

n = 10logn.

Let us show the weights of mammals on such a logarithmic scale (Figure 5.4). When using kilograms, the shrew is at 0.003; it logically must be between 0.001 and 0.01. The blue whale is at 30,000; it logically must be between 10,000 and 100,000. Now the visual center of the allowed region is around the geometric mean of shrew and blue whale, which is 10 kg – an average dog. The surprise regions evenly flank the geometric mean.

Figure 5.4. Weights of mammals on a logarithmic scale: Allowed region

But what happens to the utterly forbidden region? It vanishes because, on the logarithmic scale, the zero point shifts toward to the left, infinitely far. Indeed, we may add as many zeros as you wish behind the

44 5. Forbidden and Allowed Regions: Logarithmic Scales decimal point, 0.000000…0001, and we are still higher than zero. The logarithmic scale has no zero point – it is infinitely far away. Objects cannot have zero or negative weights, and logarithmic scale cannot show zero or negative numbers. The two are well suited for each other.

Exercise 5.1 a) Make a list of coins and banknotes used in your country. Mark their locations on a logarithmic scale. (Place “2” at one-third of the distance from 1 to 10, “3” at one-half, and “5” at two-thirds.) b) Comment on the distances between these locations. c) Now suppose I asked you to mark their locations on a regular scale. In your response, please stay polite. “Go fly a kite” is acceptable. d) Comment on the statement “We do not encounter the logarithmic scale in everyday life".

* Exercise 5.2 a) Show the forbidden and allowed areas for the number of seat-winning parties in a 100-seat district, on logarithmic scale. Also show the geometric mean. On the log scale, 2 is about one-third of the way between 1 and 10; 3 is halfway, and 5 is two-thirds toward 10. Do not bother about distinguishing between different types of forbidden ranges, nor between surprise and central ranges. b) Do the same for Casablanca deaths in Exercise 1.2. c) Do the same for the most populated US state. [NOTE: We do not have to start from 1 person! Start with a million.] d) Do the same for the largest US state. [We do not have to start from 1 km2!] e) Try to do the same on regular scale, for parts b to d, if you feel it’s simpler. Draw some conclusions.

When numbers multiply, their logarithms add We said that logarithmic scale is a scale where you add equal distances as you multiply by equal numbers. From this, the following results (for proof see Chapter 13):

log(ab) = log a + log b.

45 5. Forbidden and Allowed Regions: Logarithmic Scales

For instance, log(100×1000)=log100+log1000. Yes, 5 zeros equal (2+3) zeros. As another example, take 2×5. This is the same as 1×10. Hence the distance log2+log5 in Figure 5.3 must be the same as the distance log1+log10, which is 0+1=1. So, whatever the values of log2 and log5, they must add up to 1. How much is log(1/a)? Look at 10 and 1/10 in Figure 5.3. Their logarithms are 1 and -1. This is general:

log(1/a) = -loga.

It follows that

log(a/b) = log a - log b.

What about the logarithm of 53? log(53)=log(5×5×5)=log5+log5+ log5=3log5. We can generalize: When a number is to power n its loga- rithm is multiplied with n. Thus,

log(an) = n loga.

If you merely memorize these formulas, you risk getting confused. It’s safer to make an effort to understand why they must be true. We are now in a position to calculate logarithms in C=42/N2, in exercise 2.1: logC=log(42/N2)=log42+log(1/N2) = log42-log(N2) = log42-2logN. Note that this means a linear relationship between logC and logN. It’s like y=a+bx, where a is log42, and the slope b is exactly 2.

Graphing logarithms In the graph you constructed for Exercise 2.1, the point for Botswana is twice as high as the next one, forcing us to extend the C scale. Also the visual trend curve and the curve for model C=42 years/N2 are bent, which makes it hard to compare data with respect to these curves. In Exercise 2.3, on the other hand, the data points fall roughly on a straight line, and the logical model also is a straight line. At this moment, I did not tell you what these variables x and y represented. Now it can be told: x=logN and y=logC. y=1.62-2x, 1.62 is log42, and the slope -2 comes from the exponent of N. The two tables in Chapter 2 combine to Table 5.1.

46 5. Forbidden and Allowed Regions: Logarithmic Scales

Table 5.1. Number of parties and cabinet duration, and their logarithms.

Country N C (yrs.) x=logN y=logC Botswana 1.35 40 0.130 1.602 Bahamas 1.7 14.9 0.230 1.173 Greece 2.2 4.9 0.342 0.690 Colombia 3.3 4.7 0.519 0. 672 Finland 5.0 1.3 0.699 0.114

Exercise 5.3. Check the values of logarithms in Table 5.1, roughly, comparing them to distances in Figure 5.3. a) For values of N the logarithms must be more than log1=0 but appreciably less than log10=1. Is this so? b) For values of C, which logarithms must be between 0 and 1, and which must be between 1 and 2? Are they so?

The remarkable thing is that taking logarithms changes the curve C=42 years/N2 into a straight line, y=1.62-2x, and a straight line is much more manageable than a curve. When can logarithms turn a curve into a straight line? We’ll come to that

Logarithms of numbers between 1 and 10 – and beyond Table 5.2 shows the logarithms for numbers from 1 to 10. Why such values? We’ll come to that. For the moment, accept them, for information. Note that, pretty closely, log2=0.3, log 3=0.5, log4=0.6, and log5=0.7. Often, this is close enough.

Table 5.2. Logarithms of numbers from 1 to 10 … and from 10 to 100. x 1 1.5 2 3 4 5 6 7 8 9 10 logx 0 .176 .301 .477 .602 .699 .778 .845 .903 .954 1 x 10 15 20 30 40 50 60 70 80 90 100 logx 1 1.176 1.301 1.477 1.602 1.699 1.778 1.845 1.903 1.954 2

What about log300? Log300=log(100×3)=log100+log 3=2+0.477=2.477. Look at Table 5.2: log30 is exactly 1 added to log3. So when we know

47 5. Forbidden and Allowed Regions: Logarithmic Scales the logarithm of 3, we also know it for 30, 300, and 3000 – just add the number of zeros! Recall that 10,000=104 and log10,000=4. So 10,000=10log10,000. This is so for all numbers: n = 10logn.

Try it out on your pocket calculator. Take 10 to power 1.301, and see if you get 20. (On a usual calculator, enter 10, push “xy”, enter 1.301, push “=.” Or simply enter 1.301 and push “2nd” “log”.)

48 6. The Number of Communication Channels and the Duration of Cabinets

 The number of communication channels increases roughly as the square of the actors.  The inverse square law of cabinet duration is one of the consequences.  A law, in the scientific sense, combines empirical regularity and explanation through a logical model.

There is more to logical model building than the ignorance-based approach, which was used for the number of parties and for the size of the largest component. Now we take on a problem where a very different approach is needed. Start with an issue close to home. One of the tasks of parents is to adjudicate squabbles among their children. The more children, the more squabbles. This would be a directional model: when x up, then y up. But how fast does the frequency of squabbles increase as the number of children increases? Let us establish a quantitative model.

The number of communication channels among n actors With no children or with just one child (A), no conflicts among children can arise. With two children (A and B), they can. When a third child (C) is added, conflict frequency triples, because in addition to conflict channel AB there is also AC and BC. My wife and I felt this tripling the moment our third child began to walk. What would have happened to the number of conflict channels if we had a fourth child? Write down your answer. It’s the same with parties in a political system. The number of potential conflict channels increases faster than the number of parties. It should affect the duration of governmental cabinets: The more parties, the more conflicts; the more conflicts, the shorter the cabinet durations. But the broad issue is much wider. People form a society only if they communicate. Interaction among individuals is the very definition of what society is. This involves any sort of communication channels among social actors. The number of communication channels increases enormously as more individuals are added, and much of social organization addresses the issue of how to 6. The Number of Communication Channels and the Duration of Cabinets cut down on this number. The shift from direct democracy to repre- sentative democracy is one such attempt. How many communication channels (c) are there among n actors? Draw pictures and count the channels, for a small number of actors. You’ll find the following.

Number of actors, n 0 1 2 3 4 ↓ 5 ↓ 6 No. of conflict channels, c 0 0 1 3 6 → 10→ …

What is the general logical model for n actors? We can approach this issue step by step, looking at the table above. Two actors certainly have 1 communication channel. A third one extends a channel to each of them, thus adding 2 channels. That makes 1+2=3. A 4th actor extends a channel to each of the existing 3. That makes 3+3=6. Thus, at each step in the table above, n and c add, to produce the new c. At the next step is 4+610, and then 5+1015. We could continue, but it would take a long while to reach, say 100 actors. A more powerful approach is needed. Each actor extends a channel toward each of the (n-1) other actors. So the total, for n actors, is n(n-1). But we have double counted, approaching each channel from both ends. So we have to divide by 2. The result is c = n(n-1)/2.

Don’t just believe it. Check that this model yields indeed the numbers in the table above. The formula fits even for n=0 and n=1. These cases may look trivial, but actually they are very important conceptual extreme cases. If our model did not fit the extreme cases we would have to ask why it doesn’t. It may be tempting to say: “Oh well, extreme cases are special ones – you can’t expect the general rule to apply to them.” This is not so. To the contrary, we often construct the general model by thinking about simple extreme cases. Recall the question about the number of seat-winning parties. The “gorilla question” was: “What happens when the district has only one seat?”. This pulled in the extreme case of one-seat districts. In the present case we have to make sure that the general formula applies to n=0 and n=1 too. When n is large, the difference between n-1 and n hardly matters. Then the model simplifies into

c ≈ n2/2. [n large]

50 6. The Number of Communication Channels and the Duration of Cabinets

For n=100, the exact formula yields 9900/2=4,950. The approximate one yields 5,000 – close enough. The number of communication channels increases roughly as the square of the actors. Society is by definition not just a bunch of individuals but of individuals who communicate with each other. If so, then laws about the number of communication channels among actors should be among the most basic in social sciences, leading to many consequences. Among them, we’ll discuss here the average duration of governmental cabinets. The number of communication channels also determines the number of seats in a representative assembly; this will be discussed in Chapter 19.

Exercise 6.1 Suppose a university has 60 departments, divided evenly among 10 schools headed by deans. Assume for simplicity that the university president communicates only with the deans below her and the minister of education above her. Assume that each dean communicates only with the president, other deans and the department heads in her own school. a) Determine the president’s number of communication channels (P) and that for a single dean (D). Comment on the disparity. (WARNING: Draw the actual scheme described and count the channels – do not fixate blindly on some formula.) b) The president finds that her communication load is too high and proposes to divide the university into just 4 large schools. How large would the communication loads P and D become? Comment on the disparity. c) The president matters of course more than a dean. On the other hand, the deans are many. So let us take the arithmetic mean of P and D as the measure of overall communication load. How large would it be for a setup with 10 schools? And for 4 schools? Which of these numbers would be preferable, for keeping communication load down? d) Would some number of departments reduce this load even further? If so, then to what level? e) Would reduction in communication load seem to justify reorganization, assuming that our way to measure it is valid?

Average duration of governmental cabinets How can c≈n2/2 be used to determine the average duration of govern- mental cabinets? Cabinet breakdowns are caused by conflicts. Imagine we reduce the number of potential conflict channels (c) among parties by one-half; then the breakdown frequency of cabinets should also

51 6. The Number of Communication Channels and the Duration of Cabinets decrease by one-half. Hence cabinet duration (capital C) should double. But the number of conflict channels itself grows as the square of the number of parties (N). Thus, if we reduce the number of parties by one- half, cabinet duration should become 4 times longer. Such reasoning leads to an “inverse square” relationship:

C = k/N2.

Here k is a “constant”, the value of which we do not know. Note that C is measured in units of time. Since N is a pure number (without units), k must also be in units of time, so as to preserve “dimensional consis- tency” between the two sides of the equation.

Exercise 6.2 “If the number of parties is reduced by one-half, cabinet duration should become 4 times longer.” a) Verify that equation C=k/N2 does lead to this outcome. You do so by plugging in N/2 instead of N, and 4C instead of C. Are the two sides of the equation still equal? b) To carry out such verification, do we have to know the value of constant k?

Does this equation enable us to predict duration of cabinets, for a given number of parties? Not yet, because the value of constant k is not known. How do we find it? Graph C vs. N for many countries and see if any curve of form C=k/N2 fits the data cloud. The actual study (Taagepera and Sikk 2010) included 35 countries. The 5 countries in Table 5.2 (and the earlier Table 2.1) were chosen here so as to represent fairly well this broader set. In Exercise 2.1 you were asked to graph C vs. N and draw in the trend curve for these data points. Then you were asked to draw in three curves of form C=k/N2: C=21 years/N2, C=42 years/N2, and C=84 years/N2. You probably concluded that all the data points were in between the extreme curves, and the central curve was fairly close to your trend curve. This means that setting k at 42 years more or less agrees with the data. But it is pretty hard to draw a best-fitting curved line and to decide how close it is to a model-based curve. Shifting to logarithms changes the curves of form C=k/N2 into straight lines, logC=logk-2logN. Since k can have many values, this is an entire family of lines, but all of them have slope -2 and hence are parallel. In Exercise 2.3 you probably observed, too, that the data points fell roughly on a straight line. Now it

52 6. The Number of Communication Channels and the Duration of Cabinets was much simpler to move a transparent ruler and choose which line looked like the best-fit line. (The best fit line can be determined more accurately by statistical methods, but the main point remains: one first has to convert to logarithms.) You most likely found that this line had a somewhat steeper slope than the model line, y=1.62-2x. This was so for these 5 data points. With 35 countries, the agreement with the expected slope -2 is close. In the actual study, we determined the statistical best-fit line with slope -2 (not just any slope) and found its intercept with the x-axis. This was found to be 1.62, which is the logarithm of 42. (How do we find k when we know logk? On a usual pocket calculator, enter 1.62, and push “2nd/LOG”.) So k=42 years, and the relationship is specified as

C = 42 years/N2.

This equation is logically based regarding N2, while the constant “42 years” is empirically determined. It predicts cabinet duration for a given number of parties, within a considerable margin of error of ×÷2 (multiply or divide by two) – remember Botswana and Greece on your graphs! Even so, this is the mean duration over a long period. One particular cabinet can last much longer or fall within days, if there should be a major scandal. Let us review what has been done, in somewhat more abstract terms. First assume that we have no logical model – just data. It is hard to describe the general trend in any detail when looking at the graph C vs. N, apart from the coarse directional description: “N up, C down”. On the graph logC vs. logN, however, something new jumps to the eye: the pattern visibly is very close to linear. This means that logC=a+b(logN). This corresponds to C=k/Nb. Thus b is the same in both equations, while logk=a. We can calculate the constants a and b in logC=a+b(logN), as you did in Exercise 2.3. Hence we get the b in C=k/Nb directly and k indirectly. We find, however, that b is suspiciously close to -2. This may have a theoretical reason. So start to look for this reason. Then c ≈ n2/2 supplies the answer. This is part of a broader chain of relationships, connections among connections. Cabinet duration depends on the number of parties. The number of parties, in turn, can be deduced from the assembly size and the number of seats allocated in the average district. (Calculating the number of seat-winning parties in Chapter 1 and the share of the largest party in Chapter 4 were important steps in this chain.) Hence we can deduce cabinet duration from the number of seats in the assembly and in the district. This means that we can design for a desired cabinet duration

53 6. The Number of Communication Channels and the Duration of Cabinets by manipulating the assembly size and the number of seats allocated in the average district. Thus, this model building has reached a stage that could be of practical use, even while the range of error is as yet quite large.

Leap of faith: A major ingredient in model building A crucial link in the model was actually pretty weak: “Imagine we reduce the number of potential conflict channels (c) among parties by one-half; then the breakdown frequency of cabinets should also decrease by one-half.” Is that really so? One may accept that more conflict channels make coalition breakdown more likely. But what is the evidence that breakdown frequency is proportional to c? Should I give up on building a model for cabinet duration until I find such evidence? No. The answer is two-fold. First, breakdown frequency being proportio- nal to the number of conflict channels is the simplest approach to try, unless we have contrary evidence. So let us try it, even while it is a leap of faith! Second, if my assumption is wrong, it will catch up with me later on: the resulting model will not fit the data. That the final model actually correctly connects C to N is indirect evidence that the pro- portionality assumption is valid. The world is full of people who point out why things cannot be done. And then there are those who try anyway. They sometimes succeed. Those who don’t dare to try never do. This applies to model building too.

* Exercise 6.3 Consider a pretty extreme a situation where each day sees a new cabinet formed. How many parties would this correspond to? a) Convert the equation C=42 years/N2 into the form N=... , with N alone on the left side and all the rest on the right. Go slow and apply the basic balancing rule of algebra, step by step (see box.) b) Use this equation to calculate N when each day sees a new cabinet formed. oahketarb c) Now suppose, instead, that we actually feel no democratic regime could withstand more cabinet changes than once a month. How many parties would this correspond to? zuket.arb d) Now suppose we draw a line at a situation where the effective number of parties is 30. To what cabinet duration might it lead? .nolnegsba  egysba p

54 6. The Number of Communication Channels and the Duration of Cabinets

The basic rule of algebra: Balancing

The equation C=k/N2 is nice for calculating C. But sometimes we want to calculate N for a given C. Then we have to convert this equation into something that looks like N=... , with N alone on the left side and all the rest on the right. Sometimes we even wish to find out what k would have to be for a given C and N; then we have to convert it into form k=... This is the topic of Exercise 6.3. The hard fact is that quite a few social science students find such conversions hard – they just haphazardly try to multiply or divide, trying to recall some rules they memorized in Algebra 1, without ever understanding why these rules make sense. Now it’s time to understand, otherwise you’ll be lost in the following chapters. The basic rule is balancing the two sides of the equation. Suppose we want to calculate k. So we want to have k on the left side, alone. Instead of C=k/N2, revers it to k/N2=C. To get rid of 1/N2, multiply by N2 – but you have to do it on both sides of the equation, so as to maintain balance: N2(k/N2)= N2(C). But N2/N2=1, so we can cancel out the N2 upstairs and downstairs: N2(k/N2)= N2(C), and k=N2C results. (Do not just read this. Put the text aside, and practice it!) Now suppose that we want to calculate N. This means we want to have N on the left side, alone. Start with C=k/N2. Multiply by N2 – again on both sides: N2(C)= N2(k/N2). Again, cancel out the N2s upstairs and downstairs, so that N2C= k results. Divide on both sides by C: (N2C)/C= (k)/C. Canceling out C/C leaves us with N2= k/C. Take the square root on both sides: (N2)1/2= (k/C)1/2.As 2(1/2)=1, we are left with N= (k/C) 1/2. Put the text aside, and practice it! If you merely copy this procedure in answering the exercise above, you are badly short-changing yourself. You would condemn yourself to be a rule-applying underling rather than a thinking person. If we merely apply rules of moving symbols around, we may inadvertently multiply instead of dividing. Say, we want to get rid of C, but instead of dividing on both sides by C, we multiply. Such an error cannot slip by when we follow the cancelling-out procedure: Instead of C/C, which cancels out, we get CC, which cannot cancel out. Have the patience to go slow and really apply this balancing rule, step-by-step, until it be- comes so automatic that you can start cutting corners, cautiously.

Laws and models What is a law, in the scientific sense? It’s a regularity that exists, and we know why it holds. A law can come about in two ways. Some regularity may be first observed empirically and later receives an expla- nation through a logical model. Conversely, a model may be deduced

55 6. The Number of Communication Channels and the Duration of Cabinets logically, and later is confirmed by testing with data. Most often, it is messier than that: Some limited data  an urge to graph them  seeing a puzzling relationship  an urge to look for a logical explanation  a tentative logical model  more data  possibly adjusting the model  possibly refining the way data are measured  more data collection and model adjustment ….  a relationship that qualifies as a law. For the purpose of learning model building, the main point here is that the logical model for cabinet duration is quite different from the previous. This model is based on the idea of communication channels. Before you may start thinking that the ignorance-based approach is all there is to model construction, the example of cabinet duration serves as a warning that logical models come in an infinite variety of ways, depending on the issue on hand. One has to think. Now you are in a position to follow the history of a research issue. This may be useful in perceiving how quantitative research proceeds more generally. Someone else had calculated and tabulated the data and observed that duration decreased when the number of parties increased (Lijphart 1984: 83, 122, 124–126). Thus, a directional model was established. I graphed Lijphart’s data, because I wanted to have a picture of the relationship. The pattern was clearly curved (like your graph in Exercise 2.1). So I graphed the logarithms (like your graph in Exercise 2.3), hoping the data cloud would straighten out. It did, and the slope in this graph was suspiciously close to -2.0, meaning an inverse square relationship. Such simple integer values do not happen just like that – “Maybe they want to tell us something,” I said to myself. For a long time, however, I could not figure out what the data graph was trying to tell me – it isn’t such a straightforward process. After a while, the connection to communication channels occurred to me. But the process started with graphing the data. This is why I have explained the process here in some detail, so that you can not only admire it but also use it.

56 7. How to Use Logarithmic Graph Paper

 Many logical models and empirical data patterns graphs in social science are best presented on logarithmic scales.  It is easy to establish that log2=0.30 and log5=0.7.  Graphing numbers on log scale and graphing their logarithms on regular scale leads exactly to the same image.  Some curved data clouds y vs. x straighten out when graphed on fully logarithmic (log-log) graph paper. Some others do so on semilog graph paper. For some, neither works.

When we want to graph logarithms, it’s a nuisance to have to calculate the logarithms for each data point, the way I had to do for Table 5.1. There is a shorter way: use logarithmic graph paper. The instructor may distribute to you sample copies, which you can use as master copies for producing what you need. Logarithmic graph paper can also be found on the Internet, but be cautious – the format offered may or may not suit your needs. To make an intelligent choice, we have to understand what is involved. First, let us extend our grasp of logarithms.

Even I can find log2! We said that logarithmic scale is a scale where you add equal distances as you multiply by equal numbers. On this scale, the distances from 1 to 10, from 10 to 100, and from 100 to 1000 are equal. But we must be consistent. If multiplying by 10 adds equal distances, then this must be so for all numbers. For instance, distances must be equal when multiplying by 2: 1, 2, 4, 8, 16, 32, 64, 128... And the two scales must agree with each other. This means that 8 must place a little lower than 10, and 128 must place a bit higher than 100. This can be done, indeed, as shown in Figure 7.1.

Figure 7.1. Integer exponents of 10 and 2, and their logarithms.

7. How to Use Logarithmic Graph Paper

Now comes the creative step. Note that 210=1024 is awfully close to 103. Hence we can write 210≈103. This means that 10 times log2 roughly equals 3 times log 10: 10log2≈3log10=3. It follows that log2≈3/10= 0.30. The log2 cannot logically be anything else! Actually, log2= 0.30103 because 1024 is slightly larger than 1000. The point is that it is within your ability to determine logarithms of simple numbers. We can do so only approximately, but often this suffices. Of course, we most often use pocket calculators to calculate logarithms, but it makes the “logs” less scary to know we could find them even without a calculator. Since 4=22, it immediately follows that log4=2log2≈0.60. And since 8=23, log8≈... (fill in the blank). What about log5? Note that 5×2=10. So log5+log2=1, and log5≈... (fill in the blank).

* Exercise 7.1 a) Determine log3, approximately. Hint: Note that 34=81≈80=2310. Take logarithms on both sides. [Assume log2=0.30 and log5=... from above.] b) Determine log7, approximately. Hint: Note that 72=49≈50=510. Take logarithms on both sides. c) To complete the list of logarithms of integers 1 to 10, calculate log6 and log9. Compare our approximate values to those in Table 5.1.

And now I ask you something really scary. Estimate log538.6. How in the world could you know that? Relax. This is approximately log500. Which, in turn, is log(5×100). We found that log5≈0.70. So log538.6≈ 0.70+2=2.70. How close are we? Enter 538.6 on your pocket calculator and push LOG: 2.731. Close enough! (If we are real ambitious, we might note that 538.6≈540=2×27×10= 2×33×10. So log538.6≈0.30+3×0.475+1=2.725. But if we need that much precision, better get a calculator.)

Placing simple integer values on logarithmic scale In the previous Figure, we could now place the logarithms of 2, 3, 4 etc. on the log x scale (which is a regular scale), but instead, we could place the numbers themselves on the x scale (which is logarithmic), as shown in Figure 7.2. We have not bothered writing out the zeros in 10, 20, 200, etc. because it is sort of superfluous. Indeed, if we decide that one

58 7. How to Use Logarithmic Graph Paper of the “1” on this scale really stands for 1, then the next “1” and “2” on its right must stand for 10 and 20, and so on. And the “1, 2, ...5” on its left must stand for 0.1, 0.2 … 0.5. Each “period” (from one “1” to the next “1”) looks exactly like another one – it’s just a matter of dividing or multiplying the numbers shown by 10. Why bother with all that? The happy outcome is that, instead of laboriously calculating logarithms for each number we want to graph on the regular scale at the bottom of Figure 7.1, we can just place those numbers themselves on this logarithmic scale. This is precisely what the logarithmic graph paper does for us – and it tremendously speeds up graphing.

Figure 7.2. The numbers themselves placed on the “logarithmic scale” correspond to their logarithms placed on the regular scale.

CAUTION! Some students sort of logically, but mistakenly, think that numbers should be graphed on regular scale and their logarithms on the logarithmic scale. No! The reverse is the case. Graphing numbers on log scale and graphing their logarithms on regular scale leads exactly to the same image. The log graph just spares us from calculating each logarithm separately.

Fully logarithmic or “log-log” graph paper Maybe the paper supplied to you looks like the one in Figure 7.3. It has 1 period in the horizontal direction and 2 periods in the vertical. Other graph papers may have only one period in both directions, or as many as 5 in one direction. Note that all 1-period squares are identical. Thus, if we should need more periods than the paper has, we can just paste together as many as we need. Further advice is given, using Exercise 7.2 as a starting point.

59 7. How to Use Logarithmic Graph Paper

Figure 7.3. Blank grid of fully logarithmic (“log-log”) graph paper, with 1 period in the horizontal direction and 2 periods in the vertical. It has “1 … 1 …1” at both axes. 1

5

3

2

1 y 5

3

2

1

1 2 3 5 1 x

* Exercise 7.2 Use the N and C data in Table 5.1. Use graph paper supplied or available on Internet.

Do NOT use the example above or try to construct your own graph paper, if you can avoid it. See advice that follows this Exercise. a) Place the data points (N,C) on the log-log grid (NOT their logarithms!). b) Comment on the shape of the resulting pattern. c) Draw the best-fit line through the data points. d) Superimposed on the previous (not separately!), graph the logical model equation C=42 years/N2, using at least 3 values of N. Comment on the resulting shape. e) Draw in the line going from point (1, 100) to (10, 1). It has the slope -2, given that it drops by 2 periods on the y-scale while advancing by one period on the x-scale. Compare the slopes of your best-fit line and of C=42 years/N2 to this comparison slope. f) Compare to your graph in Exercise 2.3. Do they differ, apart from different magnification?

60 7. How to Use Logarithmic Graph Paper

First look at how many periods from 1 to 10 you need. In Table 5.1, the number of parties ranges from 1 to 5, so one period will suffice. Duration ranges from 1 to 40 years, so we need two periods – 1 to 10 and 10 to 100. At the top period, expand the labels “1, 2, …1” to 10, 20, … 100, as shown in Figure 7.4. Now you are ready to graph the data points (N,C).

Figure 7.4. Grid for log-log graph, with “1 … 1 …1” filled in as “1 … 10 …100” for the purposes of Exercise 7.2.

100

50

30

20

10 y 5

3

2

1

1 2 3 5 10 x

Slopes of straight lines on log-log paper Measuring slopes of lines on log-log graphs (such as required in Exercise 7.2, part e) may be confusing, at first. Forget about the scales shown! Take a ruler and measure the physical distance by how much the vertical distance changes when the horizontal distance changes by a given amount. For slope, divide these distances.

CAUTION: Be careful with log-log papers in the Internet. Sometimes the periods in horizontal and vertical directions differ in length, so as to fill an entire page. This is illustrated in Figure 7.5. Here the x-scale has been pulled out to twice the length, compared to the y-scale. If we draw

61 7. How to Use Logarithmic Graph Paper a line from bottom left to top right, it seems to have slope 1, but it really has slope 2. Why? When logx grows from 0 to1, then log y grows from 0 to 2. To get the real slope for any other line on this grid, we would also have to correct the apparent slope by the same factor 2. The periods on the two scales do not have to be equal, in principle. However, unequal scales distort the slopes of lines, which play an important role in Exercise 7.2. This may confuse you, if your expe- rience with log-log graphs is still limited.

Figure 7.5. Risky grid for log-log graph: x-scale pulled out to twice the length.

100

50

30

20

10 y 5

3

2

1

1 2 3 5 10 x

Exercise 7.3 To graph population vs. area of countries in the world, one must reach from 21 square kilometers (Nauru) to 17 million (Russia), and from 13,000 for Nauru to 1.3 billion for China on the population scale. a) How many 1-to-10 periods do we need on the area axis? sba b) How many 1-to-10 periods do we need on the population axis? hat NOTE: We do NOT have to start with 1 square kilometer or 1 person.

62 7. How to Use Logarithmic Graph Paper

Semilog graphs On log-log graphs both axes are on logarithmic scales. But one can also use log scale only on one axis, the other being on a regular scale. This is often called a “semilog” graph. Its grid is shown (imperfectly) in Figure 7.6.The numbers on the regular x-scale are of course arbitrary.

Figure 7.6. Grid for semilog graph y vs. x, 2 cycles on the y-axis.

100

50

30

20

10 y 5

3

. 2

1

0 2 4 6 8 10 12 x

* Exercise 7.4 Estonia’s national currency prior to adoption of the euro (2011) had 12 different coins and bank notes. They ranked as follows, by increasing values:

Rank 1 2 3 4 5 6 7 8 9 10 11 12 Value 0.05 0.10 0.20 0.50 1 2 5 10 25 50 100 500

a) Graph value on log scale and rank on regular scale. (We need graph paper that covers five 1-to-10 periods. If your paper has less, then cut and paste extra ones.) b) Draw in the trend curve or the line that respects nearly all points. (Do NOT join the individual points with a wiggly curve!)

63 7. How to Use Logarithmic Graph Paper

c) Notice and describe one blatantly irregular point. d) Draw conclusions regarding regularities and irregularities in this pattern. e) Make a list of coins and bank notes used in your country and repeat the exercise. f) Compare the slopes of the two curves.

Why spend so much time on the cabinet duration data? The model connecting cabinet duration to the number of parties deser- ves well one chapter, as an example of what the basic communication channels model c=n(n-1)/2 can lead to. But why has it been invoked repeatedly from Chapter 2 on? The reason is that it supplies also a convenient example to demonstrate three different ways to graph the same data. C vs. N on regular scales produced a curve, and curves are harder to handle, compared to straight lines. Then we showed that using logC vs. logN turns this curve into a straight line. This could be done in two ways, which lead to the same result: calculate logC and logN for each case separately and graph again on regular scales, or avoid such calculations and graph C and N themselves, but on logarithmic scales. This is not a universal recipe. Try graphing the currency data in Exercise 9.4 on log-log paper, and see what you get. Here the semilog paper straightens out the pattern – provided we place the ranks on the regular scale, not vice versa. Try graphing the cabinet duration data on semilog paper, and see what you get. Here the log-log paper straightens out the pattern. Sometimes neither works.

Regular, semilog and log-log graphs – when to use which? When data include both very small and very large values, a logarithmic scale is the only way to tell apart the median countries from the tiny ones. But the number of parties does not vary that much. We still graphed it on log scale. We did so because the logical model suggested that the expected curve would then turn into a straight line. How do we know then which way to graph, in general? Some guide- lines will be given later on, but at times I have no idea. Then I graph the data in several ways, and sometimes a linear pattern appears for some way of graphing. Then it’s time to ask: What is the reason behind this regularity? This may be the starting point for trying to construct a logical model.

64

B. Some Basic Formats

8. Think Inside the Box – The Right Box

 In any data analysis we should look for ability to predict and for connections to a broader comparative context.  Our equations must not predict absurdities, even under extreme circumstances, if we want to be taken seriously as scientists. Poorly done linear regression analysis often does lead to absurd predictions.  Always graph the data – and more than the data! Show the entire conceptually allowed area plus logical anchor points.  Before applying a statistical best fit to two variables, graph them.  To be acceptable, a data fit as well as a logical model must result in a curve that joins the logical anchor points, if any exist.  If anchor points and data points do not approximate a straight line, transform data prior to linear fit so as to get a straight line that does not pierce conceptual ceilings or floors.  After any quantitative processing, look at the numerical values of parameters and ask what they tell us in a comparative context.

Up to now, we have mostly started by constructing a logical model and then looking for data to test it. Now we start with some data. How do we proceed so as to infer a logical framework? Often (though by no means always), we are advised to think inside the box – but it has to be the right one. Consider the following hypothetical data set of just 6 points (Table 8.1).

Table 8.1. Hypothetical relationship between satisfaction with the head of state (x) and satisfaction with the national assembly (y). x 2.0 2.5 3.0 3.5 4.0 4.5 Mean: 3.25 y 0.2 0.2 0.9 1.1 2.4 3.6 Mean: 1.40

Always graph the data – and more than the data! If you have been exposed to a statistics course, you might have the knee-jerk reaction to push the OLS (Ordinary Least Squares) regression button on your computer program. It would lead to a line hly roug 8. Think Inside the Box – The Right Box corresponding to y=-4.06+1.68x (R2=0.9). This result would look highly satisfactory, if one merely goes by high R-squared. (We’ll discuss later what this R2 means – and does not mean.) But if the pattern is actually curved – like in our Exercise 2.1 – this would be a misleading fit. Some data patterns are horseshoe shaped, and here a linear fit would be utterly senseless. How do we know that a linear fit makes sense? Always graph the data! For the data above, however, graphing leads to a pretty straight pattern (Figure 8.1). Only a negligible curvature might be seen, so linear regression looks justified. (I eyeballed the equation above from this graph. Precise OLS procedure would add to formalism but little to information content.) If you let the computer draw the graph, it may even automatically add the best fit line, show its equation, and draw a nice frame to box in the data points, as shown in Figure 8.1. This is the wrong box, however. We should think outside this box.

Figure 8.1. Graphing only the data, regressing linearly, and boxing it the area where data points occur. Dotted line: y=x.

Indeed, what does this graph plus regression equation tell us in a substantive way? We can tell that support for assembly (y) increases as support for president (x) increases, given that the slope b=1.68 is positive. Moreover, y increases faster than x, given that the slope b=1.68

67 8. Think Inside the Box – The Right Box is larger than 1. This is where analysis of such data often stops. What else should we look for?

In any data analysis we should look for ability to predict and for connections to a broader comparative context.

What does this “broader context” mean? Play around with the regression line, y=-4.06+1.68x. Ask, for instance, what would it predict for x=2? It would predict y=-0.70. A negative approval rating? Approval ratings usually start from zero. If this is the case here, then such prediction makes no sense. It goes against the following norm:

Our equations must not predict absurdities, if we want to be taken seriously as scientists.

Our equations should not do so even outside the empirical range of input variables. We should also ask why the intercept is around a=-4.06? What does this number tell us? Why is it negative? Is it large or small compared to the intercepts in some other data sets? Unless one asks such contextual questions, it is pretty pointless to calculate and report a precise regression equation that, moreover, predicts absurdities for some values of x. It is high time to graph more than just the data. Graph more than the data! This advice may sound nonsensical to those who reduce science to data processing. “What else but data is there to graph?” they may ask. But we have already graphed more. In Chapter 5, we graphed forbidden and allowed regions. We will build on that idea. But even before doing so, we might draw in the equality line, if it makes sense.

Graph the equality line, if possible We might start by graphing the equality line (y=x), if equality can be defined. It certainly can, if x and y are percentages. Here it can, too, provided that both ratings are on the same scale. In contrast, if x is a country’s area and y its population, then no equality can be defined. By thinking of equality, we add a conceptual comparison line: equal support for both institutions. In Figure 8.1, it’s the dotted line, top left. Now we can see that support for assembly always falls short of support for president, even while it seems to be catching up at high values of x.

68 8. Think Inside the Box – The Right Box

One may say that this was obvious even without graphing the equality line – but I have seen too many published graphs where this line was not drawn in and the “obvious” implication was missed. Would y ever catch up with x? To answer this question, it is time to graph even more, besides the data.

Graph the conceptually allowed area – this is the right box to think in I should have told you right at the start about the scale on which people were asked to rank the president and the assembly. It matters whether the given data refer to variables that can range from 0 to 5 or from 0 to 10 – or worse, from 1 to 5. The range can even differ on the two axes. Published articles all too often hide or omit this all-important infor- mation. Suppose the conceptually allowed range is 0 to 5 on both axes. This is the box (Figure 8.2) that has a logical meaning. Now the same data take on a different appearance: The curvature somehow looks more pronounced. We are also motivated to ask: What are the extreme possibilities?

Figure 8.2. The limits of the conceptually allowed area and the logical anchor points. The dashed line is the regression line.

69 8. Think Inside the Box – The Right Box

Our data suggest that the assembly’s ratings do not surpass the pre- sident’s. (This is most often the case, all over the world.) These ratings also cannot fall below 0. In the absence of any other information, what could the assembly’s rating be when even the president’s rating drops as low 0? The only answer that agrees with the considerations above is that the assembly also must have zero rating. Similarly, when even the assembly is rated a full 5, then the information above does not allow the president to have any less than 5. These points (0,0 and 5,5) are logical anchor points for such data. They are indicated in Figure 8.2 by triangular symbols reminiscent of letter A for “Anchor”. Now we come to a corollary of the previous anti-absurdity norm:

A data fit is meaningful only when it remains within the conceptually allowed area and includes the logical anchor points.

The linear data fit (dashed line in Figure 8.2) violates logic. For presi- dential ratings below 2.4, it would predict negative ratings for the assembly. For assembly ratings above 4.4, it would suggest presidential ratings above 5. We must use a different format.

To be acceptable, a logical model or data fit must result in a smooth curve that joins the anchor points and passes close to most data points.

We should keep the equation of this curve as simple as possible. Figure 8.3 shows such a curve. How was it obtained?

The simplest curve joining the anchor points The simplest algebraic format that takes us smoothly from 0,0 to 1,1 is the “fixed exponent” equation

Y = Xk.

You can check that, whenever X=0, we are bound to have Y=0, regardless of the value of exponent k. Similarly, when X=1 then Y=1 – and vice versa. This is the most manageable format for two quantities with clear lower and upper limits at 0 and 1. Physicists often call such equation a power equation, saying that here “x is to the power k”. But this term confuses and angers some political scientists who deal with political power. So I use “fixed exponent equation”.

70 8. Think Inside the Box – The Right Box

Figure 8.3. Normalizing to range 0 to 1, and fitting with Y=X3.6.

In the present case the variables range is not from 0 to 1 but from 0 to 5. This would make even the simplest possible form more complex: y=5(x/5)k. It is less confusing to shift to different variables, which do range from 0 to 1: Y=y/5 and X=x/5. Figure 8.3 shows these “normalized” scales, and Table 8.2 tabulates the normalized data. It can be seen that Y=Xk yields a fair fit to data (plus exact fit to anchor points) when k is set at 3.6. Indeed, deviations of Y=Xk 3.6 from actual data fluctuate up or down fairly randomly. But how was the value k=3.6 found? Logarithms again enter.

Table 8.2. Original and normalized data, and fit with Y=Xk. x 0 2.0 2.5 3.0 3.5 4.0 4.5 5 y 0 0.2 0.2 0.9 1.1 2.4 3.6 5 X=x/5 0 0.40 0.50 0.60 0.70 0.80 0.90 1 Y=y/5 0 0.04 0.04 0.18 0.22 0.48 0.72 1 Y=X3.6 0 0.037 0.08 0.16 0.28 0.45 0.68 1 Deviation 0 - + - + - - 0

71 8. Think Inside the Box – The Right Box

Recall that log(Xk)=klogX – when a number is to power k, its logarithm is multiplied with k. So the equation Y=Xk implies logY=klogX and hence k=logY/logX when Y=Xk.

Pick a location on the graph that approximates the central trend, such as 0.60,0.16. (It need not be an actual data point.) Then k=log0.16/ log0.60=(-0.796)/(-0.222) =3.587≈3.6. On my pocket calculator the sequence is: 0.16 “LOG” “÷” 0.60 “LOG” “=”. On some others, it is “LOG”0.16 “÷” “LOG”0.60 “=”. If you still feel uneasy with loga- rithms, see Chapter 13. CAUTION. Some students think they can replace log0.16/log0.60 with log16/log60. This is not so. Log0.16/log0.60=3.587, while log16/log60=0.677 – quite a difference! Once we have k=3.6, we can calculate Y=Xk 3.6 for any value of X. E.g., for X=0.40, on my pocket calculator 0.40 “yx” 3.6 “=” yields 0.0369≈0.037, as entered into Table 8.2. If we want a more precise fit to the data points, we can graph logY against logX and determine the best-fit line that passes through the point 0,0. Given that logY=klogX, the slope of this line is k. For even more precision, we can run linear regression of logY on logX on computer. But how much precision do we really need? Even a coarse fit that respects the conceptual anchor points is vastly preferable to a 3-decimal linear fit that predicts absurdities. We need enough precision to compare these data to some other data of a similar type. What does this mean? We could compare the assembly and presidential ratings at a diffe- rent time, or in a different country. Would the points still fall on the curve for k=3.6, or would a different value of k give a better fit? Only such comparisons lend substantive meaning to the numerical value k=3.6, by placing it in a wider context. The implications will become apparent in the next few chapters.

* Exercise 8.1 Suppose people have been asked to express their degree of trust in their country’s legal system (L, rated on a scale from 0 to 1) and in their country’s representative assembly (A, also rated on a scale from 0 to 1). We wonder how L and A might be interrelated. a) Graph A vs. L. Label the locations 0, 0.5 and 1 on both scales. Mark off the areas where data points could not possibly be – the forbidden regions. b) When L=0, what can one expect of A? When A=0, what can one expect of L? Keep it as simple as possible. Mark the resulting point(s) on the graph.

72 8. Think Inside the Box – The Right Box

c) When L=1, what can one expect of A? When A=1, what can one expect of L? Keep it as simple as possible. Mark the resulting point(s) on the graph. d) What is the simplest way to join these points? Draw it on the graph, and give its equation. Let’s call it the simplest model allowed by the anchor points. e) Suppose we are given two data points: (L=0.45, A=0.55) and (0.55,0.50). Show them on the graph. Is it likely that our simplest model holds? Why, or why not? f) Forget about those points. Instead, we are given three data points: (L=0.45; A=0.35), (0.60,0.45) and (0.70,0.60). Show them on the same graph, using symbols different from the previous. (E.g., if you previously used small circles, then now use crosses.) Is it likely that our simplest model holds? Why, or why not? g) Pass a smooth curve through the points in part (f) – a smooth curve that predicts A at any allowed value of L, and vice versa. What is the general form of the corresponding equation? h) What is the specific equation that best fits the data points in part (g), i.e., where constants have been given numerical values? i) Assuming that this shape holds, how would you express in words trust in legal system as compared to trust in assembly?

Support for Democrats in US states: Problem Having worked through the previous example, you should now be in a position to tackle the following. Figure 8.4 shows the Democratic per- centages of votes in various US states in presidential elections 2000 compared to what they were in 1996, as reproduced from Johnston, Hagen and Jamieson (2004: 50). The illegibly dark label at the line shown reads “Vote2000 = -11.4 + 1.1*Vote1996”. It represents the OLS (Ordinary Least Squares) regression line, which is one way to fit a line to these data points. Several features can be added that might add to our understanding of the changes from 1996 to 2000.

Before you read any further, do the following, and write it down. a) List the features of interest that are missing. b) Add these missing parts to the graph, as precisely as possible. c) Try to find what the graph then tells you.

Try to do it on your own, before looking at the solution. Think in terms of allowed areas, anchor points, continuity, equality lines, baselines, and

73 8. Think Inside the Box – The Right Box simplest curves joining anchor points. This is not to say that all of them enter here in a useful way.

Figure 8.4. The starting point: Data and regression line, as shown in Johnston, Hagen and Jamieson (2004: 50).

Support for Democrats in US states: Solution Let x stand for “1996 Democratic percentage” and y for “2000 Demo- cratic percentage”. For both, the conceptually allowed range is from 0 to 100. Draw it in, as precisely as you can. How can we do it? Take a piece of paper with a straight edge. Mark on it the equal distances 30, 40, 50 … on the x-axis. Move it first so as to reach 0 on the left and then 100 on the right. Mark those spots. The graph given makes us do so at the level y=20. Repeat it at the top level of the graph. Use these two points to draw the vertical line x=0. Do the same at x=100. Then do the same for y-axis. (Watch out: the scale is not quite the same for y as it is for x.)

74 8. Think Inside the Box – The Right Box

You may find that you do not have enough space around the original graph to fit in the entire allowed region. If so, copy the graph, reducing it. Or tape extra paper on all sides. Just don’t say: “It can’t be done,” for technical reasons. You are in charge, not technology. Don’t be afraid to use less than modern means, if they serve better the scientific purpose. Now draw in the equality line, joining (0,0) and (100,100). Check that it does pass through the (30,30) and (50,50) points visible in the original graph. This is a natural comparison line, the line of no change, from one election to another. It becomes visible that support for Demo- crats

1) decreased, from 1996 to 2000, and 2) not a single state went against this trend.

What about logical anchor points? If a state existed where support for democrats already was 0 % in 1996, it would not be expected to buck the trend and go up in 2000. Hence (0,0) is an anchor point. Also, if a state existed where support for democrats still was 100 % in 2000, it would not be expected to buck the trend and be less than that in 1996. Hence (100,100) is also a logical anchor point. The simplest curve that joins these anchor point again is Y=Xk, with X=x/100 and Y=y/100. At X=0.50 we have approximately Y=0.44, in agreement with the regression line shown in Figure 8.4. Hence k=log0.44/log0.50=1.18, so that the resulting curve is Y=X1.18. Some points on this curve are shown below:

X 0 .1 .3 .5 .7 .9 1 Y 0 .07 .24 .44 .66 .88 1

They enable us to draw in the approximate curve. Figure 8.5 shows these additions. The graph might look a bit nicer when done on computer, but it might be less precise, even if we could scan in the original graph. Mules and computer programs can be stubborn. It’s better to have it your way and correct than their way – and incorrect. The curve Y=X1.18 that respects the anchor points almost coincides with the linear regression line for y above 50 %. It differs appreciably for low values of y. What have we learned that wasn’t evident from the original graph? It may seem from Figure 8.4 that states varied fairly widely in their support for democrats. In contrast, Figure 8.5 pins down that this variation was rather modest when compared to the conceptually possible range. We can see that no state bucked the trend away from Democrats. Compared to the regression line, the average trend is

75 8. Think Inside the Box – The Right Box expressed in a way that does not predict absurdities at extreme values of x or y. Johnston, Hagen and Jamieson (2004) explicitly excluded Washington, DC from their graph, because its lopsidedly Democrat vote in both years was outside the usual range. Yet, it would agree with Y=X1.18. A major payoff is that we express information in a more compact form. Indeed, the regression equation involves two numerical values (intercept -11.4 and slope 1.1), while Y=X1.18 makes do with a single one (exponent 1.18). This would be extremely important, if we conti- nued to study change in support during other periods. We can easily compare changes in the single exponent k over many elections. In contrast, the equations of regression lines would be hard to compare systematically because we’d have to keep track both of intercept and slope – and a single outlier could alter the intercept, even while it is just noise from the viewpoint of systematic comparison..

Figure 8.5. Allowed region, anchor points, and model based on conti- nuity between conceptual anchor points, fitted to data.

76 8. Think Inside the Box – The Right Box

Exercise 8.2 In Tajikistan 1926 male literacy was 6% and female literacy 1% (Mickiewicz 1973: 139). By 1939, male literacy was reported as 87% and female literacy 77%. In 1959, male literacy was 98%. Calculate the best estimate for female literacy in 1959. Think inside the box. zuz.nul, tsõudhrõ

77 9. Capitalism and Democracy in a Box

 The world isn’t flat, and all relationships are not linear.  If the equality line makes sense on a graph y vs. x, enter it as a reference line.  If both x and y can extend from 0 to a conceptual limit (such as 100), normalize it to go from 0 to 1.  If X=0 and Y=0 go together conceptually, and so do X=1 and Y=1, then Y=Xk is the simplest format to be tested.  If Y=Xk does not fit, try (1-Y)=(1-X)k next.  Boxes with three conceptual anchor points sometimes occur, leading to format Y/(1-Y) = [X/(1-X)]k.

This chapter uses previous methodology, so as to develop practical ability to handle it. We start with a published graph on democracy and capitalism, as an example, and develop it further. A slightly more complex model is needed than the earlier. Finally, we address a case with 3 anchor points.

Support for democracy and capitalism: How can we get more out of this graph? Figure 9.1 is reproduced from Citizens, Democracy, and Markets Around the Pacific Rim (Dalton and Shin 2006: 250). It graphs the level of popular support for capitalism (y, in percent) against the level of support for democracy (x). What can we see in this graph? We see that y tends to increase with increasing x, but with some scatter. One could comment on the individual countries, but let us focus on the general pattern. What else can we learn from it? A knee-jerk reaction might again be to pass the best-fit line through the cloud. This line may roughly pass through Korea and slightly below Vietnam. One could use this line for prediction of democracy/capitalism relations in countries not shown in this graph. But we would run into the same problem as in the previous chapter. If support for democracy were less than 13 %, the linear fit would predict support for capitalism lower than 0 percent! Oh well, one might say, be realistic – support for democracy never drops that low. Science, however, often proceeds beyond what is being considered realistic. It also asks what would happen under extreme circumstances. Recall the basic norm: a con- ceptual model must not predict absurdities even under extreme circum- 9. Capitalism and Democracy in a Box stances. For every conceivable level of support for democracy a proper model must predict a non-absurd level of support for capitalism – and vice versa. Could we say that the best-fit line applies only down to about 13 %, and below that level support for capitalism is expected to be zero? This would introduce a kink in the predicted average pattern, and most physical and social relationships are smooth. In sum, this is about all we can do with the best-fit line approach, apart from adding a measure of scatter, like R-squared, which may be around 0.5. For the given level of support of democracy, China looks low on support for capitalism, while Philippines look high. What else can we do? Try to do it on your own, following the approach introduced in previous chapter. Only then continue reading.

Figure 9.1. The starting point: Data alone, as shown in Dalton and Shin (2006: 250).

Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN NORTH AMERICA.

79 9. Capitalism and Democracy in a Box

Expanding on the democracy-capitalism box This graph does present the data within a box – but it’s the wrong box. Its borders have no conceptual meaning. They are like a frame around a picture, not part of the picture itself. This box could be appreciably wider or slightly smaller, as long as it does not infringe on the picture itself. We need a box that is part of the picture. This means we demarcate the allowed region where data points can conceivably occur. Take a straightedge and draw in the horizontal lines at support for capitalism 0 % and 100 %, and the vertical lines at 0 and 100 % support for democracy. The new lines are parallel to the ones already shown. They produce a square inside the earlier square – and this square has conceptual meaning. In contrast, the square shown in Figure 9.1 is mere decoration, a more austere version of mermaids embellishing the margins of ancient maps. This square is the more dangerous because it is so close to the box separating the forbidden and allowed regions, so that we may mistake it for the allowed region. Many computer programs draw this wrong box in automatically – one more reason to do our graphs by hand, until we become sufficiently proficient to be able to control the canned programs.

Figure 9.2. Data plus allowed region and equality line.

80 9. Capitalism and Democracy in a Box

Next, draw in the equality line, y=x. This doesn’t always make sense, but here it does, because both axes have the same units (or quasi-units) – percent. These two additions are shown in Figure 9.2. At once, we see more clearly that all the data points are below the equality line. It would seem that people in all countries tend to voice more support for democracy than for capitalism. The US, Singapore and Japan now look closer to the right border than they did on the original graph. We realize they are pretty close to the maximum possible. These are useful insights. Next, consider conceptual anchor points. If support for capitalism is 0 % in a country, what level of support for democracy might we expect? It would be hard to offer anything but 0 %. We cannot have less than 0 %, and for proposing something more than 0 % we would have to bend away heavily from the observed pattern. At the other extreme, what level of support for democracy might we expect when the support for capitalism is 100 %? Again it would hard to settle on anything but 100%. Our simplest assumption is that the curve y vs. x starts at the point (0,0) and ends at (100,100). This means that for any value of x that can conceptually occur we have a value of y that can conceptually occur – and vice versa. Our predictions may turn out to be wrong, but at least we do have a prediction for any value of x or y, a prediction we can subject to verification. This is better than saying “We cannot know”. Could we offer other patterns? Maybe at 90 % support for demo- cracy, support for capitalism would suddenly shoot up to 100 %. But why would it be at 90 %, rather than at 95 %? Or could it be that people never become 100 % supportive of capitalism, even at 100 % support for democracy? But would they stop at 80 % or at 90 %? A basic rule in science is: Keep things as simple as you can. Do not assume comple- xities unless evidence makes you do so. There are enough of real complexities to deal with, without the need to introduce imaginary ones.

Fitting with fixed exponent function Y=Xk Now determine the simplest curve that respects the anchor points and the data. The simplest curve to join (0,0) and (100,100) would be the equality line y=x. But it clearly does not fit the data. What is the next simplest option? It’s a curve that initially keeps below the line y=x and bends up to join it at (100,100). What is a simple curve? It’s a curve with the simplest possible equation, involving the least number of constants. For curves starting at (0,0) and ending at (1,1), it’s the fixed exponent function or power

81 9. Capitalism and Democracy in a Box function, Y=Xk. Here the exponent k expresses the deviation from equality line Y=X. For k=1, Y=Xk becomes the equality line Y=X. The more the value of k increases above 1, the more the curve bends downwards. For k<1, it bends upwards.

Exercise 9.1 So as to get some feel what various values of k in Y=Xk imply, graph the curves Y=X1,Y=X2,and Y=X0.5, superimposed, ranging from X=0 to X=1. To sketch these curves, it will suffice to calculate Y at X=0, 0.25, 0.50, 0.75 and 1. Do not forget to think within the box – show this box!

In the graph on hand, however, the scales go from 0 to 100 rather than 0 to 1. We have to change scales so that the previous 100 becomes 1. How do we do it? We have to switch from x and y to X=x/100 and Y=y/100. In other words, we shift from percent shares (out of 100) to fractional shares (out of 1). Why do we do it? If we kept the percent, Y=Xk would correspond to y=100(x/100)k. Such a long expression would be more confusing than it’s worthwhile.

Why Y=Xk is simpler than y=a+bx ?

The fixed exponent format may look complex. Would it not be simpler to avoid curves and use two straight-line segments? The first one could run from (0,0) roughly to the data point for Canada, at about x=79%, y=58%, and the second one from Canada to (100,100). Let us consider how to describe it. We’d have to write it so that the ranges are shown:

y=a+bx, 0

Count the letter spaces. The two straight-line expressions have 26 letter spaces. That’s messy, compared to 4 letter spaces in Y=Xk. We better get used to fixed exponent functions, Xk. Two straight-line segments would be considered simpler than a single curve only if one takes as an article of faith that all relationships in the world are or should be linear. But they aren’t, and for good conceptual reasons, to which we’ll come. Belief in straight-line relationships is even more simplistic than the belief of Ptolemaic astronomers that all heavenly bodies must follow circular paths. The world isn’t flat, and all relationships are not linear.

82 9. Capitalism and Democracy in a Box

Which value of k would fit best the data in Figure 9.2? Recall the previous chapter. Let us try to fitting to Korea (X=.78, Y=.58), as a point fairly central to the data cloud. This leads to n=log.58/log.78=2.19. Some points on the curve Y=X2.19 have been calculated, as explained in the previous chapter:

X 0 .2 .4 .6 .8 1 Y 0 .03 .13 .33 .61 1

Figure 9.3 complements the previous one by adding the anchor points (shown as triangle symbols) and the curve Y=X2.19 – and… this curve visibly isn’t a good fit at low values of X! Indeed, only China is close to the curve, while Vietnam, Philippines and Indonesia are all much higher. What could we do?

Figure 9.3. Model based on continuity between conceptual anchor points and fitted to data. (NB! The label “1.25” should read “0.57” – I have not yet changed this scanned-in graph. Also the scales must be read to go from 0 to 1.)

Fitting with fixed exponent function (1-Y)=(1-X)k Note that we based our variables on support for democracy and capitalism. We could as well reverse the axes and consider the lack of support. This “opposition to democracy/capitalism” would correspond

83 9. Capitalism and Democracy in a Box to variables 1-X and 1-Y, respectively. Instead of Y=Xk, the simplest format now would be

(1-Y)=(1-X)k and hence Y=1-(1-X)k. Fitting again for Korea leads to k=log(1-.58)/log(1-.78)=0.57. Some points on the curve Y=1-(1-X)0.57 have been calculated:

X 0 .2 .4 .6 .8 1 Y 0 .12 .25 .41 .60 1

This curve, too, is shown in Figure 9.3. It visibly offers a better balance between China on the one hand and Philippines and Indonesia on the other. So this is the curve we might take as a guide for predicting (i.e., offering our best guesses) for further countries. We are now close to the best quantitative description one could offer when respecting the anchor points. We could improve on it slightly by taking into account not just Korea but all data points and running a linear regression. But we first have to transform the data so that a linear form is expected (recall previous chapter). We have to take the logarithms, on both sides of (1-Y)=(1-X)k:

log(1-Y)=klog(1-X).

This equation means that log(1-Y) is a linear function of log(1-X). Hence, the regression must not be “Y on X” but “log(1-Y) on log(1-X)”. Moreover, this line must go through the point X=0, Y=0. I will not go into more details.

Exercise 9.2 Literacy percentages for males and females in three developing countries are (M=45, F=35), (60, 45) and (70, 60). Build a simple model that avoids absurdities. Show it graphically and as an explicit equation. Draw conclusions.

84 9. Capitalism and Democracy in a Box

* Exercise 9.3 The data in Figure 9.1 are roughly as follows. ______Country Support for Support for Opposition to Opposition to Democracy Capitalism Democracy Capitalism X’ Y’ Canada 81 57 China 60 29 Indonesia 54 48 Japan 82 52 Korea, S. 79 57 Philippines 42 36 Singapore 83 58 US 80 63 Vietnam 60 44

a) Calculate Opposition to Democracy / Capitalism on a scale from 0 to 1 (NOT %!), and enter these numbers in the table. b) Call Opposition to Capitalism Y’ and Opposition to Democracy X’. (This reverses the previous labels: what was called 1-Y now becomes Y’.) Graph the data points from (a). c) Which form of equation should we use, for best fit? d) Calculate the constant in this equation when X’=0.20 yields Y’=0.40. e) Insert this constant value into the general equation, and calcu- late the values of Y’ when X’=0.25, 0.5 and 0.75. f) Mark these points on the graph and pass a smooth curve through them. g) To what extent does this curve agree with the data points? Could some other smooth curve fit much better?

What is more basic: support or opposition? We observed that fitting Y=Xk didn’t seem to work out. To fit the data, we had to think in terms of opposition, instead of support. What does this imply? Asking this question makes us discover that we implicitly started out with the assumption that the supports of democracy and capitalism are features that support each other. If so, then lack of support is a leftover. Data suggest that, to the contrary, opposition to democracy and capitalism may support each other. If so, then support for democracy and capitalism would become a residual, a lack of

85 9. Capitalism and Democracy in a Box opposition. The question may have to be set up in terms of avoidance rather than support. However, it may well be that this lack of fit with Y=Xk was just a fluke due to having too few cases. Further data points from around the world might crowd more near China than around Philippines, making us shift back to Y=Xk and the primacy of support. But whatever pattern might emerge when more cases are added, the question would remain: Why does k have the value it has? Maybe we just have to accept it as an empirical fact, just like we accepted k=42 years in C=42 years/N2. But maybe there are further logical reasons why k cannot be much smaller or larger than roughly 2 or 1/2. We should keep this “why” in mind.

Logical formats and logical models

Do we truly have here a logical model? We made use of allowed regions, conceptual anchor points and requirements of simplicity and avoidance of absurdity, so we were advancing toward logical model building. But we did so only from a formal viewpoint. We have an intelligent description of how views on democracy and capitalism seem to be related – a description that fits not only the data but also some logical constraints. What we still need to ask is: Why? Why is it that the relationship is what it seems to be? Why should it be so? We are not scientists, if we shrug off this question, being satisfied with describing what is. We may not find an answer, at least not right now. But it is better to pose questions without answers rather than asking no questions. Questions posed may eventually receive answers. Questions which are not posed will never be answered. At the moment, we certainly have a logical format for data fitting, in contrast to the illogical linear format, which predicts absurdities for some values of variables. This is an indispensable step toward a model that answers more about the “why?” question. Whether to call Y=Xk a logical format or model depends on how much we read into the term “model”.

* Exercise 9.4 The graph below is reproduced from Norris and Inglehart (2003: 27). The Gender Equality scale runs from 0 to 100. Approval of Homo- sexuality scale runs from 0 to 10. The line shown visibly is the OLS regression line, Gender Equality regressed on Approval of Homo- sexuality.

86 9. Capitalism and Democracy in a Box

Make an exact copy of the graph. If you can’t use a copy machine, tape this graph on a well-lighted window, tape blank paper on top of it, and trace carefully all the points. Ignore country names, the distinction between the Islamic and Western countries, and the few markedly deviant countries. In other words, focus on the main data cloud. Up to now, I have presented lots of hoops for you to jump through. Now it’s time for you to set the hoops for yourself. Do with it all you can, on the basis of what you have learned in this book. Work on the graph and show it. Then comment on data, on data analysis previously shown, and on your additions. (CAUTION: students tend merely to comment, without showing their reworked graph. This is like presenting the text of this chapter but omitting Figures 9.2 and 9.3.)

How can we know that data fits Y=Xk? If we suspect that some data might follow the format Y=Xk, how can we verify this guess? If it does, then logY=k logX. Graph logY against logX, and see if we get a straight line. If we do, its slope is k. Caution is needed when some values are close to 0. Similarly, if we suspect that some data might follow the format (1- Y)=(1-X)k, graph log(1-Y) against log(1-X), and see if we get a straight line. If we do, its slope is k.

87 9. Capitalism and Democracy in a Box

A box with three anchor points: Seats and votes

The form Y=Xk would be our first guess when we have only two anchor points: 0,0 and 1,1. But there are situations where another logical anchor point enters, at 0.5, 0.5. This is the case when graphing seat shares against vote shares.

Figure 9.4. The simplest curve through three anchor points (0,0), (0.5,0.5) and (1,1): Y/(1-Y)=[X/(1-X)]k. The curve shown has k≈2, meaning that slope is 2 at X=0.5.

1

Y

0.5

0 0 05 X 1

Certainly 0 votes should earn 0 seats, and all votes should earn all the seats. In between, some electoral systems give a heavy bonus to the largest party. If only two parties are running, vote shares 60-40 might lead to seat shares 78-22. But if both receive one-half of the votes, then we’d expect both of them to earn one-half of the seats. The result is a “drawn-out S” shaped pattern (Figure 9.4). When three conceptual anchor points impose them- selves – (0,0), (0.5,0.5) and (1,1) – the simplest family of curves passing through these anchor points is

Y/(1-Y) = [X/(1-X)]k.

This is the form symmetric in Y and X. For calculating Y from X, it can be transformed: Y = Xk/[Xk +(1-X)k].

Here parameter k can take any positive values. Data that fit such an equation will be presented later (Chapter 21). Can we straighten out such a drawn-out S into a straight line? We can. Simply take the logarithms on both sides:

log[Y/(1-Y)] =klog[X/(1-X)].

If we suspect that some data might follow this format, graph log[Y/(1-Y)] against log[X/(1-X)], and see if we get a straight line. If we do, its slope is k.

88 10. Science Means Connections Among Connections: Interlocking Relationships

 Science aims not only at connections among facts but also connections among connections.  Interlocking equations are more powerful than isolated ones.  Parsimonious models with a single parameter, such as Y=Xk, pack information more compactly than two-parameter models such as the linear Y=a+bX. Consequently, they have more connective power.  Algebraically formulated models do not ask which came first, the chicken or the egg – they just go together.  The distinction between “independent” or “dependent” vari- ables rarely matters – the chicken and the egg are inter- dependent. Either could be an input or an output, depending on the issue on hand.

Science is more than just learning facts. Science deals with making connections among facts. Empirical connections are a start, but logically supported connections are much more powerful. Thus, C=42 years/N2, with a logically supported exponent value 2, is stronger than an empi- rical best fit. But science aims at even more than connections among facts. Science aims at making connections among connections (as was briefly pointed out in Chapter 4). These connections can be of different types.

Interlocking equations

One example is the sequence Tns1NC. From the number of seats, T, we can infer the number of seat-winning parties: n=T1/2 (Chapter 1). From there, we can infer the relative seat share of the 1/2 largest party: s1=1/n (Chapter 4). Further logical steps, not present here, lead from to the effective number of parties. From there C=42 years/N2 leads to cabinet duration. Many more than two quantities are connected through a string of equations. In such a string, any of the quantities can be treated as the input. For instance, for a given largest seat share s1, one can calculate the resulting N and C, as well as the number of assembly parties, n, from which such a largest share is likely to result. In principle, any quantity can be used 10. Science Means Connections Among Connections: Interlocking Relationships as input, and everything else results. In practice, large random errors may accumulate. Such strings are rare in physics. Rather, the basic building bloc is a three-factor relationship. In electricity, V=IR relates voltage V, current I, and resistance R. This equation is equivalent to I=V/R and to R=V/I: Any factor can be calculated from the two others. It’s like a triangle of factors. Change the value of any one factor, and the two others must adjust. But we also have another triangle, P=IV, introducing electric power P. The two triangles together lead to a third: P=I2R. These triangular relationships have one side in common, meaning two common factors. Further such triangles connect with them only by one corner, meaning one common factor: Charge q=It, where t is time, and electric field E=V/r, where r is distance. And on we go to force between two charges q and q’: F=kqq’/r2, where k is a universal constant. This k is somewhat analogous to k in C=42 years/N2 in that they are empirically determined. The equation F=kqq’/r2 is a 4-cornered pyramid rather than a triangle, as it connects 4 variables. Never mind the details of this example. The point is that there are connections among factors, and connections among such connections. In principle, similar interlocking relationships are conceivable in social sciences. We just have to be on the lookout. Remember the invisible gorilla (Chapter 3): We do not see what we are not trained to look for, even when it is glaring at us.

Connections between constant values in relationships of similar form A very different example of connectedness is offered next. Suppose two quantities, A and C, are both connected to a third one, P, through equations of form Y=Xk. This means that we have A=Pk and C=Ph, where k and h are different constants. What can we say about the relationship between A and C? Make it more concrete by returning to the example in Chapter 8, which connected the supports for the president (P) and the assembly (A), Suppose that support for the cabinet (C) also was investigated. In Chapter 8, the relationship between the supports for president and assembly roughly fits the curve A=P3.6. If we should find that the relationship is C=P2.1 between supports for president and cabinet, then what can we say about the relationship between supports for cabinet and assembly (C and A)?

90 10. Science Means Connections Among Connections: Interlocking Relationships

Take A=P3.6, which expresses P in terms of A. How can we trans- form it so as to express P in terms of A? Use the basic rule of algebra: We can do anything, as long as we do it on both sides of the equation. To get rid of exponent 3.6 for P, put both sides of the equation to exponent 1/3.6: (A)1/3.6=(P3.6)1/3.6. The exponents for P multiply out to 1.Thus P=A1/3.6. Now plug this value of P into C=P2.1. We get C=(A1/3.6)2.1, hence C=A2.1/3.6=A0.58. Two conclusions: 1) this relation- ship, too, is of form Y=Xk; and 2) the constant is obtained by dividing one of the previous constants by the other. We can repeat this example in more abstract terms: If Y=Xk and Z=Xh, then X=Y1/k and hence Z=Yh/k.

Y=Xk and Z=Xh  Z=Yh/k and Y=Zk/h.

CAUTION: Do not blindly plug numbers into the latter equations. Try to understand where they come from, and consider the situation on hand. You may have Y=Xk and Z=Yh, and then Z=Xkh.

An important result of this connection between connections emerges: Once we know how Y and Z are connected to X, we may not need to graph Z against Y so as to determine the connection between them – we can calculate it. Another conclusion is that all relationships of form Y=Xk are comparable in terms of the value of k. The relationship among male and female literacy, as both increase, tends to be around F=M0.5 (Exercise 8.2), or inversely, M=F2.0. For support for democracy (D) and for capitalism (C), the best fit of form Y=Xk was C=D2.29 (Chapter 9). (For the moment we overlook the limited degree of fit!) Of course, F and M are utterly different beasts, compared to D and C. Yet their relationships have some similarity: they not only are both of format Y=Xk, but k≈2 in both cases. What does it tell us? In both cases separately we should ask: What is the underlying mechanism or process that leads to k around 2 rather than around 1.5 or 3? And jointly for the two cases we might ask: Are there some very broad similarities between these 2 processes? I have no answer. Maybe some present student will find it.

91 10. Science Means Connections Among Connections: Interlocking Relationships

* Exercise 10.1 Lori Thorlakson (2007) defines three measures of centralization of power in federations, all running from 0 to 1: Revenue centralization (R); Expenditure centralization (E); Index of federal jurisdiction (J). Their mean values in 6 federations, from the 1970s to 1990s, are as follows. The averages of the three measures are also shown, as Index of federal power (IFP).

R E J IFP Austria .73 .70 .95 .79 Australia .73 .59 .87 .73 Germany .65 .58 .96 .73 Switzerland .58 .48 .92 .66 US .60 .56 .80 .65 Canada .48 .42 .85 .58

a) In what order are the countries listed? What are the advantages and disadvantages of this order, compared to alphabetical listing? b) Graph E vs. R, along with what else is needed – think inside the box! c) On the same graph, enter the points J vs. R. d) Establish the likely shape of the relationships, and calculate approximately the relevant parameters. egy.zuzarb .ketzuz e) Graph the corresponding trend curves. Comment on the degree of fit to the data points. f) From the previous parameters (NOT directly from the data!), calculate the approximate relationship between E and J. hr.hat g) Graph, superimposed, the resulting curve E vs. J and the data points. Comment on the degree of fit. h) What do you think are the purposes of this exercise?

Instead of calculating the various logarithms involved in this exercise, we can use log-log paper and graph Y against X. Note however, that regression of logY against logX becomes difficult when some values of X or Y are very low. Indeed, a single zero value blows up the entire log- log attempt. In such cases extra help may be needed. Recall that sometimes (1-Y)=(1-X)k yields a better fit than Y=Xk.

92 10. Science Means Connections Among Connections: Interlocking Relationships

* Exercise 10.2 Use the data in previous exercise. a) Graph E vs. R and J vs. R on log-log scale, along with equality line. Given that all values are between 0.4 and 1, you need only one period on log-log graph paper. If your paper has more periods, try to magnify one period, so as to get more detail. Note that the point (1,1) is at the top right corner – not at bottom left, as we may be used to have it. b) Draw in the best-fit straight lines that pass through (1,1). These correspond to curves when graphing Y against X. CAUTION. One of the anchor points vanishes to infinity. While the curves Y=Xk converge at 0,0, the lines logy=k logx cannot converge, because log0 tends toward minus infinity. To put it more bluntly: the point (0,0) cannot be graphed on log-log paper; do not confuse it with point (0.1,0.1), which has no special meaning. c) Determine the slopes of these two lines, by measuring vertical and horizontal distances, then dividing them. These are the values of k. How do they compare with values of k observed in previous exercise? CAUTION: Calculating the slope may be confusing. Ignore the logarithmic grid. Use a regular centimeter or inch ruler.

Why linear fits lack connective power

When both variables have a floor and a ceiling, they can be normalized to X and Y ranging from 0 to 1. If they are anchored at (0,0) and (1,1), then Y=Xk or (1-Y)=(1-X)k often fits. Compared to linear fit Y=a+bX, the format Y=Xk not only respects conceptual constraints but also makes do with a single parameter (k) instead of two (a and b). This parsimony makes comparisons of phenomena much easier and hence helps to answer “why?” It is also easy to compare various data sets, ranking them by values of k, which express the degrees of deviation from straight line. Could such comparisons also be made using linear data fits? It is easy to compare data sets on the basis of a single parameter k in Y=Xk, but it is much messier to do so for two (a and b) in y=a+bx. Most serious, if the real pattern is curved, then values of a and b can vary wildly for different parts of the same data set. For example, if we calculate k in Y=Xk separately for the lower and upper halves of the data in Figure 8.3, little will change. In contrast, a and b in the linear fit will change beyond recognition. Com- parisons with other data sets cannot be made on such a fleeting basis. In this sense, the “simple” linear fit is not simple at all. It is well worth- while to master the rather simple mathematics of the fixed exponent equation rather than grind out linear parameter values that are meaningless for comparison purposes. Even more broadly: It is easier to interconnect connect multiplicative connections than linear ones.

93 10. Science Means Connections Among Connections: Interlocking Relationships

Many variables are interdependent, not “independent” or “dependent” We often think of processes in terms of some factors entering as “inputs” and producing some “outputs”. When graphing an output variable and an input variable, the convention is to place the input on the horizontal “x-axis” and the output on the vertical “y-axis” (cf. Chapter 2). When writing equations, the convention is to put the output first: C=k/N2, rather than k/N2=C. But what was an input in one context may become an output in some other. Eggs lead to chickens, and chickens lead to eggs. It is difficult to propose a unique direction when considering support for democracy and capitalism (Chapter 9). Dalton and Shin (2006) decided to graph Capitalism vs. Democracy (cf. our Figures 9.1 to 3). Did they imply that support for democracy comes first and somehow causes support for capitalism? Not necessarily. It is more likely that both are affected by a large number of other factors (X, Y, Z) and also boost each other:

 C (X, Y, Z)   D

The relationship of C and D may well be symmetric, but when graphing, we are stuck with the choice between two asymmetric ways: either we graph C vs. D or D vs. C. We have to do it one way or the other, but don’t mistake it for a causal direction. The same goes for equations. We have used Y for support for capitalism and X for support of democracy, leading to Y=X2.19. But we could as well transform it into X=Y0.405, where 0.405=1/2.19. And we first have to find the value of k, from some given values of X and Y, transforming Y=Xk into k=logY/logX. But later, each of the quantities k, Y and X in Y=Xk may be the “output”, depending on the purpose of the moment. Such transformations of the same equation – Y=Xk, X=y1/k and k=logY/logX – are inherent whenever we use algebraic equations: The equality sign “=” is valid in all directions. Algebraically formulated models do not ask which came first, the chicken or the egg – they just go together. It will be seen in Chapter 16 that the OLS regression equa- tions, in contrast, are one-directional and should be written as

y a+b1x1+b2x2+…, rather than with “=” .

94 10. Science Means Connections Among Connections: Interlocking Relationships

Some social scientists pay much attention to whether variables are “independent” or “dependent”, but physical scientists are accustomed to think in terms of interdependent variables. Causal direction may vary. Sometimes the existing number of teachers affects literacy, and sometimes existing literacy affects the number of teachers. Thus it makes more sense to talk of input and output variables under the given circumstances rather than inherently “independent” and “dependent” ones. The terms “input” and “output variables” indicate their respective roles in the moment's context, without passing judgment on some in- herent causal direction.

Exercise 10.3 Exercise 8.1 dealt with people’s degree of trust in their country’s legal system (L) and in their country’s representative assembly (A). Which of L and A might be the independent and which the dependent variable? Can we clearly decide, which way it goes? Would we lose anything, if we assume that they are just mutually interdependent?

95 11. Volatility: A Partly Open Box

 The foremost mental roadblocks in logical model building have little to do with mathematical skills. They involve refusal to simplify and reluctance to play with extreme cases and use their means.  “Ignorance-based” models extract the most out of near- complete ignorance. They ask: What do we already know about the situation, even before collecting any data? Focus on con- ceptual constraints.  Sherlock Holmes principle: Eliminate the impossible, and a single possibility may remain – or very few.  Eliminate the “conceptually forbidden regions” where data points could not possibly occur.  Locate the conceptual “anchor points” and conceptual ceilings, where the value of x imposes a unique value of y.  This done, few options may remain for how y can depend on x – unless we tell yourself “It can’t be that simple”.  Dare to make outrageous simplifications for an initial coarse model, including as few variables as possible. Leave refine- ments for later second approximations.

When building a logical model, we can start with applying logic and look for data to test it. Conversely, we can start with intelligent data analysis and see what type of model it is hinting at. In the last few chapters, we began with data. Previously, in the ignorance-based approach, we began with logic. Now we are mixing thes approaches. There will be some repetition. This is necessary for some steps to be- come almost automatic. When noticing repetition, some people conclude: “I have already seen that, so I can by-pass it.” Some others may observe: “I saw it before, but it sounded either strange or pointless. Let me see if I can now get more out of it.” Having seen and having grasped are quite different stages.

11. Volatility: A Partly Open Box

The logical part of a coarse model for volatility Volatility (V) stands for the percentage of voters who switch parties from one election to the next. When more parties run, voters have more choices for switching. So the number of parties (N) might increase volatility. The directional model is “N up  V up”. Can we make it more specific? Once again, you should try to answer this on your own, until you get stuck, and then compare your approach to what follows. How should we start? Graph V against N, so as to visualize the issue. At first we can show only the two axes. Next, let us ask: What do we already know about volatility and parties, even before collecting any formal data? The first step in constructing models often echoes the advice by the fictional Sherlock Holmes: Eliminate the impossible, and a single possibility may remain – or at least the field is narrowed down appreciably. Apply it here. Mark off those regions where data points could not possibly occur. These are the forbidden regions or areas. Volatility cannot be less than 0 or more than 100 per cent. The number of parties cannot be less than 1. These conceptually forbidden regions are shown in Figure 11.1. The remaining allowed region has the form of a box open to the right.

Figure 11.1. Individual-level volatility of votes vs. effective number of electoral parties – conceptually forbidden regions (areas), anchor point, and expected zone. (Note: The legend “Anchor point” at top left is confusing; the actual anchor point is the triangle at lower left.)

CONCEPTUA LLY FORBIDDEN A REA 100

Anchor point ) -1 80 (N 0 -1) V=2 (N Surprise zone 10 V= 60 Expected FORBIDDEN

CONCEPTUALLY zone 40 Individual level volatility (%) volatility level Individual 20

0 012345678910 CONCEPTUA LLY FORBIDDEN A REA Effective number of electoral parties

97 11. Volatility: A Partly Open Box

Make your models as simple as possible – but no simpler

You may feel that we have simplified to an unreasonable degree. Reality is much more complex. Indeed, the foremost mental roadblocks in logical model building have little to do with mathematical skills. They have to do with refusal to simplify and reluctance to play with extreme cases and use their means. Recall the advice attributed to Albert Einstein: “Make your models as simple as possible – but no simpler.” Omit everything that isn’t essential. Do not omit anything that is essential. And know the difference. How do we know the difference? If we oversimplify, we’ll soon know – our model would not fit the data. At this point, we make at least three simplifying assumptions. First, we assume that at least one party obtains some votes in both elections. This restriction excludes the unlikely situation where a single party has all the votes in one election but loses them all to a brand new party in the next election. Second, we assume that the same voters vote at both elections. This simplification is serious, because real elections always have some voters drop out and some others entering, from one election to the next. We should not forget about it, but model-building best proceeds by stages. “As a first approximation”, let us assume that negligibly few voters drop out or join. Let us work out this simple situation first. If successful, we can go to a second approximation, where we take into account more factors – not only the shift in voters but also some other factors that might influence volatility, besides the number of parties. Third, we assume that the effective number of parties is an adequate measure. Here it is calculated on the bases of vote shares, in contrast to previous use based on seat shares. This N may change from one election to the next. In this case use their mean. Arithmetic or geometric mean? It hardly matters when the two values of N are quite similar, as they usually are.

Next, note that there is a conceptual extreme case. Suppose that only one party runs in both elections, so that N=1. Here switching to another party is impossible. Hence volatility must be zero. This point (N=1, V=0) is marked in Figure 11.1 with a triangular symbol. This is a con- ceptual anchor point. At N=1, even a slight deviation of V away from zero would violate logic. Of course, democratic countries practically always have more than one party running. Logical models, however, must not predict absurdities even under extreme conditions. If V increases with N, our simplest tentative assumption could be linear increase: V=a+bN, where slope b must be positive. This could be any upward sloping straight line. But the anchor point adds a constraint. All acceptable lines must pass through the anchor point (1,0). How do we build this constraint into V=a+bN?

98 11. Volatility: A Partly Open Box

For N=1, we must have V=0. Plug these values into V=a+bN, and we get 0=a+b. This means that a=-b, so that

V = -b+bN = b(N-1).

Now, among the infinite number of upward sloping straight lines, only those will do where the initial constant equals the negative of slope. Without any input of data, the conceptual anchor point approach has already narrowed down the range of possibilities. Instead of having to look for two unknown constants (a and b), we have only one. This is tremendous simplification.

Introducing an empirical note into the coarse model Now we proceed to shakier grounds. The effective number of parties rarely reaches N=6. You may share a gut feeling that even with 6 parties to choose from, not all voters will switch. If so, then V=100 percent at N=6 would be a highly surprising outcome, although it is not con- ceptually impossible. Here we introduce a touch of empirical know- ledge into the previously purely logical framework. The particular line of form V=b(N-1) which passes through the point (6,100) is shown in Figure 11.1. How did we find the equation of this line? Plug the values (6,100) into V=b(N-1). The result is 100=b(6-1); hence b=100/(6-1)=20. Thus, the equation of this line is V=20(N-1). Any data point located above this line would be highly surprising, although we cannot completely exclude the possibility, in contrast to the conceptually forbidden areas. Hence this zone is marked as a surprise zone in Figure 11.1. So V=20(N-1) is roughly the highest value of V that would not seriously surprise us. Do we also have a lowest value? No – even with a very high number of parties, it is still conceivable that party loyalty of voters could be complete. Thus no limit higher than V=0 can be proposed, meaning a horizontal line at V=0, which is the x-axis. Without any real data input, we have now narrowed down the reasonably expected zone where data points could occur. It’s the cone between the lines V=20(N-1) and V=0. In the absence of any other knowledge, we have no reason to expect the actual line to be closer to either of these two extremes. Therefore, our best “minimax bet” would be the average of the likely extremes. The arithmetic mean of V=20(N- 1) and V=0 is V=10(N-1). We really should write it with a wavy equality sign “”, because it is quite approximate:

99 11. Volatility: A Partly Open Box

V10(N-1).

Still, without resorting to any data, we have gone beyond a directional model to a quantitative one. It is based on near-complete ignorance and is shown in Figure 11.1. This model makes two distinct predictions, one of them firm, the other quite hesitant. 1) If any straight line fits at all, it absolutely must have the form V=b(N-1), so as to respect the anchor point. 2) The slope b would be around 10, very approximately. In other words: “If you force me to guess at a specific number, I would say 10.”

Exercise 11.1 Here we took the arithmetic mean of slopes 20 and 0, to get V10(N-1), despite my arguing in favor of the geometric mean in previous chapters. What’s the justification?

Testing the model with data Once constructed, such a model needs testing in two different ways: Logical testing to guard against any absurd consequences; and testing with actual data. Let us start with the last one. Data from different countries are quite scattered, because many other factors enter, besides N. But a uniform data set is available from Oliver Heath (2005), for state-level elections in India, 1998–1999.oad Br conditions were the same for all the states, but many parties competed in some states, while few did in some others. The mean values were: N=3.65 and V=31.6. These values lead to b=31.6/(3.65-1)=11.9. Thus our very coarse expectation of 10 was off by only 20 percent. For a prediction not based on data, this is pretty good. So, at least for India,

V=11.9(N-1)=-11.9+11.9N.

Heath (2005) reports the best statistical fit as

V=-9.07+11.14N [R2=0.50].

100 11. Volatility: A Partly Open Box

Figure 11.2 shows both lines and the data points. Neglect for the moment the downward bending curve. The “correlation coefficient” R2 indicates the goodness of fit to the best possible straight line. (We’ll come to that in Chapter 15.) Perfect fit of data points to the line would yield R2=1.00, while utter scatter would yield R2=0.00. So R2=0.50 reflects appreciable scatter but still a clear trend.

Figure 11.2. Individual-level volatility of votes vs. effective number of electoral parties: data and best linear fit from Heath (2005), plus coarse and refined predictive models.

CONCEPTUA LLY FORBIDDEN A REA 100 Anchor point 1) N- .9( 11 80 V= 4N .1 11 7+ .0 -9 V= 1) 60 3(N- ) -0.14 0(1-e V=10 FORBIDDEN CONCEPTUALLY 40 v Individual level volatility (%) 20

0 012345678910n CONCEPTUA LLY FORBIDDEN A REA Effective number of electoral parties

Which equation should we prefer? They are very close to each other, compared to the wide scatter of data points. They fit these data practically equally well. Both represent statistical best fit-lines, but for two different ways to fit data. Equation V=-9.07+11.14N results from the assumption that any straight line is acceptable, meaning that any values of a and b in V=a+bN are just fine. For N=1, it yields V=2.07 rather than the conceptually required V=0. On a scale 0 to 100, the difference is small – but it is absurd nonetheless to claim that with a single party available, 2% of voters still manage to change their vote. In contrast, equation V=-11.9+11.9N results from the assumption that only the lines passing through the conceptual anchor point are acceptable. This line is the best statistical fit subject to this logical condition.

101 11. Volatility: A Partly Open Box

If we want to predict the results of future elections in Indian states, both equations are as good (or as bad, given the scatter in previous data), but conceptually we are better off with the line that respects the anchor point. This is even more so, if we want to guess at volatility in elections elsewhere, because we are certain that the anchor point holds universally. However, we should be prepared to find that the slope might differ appreciably from 11.9 when it comes to countries with political cultures different from India’s. So we might be cautious and offer a universal quantitative prediction with a wide range of error – maybe V=(123)(N-1).

Testing the model for logical consistency So this takes care of testing the model with actual data. But we still need logical testing, so as to guard against any absurd consequences. Look again at Figure 11.2. What level of volatility does our model,V= -11.9+11.9N, predict for N=10? It predicts more than 100 per cent! This is absurd. We cannot plead that such large numbers of parties practical- ly never materialize. A logical model must not predict absurdities even under extreme conditions. If it does, it must be modified. Indeed, in addition to the anchor point (1,0), we must satisfy another extreme condition. There is a conceptual ceiling: When N becomes very large, V may approach 100 but not surpass it. Mathematically: When N∞ then V100 per cent. The curve that bends off below the ceiling in Figure 11.2 corres- ponds to an “exponential” equation, V=100[1-e-0.145(N-1)]. This equation may look pretty complex. Yet, it represents the simplest curve that satisfies both extreme conditions – anchor point and ceiling – and best fits the data. It is parsimonious in that it includes only one adjustable parameter, which here is 0.145. What does this equation stand for, and how was it obtained? This will be discussed later. We can say that V=- 11.9 + 11.9N is a coarse model, a first approximation, and that V=100[1-e-0.143(N-1)] is a more refined model, a second approximation. Even when we have a conceptually more refined model, we might prefer to use the simpler one because it’s easier to work with. In the usual range of N the simple model works as well as the more refined one – this is visible in Figure 11.2. There is nothing wrong with such simplification, as long as we do not forget its limitations. If ever we should get many values of N larger than 6, we should consider using the refined model.

102 11. Volatility: A Partly Open Box

Exercise 11.2 In Figure 11.2, the single data point above N=6 actually agrees with the coarse linear model better than with the exponential one. Then how can I say that, for N>6, we should predict on the basis of the exponential curve, when the straight line fits better? Exercise 11.3 Second and further approximations can take many directions. Here we have specified the relationship between V and N. We could consider other factors that might affect volatility. We could also consider possibly better ways to measure both volatility and the number of parties. What about people who vote only in the first or only in the second of the two elections on which volatility figures are based? How might these occasional voters affect our model? I have no answers to offer. If you find any, it would be impressive – but don’t spend too much time on this exercise. Exercise 11.4 Our model involves a box open to the right – as if N could go all the way to infinity. Student Valmar Valdna pointed out to me (2.9.09) that the adult population (P) of the country would impose an upper limit. The maximum N would correspond to each person forming a separate party and voting for this party. Each person most likely would stick to her/his very own party the next time around, and so volatility would drop again to zero! This would be another constraint. Can you work it into the model? Hint: Find a simple expression in N and P that would equal 1 when N=1 but would drop to 0 when N=P; then multiply the existing V by this expression. [In more mathematical language: find a function f(N) such that f(1)=1 and f(P)=0. This is how we often translate constraints into models.] Exercise 11.5 Instead of the exponential approach to the ceiling, we could use a fixed exponent approach: V/100=[(1-N)/N]k. Indeed, V/100=[(1-N)/N]4 yields the following sample values:

N 1 1.5 2 3 4 6 10 V=100[(1-N)/N]4 0 1.2 6.2 19.8 31.6 49.6 6.56

This curve may fit the data cloud slightly better than does the expo- nential. Graph it on top of Figure 11.2 and see how the two curves differ. Why do we tend to prefer the exponential format? See Chapter 20.

103 12. How to Test Models: Logical Testing and Testing with Data

 Logical models must not predict absurdities even under extreme circumstances. Logical testing means checking the model for extreme or special situations.  Most models apply only under special conditions, which should be stated.  In testing models with data, eyeballing must precede statistical tests; otherwise, the wrong type of test might be used.  Data often must be transformed prior to statistical testing of data, guided by logical constraints. Logarithms often enter.  Study the graphs of data and model-predicted curves very carefully. They may supply much more information than meets the eye at the first glance.  We live in a world where multiplicative relationships over- shadow the additive.

We already have touched on testing a logical model. It involves two aspects:  Logical testing, to guard against any absurd consequences.  Testing with actual data. Testing with data is often thought to be the only test a model needs, but it alone does not suffice. Even though fitting the data, the coarse linear model of volatility (Chapter 11) ran into trouble at a large number of parties, predicting volatilities of more than 100%. It had to be modified into an exponential model that avoids this absurdity. Let us review the models previously offered, in this respect.

Logical testing In Chapter 1, the model for the number of seat-winning parties (n) when T seats available is n=T1/2. Look for extreme cases. The lowest possible value of T is 1. Does the model yield a reasonable result? Yes, it predicts n=1, which is indeed the only logically acceptable possibility. There is no conceptual upper limit on T, and no contradictions can be seen even for very high values. Of course, both T and n come in integer numbers. Yet for most integer values of T, the formula yields a fractional n. This is no 12. How to Test Models: Logical Testing and Testing with Data problem, because we deal with average expectations. For T=14, we calculate n=3.7. This means that 4 parties winning seats is somewhat more likely than 3. It is also quite possible that 2 or 5 parties win seats – it is just much less likely. Does the model for the largest component (Chapter 4) fit even 1/2 under extreme conditions? The model is S1=T/n . Suppose there is only one component: n=1. The formula correctly yields S1=T. Now suppose we have a federal assembly of 100 seats and the number of federal units is also 100. If the term “federal unit” has any meaning here, each unit would have one and only one seat, and this applies to the largest unit as well. Yet the formula yields S1=100/ 1001/2=10, leaving only 90 seats for the other 99 federal units! How should we refine the model so as to avoid such an outcome? When establishing the model, we started with T/n>nm. Thus we should specify the simplified model as follows:

1/2 S1=T/n [T>>nm].

If this condition is not fulfilled, switch to more refined model. Most models apply only under special conditions, which should be stated. The relationship is analogous for the exponential and linear models for volatility. One is more refined, yet the other is so much easier to use and most often works with sufficient precision. We just have to know when we can approximate to what degree. Note that Exercise 11.4 introduces a further limit. Consider next the model for cabinet duration (Chapter 6). We already noted that the basic model for the number of communication channels, c=n(n-1)/2, works also for the extreme cases of n=0 and n=1. But what about the model for cabinet duration, C=k/N2? The lowest limit on the number of parties is N=1. Then C=k. The best fitting value of k has been found to be 42 years. If a one-party democracy came to exist, the model predicts a cabinet duration of about 42 years. For a pure two-party system (N=2), the model predicts 10 years. Does this feel right? Here it is not a matter of clear conceptual limits but having

105 12. How to Test Models: Logical Testing and Testing with Data surprise zones (as in Figure 11.1 for volatility). Consider the following limits. 1) By the loosest definition, a cabinet is considered to continue as long as it consists of the same party or parties. The ministers or even the prime minister could change. Even so, we might be surprised if the same cabinet continued beyond human life spans, say 80 years. 2) At the low side, consider a two-party system. One party is bound to have a slight majority and form the cabinet, which is likely to last until the next elections, short of a rather unusual rift within the ruling party. Elections typically take place every 4 years. Hence average durations of less than 4 years for two-party constellations would surprise us. Plug C=4 years and N=2 into C=k/N2, and out pops k= CN2=16 years. In sum, values of k outside the range 16 years

106 12. How to Test Models: Logical Testing and Testing with Data

Testing with data Models may be logically consistent, yet fail to agree with reality. We may deduce logically that the pattern C=k/N2 should prevail, but we may discover that the actual pattern is closer to C=k/N3 or to C=k/N1.5 – or to no equation of the form C=k/Nn. How do we discover what the actual pattern is? One way is to graph all the raw data we have and compare to what the model predicts. This is what we did in Exercise 2.1, for a few countries. We may graph the curves C=k/N2 for selected values of k such as k=30, 40 and 50 years and see if the clouds of data points fit along one of them. If they do, the model fits. By trial and error, we might find the best fitting value of k, but this is slow process. If, on the contrary, the data cloud crosses the curves C=k/N2, then the model does not fit, and we have to find a family of curves to which the cloud does fit. Curves are messy. It is so much easier to work with straight lines. With the latter, the eyeball test often works. At a glance, we can see whether the data points follow a straight line and what its slope is. Fortunately, whenever a model involves only multiplication, division and fixed exponents, taking logarithms turns it linear, as seen below.

Many models become linear when logarithmic are taken Most of the models discussed up to now follow the generic form y=cak (Table 12.1). This means they input variable with some simple exponent (1, ½, 2, -1, -½, or -2,), and multiplied with some constant (which could be 1). No additions or subtractions! Then the logarithms of input and output have a linear relationship: y=logc+klogx. Why do relationships tend to have such a form? We live in a world where multiplicative relationships overshadow the additive. This is so because we can multiply and divide quantities of a different nature: Divide distance by time, and we get velocity. In contrast, we can add or subtract only quantities of the same nature: distance plus distance, or time minus time. Multiplicative relationships lead to curves rather than straight lines. Conversion to logarithms turns curved patterns into linear. Thus, it is most useful to graph them on log-log paper. The coarse model for volatility is different; testing it does not need logarithms. However, logarithms enter the refined volatility model in a different way, to which we’ll come later.

107 12. How to Test Models: Logical Testing and Testing with Data

Table 12.1. Many logical models have the form y=cak.

Generic form: y=cak y=logc+klogx constant exponent No. of seat-winning parties: n=T1/2  logn=0.5logT 1 ½ 1/2 The largest share: S1=T/n  logS1=logT-0.5logN T -½ Cabinet duration: C=k/N2  logC=logk-2logN k -2 Fixed exponent: Y=Xk  logY=klogX 1 k

What can we see in this graph? Figure 12.1 shows the result for cabinet duration in 35 democracies (data from Lijphart 1999) graphed against the effective number of parties, both on logarithmic scales. (Compare it to the curved pattern you drew in Exercise 2.1 and the straightened patterns in Exercises 2.3 and 7.2.) The meaning of lines in this graph:

Thin solid line: best fit between logarithms. 2 Bold solid line: theoretically based prediction [C=42 years/N ]. Dashed lines: one-half and double the expected value.

What can we see in this graph? Stop reading and look at the graph. List all the details you can see. What can you conclude directly or by comparing the details you see?

Students trained to deal only with directional models often see only that “when N increases, C decreases.” This is akin to seeing only grass and sky in a painting that also has humans and bushes and buildings and much more. Being able to read graphs to a full extent is a major part of basic numeracy in social sciences. So this section is one of the most important in this book. My own list is given at the end of the chapter. Do not peek, before making your own list! Only this way will you learn to see more than you have, up to the present.

108 12. How to Test Models: Logical Testing and Testing with Data

Figure 12.1. Mean cabinet duration vs. effective number of legislative parties – Predictive model and regression line. Source: Taagepera and Sikk (2007).

The tennis match between data and models Testing the model with data might look like the time-honored simple recipe: “hypothesis (model)  data collection  testing  acceptance/ rejection”. However, this would oversimplify the process. It would be akin to reducing a tennis match to “serve  respond  hit inside or outside  win/lose”. Yes, this is a basic component of tennis, but much more is involved before the “win/lose” outcome is reached. The basic “serve  respond” component itself plays out many times over. So iswith it model testing, too. Superficial data inspire the first coarse logical model. The model may suggest looking for different data that better correspond to what the model is about. But the same discrepancies between model and data may also motivate search for a more refined model. Some hidden assumptions may have entered the first round of model building; they must be explicitly stipulated. For instance, the coarse model for volatility implicitly assumed that any positive values of volatility are acceptable, even those surpassing 100 percent. In sum, the process of scientific research most often looks like an ascending spiral:

109 12. How to Test Models: Logical Testing and Testing with Data

Initial hunch (qualitative hypothesis)  limited data collection   quick testing  quantitatively predictive model (quantitative hypothesis)  further data collection  testing  refined model  testing  further refining of model or data  testing ...

The simple recipe above (hypothesis  data collection  testing) represents a single cycle within this spiral. The essential part of a predictive model is the predicted functional form of relationship among the variables – “product of n and n-1 divided by a constant” for c=n(n-1)/2; “constant divided by N squared” for C=42/N2. The model may include a constant or parameter (like k for cabinet duration), which must be determined empirically. Due to conceptual constraints, predictive models rarely are linear. Linear approximations are useful in preliminary work, along with graphical representations, to get a feel for the empirical pattern. They are also useful at the very end, as practical simplifications. In order to know when a simplification can be used, one must be aware of the refined model.

Why would the simplest forms prevail?

This book has highlighted the models based on ignorance – or rather near- ignorance, teasing the most out of what we know about constraints. Conceptually forbidden zones, anchor points, and continuity in-between those points are important parts of our knowledge. Asking what would happen under extreme conditions can lead to insights, even when we agree that such extremes will never materialize. When the impossible is eliminated, the possible emerges with more clarity. But why should we expect the simplest mathematical formats to apply, among the many formats that also satisfy some obvious constraints? Addressing a similar issue, physicist Eugene Wigner (1960) observed that the physicist is a somewhat irresponsible character. If the relationship between two variables is close to some well-known mathematical function, the physicist jumps to the conclusion that this is it – simply because he does not know any better options. Yet, it is eerie how often this irresponsible approach works out. It is as if mathematics were indeed the language in which nature speaks to us.

110 12. How to Test Models: Logical Testing and Testing with Data

What can we see in this graph? A sample list At least the following can be seen in graph 12.1. Let us start from the broad frame and then work toward the center – first the most prominent features and then the local details.  Mean cabinet duration, C, is graphed against the number of parties, N.  Both variables are graphed on logarithmic scales. (This is harder to notice for N – the unmarked notches should be labeled 2, 3, 5. If this were regular scale, one would start from 0, not 1.)  The data cloud visibly follows a roughly linear pattern, on this log-log scale.  The graph shows in bold the line that corresponds to k=42 years for C=k/N2 and in dotted lines the lines that corresponds to k=84 and 21 years, respectively – i.e., the double and one-half of 42. It can be seen that all data points but one fall into the zone between the dotted lines, and most points crowd along the central line.  The exponent 2 in the denominator of C=k/N2 corresponds to slope -2 in logC=logk-2logN. Regardless of the value of k, all lines that fit the model have this slope. The value of logk just raises or lowers the line. In sum,  The data cloud does follow a straight line on log-log graph.  This line does have a slope close to -2.  This line corresponds to a value of k around 42 years.  Nearly all data points are located in a zone along the line that corresponds to an error of ÷2 for C.

The “parameter” k is not part of logical prediction – it is determined precisely from this graph. When we put N=1 into C=k/N2, it becomes C=k. This means that we can read off the values of k for the various lines on the vertical scale at N=1, at the left side of the graph. Recall our estimate that values of k outside the range 16 years

111 12. How to Test Models: Logical Testing and Testing with Data cleverly played the parties against each other, including and dropping some all the while, which technically produced different cabinets.)

Still more is to be seen in the graph.  The line labeled C=31.3/N1.757 is the best-fit line determined by statistical means. Never mind for the moment how it’s determined. Note the following: It is visually very close to the line with slope -2, but its own slope is -1.757. It would reach the left axis (N=1, where logN=0) at the height 31.3 years, appreciably below 42 years. We can see that just a small drop in slope (from 2 to about 1.75) can change the intercept (the C value at N=1) quite a lot. But N=1 is an area with no data points, which makes the intercept less informative than is the slope.  Finally, consider the values of R-square. As stated previously, R2=1 expresses a perfect fit and R2=0 perfect scatter. Never mind for the moment how it’s measured. Just observe that the fit for this best fitting line with any slope is quite good (R2=0.79). However, the fit for the best fitting line with predicted slope (-2) is almost as good (R2=0.77). In view of existence of a logical model and its empirical confirmation, we have here a law, in the scientific sense of the term – the inverse square law of cabinet duration, relative to the number of parties.

Did you see most of this in the graph? Did you draw most of these conclusions? If you did not, don’t blame yourself. It’s a learning pro- cess. The main message is: Study the graphs of data and model- predicted curves very carefully. They may supply much more infor- mation than meets the eye at the first glance. Testing with data follows much the same pattern in the case of the number of seat-winning parties and the largest share. Each presents different snags (like the issue of 100 seats for 100 subunits, for the largest share). The example of cabinet duration suffices for the moment. The main message is: In testing models with data, eyeballing on graphs must precede statistical tests; otherwise the wrong type of statistical approach might be adopted.

112 13. Getting a Feel for Exponentials and Logarithms

 We have 10a10b=10a+b, because the exponent simply adds the number of zeros. Also 10a/10b=10a-b, which subtracts the number of zeros, and (10a)b=10ab, which multiplies the number of zeros. It results that 10-a=1/10a and 100=1.  Decimal logarithms are fractional exponents of 10 that lead to the given number: logA is a number such that 10logA=A.  It follows that when numbers are multiplied, their loga- rithms add: AB=C  logA+logB=logC. Also y=Am  logy=mlogA.  Keep in mind the following markers: log10=1, log1=0, and log0-infinity.  The formulas established for exponents of 10 apply to exponents of any other number n too: nanb=na+b, na/nb=na-b, (na)b=nab, n-a=1/na, n0=1, and the b-th root of n is n1/b.  Natural logarithm (lnx) is just a multiple of logx: lnx=2.3026 logx. Conversely, logx=0.434 lnx.

By now you should be persuaded that there is no escape: One cannot do basic logical models without logarithms and their counterparts, the exponentials. If you know how to deal with them, you can bypass this chapter. But make sure you really understand them, rather than just applying rules you have memorized. We have introduced logarithms very gradually – only to the extent they were indispensable for the problem on hand. This approach also gave time for some basic notions to sink in, before being flooded by more. The somewhat different approach in this chapter still tries to keep it as simple as possible. Previous gradual introduction may have reduced anxiety. I do not want you to memorize “When numbers are multiplied, their logarithms add” (although it’s true and useful) without inter- nalizing where such a claim comes from. Only then are you prepared to use them without hesitation, knowing what they mean. Then, if you have made a mistake in calculations (as I often do), you’d be able to smell it out and correct it. This is the hallmark of a professional.

13. Getting a Feel for Exponentials and Logarithms

Exponents We use 103 as shorthand for 10×10×10. Thus, 103=1,000. More generally, the “exponent” a in 10a is the number of zeros that come after “1”. It follows that 101=10. Also

100=1, given that here “1” is followed by no zeroes. If we multiply 100 by 1,000, we get 100,000. Using exponent notation, we have 102×103=105, which is 102+3. The numbers of zeros add. This is how multiplication turns into addition. When multiples of 10 are multiplied, their exponents add:

10a10b=10a+b.

When multiplying 10 by itself 3 times, we get 100×100×100= 1,000,000. In exponent notation, 102×102×102=(102)3=106. Thus

(10a)b=10ab.

If we divide 10,000 by 10, we get 1,000. Using exponents: 104/101=103, which is 104-1. Hence division of numbers leads to subtraction of exponents: 10a/10b=10a-b.

Now consider the reverse division: 10/10,000 yields 1/1,000=0.001. The previous rule makes it correspond to 101/104=101-4=10-3. Note that the “- 3” corresponds to the number of zeros that precedes “1”. We must also conclude that 10-a=1/10a, because multiplying both sides by10a we get 10a10-a=10a-a=100=1 on the left, and also 10a(1/10a)=1 on the right. For future use, all the equations in bold may be worth memorizing – but only if you understand what’s behind them. This means you can prove these relationships, if challenged. Otherwise, memorization does you little good. Conversely, if you understand, all this may look so natural that no memorization is needed.

114 13. Getting a Feel for Exponentials and Logarithms

Exercise 13.1 OK, now quickly, without looking at the above: Why must 100 be equal to 1? Why must 10-a be equal to 1/10a? Why must 103 times 104 be equal to 107? If you cannot respond, in terms of numbers of zeros, return to the beginning of the section.

The formulas established for exponents of 10 apply to any other number n too: nanb=na+b na/nb=na-b (na)b=nab n-a=1/na n0=1.

It follows that the b-th root of n is n1/b. Also the b-th root of na is na/b. These relationships frequently enter model building and testing.

Fractional exponents of 10 The next question may sound crazy: What could 101/2 or 100.5 stand for? If you take the previous rule seriously, it would mean “1” followed by one-half of a zero! It seems to make no sense. But hold it! Also consider the previous rule 10a10b=10a+b. When multiplying 100.5 by itself, we would get 100.5100.5=100.5+05 =101=10. But this is the very definition of square root of 10, which is approximately 3.16, given that 3.16×3.16=10. Thus 101/2=100.5 stands for square root of 10 – it cannot logically stand for anything else! Yes, it’s as if 3.16 stood for “1” followed by one-half of a zero... Now consider the cube root of 10, which is 2.154, because, as you can check, 2.1543=10. We could then say that 2.154 is somehow like “1” followed by one third of a zero, because 101/3×101/3×101/3 =101/3+1/3+1/3=101, which is “1” followed by a full zero. What about exactly 3? It is somewhat less than 3.16 but much more than 1.154. So it should be “1” followed by somewhat less than one-half of a zero but much more than one third of a zero. By now you may get the message: We can assign an exponent of 10, a sort of a “fractional number of zeros”, to any number between 1 and 10. For instance, 2 is 10 with exponent 0.30. How can we prove it? Note that 210=1,024. This is quite close to 1,000=103. Thus 210≈103. Take the 10th root on both sides: (210)1/10≈103)1/10. Multiply through, and we get 2≈100.30.

115 13. Getting a Feel for Exponentials and Logarithms

Decimal logarithms This “fractional number of zeros to follow 1” – this is what the decimal logarithm is. Thus, log3.16=0.500, log 2.154=0.333, and log2=0.30. Hence, by definition 10log2=2. More generally, for any number A, logA is a number such that

10logA=A.

When numbers are multiplied, their logarithms add. Indeed, con- sider AB=C. We can write it as 10logA10logB=10logC. It follows from 10a10b=10a+b that logA+logB=logC.

AB=C  logA+logB=logC.

Also log(A2)=log(AA)=logA+logA=2logA. More generally,

log(Am)=mlogA.

Note that m enters here, not logm. In other words,

y=Am  logy=mlogA.

What could be logarithm of 0? Recall that log0.001=log(1/1000)=-3. Each time we divide a number by 10 we subtract “1” from its logarithm. How many times do we have to divide 1 by 10 so as to obtain 0? We’d have to do it infinite times. Thus log0 tends toward minus infinity:

log0  -∞.

Hence 0 cannot be placed on a logarithmic scale. What about logarithms of negative numbers? Let us say that they do not have any.

What are logarithms good for? They turn multiplications into additions – but who needs going through logarithms when one can just multiply? However, they also turn expo- nents into multiplications, and this is where one cannot do without logarithms. And all too many logical models involve exponents. Example. Take the expression Y=X 3.6 in previous Table 8.2. How did I get Y=0.037 when X=0.40? For X=0.40, Y=0.403.6. Here we cannot go ahead without applying logarithms: logY=3.6log0.40. A pocket calculator with a LOG (or LOG/10x or log) key comes handy. On a

116 13. Getting a Feel for Exponentials and Logarithms usual pocket calculator, enter 0.40, push LOG and get -0.398. (On some calculators, you must first push LOG, then 0.40 and then “=”.) Multiplying by 3.6 yields -1.4326. Now take the “antilog” of -1.4326, which means taking 10-1.4326. On most pocket calculators, once you have -1.4326 entered, push “2nd function” and “LOG”, and you get y=10-1.4326=0.0369≈0.037. Shortcut. Many pocket calculators offer a shortcut for Y=0.403.6– the “yx” key. Enter 0.4, push “yx”, enter 3.6, push “=”, and we get 0.037 directly. Does such a calculator by-pass taking logarithms? No. It just automatically takes log0.40, multiplies by 3.6, and takes the antilog.

Logarithms on other bases than 10 Logarithms can be established on bases other than 10. This may make it confusing. The only other type of logarithms needed for most models is the ”natural ” one, designated as “ln”. It is based on the number e=2.718… instead of 10. By definition, lne=1, just as log10=1. What’s so “natural” about 2.718…? We’ll come to that. When logx means logarithm to the base 10, then lnx is simply logx multiplied by 2.3. More precisely: lnx = 2.3026 logx.

Conversely, logx = 0.434 lnx.

Note that ln10=2.3026. The previously established relationships still apply. In particular,

AB=C  lnA+lnB=lnC, and

ln(Am)=mlnA.

Many pocket calculators have separate keys for LOG (and 10x for antilog) and LN (and ex for antilog). We’ll address the natural logs when the need arises.

117 14. When to Fit with What

 The conceptually forbidden areas inevitably constrain relation- ships between two variables. Conceptual anchor points and ceilings add further constraints.  When the entire field y vs. x is open, try linear fit. Graph y vs. x and see if the pattern is straight.  When only one quadrant is allowed, try fixed exponent fit. Graph logy vs. logx and see if the pattern is straight.  When two quadrants are allowed, try exponential fit. Graph logy vs. x and see if the pattern is straight.  When the resulting pattern is not linear look for further constraints – like those for simple logistic growth and boxes with 3 anchor points.

We like to have linear fits. Sometimes we get a linear fit between x and y. Sometimes, instead, we get a linear fit between logx and logy. We have seen cases like that. But sometimes we get a linear fit between logy and x itself (not its logarithm), as we’ll soon see. And sometimes it’s even more complex. How does one know which way to go? Our basic observation is that

the conceptually forbidden areas inevitably constrain relationships between two variables. Conceptual anchor points and ceilings add further constraints.

These constraints make some forms of relationship impossible, while imposing some other forms. We should start with the simplest mathe- matical format that satisfies such constraints. We should add more complexity only when data do not fit the simplest format. Such lack of fit usually means that further logical constraints have been overlooked. It is amazing, though, how often nature (including social nature) conforms to the simplest forms. (Recall “Why would the simplest forms prevail?” in Chapter 12). Suppose that we have two data points, (x1,y1) and (x2,y2). What is the simplest format to connect them, without leading to logical in- consistencies? Such an inconsistency would result, if such a format predicts an impossible value of y for a possible value of x, or vice versa.

14. When to Fit with What

Unbounded field – try linear fit About the only situation where a linear model is justified is when both x and y can conceivably take any values – from minus infinity to plus infinity. The allowed area is an “unbounded field” (Figure 14.1). Apart from time and space, such quantities are rather rare. Their zero point tends to be arbitrary, and hence there are no logical anchor points.

Figure 14.1. When to use linear regression on unmodified data.

Unbounded field (any x, any y)  Try fitting with y=a+bx Zero point often arbitrary: (0) (linear pattern) +∞ b=+1

a

c (0) +∞ (0) -∞ b=-1/2

a

-∞ c

If the field is unbounded, no transformation of data is needed. Just graph y vs. x. If the data cloud is linear, then y=a+bx applies. Then we can draw in the visual best-fit line, y vs. x, or use a more formal statistical method. How can we find the coefficients a and b in y=a+bx?

 Intercept a is the value of y where the line crosses the y axis (because here x=0.)  Slope b is the ratio -a/c, c being the value of x where the line crosses the x-axis (because here y=0.)

However, we can also find the coefficient values in y=a+bx from any two suitable points:

 Take two points, far away from each other: x1,y1 and x2,y2. These should be “typical” points in the sense of being located along the axis of the data cloud, not high or low compared to most neighboring points.

119 14. When to Fit with What

 For y=a+bx we have b=(y1-y2)/(x1-x2). Then a=y1-bx1.  When a=0 is imposed on logical grounds, so that the line is forced to go through (0,0), the equation is reduced to y=bx. Then b=y1/x1.

One quadrant allowed – try fixed exponent fit Most often, however, we deal with quantities that cannot go negative: populations, votes, and parties. Then, only one quadrant of the open field is allowed in a graph y vs. x (Figure 14.2). Moreover, there is a natural zero that cannot be shifted: zero persons, zero parties, etc. E.g., for a country with zero square kilometers, it is reasonable to expect zero population and zero parties. Here we should consider the fixed exponent pattern y=Axk. When k is positive, it has an anchor point at (0,0). This format does not lead to absurdities. In contrast, many straight lines do, predicting a negative value of y for some positive values of x. For the straight line, we calculated the parameters a and b that describe a particular line. How do we calculate the parameters A and k for a fixed exponent curve? This will be shown toward the end of the chapter.

Figure 14.2. When to try the fixed exponent format.

120 14. When to Fit with What

Two quadrants allowed – try exponential fit An intermediary situation arises when one of the quantities can range from minus infinity to plus infinity (such as time), while the other cannot go negative (such as population). In a y vs. x graph, two quadrants of the open field are now allowed (Figure 14.3). There is a natural floor at zero persons, but no natural zero for time, and hence no anchor point. Here we should consider the exponential pattern, y=A(Bx). This can also be expressed as y=A(ekx), where e=2.71… is the basis of natural logarithms. We have not yet discussed the exponential pattern, but it is very important in life and in sciences. Bank deposits at a fixed interest grow exponentially. All young biological beings initially grow exponentially; so do some social or political organizations. The fixed exponent equation y=A(xk) and the exponential equation y=A(Bx) may look confusingly similar. The first has x to fixed exponent (power), while the second has a constant to exponent x. The difference in outcomes is huge

Figure 14.3. When to try the exponential format.

How to turn curves into straight lines Humans are pretty good at telling whether a line is straight, while various curves may look all the same to us. Suppose the conceptually allowed area suggests a fixed exponent or exponential relationship. It helps if we can transform the data in such a way that they would form a straight line – if the relationship is truly a fixed exponent or exponential.

121 14. When to Fit with What

Then we could graph the transformed data and see at a glance whether the transformed data cloud looks straight. When this is the case, linear regression of the transformed data is justified. If the transformed data cloud still looks bent or otherwise odd, we would have to ponder why this is so and what we have to add to the model so as to straighten out the data cloud. When the transformed graph does show a linear relationship, y=a+bx, we should calculate or estimate the values of a and b. From these we can determine the parameters of the original model – A and k in y=Axk, and similarly for exponentials. It can be done without push- button regression, and in fact, this hands-on approach is sometimes preferable. My experience is that students cannot understand and interpret computer-generated regression outputs unless they have acquired the ability to do rough graphs by hand and calculate the parameters using nothing more than a pocket calculator. The description of how to proceed is presented next.

Calculating the parameters of fixed exponent equation in a single quadrant If we expect fixed exponent pattern y=Axk, because only one quadrant is allowed, taking logarithms leads to linear relationship between logy and logx: log y=log A+klogx. Designating logA as a takes us to the familiar linear form (logy)=a+k(logx). Hence we should graph logy vs. logx. If the transformed data cloud is linear, then y=Axk applies. Then we can regress logy vs. logx. How can we find the coefficients A and k in y=Axk? We can do it in two ways.

Finding the coefficient values in y=Axk from special points on the log- log graph:  Coefficient A is the value of y where the line crosses the logy axis (because here log x=0 and x=1).  Exponent k is the ratio -A/c, c being the value of logx where the line crosses the logx axis (because here logy=0.)

Finding the coefficient values in y=Axk from any two points on the original curved graph y vs. x:  Take two “typical” points in the data cloud, far away from each other: (x1,y1) and (x2,y2). k k  For y=Ax we have k=log(y1/y2)/log(x1/x2). Then A=y1/(x1 ).  Special case: If A=1 is logically imposed, the equation is k reduced to y=x . Then k=logy1/logx1.

122 14. When to Fit with What

Calculating the parameters of exponential equation in two quadrants If we expect exponential pattern y=A(Bx), because only two quadrants are allowed, taking logarithms leads to linear relationship between logy and non-logged x: logy=logA+x(logB). Designating logA as a and logB as b takes us to the familiar linear form (logy)=a+bx. Hence we should graph logy vs. x itself. If the data cloud is linear, then y=A(Bx) applies. Then we can regress logy vs. x (unlogged). There are often good reasons to use the alternative exponential expression y=A(ekx) and natural logarithms (ln). By definition, lne=1. Hence the logarithms are related as lny=lnA+kx=a+kx. We again graph lny vs. x itself. If the resulting data cloud is linear, then y=Aekx applies. Then we can regress lny vs. x itself. Recall that natural (lnx) and decimal (logx) logarithms relate as lnx=2.30logx and, conversely, logx=0.434lnx. Often we can use either logarithm. How can we find the coefficients A and B in y=A(Bx), or A and k in y=Aekx? In principle, we can again do it in two ways – using special points on the semilog graph or using two points on the original curved graph y vs. x. However, semilog graph papers use decimal logarithms, and we may get confused when shifting from log to ln. So it is safer to use the two-point formula:  Take two “typical” points of the data cloud, far away from each other: (x1,y1) and (x2,y2). x logB  For y=A(B ) we have logB=[log(y1/y2)]/(x1-x2). Then B=10 x1 and A=y1/(B ). kx -kx1  For y= A(e ) we have k=[ln(y1/y2) ]/(x1-x2)]. Then A=y1(e ). This is often the more useful form. In general, when the exponential format applies, use y=A(ekx) rather than y=A(Bx). Previous chapters have given a fair number of examples of using y=Axk. Examples of using y=A(ekx) will be given in Chapter 20. Instead of ekx we sometimes write exp(kx). Why? Suppose we divide two exponentials and get exp[k(x1-x2)]. If we try to fit all of k(x1-x2) “upstairs”, into the exponent location, this can become confusing.

Constraints within quadrants: Two kinds of “drawn-out S” curves Further constraints can enter. It may be that x and y are logically restricted to only part of the positive quadrant. The box 0≤x≤1 and 0≤y≤1 is one marked example. With logical anchor points at 0,0 and 1,1, the model y=Axk still applies – and it even simplifies to y=xk. But if a third anchor point imposes itself at 0.5,0.5, a “drawn-out S” pattern results, as shown and briefly discussed at the end of Chapter 9. Its equation can be written as

123 14. When to Fit with What

y/(1-y)=[x/(1-x)]k.

This means that log[y/(1-y)]=klog[x/(1-x)]. To test fit to this model, graph log[y/(1-y)] vs. log[x/(1-x)] – it should produce a straight line. If it does so, the model fits, and k is the slope of this line on the log-log graph.

Figure 14.4. Exponential curve and simple logistic curve which starts out the same way.

S Exponential

M

Simple Logistic

M/2

0

0 t

A quite different “drawn-out S” pattern is shown in Figure 14.4. This comes about when an exponential growth pattern faces a ceiling, like all biological beings do. This means that y is squeezed into the zone between a conceptual floor and ceiling – two quadrants allowed, but with a ceiling on y. When the ceiling on y is M and y reaches M/2 at time t=0, then the simplest model is the simple logistic equation

y=M/(1+e-kt).

A larger k means steeper growth from near-zero to near-ceiling. This equation can be expressed (see EXTRA below) as

y/(M-y) = ekt

124 14. When to Fit with What and hence log[y/(M-y)] = kt.

To test fit to the simple logistic model, graph log[y/(M-y)] vs. t – it should produce a straight line. If it does so, the model fits, and k is the slope of this line on the semilog graph. This is the basis for statistical packages of LOGIT regression. The simple logistic curve starts out as an exponential curve. It also approaches the ceiling in an exponential way. We already encountered an exponential approach to a ceiling when dealing with volatility. The simple logistic curve looks like a “drawn-out S", like the three- anchor point curve in a box. But the ends of the logistic “S” range from minus to plus infinity, while the three-anchor point curve ranges only from 0 to 1.

EXTRA: How do we get from y=M/(1+e-kt) to y/(M-y)=ekt? M-y=M-M/(1+e-kt)=[M(1+e-kt)-M]/(1+e-kt)= )=[Me-kt]/(1+e-kt). Hence y/(M-y)={M/(1+e-kt)}/{Me-kt/(1+e-kt)}= ekt.

* Exercise 14.1 This exercise is really basic for understanding when to fit with what. For each of the cases described below, do the following. 1) Sketch a graph showing the constraints (forbidden areas and anchor points). 2) Sketch the broad shape of the resulting curves. 3) Give the general form of the corresponding equation; no need to calculate any of the parameters. 4) Indicate how we should transform the inputs and outputs so as to obtain a straight line on regular graph paper. (NOTE: Use the simplest assumptions possible. Go along with problem statements as given. If you think they are unrealistic, point it out in a post scriptum, but address the exercise first.) a) A sum of 1000 euros is placed in a savings account earning 3% interest per year. How does this sum (S) change over time (t)? [Show graph as S vs. t, not y vs. x. Show t over several decades – this helps to bring out the trend.] b) As television expands in a country, rural populations tend to have fewer TV sets than urban populations, because of expense and access. What could be the relationship between the percentages of households with TV in rural (R) and urban (U) areas? [Show graph as R vs. U, not y vs. x.] c) As radio sets expand in a country, rural populations at first tend to have fewer radio sets than urban populations, because of expense and access. Later on, however, radio sets become more widespread in rural than urban areas, because city people have

125 14. When to Fit with What

more of other means for information and leisure. Assume that the tipping point comes when radio ownership reaches 50% both in urban and rural areas. What could be the relationship between the percentages of households with radio sets in rural (R) and urban (U) areas? d) A few families reach an uninhabited island and find it suitable for agriculture. Over decades, their numbers expand, until they began to run out of arable land. How does their population (P) change over time (t)? Exercise 14.2 At the ECPR (European Consortium for Political Research) Workshops in Rennes 2008, Staticia presented a paper that connected the sub- ventions (S) paid by the European Union to various Euroregions to the per capita GDP of those regions (G). She graphed S against logG, found a pretty nice linear fit, and concluded: support decreases as wealth increases. However, the model inherent in this fit violates at both extremes the maxim “Models must no predict absurdities”. Explain the nature of absurdities that result, and offer a better model. The intervening steps are as follows. a) Show logG on the x-axis, G being GDP per capita in euros. This means marking 1, 10, 100, 1000 and 10,000 on the axis, at equal intervals. b) Show subventions S on the y-axis, S being in units of million euros. This means marking 0, 100, 200, 300… on the axis, at equal intervals. c) Enter a straight line at a reasonable negative slope. This is how Staticia’s graph looked, with data points crowded around it. d) Now statistics must yield to thinking. What is the subvention predicted by this straight line at extremely large GDP/cap? Is it possible? e) And how large would the subvention be when a region’s GDP/cap is zero? Could the European Union afford to pay it? (Hint: Where on the x-axis do you place 1, 0.1, 0.01, 0.001, …?) f) What shape does Staticia’s straight line assume when we graph S vs. G, rather than S vs. logG? Hint: Graph S vs. G, starting both from 0, and enter the predictions from parts (d) and (e). g) Which model would you recommend in view of the following: (1) neither S nor G can be negative; (2) when G=0, then S is some pretty high figure; and (3) when G increases, then S decreases. Sketch the corresponding curve on the graph S vs. G. (NOT S vs. logG!)

126 14. When to Fit with What h) Write the simplest equation that satisfies those conditions. i) If Staticia wishes to run a linear test for this logically supported model, which quantities should she graph? Compare to the quantities she did graph (S and logG). j) But Staticia did get a satisfactory statistical fit, by social science norms. How is this possible, if she graphed in an illogical way? That’s an unfair question, unless I also give you her data so you can visualize it. The answer is that, over short ranges, many equations can fit a fairly dispersed data cloud – but most of them lead to absurdities under extreme condtions. NOTE: Actually, what we have expressed here should stand for per capita subvention. For total subvention, the population of the region should also be taken into account, not only their wealth. Staticia could ignore this only because the regions have roughly similar populations.

127

C. Interaction of Logical Models and Statistical Approaches

15. The Basics of Linear Regression and Correlation Coefficient R2

 The Ordinary Least Squares (OLS) procedure of linear regres- sion minimizes the sum of the squares of distances between data points and the line.  Regressing y on x minimizes the squares of vertical distances. Regressing x on y minimizes the squares of horizontal distan- ces. Two different OLS lines result.  These two OLS lines are directional, not algebraic. This means that if we first estimate y from x, using ya+bx, and then estimate x from y, using xa’+b’y, we don’t get back the original x.  By OLS, a tall twin’s twin tends to be shorter than her twin.  If we first estimate y from x and then z from y, we don’t get the same value as when estimating z directly from x – OLS regres- sion is not transitive.  Correlation coefficient R2 expresses the degree of lack of scatter of data points. Utter scatter means R2=0. Points perfectly on a line mean R2=1.  In contrast to OLS slopes, R2 is symmetric in x and y.

Testing logical models with data can involve the use of statistics. Logical models often enable us to transform data so that a linear relationship is expected. Therefore, the basics of linear regression should be clarified here, even though developing skills in statistics is outside the scope of this book. You should acquire those skills, but somewhere else. What is very much within our scope is when and how to use these statistics skills – and as important, when NOT to use them blindly. Statistics books tend to go light on the latter aspect, and even when they point out inappropriate uses, students tend to overlook these sections. The damage can be serious.

15. The Basics of Linear Regression and Correlation Coefficient R2

Regression of y on x So let us proceed to linear regression. Suppose that we have 5 mode- rately scattered data points in a field with no constraints such as forbidden areas or anchor points, as shown in Table 15.1 and Figure 15.1. The mean values of x (x̄ ) and y (ȳ) are also shown. The point (x̄ , ȳ) is the “center of gravity” of data set.

Table 15.1. Hypothetical data.

x -3.0 0.5 2.5 3.5 4.5 Mean: x̄ =1.6 y -0.5 2.0 0.5 3.0 5.0 Mean: ȳ =2.0

We want to pass the “best fitting” line through these points. What do we mean by “best fitting”? That’s the catch. There are many ways to define it. But they all agree on one thing: The best-fit line must pass through the center of gravity of the data cloud. The “Ordinary Least Squares” (OLS) method to regress y against x proceeds as follows: We try to minimize the sum of the squares of the vertical deviations from data points to the line. What does this mean? Draw a haphazard line through the data cloud. No, not quite – it must pass through the center of gravity. Draw vertical lines joining the data points to the line (thick lines in Figure 15.1). The lengths of these line segments show how far the data points are from the line, in the vertical direction. Then draw in squares having these lines as their vertical sides. Some of the squares are large while some others are so tiny they do not show up in the graph. Measure the areas of squares and add them. Now try to tilt the line around the center of gravity so that the sum of the squares is reduced.

130 15. The Basics of Linear Regression and Correlation Coefficient R2

Figure 15.1. Vertical distances of 5 points to a random line drawn through their center of gravity (mean x, mean y). [CAUTION: THE LINE SHOWN Is PLACED A BIT TOO HIGH. x=1.6 is between the 2nd and 3rd points from the left; and y=2.0 is at the level of the 2nd point. All Figures using these data must redone, and they must show the center of gravity.]

y

Too large

Too large

x

The middle point along the x-axis has a large square – more than the other squares combined, but tilting the line closer to this point does not help much because this point is so close to the center of gravity. (We cannot shift the line away from the center of gravity!) But it pays to tilt the line to a steeper slope, so that the square on the far left is reduced to nothing. The line shown in Figure 15.2 is close to optimal. It passes near the point on the far left, and it balances off the two large squares to the right. Any further tilting of this line would increase the sum of the squares. So this line is close to the best fit by the OLS procedure.

131 15. The Basics of Linear Regression and Correlation Coefficient R2

Figure 15.2. Vertical distances of 5 points to the line which roughly minimize the sum of squares, meaning the best fit y on x.

y

x

This is the basic idea behind “regression” to the least squares line. We do not have to carry it out through such graphical trial and error. Equations have been worked out which do the regression exactly. The calculations are rather simple, but they become quite tedious when there are many data points. So these calculations are best left to computer programs. Just feed the coordinates (xi, yi) of all data points into the computer program, and out pops the OLS line ya+bx, which best predicts y from x. The danger is that by leaving everything to the computer we do not develop a “finger tip feel” for what OLS does – and what it cannot do. If we apply OLS to improper data (what is improper will be explained later), the computer program for OLS does not protest – it still calculates a line, even if makes no sense. Junk in  junk out. If you draw mistaken conclusions from this line, don’t blame the method – blame the one who misapplied a perfectly good method.

Reverse regression of x on y Now do the reverse – try to predict x from y – and the picture changes. Previously we measured the vertical distances from points to the line; now we must measure the horizontal distances. Figure 15.3 shows the

132 15. The Basics of Linear Regression and Correlation Coefficient R2 same line as in Figure 15.2, plus horizontal distances to the data points and their squares. Visibly, this line no longer is the best-fit line – some squares are so large they hardly can be shown within the printed page. Tilting this line to a steeper slope could strongly reduce the sum of these horizontally based squares.

Figure 15.3. Previous best fit line, y on x, and horizontal distances to the 5 points: The sum of squares clearly is not minimized.

y

x

This tilting is done in Figure 15.4. It shows both the previous best-fit line (dashed), labeled “OLS y-on-x”, and a new one, along with its squares, labeled “OLS x-on-y”. This line passes close to the point on the upper right, and it balances off the large squares for the lowest two points. The sum of squares is visibly smaller than in previous graph. This is the best line for predicting x from y: xa’+b’y. It is far from the best line for predicting y from x: y=a+bx. The crossing point of the two best-fit lines is at mean x and mean y. The important conclusion is that the OLS lines depend on the direction in which we proceed:

133 15. The Basics of Linear Regression and Correlation Coefficient R2

Regressing y on x (minimizing the squares of vertical distances) leads to one OLS line; regressing x on y (minimizing the squares of horizontal distances) leads to a different OLS line.

The reverse OLS line x-on-y is always steeper than the OLS line x-on-y, when x is on the horizontal axis.

Figure 15.4. Horizontal distances of 5 points to the line that roughly minimizes the sum of squares based on them – the best fit x-on-y. The best-fit y-on x is the dashed line.

y OLS y-on-x

x

OLS x-on-y

Directionality of the two OLS lines: A tall twin’s twin tends to be shorter than her twin Directionality of OLS lines is nice for some purposes and disastrous for some others. Suppose that we take female twins and randomly assign one of them to Group X and the other to Group Y. Then we graph their heights, y vs. x. The logical model which applies to this graph obviously is y=x, which also means x=y. It is non-directional. Boringly obvious? But wait a bit!

134 15. The Basics of Linear Regression and Correlation Coefficient R2

Suppose I tell you that the twin X is 190 cm (6 ft. 3 in.) and ask you to guess at the height of her twin. You are paid or penalized depending on how close your guess is. If you offer 190 cm, you have more than a 50-50 chance of overestimating the twin’s height. Why? You know that 190 cm is way over the mean height. The chances that the twin of such a tall woman matches such an unusual height are less than 50-50. So I would guess at 188 cm (6’2”) at most – and maybe even lower. How much lower? This is precisely what OLS y-on-x is about. A tall twin’s twin tends to be shorter than her twin, and OLS tells you by how much, on the average. Compared to the slope 1 of y=x, the OLS slope is somewhat less than 1. Suppose the scatter in heights of twins is such that, for 50-50 chances of under- and overestimating, OLS tells us to guess at y=187 cm when x=190 cm. If we next were told that a person in the Y group is 187 cm, what would be our guess for her twin? Certainly not 190! This would mean using the y-on-x OLS line, y=a+bx, going in the wrong direction. Indeed, 187 cm is still unusually tall, so we would guess at about 185 for her sister. This means we now are using the opposite x- on-y line. In the usual y vs. x way of graphing, this OLS slope is somewhat more than 1. Where would such a game end? Our successive estimates would approach the median height of women, but with ever-smaller increments. Thus we never reach the median height itself, technically speaking. In the reverse direction, if we were told that the first twin is 140 cm (4’7”), we would guess at her twin being somewhat taller – so we again approach the mean height. Are you getting dizzy? I would. To make sure I really understand, I would need a specific example, and I would need to work it out on my own, rather than observing someone else doing it. This is what Exercise 15.1 is about.

Exercise 15.1 The best-fit line y on x in Figure 15.4 is y=1.06+0.59x. This is what we would use to estimate y from x. The best-fit line x on y is y=0.57+0.89x. But this line is used to estimate x from y, so we have to transpose it to x=-0.64+1.12y. To keep track of which equation goes in which direction, we better write them as y1.06+0.59x and x-0.64+1.12y. a) Graph the presumed data (from Table 15.1) and the two lines. They cross at the center of gravity G, where x=1.6, y=2.0. These are the mean values X and Y of x and y, respectively. b) If another country has x=4.0, what would be the corresponding value of y? Call this point A. tle.ot?

135 15. The Basics of Linear Regression and Correlation Coefficient R2

c) Use this value of y to calculate the corresponding value of x. Call this point B. har.n d) Use this value of x and calculate the corresponding y. Call this point C. zu.tso? e) Show the points A, B, and C on the graph, as well as arrows A to B and B to C. f) What would happen, if we continue this game? Where would we eventually end up? g) If we write and apply a computer program for the process above, how much time might the process take, with today’s computer capabilities, to reach perfectly this end point? h) What would happen, if we start with a very low value of x? Where would we now end up?

In sum, OLS is the best way to predict an unknown quantity from a known one – a practical purpose, which is directional. But the scientific “law”, on which such “engineering” is based, is still y=x=y. It is not “a tall twin’s twin tends to be shorter than her twin, who herself tends to be shorter than her twin”, which would mean y

[xy] = [yx] algebraic equations

The OLS equations, however, are directional:

[xy] ≠ [yx] OLS regression

At low scatter, the difference is negligible. At high scatter, it becomes enormous. For logical models expressed as algebraic equations, a single line must work in both directions. It’s a fair guess that its slope should be intermediary between the lower slope of y-on-x and the higher slope of x-on-y of the two OLS lines. This symmetric linear regression will be presented in the next chapter. For the moment, just keep in mind that it matters in which direction one carries out standard OLS. The custom of showing OLS equations as y=a+bx is misleading – it really is y a+bx.

136 15. The Basics of Linear Regression and Correlation Coefficient R2

Non-transitivity of OLS regression Suppose logical considerations suggest that x has an impact on y, which in turn has an impact on z. Symbolically: xyz. Average cabinet duration (C) is one such example. For a given assembly size, the number of seats in the electoral district (district magnitude M) largely determines the effective number of parties (N), which in turn largely determines cabinet duration: MNC. Sometimes we may already have calculated N from M, and we want to use it to calculate C. This means MN, followed by NC. Some other times we may wish to estimate C directly from M: MC. We expect that it should not matter, which way we go. The outcome should be the same, regardless of whether it is MNC or MC. This is what transitivity means, and it applies to the algebraic equations. Symbolically:

[xyz] = [xz] algebraic equations

The trouble with OLS regression equations is that, in contrast to algebraic equations, they are not transitive. When we regress the number of parties on district magnitude and then regress cabinet duration on the number of parties we get one relationship between district magnitude and cabinet duration. We get a different one when regressing cabinet duration directly on district magnitude: MNC is not the same as M C. At low scatter, the difference is negligible. At high scatter, it can become enormous. Symbolically:

[xyz] ≠ [xz] OLS regression, high scatter

Why is this so? This follows from lack of directionality. If we cannot go back to the same value of x, after passing through y (Exercise 15.1), then we cannot reach the same value of z when going there through y rather than directly. Most logical models are transitive, like MNC. Thus OLS regression works in model testing only when scatter is low – which often works out in physics but rarely does in social sciences.

Correlation coefficient R2 Along with the coefficients a and b in OLS linear regression equation ya+bx, one usually reports R2. This number expresses the degree of lack of scatter of data points. If the points line up perfectly, R2 is 1. If the points form a blob without any direction, then R2 is 0.

137 15. The Basics of Linear Regression and Correlation Coefficient R2

Figure 15.5. Linear regression presumes a roughly elliptic data cloud, without curvature.

One can use just R. In addition to scatter, R also indicates whether the slope is up or down. Figure 15.5 offers some schematic examples. Imagine that the ellipses shown are rather uniformly filled with data points, and there are none outside the ellipse. The slimmer the ellipse, the higher R2 is. Why is R2 used more frequently than R, when the latter gives more information? It can be said that R2 expresses the share of variation in y that is accounted for by the variation in x. The rest of the variation in y is random noise, insofar as dependence on x is concerned. For the data in Figures 15.1 to 15.4, R2=0.66. So as to give more substance to these schematic examples, Figures 15.6 and 15.7 reproduce previous Figures 11.2 and 12.1, respectively. Figure 15.6 (R2=0.51, hence R=0.511/2=+0.71) is akin to the first ellipse in Figure 15.5, tilting up (R positive) and moderately slim. Figure 15.7 (R=0.7871/2=-0.89) is akin to the last ellipse in Figure 15.5, tilting down (R negative) and quite slim.

138 15. The Basics of Linear Regression and Correlation Coefficient R2

Figure 15.6. Individual-level volatility of votes vs. effective number of electoral parties: data and best linear fit from Heath (2005), plus coarse and refined predictive models. For best-fit line, R2=0.51.

CONCEPTUA LLY FORBIDDEN A REA 100 Anchor point 1) N- .9( 11 80 V= 4N .1 11 7+ .0 -9 V= 1) 60 3(N- ) -0.14 0(1-e V=10 FORBIDDEN CONCEPTUALLY 40 v Individual level volatility (%) volatility level Individual 20

0 012345678910n CONCEPTUA LLY FORBIDDEN A REA Effective number of electoral parties

Figure 15.7. Mean cabinet duration vs. effective number of legislative parties – Predictive model and regression line. Source: Taagepera and Sikk (2007).

Notes: Thin solid line: best fit between logarithms. 2 Bold solid line: theoretically based prediction [C=42 years/N ]. Dashed lines: one-half and double the expected value.

139 15. The Basics of Linear Regression and Correlation Coefficient R2

Exercise 15.2 In Figures 15.4, 15.5, and 15.6, take a pencil and lightly sketch ellipses that encompass the data points. Comparing the slimness of these ellipses to those in Figure 15.5, estimate the corresponding R-squares. Compare your results to the actual R-squares (0.66, 0.51, and 0.79, respectively).

Both notation and names in the literature can be confusing. Kvålseth (1985) presents no less than 8 different expressions for R2 that appear throughout the literature. They most often yield approximately the same result, but for some odd data constellations they can differ. Some are called coefficients of dispersion or of determination. Some sources distinguish between R2 and r2. At the risk of omitting significant differences, this book uses just “correlation coefficient R2”. It is always positive, when applied to the scatter of the data cloud as such.

Figure 15.8. When scatter is extensive, the two OLS lines diverge from the main axis of an elliptic data cloud tilted at 45 degrees.

y Symmetric line – main R2 – measure of slimness axis of data cloud of data cloud

C x OLS y-on-x – Scatter-reduced slope

OLS x-on-y – Scatter-enhanced slope

c

140 15. The Basics of Linear Regression and Correlation Coefficient R2

The R2 measures lack of scatter: But scatter along which line? In previous Figure 15.5, the main axis of the ellipse (dashed line) is visibly the best-fit line for all three data clouds. Is this line the OLS regression line? But if so, which one would it be – y-on-x or x-on-y? Actually, the main axis is intermediary between the two OLS lines; we will see that it’s the symmetric regression line (as long as the main axis is roughly at 45 degrees). All three lines pass through the center of gravity C of the data cloud. For the slim data cloud on the right in Figure 15.5, the three regression lines are practically the same. But for the almost round data cloud in the center they diverge, as shown in Figure 15.8. Here the y- on-x line has a much shallower slope than the central line, while the line x-on-y has a much steeper slope. This is a general feature of OLS regression: As scatter increases, the slope of y on x is reduced, while the slope of x on y is enhanced. As these “scissors” between the two OLS lines widen, R-square decreases. Although R-square often is reported along with a single OLS line (y on x), it has exactly the same value the other OLS line (x on y). It goes with the combination of both lines.

141 16. Symmetric Regression and its Relationship to R2

 Symmetric regression line minimizes the sum of rectangles formed by vertical and horizontal distances from data points to line.  The slope B of the symmetric regression line is a pure measure of slope, independent of scatter. Similarly, R2 is a pure measure of lack of scatter, independent of slope.  Together, B and R2 tell us how tilted and how slim the data cloud is – B expresses the slope of the main axis, and R2 the slimness of the ellipse.  The slopes of the two OLS lines result from a combination of these pure measures. They are mixtures of slope B and scatter R2.  The symmetric regression equation is multi-directional and transitive. It is an algebraic equation, suitable for interlocking relationships.  Being one-directional, OLS equations cannot represent inter- locking relationships.

We have seen that there are two OLS lines, and their equations are directional and non-transitive. They are so because they treat x and y asymmetrically, minimizing squares of deviations either in the vertical or in the horizontal direction. Testing of logical models might be on safer grounds with a regression method that treat x and y in a symmetric way.

16. Symmetric Regression and its Relationship to R2

From minimizing the sum of squares to minimizing the sum of rectangles How could we regress so that x and y enter in a symmetric way? Ask first the reverse question: What caused the asymmetry in the OLS procedure? This came about because we measured either the vertical distances between data points and the line, or the horizontal. Well, take them now both into account, on an equal basis. Start with the OLS line y-on-x in previous Figure 15.2, but show both the vertical and horizontal distances of points to line (Figure 16.1). These lines form the two sides of rectangles. The two remaining sides are shown as dashed lines. Now look at the areas of these rectangles. Could we reduce their sum? Visibly, a steeper line could reduce the areas of the two largest rectangles. Compared to the two OLS lines in Figure 15.4, an intermediary line minimizes the sum of rectangles, as seen in Figure 16.2. To minimize clutter, only two sides of the rectangles are shown. (In fact, the entire argument could be made on the basis of the areas triangles delineated by the vertical and horizontal distances and the line.)

Symmetric regression line minimizes the sum of rectangles formed by vertical and horizontal distances from data points to line.

Figure 16.1. Vertical and horizontal distances of 5 points to the best-fit line y on x.

y

x

143 16. Symmetric Regression and its Relationship to R2

Figure 16.2. Vertical and horizontal distances of 5 points to the line that roughly minimizes the sum of rectangles (or triangles) based on them – the best fit symmetric in x and y.

y

OLS y-on-x

x

Symmetric regression OLS x-on-y line

Symmetric regression line equations are multi-directional and transitive. In this sense, they are algebraic equations. In terms of the example in the preceding chapter, symmetric regression MN, followed by regression NC yields the same result as direct regression MC. Symbolically:

[xyz] = [xz] symmetric regression equation

This is crucial for establishing interlocking relationships (Chapter 10). Interlocking relationships need two-directional equations. Hence they cannot be based on OLS regressions. They can be based on symmetric regression.

How R-squared connects with the slopes of regression lines Slope B of the symmetric regression line is the geometric mean of the slopes (b and b”) of the two standard OLS lines in relation to the same axis:

B=±(bb”)1/2,

144 16. Symmetric Regression and its Relationship to R2 the sign being the same as for b and b”. It can also be shown that R- squared is the ratio of the two OLS slopes:

R2=b/b”.

In sum, the following picture emerges. The slope B of the symmetric regression line is a pure measure of slope, independent of scatter. Similarly, R2 is a pure measure of lack of scatter, independent of slope. Together, B and R2 tell us how tilted and how slim the data cloud is – B expresses the slope of the main axis, and R2 the slimness of the ellipse. The slopes of the two OLS lines result from a combination of these pure measures. They are mixtures of slope B and scatter R2 –slopes reduced or enhanced by scatter. For full description of the data cloud, we also need the coordinates of its center of gravity (mean x and y), where the three regression lines cross. Proof of the relationships above is left to the EXTRA section. It involves more mathematics than the rest of this book.

Exercise 16.1 On top of your graph in Exercise 15.1 also graph the symmetric regression line and determine its equation. Proceed as follows. a) The two OLS lines are y1.06+0.59x and y0.57+0.89x. Calculate the slope (B) of the symmetric line as the geometric mean of the slopes of the two OLS lines..hetzuzöt b) Pass the line with that slope through the center of gravity G(1.6, 2.0). c) The symmetric line has the form y=a+Bx. To find a, plug x=1.6, y=2.0 into this equation and solve for a. .tmeneg

145 16. Symmetric Regression and its Relationship to R2

Figure 16.3. OLS and symmetric regression lines for three points.

y 1 B

C

A D 0 1 2 x

* Exercise 16.2 Figure 16.3 shows a most simple case: just 3 points (A, B, and D). a) Copy the data points on square paper so as to be able to place the regression lines accurately. However, use the same scale on both axes (in contrast to Figure 16.3) – one unit must have the same length on both axes. b) Specify two points on the OLS regression line y on x. How? Ask which value of y minimizes the squares of vertical distances to the line when x=0. Do the same at x=2. c) Draw this OLS line on the graph. Calculate its equation. .ot -.zuzot d) Now specify two points on the OLS regression line x on y. Draw it, and calculate its equation. oah. –egy. e) Calculate the coordinates of the center of gravity C. How? At the point where two lines cross, the same value of y satisfies the equations of both lines. .tlethar .stasba f) Calculate the slope of the symmetric regression line. Compare it to the slope of the line joining the points B and D. (CAUTION: Here we have a point labeled B, but the same symbol is used in text for the slope of the symmetric line. Keep the two “B”s separate in your mind.) .hrnol .stahet g) Calculate the intercept a for the symmetric regression line. How? This line, too, passes through the center of gravity. So plug its slope and the coordinates of C into y=a+Bx. h) Draw in the symmetric regression line. i) Draw conclusions.

146 16. Symmetric Regression and its Relationship to R2

EXTRA: The mathematics of R2 and the slopes of regression lines

This section, more mathematical than the rest of this book, is not obligatory reading. We can get the formulas for the OLS lines in any basic statistics book, but few of these books even mention symmetric regression. The present section enables us to make the connection, if one is interested. All three regression lines pass through the center of gravity G of the data cloud. The coordinates of G are mean x (X=xi/n) and mean y (Y=yi/n) of all the data points. Indices i refer to the coordinates of individual data points. The slope B of the symmetric regression line can be calculated directly from data (Taagepera 2008: 173–174):

2 2 1/2 B =±[(yi -Y) /(xi -X) ] .

With indices dropped and the axes shifted so that the means become X=Y=0, B = ±[y2/x2]1/2.

The sign of B (+ or -) is the same as for correlation coefficient R. This formula may not look symmetric in x and y, given that y is on top and x at the bottom. But keep in mind that any slope stands for dy/dx. When we introduce B=dy/dx, the result is symmetric in x and y: (dy)2/Σy2=(dx)2/Σx2. This slope B is a pure measure of slope in the sense that, if we increase random scatter, this slope does not systematically shift up or down. The formula for R2 is also symmetric in x and y:

2 2 2 2 R =[(yi-Y)(xi-X)] /[(xi-X) (yi -Y) ].

With indices dropped and X=Y=0, it becomes

R2=[xy]2/[x2y2].

It is a pure measure of lack of scatter in the sense that it does not depend on slope. In previous pictures of roughly elliptic data clouds (Figures 15.5 and 15.8), B and R2 tell us how tilted and how slim or roundish the data cloud is – B reflects the slope of the main axis (as long as the main axis is roughly at 45 degrees), and R2 the relative width of the ellipse. This means that the slope of the symmetric regression line is the scatter-

147 16. Symmetric Regression and its Relationship to R2 independent complement to R2. (For fuller description, we also need the coordinates of the center and the range of values of x or y.) The slopes of the two OLS lines result from a combination of these pure measures. They are mixtures of slope B and scatter R2. We usually measure the slope of OLS y on x relative to the x axis (b=dy/dx). But the slope of OLS x on y might be measured either relative to y axis (b’=dx/dy) or relative to x axis (b”=dy/dx), so that b’b”=1. So we have three relationships, as pictured in previous Figure 15.8:

b=|R|B OLS y on x – scatter-reduced slope. b’=|R|/B OLS x on y, slope relative to y axis, scatter-reduced. b”=B/|R| OLS x on y, slope relative to x axis, scatter-enhanced.

It follows that b/b”=R2, when the slopes of both OLS lines are measured with respect to the x axis. If further random fluctuation is imposed on a given data set, R2 is reduced. This means that the ratio of the slopes of the OLS lines is reduced. How does this reduction come about? The slope of the OLS line y-on-x (b) is reduced, while the slope of the OLS line x-on-y (b”) is enhanced, so that both contribute to reduction in R2. But this means that the slope of each OLS line is affected by the degree of scatter. Hence the OLS slope is a mixed indicator of steepness of linear trend and of scatter around this trend. Note that R2 can be visualized as degree of lack of scatter, either along the symmetric line or along the combination the two OLS lines. It would be misleading to visualize R2 as degree of lack of scatter along a single OLS line. This can be seen in Figure 15.8: The scatter expressed by R2 is distributed lopsidedly around the OLS line, while it is distributed evenly around the symmetric regression line. When R2 is low, reporting a single OLS slope along with R2 effectively means counting the real slope at half-weight (B in |R|B) and the degree of scatter at one-and-a-half weights (R2 plus |R| in |R|B). It also follows from the equations above that the symmetric slope B is the geometric mean of the slopes of the two standard OLS lines in relation to the same axis (b and b”):

B=±(bb”)1/2, the sign being the same as for b and b”. When the slope of y-on-x is measured with respect to the y axis (b’=1/b”), the relationship is B=±(b/b’)1/2.

148 17. When is Linear Fit Justified?

 A linear fit of data is acceptable only if the data cloud is uni- formly dispersed within a roughly elliptic area, with no visible curvature or structure. Even then the linear fit must not violate logical constraints.  Always graph all data y vs. x before pushing the regression button. Then carry out linear regression only when it makes sense.  The use of R2 is subject to the same limitations.  The least squares and symmetric regression methods are both quite sensitive to outliers, and so is R2.

When we are given just the equation of the appropriate regression line plus R2, but no graph, we tend to imagine what was shown in Figure 15.8: a roughly ellipse-shaped data cloud, tilted according to the slope b in y=a+bx, with ellipse slimness corresponding to the value of R2. This is an adequate mental picture under the following conditions:

 the center of gravity C (mean x, mean y) does express the center of the data cloud in a meaningful way;  the regression line does express the main axis of the data cloud in a meaningful way; and  R2 does express the dispersion of data around this line in a meaningful way.

Recall that the main axis of the data cloud is the line of symmetric fit (when its slope is close to 1). If instead, y=a+bx stands for OLS regression y-on-x and R2 is low, it really means ya+bx . We would then tend to underestimate the actual slope of the ellipse main axis. But this may be a relatively minor distortion. Real trouble is that all too often data clouds do not look at all like ellipses.

17. When is Linear Fit Justified?

Many data clouds do not resemble ellipses Data clouds may look like bent sausages or even like croissants, as in Figure 17.1. (The data points and regression line were copied from a published graph; center of gravity C and dashed parabola have been added.) Here linear regression is misleading because

 the center of gravity lies in a zone where few data points occur;  the y-on-x regression line passes through a zone with few data points and does not express the configuration, conjuring in our minds the false image of a tilted ellipse; and  the value of R2, low as it is bound to be in Figure 17.1, wrongly conjures the image of an almost circular data cloud (as in Figure 15.5, center) rather than the actual complex configuration.

Symmetric regression would be no better. No simple curve fits here, but a roughly parabolic fit (dashed curve) would be more expressive of the pattern than any straight line. How do we know whether our data cloud is roughly elliptic or not? The blunt advice is:

Always graph all data y vs. x before pushing the regression button. Then carry out linear regression only when it makes sense.

Whenever the data cloud has even a slight curvature (bent sausage rather than a straight one), consider some data transformation so as to straighten the pattern, before regressing. How do we carry out such a transformation? Recall Chapter 14. Applying linear regression to data configurations not suited for it can lead to monumental mischaracte- rization of the data. This is so important that further cautionary examples are due.

150 17. When is Linear Fit Justified?

Figure 17.1. A case where any linear regression would be misleading. The OLS line y-on-x is shown. The complementary OLS line y-on-x, also passing through center of gravity C, would be almost vertical.

y5

5

5

5

5 C

5

5

5

5 x

Grossly different patterns can lead to the same regression lines and R2 Consider the examples in Figure 17.2. Assume we have no information on any conceptual limitations. Those four data configurations have been designed so that they all lead to exactly the same linear regression lines: The center of gravity is the same: x=9.00, y=7.50. OLS y-on-x yields y=3.00+0.50x, OLS x-on-y yields y=0.75+0.75x, and symmetric regression line is y=1.9+0.61x. The four configurations also have the same correlation coefficient, a pretty high one (R2= 0.67), if we deemed it appropriate to apply a linear fit. But in which of these cases does a linear fit make sense?

151 17. When is Linear Fit Justified?

Figure 17.2. Data that correspond to exactly the same linear regression lines and R-square (Anscombe 1973 for y1 to y3, Taagepera 2008: 201 for y4). But does a linear fit make sense?

y1 10

5

x 0 01020

y2 10

5

x 0 01020

y3 10

5

x 0 01020

y4 10

5

x 0 01020

152 17. When is Linear Fit Justified?

 Constellation y1: A linear fit looks acceptable because the data cloud is uniformly dispersed, with hardly any visible curvature. One could draw an ellipse around the data points, and the crowdedness of points would be roughly the same throughout the ellipse. (True, one might detect an empty region in the lower center, meaning that a slightly bent curve would fit better. But in the absence of conceptual constraints we might gloss over it.)  Constellation y2: The points fit neatly a parabolic-looking curve, and a corresponding transformation should be applied before linear fitting. Linear fit to raw data would be ludicrous. Random deviation from a regular pattern is much less than intimated by R2=0.67. The parabolic transformation could be based on statistical considerations, but this would also be prime time for asking: why is it that y first rises and then falls with increasing x?

 Constellation y3: It has 10 points perfectly aligned, while one point is a blatant outlier. This point clearly does not belong and should be omitted, before carrying out regression. (The statistical justification for deletion is that it deviates by more than 3 standard deviations.) When this outlier is omitted, the slope is slightly lower than previously calculated, and R2 approaches 1.00. One should try to figure out how the outlier came to be included in the first place. Maybe there was a typo in the data table – this happens.

 Constellation y4: The pattern is far from a rising straight line. We observe two distinct populations where y actually decreases with increasing x, plus an isolate. This pattern should make us wonder about the underlying structure: Why is there such an odd pattern? Reporting only the rising regression line would misrepresent the data.

Note that none of the peculiarities of the three latter cases would be noticed, if one just used tabulated data and passively went on to linear regression. One must graph the data! The use of R2 is subject to the same limitations. If only the linear regression results are reported, we would imagine a data cloud like y1, and never imagine that it could be like the three others.

153 17. When is Linear Fit Justified?

Exercise 17.1 Make an exact copy of Figure 17.2. If you can’t use a copy machine, paste this graph on a well-lighted window, paste blank paper on top of it, and trace carefully all the points. Add the symmetric regression line y=1.9+0.61x to all 4 graphs. [Note that the center of gravity (9.00, 7.50) is on this regression line.] The discrepancies characterized above will become more apparent. (CAUTION: The scales on the four graphs are somewhat different, so one must scale off the distances on each of them separately. Sorry for this inconvenience.) * Exercise 17.2 Make an exact copy of Figure 17.2. If you can’t use a copy machine, paste this graph on a well-lighted window, paste blank paper on top of it, and trace carefully all the points. All four configurations in Figure 17.2 have the same arithmetic means for x and y: x=9.00, y=7.50. Place this point on all four graphs. (CAUTION: The scales on the four graphs are somewhat different, so one must scale off the distances on each of them separately. Sorry for this inconvenience.) a) In which cases, if any, would the use of these particular values of arithmetic means be justified, because they adequately characterize something about the data cloud? b) In which cases, if any, would calculation of arithmetic means be justified once something has been done with the data? c) In which cases, if any, should one use means different from arithmetic? d) In which cases, if any, should one give up on trying to calculate any kind of means, because they would leave a mistaken impression of the actual configuration? Do not assume that one and only one case fits each question!

Sensitivity to outliers Linear regression is trickier business than some social scientists realize. In particular, the least squares regression method is quite sensitive to extreme values (outliers), and so is R2 (Kvålseth 1985). Symmetric linear regression shares the same problem. This is illustrated by the third example in Figure 17.2. While the single outlier affects the center of gravity only slightly, it makes the slope of the regression lines much steeper and lowers R2 from nearly 1.00 to 0.67. Any point far out of a generally ellipse-shaped data cloud can mess up the results.

154 17. When is Linear Fit Justified?

Outliers can legitimately be excluded from a set by some statistical considerations, such as being off by three (or even just two) standard deviations, in an otherwise normal distribution. There are also more refined regression methods that minimize their impact, such as Tukey’s outlier-resistant method (Kvålseth 1985). The main thing is to know when trouble looms, so that one can consult. How does one smell trouble? By graphing and eyeballing.

Empirical configuration and logical constraints Linear regression, be it standard OLS or symmetric, may be used only when linear fit is justified in the first place. When is it, and when isn’t it? One must check for acceptable configuration of the empirical data, as has been done above. In addition, conceptual agreement with con- straints is also needed. Conceptual agreement includes passing through anchor points and avoidance of forbidden areas. Thus for volatility in Figure 15.6 the best linear fit V-on-N narrowly misses the anchor point at N=1. (The reverse fit of N-on-V would err in the opposite direction. The symmetric regression line would be in- between, and hence very close to the anchor point.) The linear model does respect the anchor point but goes through the ceiling at V=100 – so we have to treat the linear model with great caution, as a merely local model. Linear fit is acceptable only when volatility is much below 50 (which most often is the case). Compatibility of linear fits with logical constraints will surface again and again later on. The published graph that is copied in Figure 17.1 actually includes more information: each data point carries the name of a country in Europe. Upon inspection, all the top points pertain to Northern and Central Europe, and bottom points to Southern Europe. Hence there is no parabola. There is one downward trend line in Northern Europe, and an upward one in Southern Europe! The authors of the published article missed it. They did the right thing by graphing and labeling the data points. But then they forgot to ask: “What can we see in this graph?” (Cf. Chapter 12.) They went blindly statistic and missed the main substantive feature. This happens often, when data graphs are published. All too often only statistical analysis is published, without graphs. Then one can only wonder what could be hidden behind it. It could be garbage like fitting the data in Figure 17.1 with a single OLS line, or nuggets like dis- covering different trends in Northern and Southern Europe.

155 17. When is Linear Fit Justified?

Exercise 17.3 In Figure 17.1, assume that all points above the OLS line pertain to Northern and Central Europe, and those below the line to Southern Europe. Sketch in the separate best-fit lines for the two groups. Exercise 17.4 In the top graph (y1) of Figure 17.2, join the top 6 points with a curve, then do the same for the bottom 5 points. Does the pattern still look straight? Compare to the next pattern (y2).

156 18. Federalism in a Box

 Even the best graphs published in social sciences can be improved on.  Placing a data graph in a frame that does not correspond to the logically allowed area can distort our impression of the data.  Showing a regression line that extends into logically forbidden area can distort our impression of the data.  Recall that there is no single OLS regression line. When data are scattered, the slopes of the two OLS lines (y on x, and x on y) diverge wildly. A symmetric regression line with an inter- mediary slope is available.  When the conceptually possible range of x goes from a to a larger number b, it can be converted to range 0 to 1, byX= (x-a)/(b-a).

This chapter presents further examples of thinking inside a box, comparing linear regression to fixed exponent curves. No full-fledged quantitatively predictive logical model could be developed in these cases – it is not that easy! The point is to show that more could be done in the direction of model building than at first met the eye. Arend Lijphart’s Patterns of Democracy (1999) is a landmark in the study of democratic institutions, including various aspects of federalism.

Constitutional rigidity and judicial review One might expect that federalism needs an appreciable degree of “constitutional rigidity”. This means a written constitution that is hard to change, because this is the only way to specify and protect the rights of the federal subunits. Rather than measured quantities, Lijphart uses here informed estimates. He does so, on a 1 to 4 point scale, where 1 stands for no written constitution and 4 for one that is extremely hard to modify. A related concern is “judicial review of laws”. This means that courts can pass judgment on whether a law conforms to the constitution. Lijphart again uses a 1 to 4 scale, where 1 stands for no judicial review and 4 for a very powerful one. One cannot declare a law unconsti- tutional when there is no constitution! Thus a rating “1” on constitu- tional rigidity should exclude anything above “1” on judicial review. This looks like a logical anchor point. But how are these two features related otherwise, if at all? 18. Federalism in a Box

Figure 18.1 shows first Lijphart’s (1999: 229) graph of judicial review vs. constitutional rigidity. Then it shows the same graph with some additions. Cover up the bottom part, for the moment, and focus on the top. What do we see? There is an almost spherical data cloud in the center of the field, while the extremes are uninhabited. And there is a regression line, which is only slightly tilted. This suggests that the extent of judicial review increases with constitutional rigidity, but only mildly. The text in Lijphart (1999: 229) states that the correlation coefficient is 0.39. This corresponds to a very low R2=0.15, which reflects the extreme scatter of data points.

Figure 18.1. Judicial review vs. constitutional rigidity: Original graph (Lijphart 1999: 229), and with addition of conceptually allowed region and several regression lines.

Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN NORTH AMERICA.

158 18. Federalism in a Box

However, the picture changes when the borders of the conceptually allowed region are drawn in (from 1 to 4 on both axes), as is done in the bottom graph. What looked like areas devoid of data points are actually areas where no data points can possibly occur! As for the conceptually allowed region, almost all of it is inhabited fairly uniformly. Only the top left corner is empty: Stringent judicial review based a non-written constitution does not occur – as one would logically expect. But otherwise, anything seems to go. Stringent constitutional rigidity can go with absolutely no judicial review (Switzerland, SWI). Most surprising, countries with no unified constitutional document can still have appreciable judicial review (Columbia, COL). These features does not stand out in the original graph. Why is the impression from the two graphs so different? It’s due to the meaningless frame in the original graph. It misleadingly intimates that data points could run from 0 to 5, rather than the actual 1 to 4. Moreover, the regression line is extended into the forbidden region. This elongated line increases the impression of fair correlation when there is almost none. How did this happen? Lijphart tells me that he drew his graphs by hand, using a 1 to 4 square. The person who redrew the graphs on computer thought it looked nicer with an expanded frame. It may look nicer – but it leaves a wrong impression. In science, truth must come before beauty. Placing a data graph in a frame that exceeds the logically allowed area can distort our impression – and so can showing a regression line which extends into logically forbidden area. There are further problems with the regression line. Any data y vs. x lead to two OLS regression lines. The one used in the top figure regresses Review on Rigidity. The points furthest from the overall trend affect the OLS line in a major way. On the right side of Figure 18.1, SWI pulls the line down, US pulls the line up, and the intermediary Australia (AUL) hardly matters. Now turn the graph by 90 degrees so that the judicial review axis becomes horizontal and constitutional rigidity increases upwards. Try the same balancing act. Compared to the previous OLS line, the data points IND, GER and US at the far left all pull downwards. The balanced position is close to GER. At the far right, all the points, ranging from SWI to ISR, NZ and UK, pull upwards, compared to the previous OLS line. The balanced position might be in between NET and UK. The overall balance among all the points is approximately as shown in the lower graph of Figure 18.1. This is the reverse OLS regression line, Rigidity on Review.

159 18. Federalism in a Box

The two OLS lines diverge wildly, because scatter is huge, as reflected in R2=0.15. The bottom graph also adds the symmetric regression line, which treats the two variables on an even basis (Chapter 16). This is the real “trend line”, if any straight line applied at all. Is any linear fit acceptable on conceptual grounds? First, does the figure have any conceptual anchor points through which an acceptable fit must pass? An anchor point at (1,1) can be asserted: One cannot declare a law unconstitutional when there is no constitution. At the opposite extreme, it’s less clear. The lack of a single written document still doesn’t mean total absence of basic norms. This is why Iceland (ICE) can have a mild degree of judicial review. On the other hand, even an utmost degree of constitutional rigidity doesn’t call for judicial review (witness Switzerland, SWI) – it just makes such a review possible and highly likely. In sum, (1,1) at bottom left might be our best guess in the absence of any other information. The symmetric regres- sion line passes close to it, while the OLS lines deviate widely, in opposite directions. A similar vague claim of being an anchor point might be made for top right corner (4,4). But most countries have judicial reviews at level 2, regardless of constitutional rigidity. Hence one might sketch in a “toppled S” shaped curve, joining (1,1) and (4,4), but almost horizontal as it passes through the center of gravity (the crossing point of linear fits). Its equation would be more complex, akin to those we’ll discuss in connection with Figure 21.7. What have we achieved, in terms of logical model building? We merely have cleared away some possible misconceptions, by delineating the allowed area. This by itself has value. As for positive model building, face the real world. Many relationships observed, in nature and society, remain fuzzy, unless we get some new ideas. We may ask whether a third factor might affect the picture, one that differs for countries toward the top left (COL, ITA, IND) and for countries toward the bottom right (SWI, JPN, FIN, NET, LUX). Inspection of the graph reveals no social or political commonalities among the members of these two groups. Maybe such a factor will be discovered, but it may also be that the scatter remains random noise, based on past history.

Degree of federalism and central bank independence Figure 18.2 shows another case in Lijphart (1999: 241). Here the degree of federalism and decentralization is estimated on a scale from 1 to 5. The central bank independence is a compound of many estimates,

160 18. Federalism in a Box which makes it a quasi-continuous measure. In principle, it could range from 0 to 1.0.

Figure 18.2. Central bank independence vs. degree of federalism: Original graph (Lijphart 1999: 241), and addition of conceptually allowed region and the reverse regression line. CORRECTION: The labels “y on x” and “x on y” must be reversed.

161 18. Federalism in a Box

In the original graph the data cloud seems to occupy most of the space available, except top left and bottom right. Introducing the conceptually allowed region alters the picture. It shrinks the field left and right, while expanding it – little at the bottom but markedly at the top. Now we see that both extremes on the x-scale occur widely – utter centralization as well as complete federalism. In contrast, the extremes of complete or no central bank independence do not occur at all. Compared to the conceptually allowed range, from 0 to 1.0, the actual range of central bank independence is rather limited, from 0.2 to 0.7. Correlation is somewhat stronger than in the previous case, with R2=0.32, hence the two standard OLS lines are closer to each other. (The symmetric regression line, in between, is not shown.) What about conceptual anchor points? Imagine a central bank with zero indepen- dence. What degree of federalism would we expect? Certainly no more than the minimal “1”. Now imagine a central bank with full indepen- dence. We might expect full federalism (“5”). So the opposite corners of the allowed area just might be conceptual anchor points. To join them, while respecting the data, would take again a “toppled S” shaped curve, which roughly follows the y-on-x (mistakenly shown as “x-on-y) line at median degrees of federalism. Once again, we have made only little headway toward a quantita- tively predictive logical model, by establishing at least the conceptually allowed region and anchor points.

Bicameralism and degree of federalism Figure 18.3 (from Lijphart 1999: 214) might allow us to go slightly further. Bicameralism is estimated on a scale from 1 to 4. Delimiting the allowed region shows two empty corners. Accordingly, correlation is higher (R2=0.41) than in previous graphs, and the two OLS lines are relatively close to each other. We might claim that full-fledged federalism does call for full bicameralism – two equally powerful chambers – so that both population and federal subunits can be represented on an equal basis. Indeed, for x=5, 4 out of 5 data points have y=4. So (5,4) could be a logical anchor point. It is trickier for x=1. Even some purely unitary countries have two chambers on historical grounds. (The second chamber used to represent aristocracy.) We may still tentatively accept an anchor at x=1, y=1. This location is heavily populated with empirical data points, even though many countries with x=1 have y>>1. If we accept an anchor at 1,1, then a fit with Y=Xk could be tried.

162 18. Federalism in a Box

Figure 18.3. Bicameralism vs. degree of federalism: Original graph (Lijphart 1999: 214), and addition of conceptually allowed region, reverse regression line, and a fit with Y=Xk.

163 18. Federalism in a Box

To do so, we must first convert the ranges 1 to 5 for x and 1 to 4 on y to ranges from 0 to 1. How do we do that? First, we must pull down the lower limit, from 1 to 0, by subtracting 1. The total span of possible values on x is 5-1=4. On y, it is 4-1=3. So we have to divide by these spans. The result is X=(x-1)/4 and Y=(y-1)/3. Check that now the lower anchor point corresponds to X=0, Y=0, and the upper anchor point corresponds to X=1, Y=1. Any curve Y=Xk would join these points, but most of them leave most data points on one side of the curve. By trial and error, we find that Y=X0.5 has about an equal number of data points on either side. This curve is shown in Figure 18.3. Do we now have a logically grounded model? This is hardly so. Conceptually, the anchor points aren’t strongly imposed, and some data points deviate markedly from the central curve. For X=0, we often have Y>>0. Conversely, for Y=0, several cases with X>0 also occur. Still, (0,0) and (1,1) are the most heavily populated points in their neigh- borhoods. The curve Y=X0.5 acknowledges this, while the linear regres- sion lines do not – not even the symmetric one (not shown in Figure). So Y=X0.5 has advantages in representing the broad trend. Compared to linear regression equations, it has vastly more chances of eventually finding a theoretical justification or explanation. In all these graphs, we were dealing with subjective estimates rather than objective measurements. It’s better than nothing, although we feel on firmer grounds with measurements. All these examples have con- ceptual constraints on all 4 sides. This makes a fit with the format Y=Xk conceivable, but it clearly does not work in Figures 18.1 and 18.2. All relationships are not linear, nor do they all follow Y=Xk even when the allowed area is a box.

164 18. Federalism in a Box

Conversion to scale 0 to 1

This is something we always have to do when the logical model or format presumes a scale from 0 to 1, but the data uses some other scale. In the last example, we had to convert the ranges 1 to 5 for x and 1 to 4 on y to ranges from 0 to 1, so as to be able to apply the simple format Y=Xk. We reasoned it through, using these particular ranges. The need to convert to scale 0 to 1 occurs so frequently, that a general conversion formula might be useful. Suppose the original conceptually possible range of x goes from m to a larger number M. Those numbers can be negative or positive. First, we pull the lower limit down, from m to 0, by subtracting m from all values of x. We obtain x-m. The total span of possible values is M-m. So we must divide the values x-m by these spans. The result is

X=(x-m)/(M-m).

Check that the lowest possible value, x=m, leads to X=0, Y=0, and the highest possible value, x=M, leads to X=1. When m and M are positive and M is vastly larger than M (M>>m), then it might make sense to use logx rather x in the conversion:

X=(logx-logm)/(logM-logm)=log(x/m)/log(M/m).

An example will be given in Figure 25.6.

* Exercise 18.1 A frequent scale goes from -10 to +10 (e.g. for left-right placements). Convert it to scale 0 to 1. This means simply plugging these extreme values into the equation above. Check the result by feeding into the equation the values x=-10, 0 and +10. No need to calculate X for other values of x! * Exercise 18.2 Using a scale 1 to 10 can be awfully confusing, because it is so close to 0 to 10. Respondents may place themselves at 5 when they wish to place themselves at the center. a) What is the actual center on a scale 1 to 10? b) Convert scale the 1-to-10 to scale 0-to-1. This means simply plugging these extreme values into the equation above. No need to calculate X for specific values of x! c) Use the resulting equation to convert the value 5 on the scale 1- to-10 to scale 0-to-1. d) Have I cured you from ever using scales 1-to-10 or 1-to-5, in preference to 0-to-10 or 0-to-5?

165 18. Federalism in a Box

Exercise 18.3 a) In Figure 18.1, sketch in a “toppled S” shaped curve, joining (1,1) and (4,4), but almost horizontal as it passes through the center of gravity. b) Comment on its degree of fit to the data, compared to linear regression lines. Exercise 18.4 a) In Figure 18.2, sketch in a “toppled S” shaped curve, trying to fit the data points as well as possible. This means having a roughly equal number of data points on both sides of the curve, both at low and high levels of federalism. It’s not easy. b) Comment on its degree of fit to the data, compared to linear regression lines.

166 19. The Importance of Slopes in Model Building

 The direction and steepness of slopes of curves y=f(x) is expressed as dy/dx. The slope dy/dx is often designated briefly as y’ when it does not lead to confusion. kt  For exponential growth S=S0e , slope is proportional to size: dS/dt=kS.  For simple logistic growth S=M/(1+ekt), slope is proportional to size and to closeness to ceiling: dS/dt=kS(1-S/M).  For fixed exponent function y=Axk, the slope is dy/dx= kAxk-1=k(y/x).  The cube root law of assembly sizes is an example of use of slopes in constructing a logical model.  When functions are added, their slopes add: y=y1+y2  dy/dx=dy1/dx+dy2/dx, or more briefly: y’=y1’+y2’.  When functions are multiplied together, the resulting slope cross-multiplies slopes and functions: P=xy  dP/dt =y(dx/dt)+x(dy/dt), or P’=yx’+xy’.  When two functions are divided, it becomes more complex: Q=x/y  dQ/dt =[y(dx/dt)-x(dy/dt]/y2 or Q’=[yx’-xy’]/y2.

Curves have slopes. For straight lines the slope is constant, the “b” in y=a+bx. For curves, slope steadily changes. For the curve Y=X0.5 in Figure 18.3, the slope is steep at low X, but much shallower at high X. Why bother about slopes, when building and testing logical models? There are two major ways to make use of slopes:  Finding minimum or maximum values for models that do not involve slopes, and  Building models on the basis of slopes themselves. Examples of both uses will be given. Until then, the sentences above may have little meaning for you, but believe me: Slopes offer unbelievably powerful means for model building. First, we have to introduce some properties of slopes.

19. The Importance of Slopes in Model Building

Notation for slopes The steepness of a straight line matters, but at least this slope is constant, the “b” in y=a+bx. We determined it in Figure 2.1 by dividing the change in y by the change in x:

slope = (change in y)/(change in x) = Δy/Δx.

For curves, slope steadily changes, but the expression above still applies approximately, when the increments Δx (delta-x) and Δy are small. The slope of a curve at a given point is the slope of the line that barely touches the curve at this point but does not cross it – the tangent to the curve. Suppose a function y=f(x) is graphed, y vs. x, as in Figure 19.1. The slope at first gets steeper, but then begins to decreases until, at the peak, slope becomes zero: While x changes, y does not change. There- after, the slope becomes negative, as the curve goes downhill: while x increases, y decreases. How do we measure the slope at a point (x,y) on a curve? Advance x by an increment Δx, but make it a vanishingly tiny amount; it is then designated as dx. Then y will change by a vanishingly tiny amount, positive or negative, designated as dy. The slope is now expressed more precisely:

slope = (tiny change in y)/(tiny change in x) = dy/dx.

When do we use Δx and when dx? When the difference in x is appreci- able, Δx is used. When this difference is made ever smaller (Δx0) and becomes “infinitesimally small”, then we use dx. To repeat:

The slope of a curve, at the given value of x, is dy/dx, the ratio of a tiny change (dy) in y which takes place over a tiny change (dx) in x.

Figure 19.1. Slope dy/dx is at first positive, then 0 at the peak, then negative.

168 19. The Importance of Slopes in Model Building

Examples, for Figure 19.1:  When dy/dx=0, y does not change with x – the curve is horizontal, at the peak.  When dy/dx=+10, the curve goes steeply up, as it does for a while, prior to the peak.  When dy/dx=-0.5, the curve goes moderately down, as it does at the right end.

Figure 19.2. Parabola y=x2 and its slope dy/dx=2x.

y

2 8 1/3 6 1

4 2

2 1/2

-3 -2 -1 0 1 2 3 x

Equation for the slope of a parabola – and for Y=Xk Just like the curve itself has an equation, so has its slope. For the line y=a+bx, it is just dy/dx=b. For the parabola y=x2, the slope is dy/dx=2x. How did I obtain this expression? We’ll see that later on. For the moment, simply check on whether this equation makes sense. Does it yield credible results? Look at Figure 19.2. (The 0-to-1 square, which we have been using when thinking inside the box, would correspond to the small rectangle outlined in bold, center bottom.)

 At x=0, dy/dx=2x yields dy/dx=0, which certainly is the case: The curve is horizontal at this point – its slope is zero, indeed.

169 19. The Importance of Slopes in Model Building

 At x=2, dy/dx=2x yields dy/dx=4. On the graph, as x increases by ½ units, y increases by 2 units, so 2/(1/2)=4, indeed.  At x=3, dy/dx=2x yields dy/dx=6. On the graph, as x increases by 1/3 units, y increases by 2 units, so 2/(1/3)=6.  At x=-2, dy/dx=2x yields dy/dx=-4. On the graph, as x increases by ½ units, y decreases by 2 units, so -2/(1/2)=-4.

Why bother about the parabola? It is an especially simple case of the model Y=Xk, which we have been using so often. It will be seen that its slope is dY/dX=kXk-1:

Y=Xk  dY/dX = kXk-1.

Exercise 19.1 Using Figure 19.2, we could verify that dy/dx=2x really produces reasonable-looking slopes for the curve y=x2. Let us check if dY/dX=kXk-1 yields reasonable-looking outcomes for simple values of constant k in Y=Xk,. a) If k=1, what are Y=Xk and dY/dX=kXk-1 reduced to? Does this make sense? b) If k=2, what are Y=Xk and dY/dX=kXk-1 reduced to? Does this make sense?

Two ways to use slopes in model building were mentioned:

 Finding minimum or maximum values for models that do not involve slopes; and  Building models on the basis of slopes themselves. An example for each use will now be given

Cube root law of assembly sizes: Minimizing communication load The sizes (S, the number of representatives) of legislative assemblies empirically tend to follow a cube root pattern: S=(2P)1/3, where P is the adult literate population of the country. Why is this so? It is a matter of keeping the number of communication channels as low as possible. As S increases, the burden of communication channels on a single representative goes down in her district but up in the assembly. Indeed, the number of constituents a single representative must keep happy is P/S, and this goes down as S increases. But the number of communication channels she must monitor in the assembly is S2/2 (recall Chapter 6), and

170 19. The Importance of Slopes in Model Building this goes up. It can be shown (Taagepera 2007: 199) that total number of channels (c) is close to c=2P/S+ S2/2. When we graph c against S (Figure 19.3), we find that, as S increases, c first goes down, but then begins to increase again. The optimal size is the one where c is lowest – this is the size that minimizes the communication load.

Figure 19.3. The burden of communication channels on one represen- 1/3 tative. Assembly size Soptimal=(2P) minimizes the number of channels.

c=2P/S+ S2/2. c

cmin

0

0 Sopt S

How do we determine this optimal assembly size? We could draw the curves c vs. S for each population size and look at which S we get the lowest c. We can make it easier for us by observing that minimal c corresponds to the location where the curve is horizontal, meaning dc/dS=0. Given the equation for the curve, c=2P/S+ S2/2, we can calculate the equation for its slope, using rules to be given soon. The result is dc/dS=-2P/S2+S. This equation gives us the slope at any assembly size – anywhere on the curve. But now comes the clincher: We also require that dc/dS=0, because this is the slope at the bottom of the curve. The result is -2P/S2+S=0. Rearranging leads to S=(2P)1/3. Having a model to support empirical observation, this relationship now qualifies as a law in the scientific sense – the cube root law of assembly sizes:

S=(2P)1/3.

171 19. The Importance of Slopes in Model Building

In this example the model as such did not include slopes. Calculating the slope and then requiring it to be zero was just a device to locate the optimal value. Here we tried to minimize the output. Sometimes, to the contrary, we try to maximize some output. Both minima and maxima share the same feature: the slope is flat – dy/dx=0. Quite frequently, however the logical model itself is built in terms of slopes. Exponential growth is an important example.

Exponential growth as an ignorance-based model: Slope proportional to size Suppose we build a wall, using a steady number of masons who pro- duces k meters of wall per day. Starting from the day construction began, how does the size (length S) of the wall increase over time (t)? It increases linearly: S=kt. The rate of increase is the slope of this straight line: dS/dt=k. [wall]

The slope is a constant, as long as the number of masons is not altered from the outside. Now consider the way a microbe colony grows when food and space are plentiful. At regular intervals, each microbe splits into two. Starting with a single one, we have 1, 2, 4, 8, 16... Hence the growth rate is proportional to the number of microbes. When their number doubles, the rate of growth also doubles. For the size of the colony, this means that here we have the slope dS/dt proportional to S itself:

dS/dt=kS. [microbe colony]

Compare this equation to the one above. When something is built from the outside, at a steady rate, dS/dt=k applies. In contrast, when some- thing is building itself from the inside, at a steady rate, dS/dt=kS applies. Such equations, which include a slope dy/dx, are called differential equations. They often express the deep meaning of a process. These differential equations, however, do not enable us to predict how much stuff there is, at a given time. For this purpose we must use their “integrated” form. For dS/dt=k, the integrated form is simply where we started from: S=kt. For dS/dt=kS, it can be shown that the integrated kt form is S=S0e – an exponential equation. Here S0 is the number of microbes at time chosen as the starting point, t=0. kt This S=S0e is also written as S=S0exp(kt). More generally, when the size is S0 at time t0 (different from t=0), then

172 19. The Importance of Slopes in Model Building

S= S0exp[k(t-t0)].

(The format “exp” helps us avoid having subscripts within exponents!) Now consider relative growth rate. This means growth rate divided by the existing size: (dS/dt)/S. When expressed as percent of existing size, it is also called the percent growth rate. For exponential growth, dS/dt=kS, this relative growth rate is constant:

(dS/dt)/S=k. [exponential growth]

This is what makes exponential growth or decrease so widespread and basic. It’s the pattern of growth that takes place as long as relative growth rate does not change. A larger k means steeper growth. Let us now ask the following question. We are told that the size of something is changing over time. We are asked to make our best guess on whether its rate of change is increasing or decreasing. What would be your best guess? Write it down. In the absence of any other information, we have no grounds to prefer increase to decrease, or vice versa. So our best guess is that growth rate remains the same. If we have the further information that this something is being constructed from the outside, our best guess is constant absolute rate of change: dS/dt=k. But if we have the further information that this something constructs itself, then our best guess is rate of change proportional to size: dS/dt=kS, so that the relative rate of change is constant. Hence the exponential model is another example of an ignorance- based model. It is one of the most important and universal models of that type. And it uses the notion of slopes.

Simple logistic model: Stunted exponential Nothing continues to grow forever. Growth that starts exponentially sooner or later levels off, reaching a maximum size (M). The simplest way to impose a limit on something which begins as dS/dt=kS is to multiply by (1-S/M): dS/dt=kS(1-S/M).

When S is much smaller than M, (1-S/M)≈1, so we have exponential growth dS/dt=kS. But when S reaches one-half of M, dS/dt=kS/2. This means we are down to one-half of the exponential slope. When S approaches M, (1-S/M) approaches 0, so the slope also approaches 0 – growth stops. The resulting curve is the simple logistic curve – see

173 19. The Importance of Slopes in Model Building

Figure 19.4, which is the same as Figure 14.4. A larger k means steeper growth from near-zero to near-ceiling.

Figure 19.4. Exponential curve and simple logistic curve which starts out the same way.

S Exponential

M

Simple Logistic

M/2

0

0 t

Integration of dS/dt=kS(1-S/M) yields

S=M/(1+e-kt) when t is counted from the time when S=M/2. The proof is not shown here, but check whether this equation makes sense at some simple values of time, like zero and minus and plus infinity. When t=0, e-kt= e0=1, so S=M/2, indeed. When t+∞, e-kt=1/ekt 0, so SM, as it should. When t-∞, e-kt ∞, so SM/(∞)=0, as it should.

174 19. The Importance of Slopes in Model Building

EXTRA Recall from Chapter 14 that rearranging S=M/(1+e-kt) leads to S/(M-S)= ekt – the ratio of distances from S to the floor and to the ceiling grows exponentially as (M-S) shrinks ever more to nothing. Up to now we counted time from the time when S=M/2. More generally, when the size is y0 at time t0, then

y= M/{1 + [(M-y0)/y0]exp[-k(t-t0)]},

It follows that

y/(M-y) = exp[k(t-t0)] and hence

log[y/(M-y)] = kt-kt0.

To test whether some data fits the simple logistic model, graph log[y/(M-y)] vs. t – a straight line should result, with slope k. This is the basis of so-called Logit data fits in statistics.

How slopes combine – evidence for the slope of y=xk It can be seen that slopes are important building blocks and tools for logical models. The following introduces some basic relationships. I’ll try to show why and how they make sense. You’ll begin understanding more as you make use of these formulas. The slope dy/dx is often designated briefly as y’ when it does not lead to confusion. When we add two functions, their slopes add:

y=y1+y2  dy/dx = dy1/dx + dy2/dx, or more briefly: y’=y1’+y2’.

2 Thus, when we add a straight line y1=a+bx and a parabola y2=x , the slope of y=y1+y2 is dy/dx=b+2x. This also implies that multiplying a function by a constant multiplies its slope by the same constant:

y=ky1  dy/dx = k(dy1/dx).

The area of a rectangle is the product of it two sides: A=xy. What is the rate of change of area when the lengths of the sides change over time? This means, what is the value of dA/dt when the rates of change of x and

175 19. The Importance of Slopes in Model Building y are dx/dt and dy/dt, respectively? I claim that we have to cross- multiply the functions and their slopes:

dA/dt =y(dx/dt )+x(dy/dt).

Why? Look at the rectangle in Figure 19.5, with sides x and y. Its area is A=xy. Extend both sides by a bit, Δx and Δy, respectively. The area increases by ΔA=xΔy+ yΔx+ΔxΔy, but we can neglect the tiny corner Δx Δy. This corner becomes very small indeed, compared to xΔy or yΔx, when we make Δx and Δy extremely small (dx and dy). Divide dA=xdy+ydx by dt, and we get the rate of change over time. In sum, when functions are multiplied together, the resulting slope cross-multiplies functions and their slopes:

P=xy  dP/dt =y(dx/dt)+x(dy/dt), or more briefly: P’=yx’+xy’.

Figure 19.5. Area increase results from cross-multiplication of lengths and length increases. The corner Δx Δy becomes negligible when Δx and Δy become infinitesimally small.

Δy xΔy Δx Δy

y xy yΔx

x Δx

Now we can prove that the slope of the parabola y=x2 is dy/dx=2x. In Figure 19.5, make y equal to x, so that we have a square with area A=x2. Extend both sides by Δx. The area increases by ΔA≈xΔx+xΔx=2xΔx when we neglect ΔxΔx. Hence ΔA/Δx≈2x. Using y instead of A, and going from Δx to tiny dx, the result is dy/dx=2x.

176 19. The Importance of Slopes in Model Building

Try the same approach for a cube of volume V=x3. If you do, you’ll find that it leads to dV/dx=3x2. We have multiplied by 3 and reduced x3 to x3-1. We can generalize:

y=xk  dy/dx = kxk-1.

Maybe surprisingly, this rule applies even to fractional values of k. It applies to negative k too. For y=1/x=x-1, dy/dx=(-1)x-2=-1/x2. In short,

y=1/x  dy/dx=-1/x2.

The negative sign makes sense: as x increases, y decreases – the slope is negative. It is harder to visualize that the slope is sort of “stronger” than the curve itself. Pick simple examples and check that this is so indeed. Slopes are more complex when division is involved. For instance, per capita GDP (Q) is obtained by dividing the total GDP (G) by population (P): Q =G/P. Suppose both G and P change over time. What is the rate of change of Q? It is dQ/dt =[P(dG/dt)-G(dP/dt]/P2.

How is this result obtained? I do not expect you to go through the proof. But just in case, here it is. The proof combines the outcomes for y=1/x and A=xy. Express Q=G/P as a product: Q=(1/P)G. By the rule of products, dQ/dt=(1/P)(dG/dt)+ G[d(1/P)/dt]. Plugging in d(1/P)/dP=- 1/P2, we get dQ/dt =(dG/dt)/P+ G[(-1/P2)dP/dt]. Rearranging:

dQ/dt =[P(dG/dt)- G(dP/dt]/P2.

In a more general notation, Q=x/y  dQ/dt =[y(dx/dt)- x(dy/dt]/y2, or more briefly: Q’=[yx’-xy’]/y2. We may not need all these formulas in the examples that follow, but these are the basic combinations. Some of them are likely to puzzle you at first. They will become clearer with use. It helps if you give yourself simple examples where you can figure out the result by other means – and discover that the formula yields the same result.

Equations and constraints for some basic models Table 19.1 shows the differential and integrated equations for some basic models, along with the constraints on variables that impose such equations. The differential and integrated equations are fully equivalent expressions for the same model, but they may look quite different, and they serve different purposes. The differential equation often expresses

177 19. The Importance of Slopes in Model Building and explains better the mechanics of the process, while the integrated equation enables us to predict the output from the input.

Table 19.1. Differential and integrated equations for some basic models, and related constraints. ______Type of model Differential  Integrated Corresponding constraints equation equation Processes in time External construction dS/dt=k  S=kt Only two quadrants allowed

Self-propelled growth: kt exponential dS/dt=kS  S=S0e Only two quadrants allowed

Exponential with ceiling: Only a zone in two quadrants simple logistic dS/dt=kS(1-S/M)  S=M/(1+ekt) allowed

No process in time dy/dx= k(y/x)  y=Axk Only one quadrant allowed dy/dx= k(y/x)  y=xk Box in one quadrant dy/dx messy  y=xk/[xk +(1-x)k] Box in one quadrant, with central anchor point

Starting from a very different angle, we are back to some equations that emerged in Chapter 14. In the case of processes in time, the differential equations are remarkably simple and express well the fundamental underlying assumptions. In contrast, the integrated equations use more complex functions. Yet only the integrated equations are “explicit” in the sense of enabling us to calculate y from x. For the fixed exponent equation, the reverse applies. Here the inte- grated form is simpler:

y=Axk  dy/dx=kAxk-1.

The differential form enables us to calculate the slope at any value of x.

Also note that Axk-1=Axk/x, and Axk=y. Dividing them, member by member, yields Axk-1 = y/x. Therefore, dy/dx=kAxk-1=ky/x. It turns out that here the slope is proportional to the ratio of y and x:

y=Axk  dy/dx = k(y/x).

This form has a beautiful symmetry, but is not a very practical form. Could this pattern correspond to another growth pattern in time? No, it cannot deal with time, because time extends from minus to plus infinity, while here x is limited to positive values.

178 19. The Importance of Slopes in Model Building

When the box in one quadrant also has a central anchor point, the differential form is even more complex than the integrated one, and it adds little to our ability to visualize the pattern.

* Exercise 19.2 The previous model for cabinet duration in years, C=42/N2, has the form Y=AXk. Keep in mind that 1/N2=N-2. a) What would be the expression for dC/dN? [Use dy/dx=kAxk-1, NOT dy/dx = k(y/x).] b) Do such slopes agree with the graph in Exercise 2.1? To answer this, insert into the equation for dC/dN simple values like N=1, then N=5. Add lines with these slopes to your graph in Exercise 2.1, at N=1 and N=5. Do they look tangent to the curve C vs. N? c) Why doesn’t this result apply to Figure 12.1? (HINT: Distin- guish between x and logx.)

179

D. Further Examples and Tools

20. Interest Pluralism and the Number of Parties: Exponential Fit

 The exponential function y=Aekx is a most prevalent function in natural and social phenomena. This is a basic pattern where the curve approaches a floor or a ceiling without ever reaching it.  The coordinates of two typical points on the curve, plugged into y=Aex, enable us to determine the numerical values of k and A.  If we want to have y=A when x=B, use format y=Aek(x-B). To determine k, plug the coordinates of another typical point on the curve into k=[ln(y/A)]/(x-B).  If the exponential format applies, the data cloud is linear when graphed on semilog paper.

We are back to Arend Lijphart’s Patterns of Democracy (1999). A central measure for the majoritarian-consensus continuum of democratic institutions is the effective number of parties in the legislative assembly. This number has a definite lower limit at N=1, but no clear-cut upper limit. True, an assembly of S seats could fit at most S parties, but S itself varies, and actual values of N are so much smaller (N<

Interest group pluralism Figure 20.1 graphs interest group pluralism (I) versus N. At the top, we see the original graph (Lijphart 1999: 183). It shows an OLS line (I on N) in the midst of such high scatter (R2=0.30) that much of the entire field is filled with points. At low N, most of the data points are above the OLS line, while at high N, most points are below this line. So this 20. Interest Pluralism and the Number of Parties: Exponential Fit does not look like a good fit. How is this possible? Two outliers, ITA and PNG, top right, pull this side of the OLS line up, while having little impact on the center of gravity. So the left side of the line gyrates lower, the more so, because the outlier AUT pulls down. (NOR and SWE are almost at mean N, and hence have little impact on slope.) We must look for a better fitting format.

Figure 20.1. Interest group pluralism vs. effective number of parties: Original graph (Lijphart 1999: 183) and addition of an open box and an exponential fit.

Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN NORTH AMERICA.

182 20. Interest Pluralism and the Number of Parties: Exponential Fit

Delineating the allowed region (bottom graph) changes little. The original limits shown on the left, top and bottom are indeed the conceptual limits. It’s just that the box is wide open to the right. Drawing in the other OLS line (N on I) does not help – it errs in the opposite directions, compared to the first OLS line. Moreover, it predicts forbidden values of I at low N, and at high N. So does the first OLS line, for high N: For N>8, I will be negative. The symmetric regression line (not shown) will have the same problem. We must look for a better format than linear. We have no data points at N=1. A democracy with only one party is rare, to put it mildly, although Botswana (BOT) comes close, with N=1.35 (one huge and one tiny tribal party). However, the general trend seen in Figure 20.1 suggests that if N=1 ever materialized, then I might be close to its upper limit of 4. No other value is more likely, so let us take (1,4) as a tentative anchor point, which a reasonable data fit should respect. In addition to the anchor point at (1,4), we also have a conceptual bottom at I=0. Data points may approach it, at high N, but cannot drop beneath. So we must have a curve starting at (1,4) and gradually approaching I=0 at large N, without ever reaching it. As seen in Chapter 15, the simplest format in such cases is an exponential function: y=Aekx. The constant k indicates how rapidly y changes with increasing x. A positive value of k means increasing y. A negative value of k means a decreasing y – this is the case here. We replace y with I, and x with N, so the general format is I=AekN. It will soon be shown that the specific form is

I = 4e-0.31(N-1), as shown in the bottom part of Figure 20.1. This data fit also implies that the rate of change of I with increasing N is

dI/dN=-0.31I, and the relative rate of change is constant:

(dI/dN)/I=-0.31.

But le us first check whether this equation fits the data. At low N, this curve passes below most data points, so the fit is not much better there than it is for the OLS line I-on-N. Indeed, a simple straight line passed through (1,4) might seem to fit the data cloud better, and the symmetric regression line (not shown) would come close. Trouble is, such a line

183 20. Interest Pluralism and the Number of Parties: Exponential Fit would predict a negative value of I for N larger than 6. Our data fits should not predict absurdities, even though such high values of N do not occur in that particular data set. So the exponential fit still is preferable. It fits most points at high N, apart from those blatant deviants, ITA and PNG. Have we encountered a similar pattern earlier? Think. No, do not ask if we have dealt with interest groups earlier. Think in terms of simila- rities in allowed areas and anchor points. What do we have here? We have conceptual limits on 3 sides only. We have an anchor point at one end, plus a gradual approach to conceptual limit at the other end. Where have we met it before? Volatility. Its refined model (Figures 11.2 and 15.6) is the mirror image of the present one: approach to a ceiling instead of a floor. Here we have the broad form y=Aekx. For volatility, it is y=1-Aekx. It is extremely important to recognize such broad simila- rities.

Fitting an exponential curve to interest pluralism data How can we determine a fit like I = 4e-0.31(N-1)? We can start in two ways: the basic exponential format y=Aekx, or a format which looks slightly more complex: x=Aek(x-B). We’ll do it both ways – and what looks more complex turns out simpler. The starting point is the one suggested in Chapter 14: pick two “typical” points and fit the equation to them. One of those points should be the anchor point (1,4). Then pick a data point at a moderately large N that seems to be close to the central trend. In Figure 20.1 something close to VEN (3.4,1.9) might do.

APPROACH I. The format y=Aekx means here I=AekN. Plugging in the coordinates of one typical point, then the other, yields two separate equations in k: 4=Ae1k and 1.9=Ae3.4k.

Dividing member by member cancels out A:

1.9/4=e3.4k/e1k.

Hence 0.475= e(3.4-1)k=e2.4k. Take logarithms – and here it pays to push the lnx button rather than logx, because ln(ex)=x by definition. So here ln0.475=2.4k. Hence

k=ln0.475/2.4=-0.744/2.4=-0.3102,

184 20. Interest Pluralism and the Number of Parties: Exponential Fit which we round off to -0.31. The value is negative, because the curve goes down. To find A, we plug k=-0.31 into one of the initial equations. Actually, I often do the calculations with both initial equations, to guard against mistakes. Here

4=Ae1×(-0.31) and 1.9=Ae3.4×(-0.31)= Ae-1.054, so that A=4e0.31 and A=1.9e1.054.

(Note that the exponential goes to the other side of the equation and hence changes sign!) By using the ex or 2ndF/lnx button,

A=4e+0.31=4×1.363=5.454 and A=1.9e+1.054=1.9×2.869=5.451.

The difference is due to rounding off k, and A=5.45 is sufficiently precise. The final result is

I=5.45e-0.31N.

This equation can be used to calculate points on the curve shown in the bottom Figure 20.1. Trouble is, the anchor point (1,4) does not stand out in this equation. We can easily see what the value of I would be at N=0 – it would be 5.45 – but this is pointless information when N≥1 is imposed. This is where the second format comes handy: x=Aek(x-B).

APPROACH II. The format x=Aek(x-B) means here I=Aek(N-B). The constants A and B go together in the following way: When N=B, then I=A, given that e0=1. As values of A and B, it is convenient to plug in the anchor point coordinates (4, 1):

I=4ek(N-1).

Thus, we already have the value of A, and it remains to find k. We again plug in (3.4,1.9): 1.9=4e2.4k, and again k=-0.31 results. The final result now is

I=4e-0.31(N-1).

It yields exactly the same values as I=5.45e-0.31N, but it also keeps the anchor point in full view. And it was easier to calculate – the value of A was there without any calculations.

185 20. Interest Pluralism and the Number of Parties: Exponential Fit

More generally, if we want to have y=A when x=B, use format y=Aek(x-B). To determine k, plug the coordinates of another typical point on the curve into k=[ln(y/A)]/(x-B).

GRAPHIC APPROACH. We could graph the data on semilog paper, with I on the logarithmic scale (or graph logI vs. N on regular paper). The corresponding data are available in Lijphart (1999: 312-313). See if the data cloud is linear or bent. If it looks sufficiently straight, the exponential model applies. Starting at the anchor point (4, 1), draw in the visual best-fit line. The slope of line corresponds to the value of k. But semilog paper is based on the decimal “log”, while the exponential formula connects to the natural log, “ln”. The difference may well confuse you at first, so check your results with one of the other approaches. Note that, apart from ITA and PNG on the high side and AUT, SWE and NOR on the low side, all data points are below the curve I=2×4e- 0.31(N-1) but above the curve I=(25/2)e-0.31(N-1). (See Exercise 20.1.) This means that if we graph logI vs. N, the data cloud would become a straight zone, like the one in Figure 15.7, with zone limits at one-half and double the expected value. The data fit I=4e-0.31(N-1) is approximate. Given the degree of scatter, a more precise statistical fit would add little to our understanding of the relationship. But do we have a logical model? It is hard even to argue that interest group pluralism should decrease with increasing N (which is a good indicator of consensus-minded politics). It is even harder to explain why it should decrease exponentially, and at the rate observed (as reflected in k=-0.31). These are questions we should ask, even if we do not reach answers.

Exercise 20.1 The curves I=8e-0.31(N-1) and I=2e-0.31(N-1) are not shown in Figure 20.1. Sketch them in. How? Select a couple of convenient points on the main curve, and place by eye the points twice that high and one-half that high. Then join them.

186 20. Interest Pluralism and the Number of Parties: Exponential Fit

The slope of the exponential function

The simplest exponential function is y=ex. This is what we get when we plug A=1, k=1 and B=0 into y=Aek(x-B). It has the unique property that, at any value of x, its slope equals the size itself:

d(ex)/dx= ex.

This is the only mathematical function to have this property. Indeed, if we require that slope must equal the function, then the function f(x)= ex results – and this is what defines the numerical value of e=2.71828… in the first place. This is when k=1.With any other value of k, the slope is proportional to the size itself:

d(Aek(x-B))/dx=kAek(x-B). This is why the curve approaches zero without ever reaching it, when k is negative: As y becomes small, so does its further decrease. Whenever something grows or decreases proportional to its existing size, it follows the exponential pattern. Growth of bacterial colonies or human populations and growth of capital at fixed interest follow this pattern, with a positive k. The amount of radioactive material decreases according to this pattern, with a negative k. Indeed, the exponential function pops up whenever some variable x can take any values, from minus to plus infinity, while the connected variable y can take only positive values, without any further restrictions. This makes the exponential a most prevalent function in natural and social phenomena, on a par with the linear y=a+bx and the fixed exponent y=Axk that can be reduced to Y=Xk.

Exercise 20.2 At the ECPR (European Consortium for Political Research) Workshops in Rennes 2008, Rense from Nijmegen told me the following. They help the unemployed to write better CVs and hence be more successful in seeking jobs. They think they do something socially useful, but they are told that this isn’t so. They do not affect the number of jobs available. If they help one person to get a job, then simply someone else fails to get this job. They supposedly just help one person at the expense of another, and the social benefit is nil. Rense and his colleagues disagree but find it hard to respond. So who is right, or is there a third way out? Does it make sense to help the unemployed to write better CVs? Close your eyes for the moment and ponder: How would I tackle this issue? (If you omit this step you might lose more than you think!) Scribble down a few ideas. And now proceed to the following.

187 20. Interest Pluralism and the Number of Parties: Exponential Fit a) Which variables are involved? OK, the number of positions filled (F) is one variable. What else? No idea? Well over which variable does F change? b) Label this other variable T. Sketch a graph, F vs. T. c) Introduce a simplifying assumption. Assume that initially all positions the economy needs (FM) are filled (F=FM) but then a recession sets and some workers lose their jobs, so that the number of filled positions drops to FR. Later, at T=0, the recession suddenly goes away and FM positions again need to be filled, but only FR positions are filled. Enter these levels (FM and FR) on your graph, at T=0. Prior to T=0, F= FR. (Actually, recessions recede only slowly. But taking this into account would make the issue pretty complex right at the start. Better assume instantaneous recovery, for the moment, and leave the more complex situation for later.) d) Also enter the ceiling, i.e., the level FM toward which the employers try to raise F during the period T>0. What is the simplest model to express the notion that “the number of positions filled tends toward the number of positions available the faster, the larger the gap between the two”? Offer the format of the corresponding equation (with no specific numbers), and sketch the approximate graph. e) Why aren’t the vacant positions filled instantaneously? It’s because all employers and jobseekers suitable for each other do not meet. The mutual fitting together takes time. How would this time change for those who present themselves with improved CVs? Enter (always on the same graph) the curve for those with improved CVs. f) Which curve would tend to be followed immediately after T=0? To which curve would it shift later on? g) Then how do those who help to write better CVs alter the situation? Are they of any help to job seekers as a group? Are they of any help to economy as such? h) Would these conclusions change if we made the description more realistic by assuming that the job market expands from FR to FM only gradually? i) And now to the most important step. Stop for a while and think back: How would you have tackled this issue on your own, compared to what I have put you through? Would you agree with my approach? If you don’t, fine – as long as you take me as seriously as you take yourself. At the end of the task, just enter: “Yes, I have pondered it.”

188 20. Interest Pluralism and the Number of Parties: Exponential Fit

EXTRA I: Why not a fit with fixed exponent format?

The anchor point at (1,4) and approach to a floor at I=0 can also be fitted with fixed exponent format I=4Nk. Indeed, I=4N-0.60 =4/N0.60 yields a curve fairly close to that of I=4e-0.31(N-1). It is slightly lower at N=2 and remains higher at N=6, so the fit to data is visually only slightly worse. An analogous fit for volatility was pointed out in Chapter 11, EXTRA. Why should we prefer an exponential fit? Suppose we have a positive value at the start, and it decreases toward a floor. If the anchor point is at x=0, the format y=Axk cannot work, because at x=0, y is either 0 (when k is positive) or infinity (when k is negative), but not some intermediary value. We are left with the exponential option by default. If the anchor point is at some positive value of x, as is the case in I vs. N, then both options are possible. The exponential approach still has the following advantages.

1) Uniformity. To be able to compare, it is better to keep the same format, regardless of whether the anchor point is at x=0 or x>0. 2) Rate of change is proportional to the existing size, hence relative (percent) rate of change is constant. This is a powerful general model of ignorance. The format y=Axk does not offer anything that simple.

For these reasons, exponential fit is our preferred option. If this fit does not work, we could consider y=Axk and even other options. In Figure 20.1, for instance, I=10/(N+1.5) yields a curve intermediary between those of I=4e-0.31(N-1) and I=4/N0.60. Even if it should fit better in a statistical sense (it doesn’t), it should be avoided, unless its format has some logical justification.

189 20. Interest Pluralism and the Number of Parties: Exponential Fit

Figure 20.2. Disproportionality vs. number of parties: Original graph (Lijphart 1999: 169), and addition of exponential fit with a vague anchor point.

EXTRA II: Electoral disproportionality

This example, again from Lijphart (1999), is left as an EXTRA, because the format x=Aeky (rather than the usual y=Aekx) may be a bit confusing for beginners.

190 20. Interest Pluralism and the Number of Parties: Exponential Fit

Figure 20.2 deals with electoral disproportionality (D). This is the difference between vote and seat shares of parties. It ranges in principle from 0 to 100%. I will not describe how it is determined. While the previous graph had something else graphed against N, here Lijphart (1999: 169) prefers the opposite direction, N against D. If one insists on a linear fit, a low R2=0.25 results. But it’s quite visible that no linear fit does justice to the data cloud, because the cloud shows a clear curvature. The allowed region again has finite limits on 3 sides, even while one of them isn’t visible in the graph. While N can range upward from 1, with no clear upper limit, D can range upward from 0, and no higher than 100 percent. But the situation is more fluid than that. Consider the situation at N=1, meaning that a single party has won all the seats. What level of disproportionality could be expected? If it also had all the votes, D would be zero. At the other extreme, D could be 100% only if that party obtained no votes at all. It is argued in Taagepera and Shugart (1989: 109-110) that the average outcome can be expected to be around D=25%. So we take (N=1, D=25) as an anchor point. If D were graphed against N (rather than vice versa), the picture would be quite similar to the previous I vs. N. (You may find it easier to turn the page by 90 degrees, so that the D axis goes up.) The exponen- tial format D=Aek(N-1) looks possible. Taking N=1, D=25 as an anchor point and D=0 as the conceptual bottom (which is vertical in the actual graph!) leads to the D = 25ek(N-1).

Picking NOR as a typical central point leads to k=-0.66 so that

D = 25e-0.66(N-1).

Here this fit is clearly more satisfactory than any linear fit could be. Note furthermore that, apart from PNG, the highest points (FRA, IND) are near the curve D=4×25e-0.66(N-1), while the lowest points (MAL, AUT, GER, SWE) are near the curve D=(25/4)e-0.66(N-1). This means that if we graph logD vs. N, the data cloud would become a straight zone, like in Figure 15.7, but the zone limits would be at one- quarter and 4 times the expected value. How close are we to a logical model? The anchor point is somewhat fluid. It makes sense that higher N leads to lower D, given that higher N tends to reflect more proportional electoral rules. Also, as N increases, its further impact on D should gradually become milder. In this respect the exponential pattern makes conceptual sense.

191 21. Moderate Districts, Extreme Representatives: Competing Models

 When both x and y are conceptually limited to the range from 0 to 1, with anchor points (0,0) and (1,1), the simplest fit is with Y=Xk.. When a further anchor point is imposed at (0.50,0.50), the simplest fit is with Y=Xk/[Xk +(1-X)k].  When this third anchor point is shifted away from (0.50,0.50), even the simplest fit becomes more complex.  Different logical approaches sometimes can be applied to the same problem and may fit data about equally well. We have to ponder which one has a wider scope.  All other things being equal, a model with no adjustable para- meters is preferable to one that has such parameters. A smooth model is preferable to a kinky one. A model that connects with other models is preferable to an isolated model.

We now reach situations with more than two constraints, and the models become more complex. I’ll explain the resulting equations in less detail, compared to exponential and fixed exponent equations. Why? First, we are less likely to encounter these specific forms in further research. Second, by now we should be able to recognize some general patterns that repeat themselves, in a more complex form. Our starting point is a graph in Russell Dalton’s Citizen Politics (2006: 231), reproduced here as Figure 21.1. It shows the degree of conservatism of representatives (on y-axis) as compared to the conservatism of their districts (on x-axis) in US House elections. “There is a very strong congruence between district and representative opinions (r=.78) [thus R2=0.61], as one would expect if the democratic process is functioning” (Dalton 2006: 231). Reporting the value of R implies a linear fit. This line is not shown, but we can mentally draw it in. It would predict negative y for x less than 0.10, and y above 1 for x above 0.80. What else can we see in this graph, beyond “When x goes up, y goes up”? What else should we add to the graph, so as to see more?

Graph more than the data By now we should know it. Delineate the entire conceptually allowed area. It both reduces and expands on the field in the original graph, as x 21. Moderate Districts, Extreme Representatives: Competing Models and y can both range from 0.00 to 1.00. Next, consider conceptual anchor points. In a 100 percent conservative district, the representative has a strong incentive to vote 100 percent conservative, while in a 100 percent radical district, the representative has a strong incentive to vote 100 percent radical. No such data points can be seen, but the most extreme data points do approach the corners (0,0) and (1,1). Any data fit should connect these anchor points. Also draw the equality line y=x. Figure 21.2 shows these additions. A puzzle emerges as soon as we graph the equality line. District and representative opinions do not match. In conservative districts, representatives are more conservative than their average constituents, and in radically non-conservative districts, representatives are more radical than their average constituents. The representatives tend to be more extreme than their districts. Why do they do so, and why precisely to such degree? Note that we seem to have a third anchor point, half- way up, as briefly introduced in Chapters 9 and 14.

Figure 21.1. Attitudes of representatives and their districts: Original graph (Dalton 2006:231) – data alone.

Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN NORTH AMERICA.

193 21. Moderate Districts, Extreme Representatives: Competing Models

How would a representative tend to vote in a 50-50 district? The situation is analogous to seat distribution when votes divide split 50-50 (Figure 9.4). In in the absence of any further information, we have no reason to guess at more or less than 0.50. So we would expect the curve to pass through the point (0.5,0.5). This agrees roughly with the data cloud. Note that the data cloud itself appears somewhat different once the equality line, forbidden areas and anchor points are introduced. The free-floating bent sausage of Figure 21.1 appears in Figure 21.2 as visibly squeezed in between the equality and bottom lines, at the left corner. More diffusely, the same applies at the top right corner. The data cloud rises sharply around the central anchor point.

Figure 21.2. Original graph (Dalton 2006:231) plus conceptually allowed region, equality line and 3 anchor points.

A model based on smooth fit of data to anchor points When X and Y can range from 0 to 1, and three conceptual anchor points impose themselves – (0,0), (0.5,0.5) and (1,1) – the simplest family of curves passing through these anchor points is (recall Chapter 9)

194 21. Moderate Districts, Extreme Representatives: Competing Models

Y = Xk/[Xk +(1-X)k]. [3 anchor points, no bias]

Here parameter k can take any positive values. This equation can also be expressed more symmetrically as

Y/(1-Y) = [X/(1-X)]k.

When k=1, we obtain Y=X. Then the representatives would be as extreme as their districts. Values of k exceeding 1 lead to curves in the shape of a “drawn-out S”. Then the representatives are more extreme than their districts. This is what we observe. (Values of k less than 1 would lead to “drawn-out S” curves on the opposite sides of the equality line. The representatives would be more moderate than their districts.)

Figure 21.3. Further addition of a smooth model, based on continuity between three anchor points, fitted to the data.

The parameter k expresses the steepness of the slope at X=0.5. For the given data, k=3.0 is close to best fit. Figure 21.3 shows the corresponding curve. To determine the precise best linear fit, we should graph log[Y/(1-Y)] against log[X/(1-X)]. But we do not need more precision. What we need is an explanation for the broad pattern we observe.

195 21. Moderate Districts, Extreme Representatives: Competing Models

We have fitted the data to the simplest mathematical format imposed by conceptual anchor points. But what is it that imposes k=3 rather than k=1 (equality line Y=X) or k=2 (a shallower central slope) or k=4 (a still steeper slope)? In the absence of an answer, let us try a different approach that may offer some insight.

Figure 21.4. Graph in Figure 21.2, plus a kinky model based on extremes of representative behavior and their means.

A kinky model based on political polarization

Simplify the issue as much as we can, as a first approximation. Consider a two-party constellation with a purely Conservative Party and a purely Radical Party. The US Republican and Democratic Parties are imperfect approximations to such ideologically pure parties. Under such condi- tions we may assume that all districts with X>0.5 elect a Conservative and all those with X<0.5 elect a Radical. Consider the extreme possibi- lities. If a society is extremely polarized, all Radical representatives will take 0% conservative stands (Y=0) while all Conservative represen- tatives will take 100% conservative stands (Y=1). A “step function” results (Figure 21.4), meaning a sudden jump at 0.5: Y=0 for X<0.5 and

196 21. Moderate Districts, Extreme Representatives: Competing Models

Y=1 for X>0.5. On the other hand, if the society isn’t polarized at all, all representatives may wish to reflect their constituent mix, leading to the Y=X line in Figure 21.4.

Figure 21.5. Graph in Figure 21.2, plus combination of two kinky models, based on different ways to take means.

At the center of the graph, we have then two extremes: the vertical line, and the line at slope 1. In the absence of any other information, take the mean of these conceptual extremes at the center of the graph. The mean, however, can be taken in at least two ways. Using angles, the mean slope is 2.41, while using cotangents leads to 2.0. Shown as dotted and dashed lines in Figure 21.4, both look rather shallow, compared to the data cloud. Also, they would predict Y to drop to zero for X less than 0.25. We would have a kinky model. For central slope 2: Y=0 for X<0.25; Y=-0.50+2X for 0.250.75. (Calculation of slopes: The mean angle between the lines is (90o+45o)/2=67.5o, and tan67.5=2.41. The mean of cotangents, (0+1)/2=1/2, leads to slope 2.) But take a different look at what might happen at X=0.25. Consider a Radical elected in such a 25% conservative district. If she wants to please only her radical voters, she’ll take a 0% conservative stand (i.e., Y=0), placing her at the Polarized line. But if she wishes to serve all her

197 21. Moderate Districts, Extreme Representatives: Competing Models constituents equally, her stand would be Y=0.25, at the Not-Polarized line. The mean of these two extremes is (0+0.25)/2=0.125. It’s easy to generalize that the Radical representatives would follow Y=X/2, on the average. This is roughly what we observe, indeed, for Radicals in Figure 21.5, at least up to X=0.4. The corresponding line for Conservative representatives would be Y=0.5+X/2. This is roughly what we observe for Conservatives in Figure 21.5, for X>0.60. The resulting curve would consist of two lines at slopes 0.5 interrupted at X=0.5 by a vertical jump – a second kinky model. The first kinky model fits the data better in the center and the second one at the margins of the graph. The two lines intersect around X=0.37 and X=0.63, as shown in Figure 21.5. [NB! In my Figure the central section is drawn a bit steeper than it should.] The combined model has less sharp kinks, but it still has them.

Comparing the smooth and kinky models In sum, we have two partly conceptually based models, both of which respect the 3 conceptual anchor points. First, we have Y=Xk/[Xk +(1- X)k]. This is the simplest continuous curve. It involves no assumptions of a political nature. Its slope k at X=0.5 is a freely adjustable parameter. We do not know why the best fit with data occurs around k=3. Second, we have a kinky model that starts from politically grounded upper and lower values. Its predictions do not depend on any free parameter. Its central slope is slightly shallower than the observed tendency. Figure 21.6 compares the two models. The smooth model has the advantage of continuity of slope. Nature rarely offers sharp kinks like those in the second model. However, at very low and high X, the kinky model agrees better with data. In the center it does less well than the continuous model, but it is unfair to compare a model with no adjustable parameters to one that has one. Of course a model with an adjustable parameter can do better – but that still leaves the need to explain why the parameter has the value it has. For a model with no adjustable parameter, the kinky model does remarkably well – and it leaves no unexplained features. [NB! In my Figure the central section of the kinky model is drawn a bit steeper than it should be. This means that the fit to data is a bit less good than it looks in the graph!] Thus both models have advantages. Testing with further data (for other US Congress periods and for other two-party parliaments) might show how stable the central section of the pattern is. Variations in parameter value of the continuous model might also cast some light on what it depends on. The US House has lately been relatively highly

198 21. Moderate Districts, Extreme Representatives: Competing Models polarized. How would the pattern differ in earlier, less polarized times? It should be kept in mind that the kinky model presumes clearly radical and clearly conservative parties. This presumption breaks down for the times the US Democrats had a large conservative southern wing.

Figure 21.6. Comparison of smooth and kinky models.

The envelopes of the data cloud

The envelopes of a data cloud are the curves that express this cloud’s upper and lower limits. Such curves are proposed in Figure 21.7. The area they enclose includes most data points, though not all. Like the median curve, the upper and lower envelopes tend to follow a drawn- out S shape, starting at (0,0) and ending at (1,1). However, they do so with a bias: They reach Y=0.50 around X=0.35 and X=0.65, respecti- vely, meaning a bias of B=±0.15 compared to the unbiased X=0.50. The simplest family of such biased curves is (Taagepera 2008: 109- 110)

Y= Xbk/[Xbk+(1-X b)k]. [3 anchor points, bias exponent b]

199 21. Moderate Districts, Extreme Representatives: Competing Models

This can also be expressed more symmetrically as

Y/(1-Y) = [Xb/(1-Xb)]k.

Here the exponent b is connected to bias B as b=-log2/log(0.5+B). For unbiased system, b=1. For B=±0.15, b=0.66 and b=1.61, respectively. For the envelope curves, we should keep the same value k=3 that we used for the mean curve, unless the data strongly hints at something else.

Figure 21.7. Envelope curves.

Figure 21.7 shows the resulting envelope curves, along with the aver- age continuous curve:

Y = X3b/[X3b+(1-X b)3], where b=1 for the average curve and b=0.66 and b=1.61, respectively, for the envelope curves. Both the average trend and the limits of the data cloud are expressed reasonably well. Fairly few data points remains outside the area they delimit, and fairly few blank areas remain inside. [Extending the bias range to B=±0.19 would make the area to include almost all outliers. Maybe it should be done during a revision.]

200 21. Moderate Districts, Extreme Representatives: Competing Models

Why are representatives more extreme than their districts? This example has both methodological and substantive purposes. On the methodological side, it has shown that, compared to the published analysis, much more could be extracted from data. The data cloud could be described much more specifically than just “When x goes up, y goes up” in a linear way. This more specific description offered a better starting point for asking why the data followed this particular pattern. In addition to offering a specific example of a symmetric drawn-out S curve, it has also introduced biased drawn-out S curves. This brief introduction will not enable one to calculate the parameters for such curves, but one should be able to recognize the shapes of such data clouds and ask for help when the need arises. The notion of envelope curves has also been introduced. Different logical approaches sometimes can be applied to the same problem, and they may fit data about equally well. One still has to ponder which one makes more sense. Each of the models presented here is preferable on the basis of the following: 1) All other things being equal, a model with no adjustable parameters is preferable to one with such parameters. 2) All other things being equal, a smooth model is preferable to a kinky one. What do we mean by “making more sense”? It’s a question of which model is more useful in the long run. Models with a wider scope and more ability to connect with other models are more useful. For this purpose, the smooth format offers more hope. On the substantive side, our more detailed data analysis leads to a logically based answer to the question “Why are representatives more extreme than their districts?” They try to balance off serving their hard- core partisans and serving all their constituents. This assertion leads to more specific follow-up question “Why are they more extreme to the degree that we observe, and not more or less so?” Here our kinky model offers a quantitative answer that does not quite fit. The continuous model can be made to fit better, thanks to its adjustable parameter, but the specific value of this parameter remains to be explained. Maybe the two approaches could be combined, somehow.

201 22. Centrist Voters, Leftist Elites: Bias Within the Box

 “Always” start scales from zero, not from “1”.  Among competing models, the one that covers a wider range of phenomena carries it.

The next graph in Dalton (2006: 233), reproduced in Figure 22.1, seems to represent an issue close to that in Figure 21.1. Instead of representatives and districts, we have party elites and voters. Instead of conservatism in the US, we have opinions on Left/Right scale in Europe. At first glance, we may see a quite similar upward curve in both figures. This time, the equality line has already been inserted in the original figure, and like in Figure 21.2, the data cloud crosses this line. But there are differences.

Figure 22.1. Party elites and voters: Original graph (Dalton 2006:233) – data plus equality line.

22. Centrist Voters, Leftist Elites: Bias Within the Box

First, take a quick look at Figure 22.1, turn away, and answer the two following questions. If you wish to place yourself at the center of the Left/Right scale, which score would you offer? Also: How many scale intervals are there on the x scale? If you said 10 scale intervals, look again and count them. There are 9 intervals, because the lowest possible rating is 1, not 0. If you placed yourself at 5, you didn’t declare yourself a pure centrist but ever-so- slightly Left-leaning, because the arithmetic mean of 1 and 10 is 5.5, not 5.0. One can suspect that many respondents may have placed them- selves at 5 when they really meant 5.5. This would introduce some distortion – and can we be sure that voters and elites would make this mistake in equal proportions? The relationship between their opinions may be a bit distorted. We already met this confusing aspect of using scales 1 to 10 in Chapter 18. We now have cause to repeat the advice: “Always” start scales from zero, not from “1”. Still, “always” is in quotation marks, because exceptions may occur. In the present case, if we continued with the scale 1 to 10, we would continuously be confused. We better convert immediately to scales that run from 0 to 1: X=(x-1)/9 and Y=(y-1)/9 (cf. Exercise 18.2). These scales are introduced in Figure 22.2. We further demarcate, of course, the allowed region and two anchor points (0,0 and 1,1) and extend the equality line to the right top corner. Now we may notice a difference, compared to previous example. It is made even more evident in Figure 22.2 by drawing in the vertical and horizontal lines passing through the center point (0.5,0.5). This center point is outside the data cloud. The data are biased toward the right. It means that, with 3 exceptions, party elites are left of their voters – and this applies even to right-wing parties! Instead of a symmetric curve, we have a biased one, like the lower envelope curve in Figure 21.7. We can fit the data to the previous equation that includes bias:

Y= Xbk/[Xbk+(1-X b)k]. [3 anchor points, bias exponent b]

The equation has two adjustable parameters, so we have to fit with two points in between 0 and 1. Let us fit a central location in the data cloud on the right, such as a spot below the point labeled CSU, and another on the left, such as in between the points labeled PS and LAB. By trial and error, I reached approximately b=1.73 and k=1.04. This k barely differs from 1.00, in which case the equation would boil down simply to Y=Xb. The corresponding curve is shown in Figure 22.2.

203 22. Centrist Voters, Leftist Elites: Bias Within the Box

This is not the statistical best fit of Y= Xbk/[Xbk+(1-X b)k] to data. (In retrospect, I would rather fit to the points CDU and LAB, so as to reduce bias and make the curve steeper, reaching higher above the equality line.) But it is sufficiently close to get the picture. Given the degree of scatter, we could be almost as well off with a fit to Y=Xk – and it would be much simpler. Then why don’t we do it? Because we would lose comparability with the previous case – and comparisons are important for theory building.

Figure 22.2 Original graph (Dalton 2006:233) plus corrected scales and a model including bias.

If we switched to a different model, we’d imply that the European and the US situations are not comparable – and this would be the end of comparative theorizing. If we stick to the same general format, however, we imply that similar processes might be involved. Then we can decompose the question into several. Why is it that the US picture in Chapter 21 is unbiased while the European elites pull toward the Left (b=1.00 vs. b=1.73)? Why is it that the swing at the crossing point is high for US elites and almost nil for the European (k=3 vs. k=1.04)? Or

204 22. Centrist Voters, Leftist Elites: Bias Within the Box are these two features related? Is it a difference of countries or a difference in the question asked (voting on issues vs. expressing opinion)? As further data of a roughly similar type are collected, we may be able to answer such questions. In contrast, it would be a dead end if we limited ourselves to the original graphs (Figures 21.1 and 22.1). We’d simply observe that “y increases with increasing x” and that’s it. Not only would we not have answers – we wouldn’t even have questions begging for answers. What about our kinky model in the previous chapter? It plainly fails here, on the right side of the field. This is illustrative about competing models. The one that applies over a wider range of phenomena carries it. Here neither model from the previous chapter works in a direct way. But the continuous model can be adjusted, by introducing a bias parameter. I see no way to adjust the kinky model.

* Exercise 22.1

These graphs have the same format as Figure 22.1: elite opinion vs. voter opinion. They come from an earlier book by Russ Dalton (1988) and include 14 parties in France, Germany and UK. Different from Figure 22.1, they present opinions on specific issues. The first graph has abortion (Dalton 1988: 215). In the second I have superimposed graphs on three issues where the patterns look rather similar: nuclear energy; further nationalization of industry; and aid to third world nations (Dalton 1988: 214, 216, 217). Due to this superimposition, the labels are blurred, and the same party label occurs three times.

205 22. Centrist Voters, Leftist Elites: Bias Within the Box

Make an exact copy of Figure above. If you can’t use a copy machine, paste this graph on a well-lighted window, paste blank paper on top of it, and trace carefully all the points. a) What type of models could be used here? b) Add the corresponding data fits to the graphs. c) Compare to the envelope curves in Figure 21.7. d) Do everything you can, approximately – no detailed data fitting is expected. e) Give some thought on why it is that the elite-voter relationship is different for abortion. How does this issue stand apart from the 3 others?

206 23. Medians and Geometric Means

 Unlimited ranges go with normal distributions, arithmetic means, and linear relationships. Positive ranges go with log- normal distributions, geometric means, and fixed-exponent relationships. The logarithms of the latter are equivalent to the former.  For two-peaked distributions, the overall median and means often make no sense.  For single peaked distributions, choose a mean close to the median. Geometric means are often more meaningful than arithmetic means when only positive values can occur.  To calculate the arithmetic mean of n numbers, add them, then divide by n: (100+10+4)/3=38. For geometric mean, multiply them: 100104=4000, then take the n-th root. On a pocket calculator, push key ‘yx’, enter n, push key ‘1/x’, push key ‘=’ and get 40001/3=15.87≈16.  The geometric mean of numbers corresponds to the arithmetic mean of their logarithms.  Whenever fitting with normal distribution yields a standard deviation larger than one-quarter of the mean, we should dump the normal fit and try a lognormal fit instead.  When x cannot take negative values but some values drop to 0, then calculate the geometric mean, replacing these 0 values by values close to the smallest non-zero value.

Geometric means entered right in the first chapter – they could not be avoided. Now is time to compare the median and various means more systematically. When talking about means or averages, we tend to visualize a value such that half the items are smaller and half the items are larger. This is the definition of the median. Yet, instead of medians, we actually often deal with arithmetic or geometric means. Depending on the nature of data, either one or the other mean approximates the median. Why don’t we just use the median, if this is what we really are after? The median itself is often awkward to handle, as we will soon see. We should know which mean to use, instead. We should also recognize situations where when neither mean is satisfactory, and sometimes even the median makes little sense. What difference does it make, which mean is used? Suppose we are told that the monthly mean income in a country is 10,000 euros. We may now think that half the people earn more than 10,000, but actually 23. Medians and Geometric Means only about one-third do. How come? Because a few millionaires can outweigh many poor people, when it comes to arithmetic mean. Indeed, take three incomes, 1,000, 3,000 and 26,000 euros. Their arithmetic mean (10,000) depends too much on the largest income. The geometric mean (4,300) is closer to the median (3,000). It matters very much whether a variable could take only positive values or any values from minus to plus infinity (Chapter 14). This is so for the means, too. When variables can range from minus to plus infinity, only the arithmetic mean is possible. When variables are restricted to positive values, then the geometric mean is the likeliest to reflect the median.

A two-humped camel’s median hump is a valley All this presumes a single-peaked distribution of data. When the fre- quency curve has two peaks, the overall median (and means) may mislead us badly. The “median hump” of a two-humped camel lies in the valley between the two humps. The worldwide distribution of electoral district magnitudes consists of a sharp spike at M=1 for countries that use single-member districts (SMD), followed by a shallow peak around M=5 to 15 for countries that use multi-seat districts. This distribution looks like a minaret and a shallow dome next to it. Depending on the exact proportion of SMD countries, the median could happen to be M=1, hiding the existence of multi-seat districts, or around M=2 to 5, where few actual countries are located. What should we do? If the distribution has two peaks, try to handle each peak separately. In the following, we assume one-peaked distributions. Then the broad advice is to pick a mean reasonably close to the median.

Arithmetic mean and normal distribution Suppose 35 items are distributed as in Table 23.1. (These are the frequencies of measurements in Chapter 29, Extra 1.) Most cases are in the center. Most important, the two wings are symmetric, as one can see when looking at the number of cases for each size. The median is 5, and so is the arithmetic mean (A). Recall Chapter 1: To calculate the arithmetic mean of n numbers, we add them, then divide by n:

A = Σxi/n.

Here, A=(75+2×76+3×77+… +85)/35 = 2800/35 =5.

208 23. Medians and Geometric Means

Table 23.1. Hypothetical sizes of 35 items. ______Size 75 76 77 78 79 80 81 82 83 84 85 Number 1 2 3 4 5 5 5 4 3 2 1 of cases

When the distribution is symmetric, the arithmetic mean yields the median. One particular symmetric distribution that occurs frequently is the bell-shaped normal distribution, roughly shown in Figure 23.1. What’s so “normal” about it? This is a prime example of an ignorance- based model. When the items involved can in principle have any positive or negative values, the very absence of any further knowledge leads us to expect a distribution that fits a fairly complex equation – the one that produces the normal curve. (This equation is not shown here.) Normal distribution has long symmetric tails in both directions. Its peak (the “mode”) is also the median and the arithmetic mean. Normal distribution is fully characterized by two constants, the arithmetic mean and a typical width, standard deviation (σ, sigma). Each tail that goes beyond one standard deviation includes only 1/(2e)=1/(2×2.718)=18.4% of all the cases. The form of the curve is prescribed by our ignorance, but the location of the mean and the width σ of the peak are not.

Figure 23.1. Normal and lognormal distributions. For normal distribution, standard deviation σ characterizes the width of the peak.

σ σ

0 median= 0 median= arith. mean geom. mean

Beyond two standard deviations, the normal distribution falls to very extremely values. However – and this is important – it falls to utter zero only at plus and minus infinity. Thus, in principle, normal distribution does not apply to quantities that cannot go negative. When applied to heights of people, normal distribution would suggest that, in extremely rare cases, people with negative heights would occur. True, if the mean is far away from zero and standard deviation is much less than the mean, then normal distribution still works out as a

209 23. Medians and Geometric Means pretty good approximation. Suppose that women’s mean height is 160 cm and the standard deviation is 10 cm. Then the zero point is so far away (10 standard deviations) that it might as well be at minus infinity. However, suppose it is reported that the mean number of telephones per 1000 people in various countries is 60, with standard deviation 70. (I have seen such reports in print!) This is what Figure 23.1 shows. It would mean that more than 18% of the countries have negative numbers of telephones! Obviously, the actual distribution is not normal, and the normal model must not be applied. Our models must not predict absurdities. In sum, for normal distribution the arithmetic mean equals the median. Strictly taken, normal distribution applies only to quantities that can range from plus to minus infinity. We can apply it to quantities that cannot go negative, if standard deviation turns out to be much less than the mean. When, to the contrary, standard deviation exceeds one quarter of the mean, normal distribution must not be used. What do we do instead? Try lognormal distribution.

Geometric mean and lognormal distribution To calculate the geometric mean, we multiply them, then take the n-th root: 1/n G = (Пxi) .

Like Σ (capital sigma) stands for summation of terms xi, here П (capital pi) stands for product of terms. For 100, 10 and 4, the product is 100104=4000. On a pocket calculator, push key ‘yx’, enter n, push key ‘1/x’, push key ‘=’, and get 4,0001/3=15.87≈16.

Table 23.2. Hypothetical sizes of 10 items. x 1 2 5 7 10 11 13 20 50 100 log x .00 .30 .70 .85 1.00 1.04 1.11 1.30 1.70 2.00

Suppose we have 10 items of sizes shown in Table 23.2. These might be the weights of dogs in US pounds. The median is (10+11)/2=10.5. The arithmetic mean is much larger – 21.9 – because it is much too much influenced by the few largest components. But the geometric mean is quite close to the median:

1/n 1/10 0.1 G = (Пxi) = (1×2×5×….×100) = 10,010,000,000 = 10.

210 23. Medians and Geometric Means

Here the geometric mean reflects the median. This is so because we have many small and few large entries. If we group the data by equal size brackets 0 to 19.9, 20 to 39.9, etc., we get 7 – 1 – 1 – 0 – 0 - 1 items per group – the distribution is far from symmetric, with peak at the low end. The geometric mean corresponds to the median when the distribu- tion is lognormal. This is a distribution where all entries are positive, and it has a long drawn out tail toward the large values, as shown in Figure 23.1. Why is this distribution called lognormal? Because the distribution of the logarithms of such data is normal. Take the logarithms of the 10 items above, as also shown in Table 23.2. Group these logarithms by equal size brackets 0 to 0.49, 0.5 to 0.99, etc., and we get 2 – 2 – 4 – 1 – 1. This is more balanced, with a peak at the center. (By juggling the data a bit, I could get a perfectly symmetric distribution, but this would be somewhat misleading. With so few data points, we obtain only approximations to smooth distributions.) What is the meaning of going by brackets 0 to 0.49, 0.5 to 0.99, etc.? We effectively divide data into multiplicative slots: going from 1 to 3.16, then from 3.16 to 10, then from 10 to 31.6, and so on. The geometric mean has a similar relationship to the arithmetic mean: The geometric mean of numbers corresponds to the arith- 1/n metic mean of their logarithms. Indeed, G=(Пxi) leads to

logG = (1/n)log(Пxi) = (1/n)log(x1 x2 x3…) = = (1/n)(log x1+logx2 +logx3…) = Σ(logxi)/n.

When our pocket calculator does not have a “yx” key but has “logx” and its reverse, “10x”, then we can take the logarithms of all numbers and calculate their arithmetic mean, which is logG. Then put 10 to the power logG, to get G. Note in Figure 23.1, that the geometric mean for the lognormal distribution is NOT at the mode (the peak) – it is further right. It has to be, because the areas under the curve, left and right of the median, are equal by definition. The arithmetic mean is even further to the right.

211 23. Medians and Geometric Means

The median is harder to handle than the means When talking about means or averages, we most often are really interested in the median. When the distribution is normal, the median equals the arithmetic mean – this was the case for numbers in Table 23.1. When the distribution is lognormal, the median equals the geometric mean– this was approximately the case for numbers in Table 23.2. But if we are really interested in the median, then why don’t we just calculate the median? The median is often awkward to handle, for the following reasons. For arithmetic and geometric means, we just add or multiply the numbers in any random order. This is easily done on a pocket calculator. For median, we would have to write them out, arranged by size. This is a minor hassle, and of course, computers can handle it easily. But suppose someone else had a data set like the one in Table 23.2. She reports that for these 10 items, A=21.9, G=10.0, median =10.5. Much later, we locate two other items that fit in, with values 6 and 8. They clearly reduce the median and the means, but by how much? What are the new values? For the arithmetic mean, it’s simple. We restore the previous total (10×21,9=219) and add the new items. The new arithmetic mean is A’=(219+6+8)/(10+2)=133/12=19.4. Similarly, the new geometric mean is G’=(10.010×6×8) 1/(10+2)=(4.8×1011)1/12=9.4. For the median, in contrast, we are stuck. Having its previous value does not help us at all. We’d have to start all over, arranging the items by size – but we don’t have the original data! Even if we can locate the author, she may have already discarded the data. This is why medians are harder to handle than means. We can build on previous means, but with medians we may have to start from scratch.

The sticky case of almost lognormal distributions Recall the situation where the mean number of telephones per 1000 people in various countries is 60, with standard deviation 70. The distribution clearly cannot be normal. We should try lognormal distribution and calculate the geometric mean rather than the arithmetic. But suppose there is one country without a single telephone. Single- handedly, it sinks the geometric mean to zero! Take the data in Table 23.3. The arithmetic mean clearly exceeds the median (0.9), but G=0 would under-represent it outrageously. What should we do? I propose the following.

212 23. Medians and Geometric Means

When the numbers to be averaged are a1=0

2 a1’=a2 /a3.

2 In the present case a1’=0.2 /0.7=0.06, and G=0.75, reasonably close to the median.

Table 23.3. Hypothetical sizes of 6 items, one of which is 0.

Median A G ai 0 0.2 0.7 1.1 2.0 10.0 0.9 2.33 0 Calculating G for non-0 items only counts 0 as if it were 1.252. 1.252 Replacing 0 with 1 counts 0 at more than median value. 1.206 Replacing 0 with 0.22/0.7=0.06 yields G close to median. 0.755

Why do I propose this? Using equal ratios means we assume that the values at the low end of the lognormal distribution increase roughly exponentially. What other choices do we have? Some computer programs sneakily omit the 0, without telling you so, calculate the geometric mean of the 5 non-zero items, and report G=1.252. By so doing they effectively replace the “0” by 1.252, a number that in this case is larger than the median! (Test it: Replace 0 by 1.252, take the geometric mean of the 6 numbers, and see what you get.) This patently makes no sense. Some other programs multiply together the 5 non-zero items but then take the 6th root rather than the 5th, and report G=1.206. By so doing they effectively replace the “0” by 1, which is again a number larger than the median. (Test it!) This makes no sense either. We should not replace 0 by something that exceeds the smallest non-zero item.

213 23. Medians and Geometric Means

How conceptual ranges, means, and forms of relationships are connected

Unlimited ranges go with normal distributions, arithmetic means, and linear relationships. Positive ranges go with lognormal distributions, geometric means, and fixed-exponent relationships. The logarithms of the latter are equivalent to the former.

This is quite a mouthful. What does it mean? Chapters 14 and 23 are inter- related. The conceptually allowed range matters in the following way. When any values are possible, from minus to plus infinity, and things are randomly distributed, they are likely to follow a “normal” pattern around the arithmetic mean. Its equation (not shown here) is established on sheer pro- babilistic grounds, based on ignorance of anything else. Also, when two variables can range from minus to plus infinity, their relationship may well be linear. When only positive values are conceptually possible, this constraint “squeezes” the negative side of the normal distribution to zero, and lognormal distribution results, with median at the geometric mean. Also, when two variables are constrained to only positive values, it “squeezes” their simplest relationship from linear to fixed exponent. This reasoning works even more clearly in the opposite direction. Take the logarithms of 1 and of 0. Log1 is pulled down to 0, while log0 is pulled way down, to minus infinity. We are now back to the unbounded range, where the logarithms of values from 0 to 1 fill the entire negative side. This is how the lopsided lognormal distributions of x, such as people’s incomes, are pulled out into nicely symmetric normal distributions of logx. The geometric mean of x then corresponds to the arithmetic mean of logx. And the curve y=Axk, limited to x>0 and y>0, is unbent into the straight line logy=logA+klogx, which can take negative values. When we notice these broad correspondences, many seemingly arbitrary and unconnected features start making sense in a unified way. Physical and social relationships respect the resulting patterns, not because of some conscious decision to do so but precisely because these patterns describe what happens when no conscious decision is taken. Randomness has its own rules.

214 24. Fermi’s Piano Tuners: “Exact” Science and Approximations

 Exact sciences mean sciences that strive to be as exact as possible but no more exact than needed. It avoids getting stuck in details.  Estimates impossible at a first glance can be decomposed into a sequence of simpler estimates that can be answered ap- proximately.  One should develop a sense for typical sizes of things, which could then be fed into such approximations. This is part of basic numeracy.  Dimensional consistency is needed – and it is also helpful in calculations.

We have repeatedly used approximations, and you may feel uneasy about it. Aren’t natural sciences “exact sciences” and shouldn’t social sciences try to become more of an exact science? Well, exact science does not mean that every result is given with four decimals. It means being as exact as possible at the given stage of research. This makes it possible to be more exact in the future.

As exactly as possible – and as needed Nothing would stifle advance of social science more than advice to give up on quantitative approaches, just because our first measurements involve a wide range of fluctuation or our conceptual models do not quite agree with the measurements. A three-decimal precision will never be reached, if one refuses to work out problems approximately, at first. But even “as exact as possible” must be qualified. In later applica- tions, there is little point in using more precision than is needed for the given purpose. And while building a logical model, one might initially apply “no more exact than needed”. What does it mean? Take volatility. We restricted ourselves to a single input variable – the number of parties. Even here we at first built a linear model that ignored a glaring constraint: volatility cannot surpass 100 per cent. We simplified and aimed at an approximation that might work at the usual levels of volatility, much below 100 per cent. We got a fair fit to this coarse 24. Fermi’s Piano Tuners: “Exact” Science and Approximations model (Figure 11.2). The refined exponential model would be needed only if we met a huge effective number of parties. If we had tackled the refined model first, we might have been lost in needless complexities.

How many piano tuners? Approximations are not something social scientists are reduced to. The ability to approximate is useful in physics too, as the great physicist Enrico Fermi kept stressing. (How great was Fermi? Great enough to have the 100th element called Fermium.) His favorite question to new students was: “How many piano tuners are there in the city of New York?” This takes longer than “How many parties might win seats”. Fermi’s students’ first reaction most likely was that they could not possibly know. It wasn’t even a question about physics. Fermi’s intention was two-fold. First, show that questions can be decomposed into a sequence of simpler questions that can be answered approximately. Second, develop in students a sense for typical sizes of things, which could then be fed into such approximations. For piano tuners, our first question would be: How many people are there in the city of New York? Forget about how one defines this city, within a larger megapolis. Forget about decimals. An estimate of 10±2 million is close enough. However, we must have some general sense of how large cities are. It helps to have one city’s population memorized. If you have memorized that Chicago has about 5 million or that Stockholm has about 1 million, you can peg other cities to that know- ledge. Would Budapest be larger or smaller than Chicago or Stock- holm? Knowing that Hungary has 10 million people would help. Many cities in the developing world are presently growing with such speed that my estimates might fall behind the times. Also, since Fermi’s times (say, 1960), pianos have given way to electronic noisemakers. So let’s specify: How many piano tuners might there have been in the city of New York around 1960? The sequence of questions might be the following.  How many people in New York?  How many households in New York? We would have to ask first: How many people per average household?  What share of households might have a piano? And compared to the number of household pianos, how many pianos elsewhere (concert halls etc.)?  How often should a piano be tuned? And how often is the average piano tuned?

216 24. Fermi’s Piano Tuners: “Exact” Science and Approximations

 How long does it take to tune a piano?  So what is the total workload for piano tuners, in hours, during one year?  At 40 work hours per week, how many piano tuners would New York keep busy? Exercise 24.1 How many piano tuners might there have been in the city of New York around 1960? a) Offer your estimate for each step. b) Calculate the resulting number of piano tuners. c) A student of mine looked up the US census around 1975. Believe it or not – it did have the number of piano tuners in New York: about 400. By what percentage or by what factor was your estimate off?

The range of possible error How good is our guess for the number piano tuners? How much off are we likely to be? Let us look realistically at the possible errors at each step.  How many people in New York? We might be off by a factor of 2: relative error ×÷2, meaning “multiply or divide by 2”.  How many households in New York? Once we ask this, we discover that we first have to ask: How many people per average household? It could again be ×÷2.  What share of households might have a piano? This is where students in the 1960 and 1970s were liable to be far off, depending on whether their own parents did or didn’t have a piano – and we have to think back into no-computers surroundings. Also, compared to the number of household pianos, how many pianos elsewhere (concert halls etc.)? This adds to the uncertainty. We might be off by ×÷4.  How often does a piano get tuned? We may have no idea. Phoning a piano tuner would give us only a lower limit – how often a piano should be tuned. But how often is the average piano tuned in reality? I know one that hasn’t been tuned for 20 years. We might be off by ×÷3.  How long does it take to tune a piano? We might be off by ×÷3, if we have not seen one at work.  So what is the total workload for piano tuners, in hours, during one year? We have a long string of multiplication but no new error, unless we make a computation error.

217 24. Fermi’s Piano Tuners: “Exact” Science and Approximations

 Assuming a 40-hour workweek, how many piano tuners would it keep busy? No new error here either, unless we make a computation error.

By what factor are we likely to be off on the number of piano tuners? Suppose we overestimate every factor we multiply and underestimate every factor we divide by. By the error estimates above, we could then overestimate by a factor of

2×2×4×3×3=144≈150.

Instead of 400 tuners, we could have found 150×400= 60,000 tuners. If we underestimated to the same degree, we could propose 400/150= 2.7≈3 tuners. An estimate that could range from 3 to 60,000 – this was not the point Fermi was trying to make. When you did the exercise above, you probably got much closer, most likely within ×÷10, meaning 40 to 4000 piano tuners. Why? You were most likely to err in random directions. At some steps your error boosted the number of tuners, and at some other step it reduced it. If you were very lucky, you might have horrendous errors at each step, yet end up with the actual value, if your errors cancelled out perfectly. What is the most likely error range on multiplicative sequences of estimates? My educated guess is: Take the square root of the maximal combined error. For piano tuners, 1441/2=12. Indeed, being off by ×÷12 fits with my average experience with untutored students. However, the combined error cannot be lower than its lowest single component. If we combine ×÷2 and ×÷4 to ×÷8, then the actual likely error is ×÷4, even though 81/2=2.8.

Dimensional consistency We briefly encountered dimensional consistency requirement in Chapter 6. It looked like another formality. In the present problem, it actually becomes extremely useful. But if it mystifies you, leave it for later – it’s not essential, just helpful. We started with the city population. The larger it is, the more piano tuners there must be. But so as to find the number of households, should we multiply or divide by the number of people in a single household? We may use our good sense. If we multiply, we get more households than people, so I guess we must divide. And so on.

218 24. Fermi’s Piano Tuners: “Exact” Science and Approximations

But introduce units, or quasi-units. City population is not just so many million – it’s so many million persons. What is the unit for the number of persons in the household? The unit is persons/household. Now, if by mistake we multiply the population and the number of persons per household, the units also multiply, resulting in

[x persons][y persons/household]= xy [persons]2/household.

Persons squared? No, this is not what we want. So let us try dividing:

[x persons]/([y persons/household]= (x/y) households.

By the usual rules of arithmetic, persons/persons cancel out, and 1/[1/household]=household. Yes, this is the unit we want to have. So division must be the right way to combine the population and the number of persons per household. Overall, we want a sequence where, after the rest cancels out, only the unit “tuners” remains. I’ll drop the numbers (such as x and y above) and show only units. Go by three steps. The first is for the total number of pianos:

[persons] [pianos/household] = pianos. [persons/household] [household pianos/all pianos]

Check that, indeed, everything else cancel out, leaving pianos. The second step is for the total work hours/year needed:

[tunings/(year×piano)] [work hours/tuning] = work hours/year.

The third step is for the number of piano tuners:

[work hours/year] = tuners. [work hours/(tuner×week)][weeks/year]

Check that everything else cancel out, leaving tuners. We could do the whole operation in one mammoth step:

[persons][pianos/household][tunings/(year×piano)][hours/tuning][work hours/year] = tuners. [persons/household][household pianos/all pianos][work hours/(tuner×week)][weeks/year]

Here we just have to insert the numbers in their proper places (multi- plying or dividing). But maybe such a long single sequence is too much of a mouthful.

219 25. Examples of Models across Social Sciences ______ Sociology offers examples where the number of communication channels can be used.  Political history offers examples where reasoning by extreme values and differential equations, especially exponential, come in handy.  In demography, differential equations, exponential and more complex, are useful.  Economics offers examples for using allowed areas and anchor points, plus differential equations.

Most examples in this book referred to political institutions. This is so because I found the simplest examples in that subfield. Do the models presented apply beyond political science? A student asked me so. Here are some examples from my own work.

Sociology: How many journals for speakers of a language? How would the circulation of journals (J) published increase as the number of speakers of the language increases? Language means com- munication. As population (P) increases, the number of communication channels (c) increases proportional to P squared. Journals might expand at a similar rate, because with more speakers more specialized journals can be introduced. Thus we might test the model J=kP2. Circulation per capita, j=J/P, would then be j=kP. Per capita circulation is proportional to popu- lation. Graphing on log-log scale, the slope must be 1, if the model fits. Such a test needs a group of languages under rather similar conditions, so as to eliminate a multitude of other factors. I found such a group in the form of languages belonging to the Finno-Ugric language family and spoken in the Soviet Union, where the authoritarian regime made the conditions highly uniform. Circulation was weighted, giving as much weight to a thick monthly magazine as to a daily newspaper. Graphing on log-log scale does indeed lead to slope 1 for nations with more than 10,000 speakers (Figure 25.1). For smaller populations, figures become erratic. The best fit is around j=2×10-6P (Taagepera 1999: 401-402). This is an example how the basic notion of the number of commu- nication channels applies in sociology.

25. Examples of Models across Social Sciences

Figure 25.1. Weighted per capita circulation of journals vs. number of speakers of Finno-Ugric languages. Data from Taagepera (1999: 401– 402).

2

Weighted per Capita Circulation

0.2

0.02

10,000 Population 100,000 1,000,000

Political history: Growth of empires As history unfolds, more advanced technology enables larger polities (political entities) to form, so that the total number of polities decreases. What is the pattern of this decrease? First, we have to define the number of polities, given that some are large while others are tiny. The aforementioned effective number of components (Chapter 2) comes handy. Should we consider size by area or by population? Let us do both. But the two are interrelated, because large polities do not form in empty space – they tend to form where population is the densest. Indeed, it can be shown that the effective number of polities by area can 2 be expected to be the square of their number by population: NA=NP . This follows from consideration of geometric means of extremes. As we graph logN against time, over 5,000 years (Figure 25.2), we find that the broad pattern is linear, which means that N decreases exponentially with time. This is not surprising, because exponential change is the simplest pattern around. But the two lines are inter- connected in two ways.

221 25. Examples of Models across Social Sciences

First, they both must reach N=1at the same time, as this would corresponds to a single polity encompassing the entire world. The statistical best-fit lines (OLS logN vs. t) cross around N=3, but it would take only a minor adjustment to make them cross at N=1. This would happen around year 4,000. So don’t expect (or be afraid of) of a single world empire any time soon. 2 Second, if NA=NP , then the slope for the area-based lines must be double the slope for population-based line. This is pretty much so, indeed. But why does the rate constant k (the slope in Figure 25.2) have this particular value? Why isn’t the slope shallower or steeper? We don’t know as yet. This is an example how the exponential model and the use of geo- metric means of extremes apply in political history.

Figure 25.2. Decrease of effective number of polities over time (Taagepera 1997), based on area (NA) and population (NP). Since N is graphed on log scale, the exponential curve becomes a straight line.

Demography: World population growth over one million years Differential equations are widely used in demography. Growth of populations often follows the exponential pattern dP/dt=kP. However, growth of world population from CE 400 to 1900 followed an even steeper pattern: P=A/(D-t)m, where A and D are constants. How do we determine D? By trial and error.

222 25. Examples of Models across Social Sciences

The nasty thing about such a “quasi-hyperbolic” growth is that, when time reaches t=D, population would tend toward infinity. World population began to veer off this model by 1900 and by 1970 clearly began to slow down. How can we modify quasi-hyperbolic growth to as to include this slowdown? The model involves population interacting with technology and the Earth’s carrying capacity (Taagepera 2014). It projects toward a population ceiling at 10 billion. The corresponding graph in Figure 25.3 has not yet been published. It shows two separate phases and a puzzling kink around CE 200. This is an example how the exponential and more complex rate equations apply in demography.

Figure 25.3. Human population during the last one million years graphed on logarithmic scale against time prior to 2200, also on loga- rithmic scale. Vertical error flags show the ranges of estimates.

223 25. Examples of Models across Social Sciences

Economics: Trade/GDP ratio Some countries export most of what they produce; so they have a high Exports/GDP ratio and a correspondingly high Imports/GDP ratio. (GDP means Gross Domestic Product.) Some other countries have little foreign trade, compared to their GDP. Could this ratio depend on country size? Let us carry out a thought experiment. If a country came to include the entire inhabited world, what would be its Trade/GDP ratio? It must be 0, since this country has no one to trade with. So we have an anchor point: P=Pworld  Exports/GDP= Imports/GDP=0. At the opposite extreme, if a country consisted of a single person, what would be its Trade/GDP ratio? It would be 1, because all this person’s monetary transactions would be with people outside her own country. So we have another anchor point: P=1  Exports/GDP= Imports/GDP=1. It is now time to draw a graph of forbidden areas and anchor points (Figure 25.4). Visibly, P can range only from 1 to Pworld, and Exports/ GDP can range from 0 to 1 (or 100%). Given the huge difference between 1 and Pworld, we better graph P on logarithmic scale.

Figure 25.4. Exports/GDP graphed against population. In 1970, 9 Pworld≈4×10 .

1

Exp/ GDP Exports/GDP= 1- logP/logPworld 0.5

Actual, roughly 0

5 1 Population (log scale) 10 Pworld

How can we move from the top-left anchor point to the one at lower right? The simplest way would be a straight line, and indeed, a tentative

224 25. Examples of Models across Social Sciences entropy-based model (Taagepera 1976) would lead to just that: A fraction logP/logPworld of GDP is not exported, and hence Exports/ GDP=1-logP/logPworld. Trouble is, actual data do not fit. Countries with populations below one million tend to have much higher trade ratios than expected, while countries with populations above 10 million tend to have lower trade ratios than expected. The curve shown in the graph very roughly indicates the actual trend. From this point on, some skills are used which are beyond the scope of this book. The main thing is to note that it all starts with Figure 25.4: allowed area and anchor points.

Figure 25.5. Empirical approximations for dependence of Imports/GDP and Exports/GDP on population (Taagepera and Hayes 1977).

225 25. Examples of Models across Social Sciences

Taagepera (1976) started with the classic differential equation for absorption, in physics: dI/dr=-kI. Here I is flow intensity, r is distance from source (of neutrons in physics, or of goods in economy), and constant k reflects how rapidly the stuff produced is absorbed. This -kr means simple exponential decrease: I=I0 e , where I0 is the intensity at the source, such as a factory producing goods. The flow that reaches the country border is counted as export. There are many sources (factories) spread across the country, so that the equations become more complex. In one dimension, they can be solved. Unfortunately, countries are two- dimensional, making it even more complex. An approximate solution can be worked out, and it fits the data cloud. This model makes outrageous simplifying assumptions. It is assumed that an infinite flat world is uniformly populated. Thus a finite world population cannot enter. All goods are assumed to be absorbed at the same rate over distance, be it milk or oil. The model can be tested using either country area or population. Not surprisingly, population yields a better fit – people absorb goods, not space. Despite such simplifications, this model still explains how trade depends on population, on the average. For practical purposes, it was found simpler to replace the complex model by empirical approxi- mations (Taagepera and Hayes 1977):

Imports/GDP = 40/P1/3. Exports/GDP = 30/P1/3.

How could imports exceed exports? Countries also pay for imports by revenue from shipping, tourism, exporting labor, etc. As seen in Figure 25.5, the Imports equation fits within a factor of 2, while scatter is much wider for Exports, due to the variety of the other revenue items. Note that Figure 25.4 has the trade ratio on regular scale and in fractions of 1, while Figure 25.5 has it on logarithmic scale and in percent. My co- author for this work, Jim Hayes, was an undergraduate student. How do those approximations fit with the logical anchor points? At populations equal to world population (around 4 billion in 1970), the equations predict Imports/GDP =0.0006=0.06% and Exports/GDP= 0.04%, rather than the conceptually required 0. For a logical model, any non-zero value would be unacceptable, but for an approximation, this is quite close – as long as we keep in mind that this is an approximation. At the other extreme, Imports/GDP would reach 1 when P=6,400 rather than 1, and Exports/GDP would do so at 2,700 persons. This is far from the anchor point of 1 person. The detailed model and it approximations in Figure 25.5 could be part of an S-shaped curve (see Chapters 21 and 22) which joins the

226 25. Examples of Models across Social Sciences anchor points and goes steeply down around P=1 million (106). Indeed, we could get an imperfect fit of data to logical anchor points by using the equation used in Figure 21.7, which corresponds to adding a third central anchor point: Y/(1-Y)=[X/(1-X)]k. Figure 25.6 is similar to Figure 25.4, except that the x-scale has logP divided by logPworld, so that logically allowed values range from 0 to 1. We could get an imperfect fit of data to logical anchor points by using an equation similar to the one introduced in Figure 9.4: (1- Y)/Y=[X/(1-X)]k. This symmetric “drawn-out S” curve drops too soon, compared the data cloud. A better fit is obtained by adding a “bias exponent” b (cf. Chapter 21): (1-Y)/Y =[Xb/(1-Xb)]k. Thist shifts the bending point away from 0.5,0.5. Why would the Export/GDP ratio take such a path between the two anchor points, rather than a simpler path such as Y=Xk? This remains to be explained. Imagine a country with only 300 inhabitants, so that X=0.25 in Figure 25.6. We can well imagine that it would still export almost all of what it produces and import almost all of what it needs, in line with the dashed part of the curve. On the other hand, consider a country with 30 million people, so that X=0.75 in Figure 25.6. We can visualize that it could easily produce much more than 75% of what it needs, in line with the curve shown. Thus this curve makes sense. But it’s a long way from what makes sense to a quantitatively predictive logical model.

Figure 25.6. Exports/GDP graphed against normalized population.

1 b b k Y= (1-Y)/Y = [X /(1-X )] , Exp/ very roughly GDP Exports/GDP= 1- logP/logPworld 0.5 =1- X

Actual, roughly 0

0 0.25 0.5 0.75 1 X = logP/logPworld

227 25. Examples of Models across Social Sciences

Like in Chapter 21, we observe here several competing and partly contradictory approaches. They offer examples how allowed areas, anchor points, and differential equations more complex than the exponential can apply in economics.

228 26. Comparing Models

 One can establish logical relationships among factors without worrying too much about causality, ability to indicate a specific process or mechanism, and distinctions between models and formats, and between general and substantive models.  Broad models based on near-total ignorance apply in all sciences. Normal distribution and exponential change are prime examples.  Constraints on the range of values of factors can determine the simplest forms of relationships between them. Call these relationships logical models or logical formats – the way to use them does not change.  Substantive models are specific to a field of inquiry. They are bound to operate within the framework of broader logical formats. They may show up when data deviate from the simple ignorance-based expectations.  Many models are “as if”: Complexly integrated processes behave (produce outcomes) as if they could be broken up into simple components.  Development of truly substantive social models, which intro- duce specifically social processes or mechanisms, depends on interlocking of fairly universal partial models: Connections among connections.

I feel uneasy about this concluding chapter. If you feel it helps you, use it. If it doesn’t, forget it. This comparative perspective may be of use to those doctoral students who have already pondered broad methodology. The hands-on skills in model building can be attained without it. This attempt to systematize some aspects of some models should not fool us into thinking that this is the entire range of model building. When faced with a possible relationship between two or more factors or quantities, the nature of the problem should drive our thinking. Model building should not be reduced to looking for the most suitable format in an existing grab bag, even while we inevitably do give a chance to those formats we already know best. One has to think, and the model depends on the assumptions made. Whether these assumptions are adequate can be evaluated by the degree of fit to data. (Note that I prefer to use “factor” or “quantity” rather the mathematical term “variable”.)

26. Comparing Models

An attempt at classifying quantitatively predictive logical models and formats Table 26.1 tries to classify those logical models and formats mentioned in this book. This is of course only a small sample of basic forms, and they do not fit neatly into a table. Note that these are quantitatively predictive models, not merely directional ones. Note further that we do not deal with linear data fits, which some people call “empirical models”; those fits are not predictive but merely “postdictive” for the specific data on which they are based.

Table 26.1. An attempt at classifying quantitatively predictive logical models and formats mentioned in this book.

Static Rate Other Processes/ Formats Equations Mechanisms Ignorance- Normal distribution Based, Lognormal distribution 1 variable Mean of limits: m

Ignorance- 2 quadrants allowed  Exponential : dS/dt =kS Based, Zone in 2 quadrants  Simple logistic : dS/dt =kS(1-S/M) 2 variables 1 quadrant allowed: y=Axk Zone in 1 quadrant: anchor and floor/ceiling  Exponential, preferably Box: l

Substantive Systematics of k in Y=Xk, kt in S=S0e , etc.

Interconnecting models: CONNECTIONS AMONG CONNECTIONS

230 26. Comparing Models

Some distinctions that may create more quandaries than they help to solve We may distinguish among logical models where a mechanism or process links one factor to another, and logical formats that avoid absurdities but otherwise remain black boxes: They work, but we do not pin down the process that connects the factors. We may also distinguish between deterministic relationships where a given input leads to a precise output value, and probabilistic ones where the output value is only the average output for a given input. Furthermore, we may distinguish between general formats or models that appear throughout natural and social sciences, and substantive models that are specific to one field of study, such as sociology, eco- nomics or political science – or even a subfield. And of course, we can wonder about causality. Which factor causes which one? Or are they mutually affecting each other? Or are both affected by a third factor? kt Now we apply these notions to the exponential format, S=S0e , or dS/dt =kS in its differential form. Is it a model or a format, deterministic or probabilistic, general or substantive? It depends. Microbial growth looks like a prime example of a substantive deter- ministic model. Imagine a microbe dividing once a day. The process is repeated every day, as long as conditions (food, space, temperature, etc.) remain the same. The successive numbers of microbes are 1, 2, 4, 8, 16, 32,... The process of each microbe dividing each day leads to “Rate of growth is proportional to existing size”, dS/dt=kS. Here we do have a mechanism: each microbe dividing. And it is deterministic, as long as each microbe does its part. It further looks substantive, specifically biological. Here the biological dividing process does lead to the exponential equation. The reverse, however, is not always true. Every exponential phenomenon doesn’t result from a substantive deterministic process. Radioactive decay does follow the same exponential format, dS/dt=kS. (The negative value of k makes little difference.) “The rate of decay is proportional to existing size”. But each atom does not do the same thing, like each microbe does. Instead of each atom doing something, they all have equal probability of suddenly decaying. A few atoms are completely altered, while the others have not changed at all. We cannot predict, which atoms will decay next – in this sense the process is probabilistic. But we can predict with high precision how many atoms will decay during the next time period – this is deter- ministic.

231 26. Comparing Models

For the same exponential format, bacterial growth involves a process we can describe, but radioactive decay does not – nor does the decrease in the effective number of sovereign political entities (Figure 25.2). As the number of parties increases, interest group pluralism tends to decrease exponentially, and so does electoral disproportionality (Chapter 20). Are parties “causing” such decreases? One might as well ask whether time is “causing” some radioactive nuclei to break apart. We merely have interrelation, without one specific process, but there are probably some logical reasons connected to underlying factors. Is the exponential model deterministic or probabilistic, process- based or not, causal or just relational? This does not matter when it comes to testing whether a data set fits the exponential equation. More broadly, such issues are of legitimate interest to philosophers of science. But one can largely do without worrying about them when the objective of science is seen as establishing connections among factors, and then connections among connections.

Ignorance, except for constraints Some models or formats are based on near-total ignorance of details. They are general rather than specific to some substantive field such as sociology or political science, given that field-specific information does away with near-total ignorance. These models start with normal and lognormal distributions, and the mean of the extremes, all of which deal with only one factor. They continue with models connecting two factors the range of which may be subject to constraints. Finally, models based on ignorance and constraints on rates of change lead to a puzzling feature of the exponential function.

Normal distribution – full ignorance This is the granddaddy of all ignorance-based models. A quantity like the size of peas has median that is imposed by substantive reasons. Actual sizes vary randomly around this median, subject to no constraints. If they can range randomly from minus to plus infinity, a definite mathematical expression results from this very ignorance, pro- ducing a bell-shaped curve. Is the relationship deterministic or probabi- listic? For each single item, it is probabilistic, but for a sufficiently large number of items, the form of the equation is deterministic: the bell- shaped curve will materialize, unless there are factors that counteract randomness.

232 26. Comparing Models

Utterly lacking any specific information, normal distributions occur in all scientific fields, from physics to biology and social sciences. It would be pointless to ask for some substantive, specifically social reasons why some items “decide” to “arrange themselves” according to a normal distribution. They don’t! There is no mechanism or process. So, is the normal distribution a model or just a format? This distinction does not seem to affect the way it is used. This ignorance-based model transcends the bounds of social sciences. Substantive implications enter in two ways. First, the mean values of normal distributions (and their standard deviations) have substantive origins– we can compare the means of different data sets. Second, if a distribution fails to be normal, additional substantive reasons must enter, and we should look for them. A two-humped distribution in heights of people informs us that both females and males may be present.

Lognormal distribution – a single constraint Here a single constraint is introduced: The quantity x is restricted to positive values. Its logarithm can range from minus to plus infinity, without any constraints. Hence logx is expected to have a normal distribution. The equation for lognormal distribution thus results from the equation of the normal distribution. The resulting curve starts at 0 and goes steeply up; it extends a long tail to the right. The model is common to all scientific fields. Substantive implications again enter through comparing the lognormal parameters of different data sets (which correspond to the means and standard deviations of logx) and through deviations from the lognormal expectation.

The mean of the limits – two constraints The mean of lower and upper limits (m and M, respectively) for a quantity x is the logical best guess for the value of x, in the absence of any other information. If x must be positive, geometric mean is preferable: g=(mM)1/2. The mean-of-the-limits model is probabilistic. All we expect is that it predicts the median result over many cases. In this sense, it is analogous to the arithmetic mean in the normal distribution and the geometric mean in the lognormal. When applying g=(mM)1/2, we would imagine a vaguely lognormal-looking distribution of actual values around this median.

233 26. Comparing Models

Examples in Chapter 1 include the number of parties in an assembly elected by unrestricted nationwide PR (n=T1/2), but also riot victims and the weight of an unknown mammal. The latter example reminds us that this ignorance model transcends the social realm – it is as universal as normal distribution. Like for the normal distribution, itb would be point- less to ask for reasons that are specifically political (for parties) or biological (for mammals). This is precisely the outcome in the absence of any mechanism or process. 1/2 The share of the largest component, S1=S/n (Chapter 4) offers another example. Using the relative share s1= S1/S, it could be 1/2 reformulated even more compactly as s1=1/n . Later, Figures 21.4 and 21.5 show a situation where the arithmetic mean of the limits makes more sense than the geometric. Compared to the single constraint in the lognormal distribution – a lower limit of 0 – the model based on limits introduces two constraints on x, lower and upper. Both limits depend on the issue on hand. Thus, the lower limit is 1 in n=T1/2, but it is S/n, the mean size, in the case of the largest component. So, while dealing with only one quantity x, like normal or lognormal distributions, we are slipping toward a relationship between 2 or even 3 factors. It’s the relationship between x and M when the lower limit is logically set at 1, or among x, M and m when the lower limit also can vary (as for the mammals or riots). All the following models deal explicitly with relationships between 2 factors, with extensions to 3 and more factors possible.

The dual source of the exponential format: constant rate of change, and constraint on range The exponential model is the prime example of a rate equation, a slope equation, as it was first introduced. The rate of change dS/dt is pro- portional to existing size: dS/dt =kS. The relative rate of change is the ratio of absolute change rate and the size: (dS/dt)/S. (The more familiar percent change rate is just 100 times that.) Now the exponential model becomes even more succinct: (dS/dt )/S=k – relative rate of change is constant. This is the fundamental ignorance idea behind the exponential model: If we do not know whether the relative (percent) growth rate increases or decreases, the neutral guess is that it stays the same. This could be called “the rule of conservation of relative rate of change”. In contrast to this very simple and meaningful differential equation, the corresponding integrated equation is much more complex even in its kt simplest form, S=S0e . Figure 25.2 offers an example of exponential

234 26. Comparing Models decrease over time – the effective number of sovereign political entities has decreased at a constant relative rate. kt Remarkably, the integrated form S=S0e also emerges from a quite different approach. Restrict a factor that changes in time to only positive values, 0

Simple logistic change: Zone in two quadrants Within two allowed quadrants, we can further impose an upper limit on y. This means 0

Limits on ranges of two interrelated factors: Fixed exponent format When two interconnected factors, x and y, are restricted to positive values, then only one quarter of the entire two-dimensional field is allowed: 0

235 26. Comparing Models

Why should we expect such a relationship? It does not evoke any simple process or mechanism. We are back to the broad principle of ignorance. If we have logical reasons to assume that y=0 and x=0 go together (a logical anchor point), then a constant value of k is the neutral guess as compared to k increasing or decreasing in y=Axk. An example is journal circulation vs. population for a given lan- guage. Per capita circulation is j=2×10-6P (Figure 25.1). This corres- ponds to total circulation J=2×10-6P2. The specific value k=2 comes, of course, from the number of communication channels, a model to which we’ll come.

Zone in one quadrant: Anchor point plus floor or ceiling lead to exponential change A surprise awaits us when we impose a further constraint. Within one quadrant, restrict y, by imposing an upper limit. We still have 0

236 26. Comparing Models

In none of the cases above can we pin down a specifically political process or mechanism for how the number of parties affects the various outputs. Indeed, we should not look for such a mechanism any more than we do when encountering a normal distribution, because the reason is more universal. The exponential format is what any quantities tend to follow, on the average, when constrained to move from an anchor point to a floor or ceiling. Under these conditions, specifically social factors make themselves apparent when the exponential format is NOT followed. In all the cases considered here, random scatter is so wide that the exponential pattern cannot be rejected.

Box in one quadrant, plus two anchor points For the zone in one quadrant, we have 0

Box in one quadrant, plus three anchor points Up to now, all the constraints have been general, outside any specific scientific field. Hence the resulting models are universal, given the type of constraints imposed. When introducing a third anchor point in the box, we arguably become more substantive. In Figure 9.4, the anchor

237 26. Comparing Models point (0.5,0.5) was introduced specifically in the context of two-party elections. The other example (Figure 21.3) deals with conservatism of voters and representatives, again in a two-party context. The third anchor point is based on parity of output, in the presence of equal inputs by two groups. Here we may deal with an issue specific to political science, given that one is hard put to offer examples outside party politics. (The relationship between urban and rural radio set ownership in Exercise 14.1 is rather hypothetical.) Does this make the resulting model, Y/(1-Y)=[X/(1-X)]k, a substan- tive one? I doubt it. Once the extra anchor point at (0.5,0.5) is posited, on whatever grounds, broad considerations again take over: What is the simplest way to wiggle through the three points, on the basis of univer- sal concerns of smoothness and continuity? The kinky variant of a model for the representative’s positions (Chapter 21) could be said to involve more of a political process, in that the representative tries to balance off various voters. Introduction of bias (Figure 22.2 and Exercise 22.1) offers a way for substantive factors to enter. There must be something that shifts the inflexion point off center. This point stops being a logical anchor point. Once the general degree of bias is stipulated, however, through parameter b, the model Y/(1-Y)=[Xb/(1-Xb)]k again looks for the simplest way to wiggle through. There is no substantive mechanism, beyond whatever determines b. Does this matter? If a cat does catch mice, how concerned should we be about its color? Recall that even exponential change can be connected to a mechanism such as bacterial division only in selected cases. This is no reason to refrain from making use of it.

Communication channels and their consequences The number of communication channels among n actors finally seems to offer us a substantively social model: c=n(n-1)/2≈n2/2. The term “Actors” feels social, and the mechanism is clear: With n actors present, the addition of another one adds n new channels – one toward each of the existing actors. However, similar notions are used also in commu- nication engineering and biology of contagious diseases – they are not specifically social. Counting channels among actors, after all, boils down to counting links among points. When the number of actors is so large that quasi-continuity can be assumed, this corresponds to another rate equation: dc/dn≈n. This leads to dc/dn≈21/2c1/2 – the number of channels increases with increasing number of actors at a rate proportional to the square root of the number of channels. The most direct application is circulation of journals in a language spoken by P people. If journals mean communication channels, the number of journal copies might be proportional to population squared: J=kP2. In the case tested (Figure 25.1) it works out. But this is a very remote approximation of what “really” ties journal copies to people. So

238 26. Comparing Models the actual mechanism of social interaction is at best replaced by a very iffy “as if”. One may easily conclude that it was fortuitous that the data fitted such a naively logical expectation – but then we are back to square one. Or one may conclude that some essential feature was nonetheless captured – and then we may have something to build on. In C=42 years/N2 for cabinet duration (Chapter 6), k=2 also derives from the number of communication channels among N actors. Shifting from real actors to the effective number of parties is a leap of faith, which paid off. The substantive mechanism of the connection is limited to the probability that the number of parties affects duration inversely. The cube root law of assembly sizes (Chapter 19) takes into account the number of communication channels at three different levels and minimizes their total. Among the models using communication channels, this one might be the most substantive, provided one tolerates the numerous simplifying assumptions involved.

Population and trade models World population model (Figure 25.3) combines three-fold interaction of population, technology and Earth’s carrying capacity. This is a set of 3 rate equations, which have not been presented here. This combination has some substantive basis. For the trade/GDP ratio, various approaches have been surveyed, including the basic rate equation dI/dr=-kI from physics, for absorption of any flow intensity I over distance r. The problem is eventually placed within a square box (Figure 25.6). The diagonally opposed anchor points make logical sense, but one is left wondering what imposes the S- shaped curve joining them, with its inflexion point. It isn’t clear how well the simple bias format Y/(1-Y)=[Xb/(1-Xb)]k fits the data. In sum, substantive mechanisms enter, but more work is needed.

Substantive models: Look for connections among connections Development of truly substantive social models, which introduce specifically social processes or mechanisms, depends very much on interlocking of fairly universal partial models. What does this mean? Exercises 10.1 and 10.2 give a hint of how the constants k in several equations of form Y=Xk are interconnected in the case of various measurements of centralization. Discussion of support for US Demo- crats in two elections (Figure 8.5) points out possibilities for comparing many elections over time and maybe in many countries. Why has not more been achieved in this direction? The database for values of k is

239 26. Comparing Models lacking. Such a database will come about when social scientists stop putting meaningless regression lines across data clouds, which obvious- ly beg for a fit with Y=Xk, and start fitting with a format which respects logical constraints. Empire growth (Figure 25.2) offers an example of connectedness kt between rate constants k in exponential patterns N=N0e . The effective number of polities on the basis of area (NA) is larger than that based on population (NP), because empires tend to form in more densely populated regions. Going by the geometric mean of extreme 2 possibilities, the model NA=NP emerges. This also means that kA=2kP in kt N=N0e , which shows up as a doubly steep slope in the logN graph in Figure 25.2. We need more interconnections on such a level – models to inter- connect constants in models. This involves models of a similar type, such as all boxes or all exponentials, connecting variables of a similar type. We also need to interconnect models of different types, which use different variables. Chapter 10 offers the sequence Tns1NC. Each arrow stands for a separate model, joining two or three variables 1/2 1/2 2 analogous to n=T , s1=1/p , and C=42 years/N presented in this book. The actual string of models starts with two separate basic inputs: assembly size and electoral district magnitude. Many more than two quantities are connected here, through a string of equations. The nature 1/2 of the underlying models varies: while s1=1/p is rooted in the square root of the extremes, C=42 years/N2 results from the number of communication channels. Such strings and networks of interconnected models form the back- bone of physics, and social sciences need more of them. The existence of the string shown here proves that they can exist in social sciences. This is where the quest for specifically social mechanisms begins:

connections among factors, and then, connections among connections.

240 APPENDIX A What to Look for and Report in Multivariable Linear Regression

 Use regression only for exploratory research or for testing logical models. Don’t even think of using regression for model construction itself.  Graph possibly meaningful relationships, so as to avoid running linear regression on curved data.  Guard against colinearity or “co-curvilinearity”.  Use Occam’s Razor: Cut down on the number of variables.  Distinguish between statistical significance (the “stars”) and substantive meaningfulness.  Report not only the regression coefficients and the intercept but also the domains, medians and means for all input variables.

In Chapters 15 and 16, we considered linear regression of one variable against another – y vs. x. We can also regress an output variable against several inputs: zx,y – but this is a one-directional process, with all its pitfalls. Symmetric regression of 3 variables x, y and x is still being worked on. In social science literature we often encounter tables that report the results of multi-variable linear regression. Those numerous numbers seem to involve useful information – why else would they be published? But what can we actually read out of them? We’ll first approach the issue from the viewpoint of someone who wants to make use of published multi-variable regression tables so as to gain social insights. Thereafter, we’ll ask what we should do when running such a regression ourselves. When reporting the results, we should give enough information so that readers can understand – and possibly carry out some further analysis. Only this way can bits and pieces of knowledge become cumulative science. The principle of multi-variable regression is the same as for single variable regression (y vs. x), but more than one input variable is fed in. The output is usually construed as y=a+b1x1+b2x2+…, but it really should be shown as y a+b1x1+b2x2+…, because the relationship is valid in only one direction (recall Chapter 15). Symmetric multivariable regression is difficult, and mathematicians still work on it. The equations I tentatively offered (Taagepera 2008: 174-175) are plain wrong. So we are reduced to directional regression, with all its risks. There is still quite a lot we can do, provided that major pitfalls are avoided. The main purpose of determining a regression line is to estimate the output for given inputs. Given the values of inputs x1, x2,…, we should APPENDIX A be able to deduce the corresponding most likely value of y from y a+b1x1+b2x2+…. But this is not all we are interested in. Maybe scatter of y, for given x1, is so wide that the impact of x1 is “statistically insignificant” and we should omit x1. On the other hand, maybe x2 is highly significant in the statistical sense, but it varies over such a short range that it hardly affects y, given the wide range of y. Could we then overlook x2? We may also be interested in what the median value of y is, and how large or small y can get. Maybe we have the values of x2, but the values of x1 are hard to get. Can we still estimate y, and how well? The importance of such questions will become clearer as we walk through an actual example.

Making use of published multi-variable regression tables: A simple example Table A.1 shows a part of an example presented in Making Social Sciences More Scientific (Taagepera 2008: 207), based on a regression in Lijphart (1994). The output NS is the effective number of assembly parties. It may be logically expected to depend on two factors: How high is the “effective threshold” (T) of votes at which a party is likely to win a seat; and the size of the representative assembly (S), meaning the number of seats available. As the latter can vary over a wide range of positive values (from 60 to 650), it is likely to be distributed log- normally rather than normally, and hence logS is used in regression, rather than S itself. What do these numbers mean, in Table A.1? What can we conclude or deduce from them?

Table A.1. Effective number of assembly parties (NS) regressed on effective threshold (T) and logged assembly size (logS).

Independent Domain Mean Median Coefficients for variables (Range) NS Effective threshold (T) 0.1 to 35 11.6 7.0 -0.05** Log assembly size (logS) 1.8 to 2.8 2.2 2.2 0.12 Intercept 3.66 R2 0.30 Adjusted R2 0.28 *: statistically significant at the 5 per cent level. **: statistically significant at the 1 per cent level.

242 APPENDIX A

These coefficients mean that NS can be estimated from

NS 3.66-0.05T+0.12logS.

The two R2 values are measures of scatter around this directional equation. This equation accounts for 28% of the variation in y. The remaining 72% remain random scatter in the sense that threshold and assembly size cannot account for it, at least the way they are entered in the regression. (“Accounting for” in a statistical sense must be distinguished from “explaining” – only a logical model can explain the process by which T and S affect NS.) The two stars at T indicate that this factor has a definite impact on the output. The lack of stars at logS indicates that scatter is so large that it is statistically uncertain whether logS has any systematic impact on the output. If so, then we might as well omit logS altogether and estimate from T alone. Scatter might be hardly reduced. It might be tempting to use the equation above, just dropping the logS term: NS 3.66-0.05T. Wrong. This would imply that we assume that logS=0, hence S=1. This would be ludicrous: no national assemblies are that small. We need to replace the omitted factor not by 0 but by its mean value. Unfortunately, mean values are all too often omitted from reported regression results. Hence, to calculate the output for a given threshold value, we have also to locate the assembly size, even while it is stated that assembly size lacks statistical significance! The table above does add the mean, so that we can drop logS, if we so wish. This mean is 2.2, corresponding to assembly size S=160. The best estimate of the output becomes NS 3.66-0.05T+0.12(2.2). Hence

NS 3.92-0.05T.

The difference between 3.92 and 3.66 may not appear large, but given that few countries have fewer than 2.0 or more than 5.0 parties, a difference of 0.3 is 10% of the entire range. The table further adds the median and the domain, meaning the range from the smallest to the largest value. Why are they needed? Regression makes sense only when the variables are fairly normally distributed, so that their medians and arithmetic means coincide. This is the case for logS. (For S itself the mean would exceed the median appreciably.) For T, the mean exceeds the median by almost 5 units. Is the gap excessive? It depends on how widely a variable is observed to range – its domain. The domain of T goes from near-zero to 35, so a 5- unit discrepancy is appreciable. We might obtain a better fit for the

243 APPENDIX A number of parties, if we carried out the linear regression on the square root of T rather than T itself. Actually, the distribution of T here is not just a peak with a longer tail in one direction – it has two separate peaks. Hence the very use of regression becomes problematic. Once we omit logS, leaving only one input variable, it would be high time to graph NS against T, see what the pattern looks like, and try to express it as an equation. There is another reason for reporting the domain. Researchers some- times neglect specifying the measures they use. Did “logS” mean decimal or natural logarithms? When the domain is given, it becomes clear that decimal logarithms are used, because the corresponding range of S would be 60 to 650 seats, which is reasonable. If logS stood for natural logarithms, S would range from 2 to 16 seats! Often there are several ways to measure what looks the same conceptually. For instance, cabinet duration is observed to range from a few months to 40 years by a fairly lenient measure, but only up to about 5 years by a more stringent one (which resets the clock whenever there are elections). How am I to know? Well, if the domain for cabinet duration is given as 0 to 5, then I have a pretty good idea of which indicator has been used. Also, one often talks of “corruption index” when actually using an index of lack of corruption (so that honest countries have highest scores). Authors often are so used to a given measure that they neglect to specify it – or they even mis-specify, as for corruption. Exercise A.1 shows what further information one can glean from regression results, provided that the mean, median and domain are included.

Exercise A.1 Use Table A.1 for the following. a) Calculate NS for the median values of T and logS. One can expect the median value of NS to be close to this result. b) What is the lowest value of NS that could result from the extreme values of T and S on the basis of this regression line? c) What is the highest value of NS that could result from the extreme values of T and S on the basis of this regression line? d) By how much do these extreme values of NS differ from its presumed median. Comment on what it implies. e) Compare the extents to which T and S, respectively, are able to alter NS. Could you have predicted it just by looking at their respective coefficients? f) Given that the impact of logS is not statistically significant, we should be able to ignore it and still get basically the same result. How would the expected range of NS change? g) Which of the previous questions could you answer, if only the columns for “Coefficients” were reported?

244 APPENDIX A

Guarding against colinearity Suppose we have an output z that might depend on some factor x and possibly also on another factor y. (I owe to Kalev Kasemets this example based on actual published work.) We run simple OLS z on x and find

z = 0.634(±0.066)x + 0.789(±0.197) R2=0.828 p(x)<0.001.

The “(±0.066)” indicate the possible range of error on the coefficient of x. The low value of p(x) says roughly that chances are extremely low that the correlation is random chance. And R2=0.828 indicates that 83% of the variation in z is accounted for by variation in x. This looks pretty good. Now run OLS z on y, and the outcome is

z = 0.325(±0.027)x + 0.602(±0.169) R2=0.887 p(y)<0.001.

This looks like even a slightly better fit: R2=0.887 indicates that 89% of the variation in z is accounted for by variation in y. We get greedy and feed both of them into multi-variable linear regression. With two inputs, we should obtain an even better fit. We get

z = -0.208(±0.268)x + 0.426(±0.133)y + 0.0577(±0.174)

R2=0.890 p(x)=0.447, p(y)=0.005. This R2=0.890 is practically no higher than 0.887 for y alone. Most surprising, the impact of x now looks negative! It reduces z rather than increasing it! The p(y) is higher than before, meaning a higher probability that the correlation between y and z is random chance. The p(x) is disastrously higher than before, showing an almost 50-50 probability that the correlation between x and z is random chance. So which way is it? Does x enhance z in a highly clear-cut and significant way, or does it reduce z, in a quite uncertain way? We cannot have both. The answer is that the two inputs must be highly correlated them- selves. Indeed, the two inputs together cannot account for 83+88=171% of the variation in the output! The inputs must be correlated by at least 71%, roughly speaking. What happens if we ignore this “colinearity” of x and y, and plug them both into a multi-variable regression? They destroy and confuse each other’s impact. Among the two, y has a slightly higher correlation with z. It sort of cannibalizes the effect of x, reducing it to nothing. The small negative and uncertain coefficient of x

245 APPENDIX A does NOT show the total effect of x on z – it shows the residual effect of x, once its effect through y has been accounted for. We effectively have a causal chain xyz or maybe rather (y≈kx)z. By plugging both x and y into the same regression we arbitrarily assumed that causality is xzy – meaning x and y affecting z separately, which is here a blatantly false assumption. The computer program does not argue with you. I you feed in junk, it obediently processes it, but “junk in, junk out”. What are we supposed to do? In the present case, where both inputs account for more than 80% of variation in output, it’s fairly simple: use only one. Which one? One might pick the one with the higher R2, in the present case y, which tacitly implies that x acts through y: xyz. But this is not the only consideration. If we have logical reasons to assume that the causal process is yxz, then a small shortfall in R2should not deter us from using x. Also graph z vs. x, z vs. y and z vs. x. Curvatures or irregularities in the data clouds may give you hints on how the variables are related. The real hassle comes when the inputs are only mildly correlated– not R2=0.70 but R2=0.30. Here one of the inputs may act on the output both indirectly and also directly: xy zx. There are statistical ways to handle such situations, but also graph the data. The main thing is: If it looks odd, it probably is – then it’s time to double check or ask for help. Do not report absurdities, without expressing doubts.

Running a multi-variable linear regression Suppose that many factors come to mind, which could conceivably have an impact on the values of some output variable y. Among these, A, B, C, D, E and F are the prime suspects. Furthermore, the output might be different for women and men. As a first exploratory step, we might run multi-variable OLS regression. It may suggest which factors have a definite impact and thus guide our search for a logical model. Once a model is constructed, its testing may need another regression. The process includes the following stages. Processing data prior to exploratory regression, Running exploratory regression, Reporting its results, Re-running exploratory regression with fewer variables, Graphing the prediction of the regression equation against the actual outputs. Model-testing regression, once a logical model is devised.

246 APPENDIX A

Processing data prior to exploratory regression It matters what we feed into the computer. All relationships are not linear. Before applying linear analysis we better do some thinking. Instead of feeding in x, maybe linearity is more likely with 1/x – or even 1/x2. (The latter is the case between cabinet duration and the number of parties, for instance.) When there is linearity between y and 1/x2, then there is no linearity between y and x. If we still regress on x, we would fit a curved data cloud with a straight line. We would obtain some correlation but not to the full extent possible. Most important, we would miss the logical nature of the connection. But what can we do, short of working out elaborate logical models? At the very least, consider the conceptually allowed ranges. If factors A and D can in principle range from minus to plus infinity, enter them as they are. But if factors C and D can take only positive values, it is safer to enter logC and logD. Also consider whether all factors are mutually independent. We talked about guarding against colinearity. This is worth repeating. Suppose input variables D and E are strongly connected through D=a- bE. Then one of them should be left out. Which one? Before answering this question, also consider “co-curvilinearity”: Maybe D and E are even more strongly connected through D=a/E2. How do we know? Graph each potential input variable against each other. If the data cloud looks like a fat ellipse, almost circular, then the two variables are independent and can be used together. But if the data cloud looks like a thin tilted ellipse or bent sausage, then the variables are interconnected and we better keep only one of the two. But which one should it be? If y may be affected by D and E, which are interrelated, then it might be that one of them acts through the other:

E D y OR D E y.

Which one could it be? Make graphs and run correlations y vs. E and also y vs. D. The one with the higher R2 (once data clouds are straightened out) is likely to be more directly connected, and this is most often the one to keep. True, it could be that one of the factors acts on y both indirectly and also directly: E D y  E, but one of them is likely to predominate. Let us keep it simple, if we can.

Running exploratory regression Suppose co-curvilinearity eliminates factors E and F. We are left with A, logB, logC and D. We run multivariable regression on them. Suppose

247 APPENDIX A we report the results, using today’s standard format (Table A.2). Instead of just R2, there may be a somewhat differently labeled coefficient. Stars indicate the strength of supposed significance in a statistical sense, which often is misinterpreted (Taagepera 2008: 77-78). The computer printout may have other statistical features that will not be discussed here.

Table A.2. Typical minimal reporting of multi-variable linear regres- sion analysis.

Factor A -0.03*** Factor B (log) 0.12** Factor C (log) 0.28 Factor D 3.77* Dummy (F=1) 0.74* Intercept 4.07 R2 0.39

What this table means is that the values of y can be best predicted from values of A etc. by applying the equation of format ya+b1x1+ b2x2+…and plugging in the coefficient values shown in the table. For males, it is

y  4.07- 0.03A + 0.12logB + 0.28logC + 3.77D.

For females, add 0.74. Together, these variables account for 39 % of the variation in y, as R2 tells us. The number of stars suggests that A affects y most certainly, followed by logB, while the impact of D is less certain, and the impact of logC could well be random chance. Whether gender may have some impact also remains in doubt.

Lumping less significant variables: The need to report all medians and means Occam’s Razor is a principle that tells us to try to prune off everything that is not essential. (Recall Albert Einstein’s advice: make your models as simple as possible – and no simpler.) Which factors should we discard? On the face of it, we should keep only the two most significant. Then the equation above might seem to be reduced to y4.07- 0.03A+0.12logB – but not so fast! By so doing, we would assume that

248 APPENDIX A the mean values of logC and of D are 0, which might be widely off the mark. As pointed out earlier, we must plug in the average values of logC and of D. Also, assuming roughly equal numbers of females and males, we should add 0.74/2=0.37. Thus the reduced equation would result from

y  4.07-0.03A+0.12logB+0.28(aver. logC)+3.77(aver. D)+0.37.

But what are “averages” – medians or arithmetic means? And have we reported their values? Without these values, anyone who wants to make use of our findings to predict y would have to enter into the equation not only the values A and B but also C and D. He would have to dig up the values of C and D, even while our analysis concludes that they are rather insignificant! Indeed, for this reason all too many published regression results are useless for prediction: Too many variables are shown, and their average values are not. So we better report the averages. As pointed out earlier, we better report both median and arithmetic mean, to give the user a choice, but mainly for the following reason. If median and arithmetic mean differ appreciably, this would indicate that the distribution of values could not be normal. This means that assumption of linear relationship is on shaky grounds – and we should warn the readers. Actually, we should try to transform our data, prior to linear regression, so that median and arithmetic mean would be roughly equal.

Table A.3. Multi-variable linear regression results, when also reporting averages. RC=Regression Coefficient. Factor median weights (median× RC) emerge.

Median Mean RC Median×RC RC after lumping Factor A 1.20 1.23 -0.03*** -0.036 -0.03*** Factor B (log) 0.50 0.57 0.12** 0.06 0.12** Factor C (log) 6.5 6.4 0.28 1.82 --- Factor D 0.40 0.37 3.77* 1.51 --- Dummy (F=1) 0.5 0.5 0.74* 0.37 --- Intercept 4.07 7.77(=4.07+1.82+1.51+0.37) R2 0.39

Suppose we have done the needed transformations and can happily report the results in Table A.3, where medians and means roughly agree. I have also included the product of each factor’s median and its

249 APPENDIX A regression coefficient. This is the median weight it contributes to y. If we drop factors C, D and gender, then we must add these weights to the intercept. The new intercept value is 4.07+1.82+1.51+0.37=7.77. So the equation with some variables lumped is

y  7.77 - 0.03A + 0.12logB, as reflected in the last column of Table A.3.

Re-running exploratory regression with fewer variables If at all possible, we should now carry out a new regression, using only A and B. It should yield an intercept close to 7.77 and coefficients close to the previous ones. If this is not so, there is something in the data constellation that we should check more closely. The value of R2 can be expected to drop below 0.39 because we no longer account for the variation due to C, D or gender. If the drop is appreciable, we may have to reconsider. Indeed, the drop in R2 may be serious. Look at the factor weights in table above: They are large for C and D (1.82 and 1.51), while tiny for A and B (-0.036 and 0.12). How much impact could such tiny inputs have on the output? The question is justified, but to answer it, we must also take into account how widely the variables are observed to range. We saw that this is called their domain.

Report the domains of all variables! Table A.4 adds the observed domains of variables. How widely could the estimates of y vary? We must consider the extremes of inputs, keeping track of the signs of their extreme values and coefficients. The lowest value of y results from ymin = 1.5(-0.030)-2(0.12)+4(0.28)- 0.4(3.77)+0 = -0.67. The highest value of y results from ymax = 0.9 (-0.030)+3(0.12)+9(0.28)+1.2 (3.77)+0.74 = 8.12. The actual domain of y is likely to be somewhat smaller, because extreme values of inputs rarely coincide, but clearly y can vary over several units.

250 APPENDIX A

Table A.4. Multi-variable linear regression analysis, also reporting the domains.

Median Mean Domain Regr.coeff. Span Span×RC Factor A 1.20 1.23 0.9 to 1.5 -0.030*** 0.6 -0.018 Factor B (log) 0.50 0. 57 -2 to +3 0.12** 5 0.60 Factor C (log) 6.5 6.4 4 to 9 0.28 5 1.40 Factor D 0.40 0.37 -0.4 to +1.2 3.77* 1.6 6.03 Dummy (F=1) 0.5 0.5 0 to 1 0.74* 1 0.74 Intercept 4.07 R2 0.39

The “span” is the extent of the domain, the difference between the largest and smallest values. Then (Span×Regression Coefficient) is the extent by which the given factor could alter y. While it is highly likely that A has an impact on y (3 stars!), this impact is tiny. Even the extreme values of A would alter y by only 0.18. If we dropped A, our ability to predict y would hardly be affected. In contrast, variation in Factor D alone determines most of the domain of values y could take, if the impact of D is real. Trouble is, it is uncertain whether D has any definite impact. We should distinguish between statistical significance (the “stars”) and substantive meaningfulness. In terms of health issues, suppose presently 54,000 people per year die of a given disease. Suppose some rather expensive and painful medication reliably reduces mortality – to 53,000. Is a drop of 1,000 worth subjecting 54,000 people to this treat- ment? Now suppose that the uncertain Factor D can easily be altered – like making drinking water slightly more or slightly less acid. If it works, it would have a large impact on mortality, but we are not quite certain it would even have an effect (single star). Which factor should we focus on in the present example? I have no simple answer. It depends on many other considerations. But one thing is certain: Reporting only the information in Table A.2 could badly mislead readers into thinking that Factor A is the best predictor, just because it has the most stars. Whenever one runs a regression, one has the possibility to determine the domains, medians and means of all the input variables. Do report this information. Omitting the explanatory columns (Span and Span×RC) in Table A.4, our table of regression results should look like the one in Table A.5. Then the reader can play around with it, weighing the impact vs. significance of various factors. When she has access to the values of only some of the factors, she can

251 APPENDIX A still make predictions, using the mean or median values for missing factors.

Table A.5. Recommended format for reporting multi-variable linear regression analysis.

Median Mean Domain Regr.coeff.

Factor A 1.20 1.23 0.9 to 1.5 -0.030*** Factor B (log) 0.50 0.57 -2 to +3 0.12** Factor C (log) 6.5 6.4 4 to 9 0.28 Factor D 0.40 0.37 -0.4 to +1.2 3.77* Dummy (F=1) 0.5 0.5 0 to 1 0.74* Intercept 4.07 R2 0.39

Graphing the predictions of the regression equation against the actual outputs The advice “Always graph the data” may become unfeasible when more than 2 factors interact. Whenever we graph two of them against each other, the interference by other factors can completely blur out a very real relationship. But one can always graph the predictions of the regression equation against the actual outputs. We would expect the resulting “data” cloud to be evenly scattered around the equality line: yexpected=yactual, if the regression is any good. This “data” cloud should look like an ellipse. If this is not the case, we should look into what causes the irregularities. When we regress yexpected against yactual (preferably using symmetric regression), we obtain yexpected=a+byactual. Here a should be very close to 0, and b should be very close to 1. If this is not the case, we should look for ways to improve the original multivariable regression. Maybe some input factor should be transformed before feeding it in. Maybe some parts of the data produce “islands” far away from the equality line and should be treated separately. Always graph the predictions of a model or data fit against the actual data – this is so easy to do that it should become standard procedure in multivariable regression in particular. This would supply a clear and simple picture of the quality of the regression.

252 APPENDIX A

Model-testing regression All the preceding refers to exploratory regression – trying to get some idea of what might affect a given output. It may help to focus on just a couple of inputs. The next step would be to ponder how these factors might impact the output. This means trying to build a logical model. More graphing may be involved – and don’t even think of using regression for model construction itself! It is unlikely that the predictive model would include more than 1 to 3 input variables. All others are likely to act through these or be negligible, at least in a first appro- ximation. By this time, we no longer accept just any regression coefficient values. Suppose the logical model is y=KAB/C2. We know we have to regress logy against a=logA, b=logB and c=logC. But we also know that the result must be logy=k+1.00a+1.00b-2.00c – only the intercept k=logK is not predicted. What do we do when the coefficients found differ appreciably from 1, 1, and -2? The answer depends on the specific situation. For the moment, let us just keep in mind the diffe- rence between preliminary regression (preliminary to model building effort) and final regression (testing the model).

Substantive vs. statistical significance High statistical significance alone could be pointless for making sense of the world. We must look for substantive significance. This point was briefly made in previous chapter as well as in Taagepera (2008: 77– 78 etc.) Table A.6 is an attempt to visualize a similar message in Professor McCloskey's The cult of statistical significance ()

253 APPENDIX A

Table A.6. Substantive vs. statistical significance

Extent (size) of effect e.g., by how much a cure reduces the death rate for a disease Tiny Appreciable e.g., from 39% to 38% e.g., from 39% to 18%

No substantive Median substantive Low significance significance p<.1 Forget it If it were real, it would have high substantive significance. So try to refine and add data, so as to raise Statistical statistical significance -- significance the potential payoff is high! Also try to build a logical model, as it would help you to refine the data.

Low substantive High substantive High significance significance p<.01 A reduction by 1 Congratulations! percent point is peanuts, BUT beyond happy regardless of how application, also try to certain it is. BUT don't elucidate the underlying yet give up. Try to mechanism, i.e., build a elucidate a plausible logical model. Otherwise underlying you would be at a loss mechanism, i.e., build a when the effect suddenly logical model. Then you weakens.* might find ways to enhance the extent of the effect.

* At a refractive materials plant, they had a product vastly superior to those of all competitors. The management was just happy about it and never bothered to find out why. Then a foreman died, and the quality plunged, never to rise again, no matter what they tried. By this time it was too late to ask “why?"

254 APPENDIX B Some Exam Questions

These are some questions used in recent midterm tests and final exams, with open books and notes. The “points” reflect the expected time in minutes to complete them, except for questions that offer data and then only say: “Do all you can with this information, using skills developed in this course. Beyond responding to step-by-step commands, show that you can set up such steps for yourself.” If the student has grasped the “meta-skills” this book tries to transmit, the response might take little time. Lacking those skills, it would also take little time before the student runs out of ideas of what to do with the data. Why such an open-ended format, with open books and notes ? When you land a college-level job, your boss will not tell you “Do a), then b), and then c).” She will tell you: “We have this problem. Find an answer.” It’s an open-books-and-notes situation. If you return with empty words, she will not give you half-credit just because she cannot figure out if there is something relevant hidden in your gibberish. Instead, she will wonder: “Should I keep this employee?” These questions are presented here in an increasing order of estimated length or difficulty. Of course, exercises spread throughout the book also can serve as exam questions.

1. (4 points) How would you explain the meaning of decimal “loga- rithm” of a number like 100 or 10,000 to your 8-year-old niece?

2. (4 points) A population grows in time as P=5e0.23t. The constant e stands for e=2.71828182… How much is P when t=0?

3. (10 points) Both graphs next page (A and B), from a recent issue of the journal Electoral Studies (vol. 32, pp. 576-588, 2013) show the citizens’ perception (C) of fairness of elections vs. international expert perception (E) of fairness of the same elections. All scales are in percent. The graphs differ in time and method of survey. In p. 584, the author comments: “By contrast [to certain other countries], both the Ukrainian and Peruvian elections are regarded far worse by inter- national experts than by citizens living in these societies.” a) Is this so in Graph A? Why, or why not? b) Is this so in Graph B? Why, or why not? Fluffy words (a.k.a. BS-ing) will not take you anywhere. Do not waste your time on that. Start by applying some of the basic approaches taught APPENDIX B in this class, and the answer will be quite short. Add to the graph, to help show your reasoning.

4. (10 points) Figure 8.5 in Taagepera, Logical Models…, based on a graph by Johnston, Hagen and Jamieson (2004), shows a curve y vs. x, where

256 APPENDIX B y = support for Democrats in individual US states in Presidential elections 2000, and x = support for Democrats in individual US states in Presidential elections 1996. The relationship is approximately y=x1.18, as the support for Democrats dropped in 2000. This support for was up again in 2008. Assume that the pattern was z=y0.91, where z = support for Democrats in individual US states in Presidential elections 2008. a) Calculate the relationship between Democrat support in 1996 and 2008. In other words, express z in terms of x. b) Was the support for Democrats in 2008 higher or lower than in 1996 (assuming that z=y0.91 holds)? How can you tell it from the equation for z and x? (A graph might help.)

5. (12 points) Estimate the number of swimming pools (public and private) in Los Angeles. (Show your reasoning. No credit for just a number drawn from the hat.)

6. (12 points) In a developing country, 40 % of boys attend elementary school, but only 10% of girls do. Developing countries tend to have warm climates. a) Calculate the likely percentage of girls attending school at a later time, when 60% of boys do. (Zero points for guesswork. Place the information given in a logical framework and calculate on that basis.) b) At the time 40 % of girls attend elementary school, calculate the likely percentage for boys. c) Graph all three points, so as to make sure they lie on the same smooth curve. Do they?

7. (15 points) Larger countries tend to have larger populations. Examples:

Area Population 1995 (thou. sq.mi.) (million) Trinidad 2.0 1.3 Denmark 17 5.2 Colombia 440 37 USA 3700 263 a) Graph on suitable scales. b) Draw in the best-fit line or curve. If you can determine its equation, good for you, but I really do not expect it at this early stage.

257 APPENDIX B c) Which slope might one expect, in the absence of any information? How does the actual slope differ? What might be the reason?

8. (16 points) The graph below shows the actual seat shares for parties plotted against the seat shares expected on the basis of a logical model (the nature of which does not concern us here). a) For “Two-party contests", this model looks like a satisfactory approximation, because [complete the short sentence] … b) For “Two-party contests", this model is far from perfect data fit, because [complete the short sentence] … c) For “Multiparty contests", this model looks unsatisfactory, because [complete the short sentence] … d) Add a smooth curve to fit the “Multiparty contests” data. Do it quite carefully, so as not to propose absurdities, inadvertently.

9. (15 points) How much could a country's Gross Domestic Product (G) depend on its population (P) and on its average level of education (E,

258 APPENDIX B measured as the number of school years for the median person)? Suppose the following model is proposed, as a first approximation, when comparing countries: G=aPE1/2. a) Does the proposed relationship between G and P make sense? Why or why not? (Consider what would happen to G when a country splits into two equal parts.) b) When the median person has at least 6 years of schooling, does the proposed relationship between G and E make sense? Why or why not? (Consider what would happen to G when E is doubled.) c) Also consider what the model leads to in a country with absolutely no schooling. Suggest a slight correction to the model, to take care of this difficulty.

10. (14 points) The clipping from Los Angeles Times (25 January 2013) shows the results of recent elections in Israel. [This graph, NOT REPRODUCED HERE, shows these numbers of seats for parties: 20, 12, 11, 11, 7, 19, 15, 6, 6, 4, 4, 3, 2; Conservative parties, 61; Center- Left parties, 59.] Israel is one of the few countries that allocate seats by nationwide proportional representation (with a minimal “legal threshold of votes”). Compare this picture with what you would expect when knowing only that there are 120 seats, allocated by nationwide pro- portional representation. If this were a final exam, these would be all the instructions given. But this being a midterm, further hints follow. What number of parties would you expect to win seats? To what extent does the actual number differ? How large would you expect the largest party to be? To what extent does the actual size differ? How large would you expect the largest party to be, if knowing the actual number of seat-winning parties? To what extent does the actual size differ?

11. (14 points) Suppose male and female literacies varied as follows, over time (they roughly do):

Time t1 t2 t3 t4 t5 Male literacy (%) 16 30 50 81 92 Female literacy (%) 1 5 --- 59 81

Determine the female literacy at time t3, if the general pattern holds. Make use of all you have learned in this course. If this were a final exam, these would be all the instructions given. But this being a midterm, a further hint follows: Think inside the box.

259 APPENDIX B

12. (16 points) Consider the equation y=M/[1+e-kt], where t is time, and M and k are positive constants. a) How large would y be at time t=0? b) How large was y in the far past? How large will y be in the far future? d) Sketch a graph y vs. time.

13. (14 points) All developed democracies tend to redistribute some wealth from wealthier to poorer people, through taxation. Some people like it, some don’t. One may well presume that the extent of redistri- bution is higher when more people support it. The figure below shows the extent of redistribution graphed (y) against the support for redistri- bution (x). Both can in principle range from 0 to 100 %. (Source: Lupu and Pontusson, Am. Political Science Review 105, p. 329, 2011.)

Does the line shown in the graph make sense at all possible values of support for redistribution? Offer a better line or curve. [Neglect the SWZ and SPA points.] If this were a final exam, these would be all the instructions given. But this being a midterm, further hints follow. If no one supported redistribution, how much redistribution could we logically expect? What does the straight line shown in the graph represent? What would it predict when support for redistribution is nil? Which line or curve would satisfy both data and logical considerations? How does its degree of fit compare to the fit of the line shown?

260 APPENDIX B

14. (16 points) In an assembly of S=100 seats with p=10 parties represented, we usually start an estimate of the largest party share (S1) by observing that it must be between the mean share and the total: 10≤S1≤100. A perceptive student protests that S1=100 is conceptually impossible, because it would not leave any seats for the other 9 parties which are also supposed to be represented. The instructor agrees that this is only a first approximation, which could be refined later on. a) What value of S1would result from this first approximation? b) How would you refine the first-approximation approach, to satisfy this student? c) What value of S1would result then? d) The actual largest shares in the Netherlands 1918-52, when S=100 and p was close to 10, ranged from 28 to 32, with a mean of 30.5. Compare this data piece with the predictions of the coarse and more refined models. Does the refining of the model seem worthwhile?

15. (20 points) What is the simplest model to try, if you know that the relationship between y and x is subject to the following constraints. Give the name of this type of model, and also its equation, if you can (for some, we have not given it). Sketching a picture might will help to guide you. Three separate cases: a) When x=0, y must be 1. When x becomes huge, y approaches 0 b) When x=0, y must be 0. When x=1, y must be 1. In-between, x and y are not equal. c) When x=0, y must be 0. When x becomes huge, y approaches 100%. d) When x=0, y must be 1. When x=1, y must be 0. In between, y and x need not be equal. (This is a special challenge – a case we have not discussed explicitly. Bypass this question, if you find the previous ones difficult.)

16. (14 points) Indicate the simplest model to try, when the relationship between y and x is subject to the following constraints. In each case, sketch a picture, give the name of the model, and give its equation. a) When x=0, y must be 0. When x=1, y must be 1. When x=0.5, y must be 0.5. In between, y and x are not equal. Give a specific example where these constraints apply. b) When x tends toward negative infinity, y tends toward 0. When x becomes tends toward positive infinity, y tends toward a maximum value M. Give an example where these constraints apply.

261 APPENDIX B

17. (27 points) In the graph below, add the following, trying to be precise. [GRAPH NOT REPRODUCED HERE shows points 3,5; 9,5; 9,8.] a) Draw in the OLS regression line y vs. x. Label it. (Wiggly lines are marked down.) b) Draw in the OLS regression line x vs. y. Label it. c) Determine the coordinates (X,Y) of the center of gravity of these 3 points. (This is where precise drawing is needed.) d) Calculate the slope of the symmetric regression line. e) Draw in the symmetric regression line quite precisely. Label it. f) Which data fit should you choose, in the absence of any further information on the causal relationship between y and x? g) In the absence of any further information, what would be your best guess for y at x=0 and for y at x=10? h) Now suppose you also know that x determines y, rather than vice versa. Which data fit should you choose now? i) Now suppose you also know on logical grounds that y=0 goes with x=0. Add this to the graph and look at it. Which pattern makes sense now? Draw in the approximate new fit.

18. (20 points) The following data are given, with no further informa- tion on logical constraints etc.: x 0 5 10 y 1 20 400 a) Calculate the suitable means for x and y (i.e., those most likely to reflect the medians). b) Graph the data points, y vs. x, on the type of graph paper most likely to yield a straight line. A variety of graph papers is attached for this purpose. c) On the same graph, also place the point for (mean x, mean y). d) Draw in the best-fit straight line. (Wiggly lines are marked down.) e) Write the equation that connects y and x (to the extent the line in graph holds). If you can give the numerical values of some of the constants in the equation, so much the better, but symbolic notation will do.

262 APPENDIX B

19. (21 points) Describe as fully as possible what one can directly see in this graph.

20. (24 points) Costa Rica's population was 0.3 million in 1900; 0.9 million in 1950; and 4.2 million in 2000. Do all you can with this information, using skills developed in this course.

21. (40 points) The percentage of people who disagree with the state- ment “Most people can be trusted” is as follows, depending on their level of education, Lower or Higher. (The responses of people with Middle education tend to be in between.) Analyze these data, using all your skills, and draw conclusions. Data reprocessed from Table A165 in Ronald Inglehart et al. (2004), Human Beliefs and Values (Mexico: Siglo Veintiuno). The sample is fairly representative. Country Lower Higher Argentina 86 79 Australia 69 50 Brazil 98 95 China 50 26 Denmark 44 17 U.S. 76 58 22. (20 points) Literacy is expressed as the percentage of people over 15 who can read and write. As a population becomes more literate, female literacy (F) tends to remain lower than male literacy (M): F

263 APPENDIX B

a) When female literacy is graphed against male literacy, which regions are conceptually allowed, and which are forbidden? (Draw a graph, to show them.) b) Suppose the following model is proposed: F=M-20. Show it on your graph. (Do so fairly precisely, e.g., by calculating F for M=100 and M=50 and drawing a line through these points.) c) This F=M-20 does satisfy the two conditions above (F

22A. (30 points) Literacy is expressed as the percentage of people over 15 who can read and write. As a population becomes more literate, the following is most often observed. 1) Female literacy (F) tends to remain lower than male literacy (M): F

21. (27 points) Literacy percentages for females and males in three developing countries are F=35, M=45; F=45, M=60; F=60, M=70. Build a model, in as much detail as you can.

For Appendix A 24. (13 points) An output z is estimated from linear regression in x and y: z  a+bx+cy. The mean values of the inputs are X and Y. The impact of y is found to be statistically non-significant, so one wishes to

264 APPENDIX B forget about y and estimate z from x alone: z  A+Bx. How are the new constants A and B derived from the previous?

25. (13 points) An output z is estimated from linear regression in x and y: z  0.68+0.56x+0.28y. The mean values of the inputs are X=2.1 and Y=1.6. The impact of y is found to be statistically non-significant, so one wishes to forget about y and estimate z from x alone: z  A+Bx. What values should we plug in for A and B?

26. (16 points) Suppose you are given the following results of a multi- variable linear regression of z on x and y: x 0.97*** y 0.55 Intercept 0.20 R2 0.50 a) Write the equation by which we can estimate z from x and y. b) Suppose the median value of x is 10.0 and the median value of y is 0.90. Calculate an estimate for median z. c) One of the inputs is statistically not significant. Omit it, and show the equation that connects z to the remaining input. d) Use it to estimate the median z. Does the result surprise you? Why, or why not?

27. (20 points) Continuation of Question 10, which is reproduced here. A country's Gross Domestic Product (G) can be expected to grow with its population (P) and also with its average level of education (E), measured as the number of school years for the median person. Suppose the following model is proposed, for a not heavily overpopulated country: G=aPE1/2 a) Does the proposed relationship between G and P make sense? Why or why not? (Consider what would happen to G when P is doubled.) b) When the median person has at least 6 years of schooling, does the proposed relationship between G and E make sense? Why or why not? (Consider what would happen to G when E is doubled, and then doubled again several times.) c) Also consider what would happen in a country with absolutely no schooling. Could you suggest some slight correction? d) Rather than using this attempt at a logical model, suppose someone decides to use multivariable linear regression to elucidate the connection of G to P and E. How should he first transform the variables? (Write out the equation to be used.)

265 APPENDIX B

e) Assuming that the outcome of such analysis is as follows.

x 0.97*** y 0.55 Intercept 0.20 R2 0.50,

(x pertaining to P and y to E). Write the corresponding equation for G itself. f) The resulting equation has some similarities to G=aPE1/2, as well as differences. To what extent would it confirm or disconfirm the sup- position G=aPE1/2?

266 APPENDIX C An Alternative Introduction to Logical Models: Basic “Graphacy” in Social Science Model Building

Basic “Graphacy” in Social Science Model Building

What is graphacy1? In analogy with literacy (ability to read and write), the notion of numeracy has been introduced for ability to handle numbers. I add “graphacy”, for ability to construct and interpret graphs. Here I develop graphacy skills for building logical models, using social science examples. I try to make do without any mathematics, or at least without any equations. This is the challenge. Bright 13-year-olds should be able to understand the first two chapters. All I expect is 1) knowledge of what percentages are, and 2) ability to graph a set of data, y vs. x, on regular scales. My experience is that not all college students possess the latter skill, so my Logical Model Building and Basic Numeracy in Social Sciences (Taagepera 2015) spends a long introductory chapter on its pitfalls. Here I bypass it. Forget about bar and pie charts. They do not connect. Science is about ability to connect factors, and then to predict. So I focus on graphing one factor (y) against another factor (x). My goal is to show how to use such graphs, adding lots of good sense, so as to build “quantitatively predictive” logical models. Why do I see a need for the present text? Students who complain that they are “not good at math” too often fail even before reaching any math in the sense of arithmetic or algebraic equations. They fail at visualization of a problem or issue. In final exams based on Taagepera (2015), too many students who have mastered a fair amount of mathematics make disastrous mistakes in non-math picturing of logical constraints, mainly by not trying to picture them at all. This initial failure can make what follows pointless, no matter how many logarithms and exponents one piles up. Visualization seems to be a major hurdle. So, such graphacy skills should be introduced at the very beginning of a course on logical models, so as to give these skills time to be digested and reinforced throughout the remaining course. The first chapter introduces our amazing ability to deduce the broad overall pattern of y vs. x at any values, on the basis of just logical Appendix C constraints involved plus one data point. The second chapter expands to a number of most usual constraints. Here data points are fictitious (but realistically chosen). The third chapter presents actual data, as published but only incompletely analyzed (mostly by linear fit that risks producing absurd value of y at some values of x). This chapter asks students to establish the constraints involved and then draw the curve for the entire possible range of x. A separate document offers my answers.

How do these graphacy chapters fit in with Logical Models and Basic Numeracy in Social Sciences, http://www.psych.ut.ee/stk/Beginners_Logical_Models.pdf

In pilot courses, 2016 and 2017, I intend to start off with Chapter 2 of Logical Model Building and Basic Numeracy in Social Sciences, which introduces ability to graph a set of data, y vs. x, on regular scales (including the first Home Problem). I then intend to shift to the present text. As we work out its Chapters 1 and 2 in class, Chapter 3 is assigned as Home Problems, probably in two separate batches, Cases 1 to 5 and 6 to 11. Out of these 11, fully 8 occur in Logical Models and Basic Numeracy – but here they come unencumbered with any equations. Thus the basics of graphacy should be easier to grasp, and later we can address the mathematical aspects separately. A complementary skill, ability to “read” all features in a moderately complex graph is not addressed here but rather in Logical Models and Basic Numeracy (Chapter 10, section “What can we see in this graph?”). Once the pilot courses work out, I intend to include this text in Logical Models and Basic Numeracy. However, this would mean a major rewrite.

Rein Taagepera University of California, Irvine and University of Tartu, Estonia Johan Skytte Award 2008, IPSA Karl Deutsch Award 2016

268 Appendix C

1. Data: Do something with them. Using thinking Suppose I offer you the following data point: when x=60%, y=36%. Say something about this information. Do something with it. Are you at a loss? What can one do with just one data point? If your idea of science is limited (mistakenly) to computerized statistical analysis, you can do little, indeed. But if we realize science is interaction of data and thinking, then we can squeeze something out even of this little.

1. One data point and abstract percentages What can we do? Compare. Place this 60%;36% in a wider framework. (Without comparisons, science could not even begin.) Still at a loss? Think in a comparative way. What is usually the lowest value per- centages could have? And the highest? These are useful comparison points. Still at a loss? OK, here’s what I am after. Percentages usually run from 0 to 100. In this range, 60 is on the high side, closer to 100, while 36 is on the low side, closer to 0. This is what I mean by “Compare and place in a wider framework”. If you think this is too obvious to be mentioned, you are on the wrong track. Science starts with making conscious note of the “obvious”. There is still more we can do. Again at a loss? Always try to turn a problem into a picture. Still at a loss? Well, we have some x and some y. Let us graph it. Graph all we know. Maybe you drew the axes only from 0 to about 60 and then placed the point 60;36 in this field:

If this is what you did, then you did not graph all we know. We also know that the percentages usually go up to 100, and no more. So draw the entire box, x going from 0 to 100, and y going from 0 to 100. Shade in grey the area outside this box: This is the region where percentages usually cannot occur. In contrast, inside this box, any point x;y may occur, in principle.

269 Appendix C

We now can see the “conceptually allowed” region, and within it, our single data point. It is now quite explicit where this data point is located, compared to all the other locations where it could conceivably be. If you now think I have dumbed it down beyond your dignity and patience, with stuff at best marginal to science, please reconsider. If all this is so self-evident that you might as well by-pass it, then why didn’t you draw the box above on your own, the moment you were told, “Do something with it”? As for being marginal to scientific thinking, this is dead wrong. This example is at the very center of scientific thinking, because it introduces two basic habits to develop:  Think about combining things directly given (data point) with other knowledge you already have (per- centages ranging from 0 to 100).  Visualize – draw a picture – a picture of the entire possible range.

2. One data point with substantive content This is how far we can go without knowing, what x and y are about – just abstract x and y. Now let us give them substance. This is where we shift from mathematics to natural science, human nature included. I now tell you we are dealing with male and female literacy rates in a country – the proportions able to read and write. Make it appear in the notation: M=60%, while F=36%. Such informative notation greatly helps thinking, compared to bland x and y. Now we can go much further with these data plus thinking. The fundamental question that we should ask at every turn is:  But can this be so? Does it make sense? Apply it to M=60%, F=36%. Are you at a loss? Compare. M is larger than F. Does this make sense? If we were given the reverse (M=36%, F=60%.), would you sense something odd? You should. When literacy is low, it always is even lower for females than for males. This is a hard fact. So M larger than F makes historical sense.

270 Appendix C

Historical. Think about this word. What could the literacy rates have been earlier, and what could they be later on? Try offer some estimates, and place them on the graph F vs. M. How on the earth could we know? For recent past and close future, this would be guesswork, indeed, at this stage. But, counter-intuitively, this task becomes quite precise for much more distant past and future. How could this be? Way back, there must have been a time when no male in this country could read: M=0%. What could F have been at that time? No escape: It must have been F=0%. When was that? Never mind the date. (In science we must learn to overlook what is not needed.) But whenever it was that M was 0, F must have been 0 too (unless one builds a very fancy story). Place this point on the graph. Now think ahead. Sooner or later, all countries reach practically complete literacy: M=100% and F=100%. Place this point, too, on the graph:

Such points are “conceptual anchor points”, because they anchor the entire possible path literacy can take, from no one literate to everyone literate. Anchor points are “experimental data points” in a special sense: They result from thought experiments. Hold it, some data-oriented people may scream: Science deals with reality, not hypotheticals. They are dead wrong. Thought experiments are very much part of science. Our anchor points are even stronger data points than 60;36, in the following sense. The 60;36 is subject to random error. It could have been 61;35, for all we know. But 0;0 has no such leeway. This is a perfect 0;0. (The 100;100 could technically stop at something like 99.7;99.8. Learn when it makes sense to overlook such quibbles.) Now our graph no longer has just a single data point. It has 3 points. “In the absence of any further knowledge” (another important term in science thinking), our best bet would be to join them with a smooth curve. (If we make it fancier, we must offer reasons why the path must deviate from the simplest path possible.)

271 Appendix C

Chances are that the curve you have drawn suggests that M=50% would correspond to F around 25%. In reverse, F=50% would correspond to M around 70%. Does it, approximately? In sum, for any given male literacy rate we can predict what the female rate would be. And vice versa. This is a main goal of science: ability to predict, with some degree of precision. We could further add the “equality line”, F=M. The distance between this diagonal and the actual curve shows how much female literacy trails the male. As overall literacy increases, this gap (M-F) widens until M reaches 50%. Thereafter, the gap begins to decrease. Most countries actually do follow such a pattern, as literacy develops. We could now go mathematical and develop the equation that pretty much fits the curve you have drawn, but we will not do so at this stage. (This equation is F/100=(M/100)2, if you insist on knowing.) The main point is that a single data point could lead to this ability to predict female literacy rate from the male, and vice versa. This is pretty power- ful! Such predictive ability becomes possible, once one trains oneself to practice thinking, using fairly broad guidelines. To wit:

 “But can this be so?” Does it make sense?  Use informative notation (M and F rather than x and y).  Think about extremes (like 0 and 100%).  Compare (also with extremes).  Graph the data.  Graph more than the data -- show the entire conceptually allowed area where data points could possibly occur.  Consider conceptual anchor points.  Connect anchor points and data points with a smooth curve.  Sometimes a diagonal also makes sense, as a benchmark for comparisons.

272 Appendix C

3. More data points: thinking vs. statistical analysis Now assume more data points are given -- like 60;36, and 50;25, and 70;50. For the thinking approach, this adds confirmation that the single point 60;36 was not a fluke or result of erroneous measurement. All three data points visibly are close to the same smooth curve joining the anchor points. This is a pattern that makes sense, going from utter illiteracy to complete literacy. For instance, when M=20%, F is around 4% (the small “+” in graph below).

But now we also have enough points to apply statistical analysis. This leads to temptation to spare us thinking. Let the computer do this dirty work of thinking! (This means we stop being scientists but still hope to get results that qualify as science.) Enter the data in the computer, push the “linear regression” button, and we get a printout with lots of impressive-looking numbers. We might figure out that some numbers in the printout mean that the best straight-line fit is F=-41+1.3M. But we also are offered much more. The printout also shows “R2“. Its value is around 0.90, which means that the line obtained is a pretty good fit -- data points are scattered only slightly around this line. The printout may also have p-values, stars next to some coefficients, and maybe even more. This may look impressive. But what about ability to predict, this major goal of science? From your computer printout, can you deter- mine, say, what the female literacy rate would be when the male one is 20%? Or what the male literacy rate would be when the female reaches 100%? If you cannot do that, you can publish your printout and babble about R-s and stars, but your publication would fall short of what is of most interest in science. Suppose you can hack it through the printout maze and deduce from it the line F=-41+1.3M. This does enable us to deduce F from M, and vice versa. What would female literacy rate be when the male one is 20%? For M=20%, the computer equation predicts F=-15%. A negative

273 Appendix C female literacy rate? Does that make sense? Of course not. Something is badly off the track. What would male literacy rate be when the female reaches 100%? The computer equation predicts M=108.4%. More than 100% of the males being literate? Come now! What’s happening? To make it visible, let us graph the computer- supplied results and compare the outcome with our previous one:

Yes, this is what one gets, when one does not think and does not graph, before rushing to the computer. Good data in, junk out. The computer ignored the conceptual anchor points. Why? Because no one told the computer that such anchor points exist. Do not blindly yield to the computer printout. Ask “Does it make sense?” If it doesn’t, do not accept it. Say: I must have goofed somehow. Indeed, nothing is wrong with computer and its linear regression program. Blame the person who misuses them. Before feeding the data into the computer, we must think about how to do it so that it makes sense. This takes some mathematics, which we avoid here. (If you insist on knowing: we must feed into linear regression the logarithms of M and F, not M and F themselves.) For the moment, let us get more practice with what is the most important – drawing pictures to express various situations. And this does not need mathematics! – at least not what is usually viewed as mathematics.

274 Appendix C

2. Some usual patterns of data subject to constraints

Two factors can interact in a multitude of ways. The processes of inter- action can be varied and complex. At times x may influence y, while at other times y may influence x. Or both x and y may be influenced by a hidden third factor z. Constraints such as anchor points may still impose a similar patterns. We’ll cover a number of them. Do not try to memorize them. When approaching a new problem, do not try to go through the list, trying to guess which one might fit. None might. Try graphing the constraints the factors are subject to, and only one option may emerge.

1. Box with two anchor points The pattern of male and female literacy rates repeats itself whenever two factors must be 0 jointly and also must reach their maximums jointly. One can always express their values as percentages of their respective maximums. Then 0;0 and 100;100 are anchor points. Thus, similar curves arise when a few data points are given. (Somewhat different shapes can still materialize, but let us not worry about this right now.) Take popular approval ratings of head of state and legislative assembly. Typically, people rate heads of state higher than assemblies. (They yearn for father figures.) Then, when even the head of state’s approval rate drops to 0%, we can expect no higher rate for the assembly. This leads to anchor point 0;0. Conversely, if even the assembly rates 100%, we can expect no lower for the head of state. This leads to anchor point 100;100. If we have a single data point for a country, say head of state approval H=60%, assembly approval A=36%, we can already guess at how one of them might change when the other changes. Now suppose we have data for 3 countries across the world, one data point for each. I use three different symbols, so as to drive in that these are different countries:

275 Appendix C

What could be our guess for the overall pattern for the world? Some students have too much respect for data and try to pass a wiggly curve through all the points, sometimes even neglecting to connect to anchor point 100;100:

Don’t do it! Different countries may be on different curves (each of them respecting the anchor points), some closer to the diagonal, A=H, some further away:

(Mathematically, if you insist on knowing, all these curves are A/100=(H/100)k, but the value of k is different.) Our best guess is that the world average curve might be around the central curve. With so few cases, we cannot be sure. But with the help of logical anchor points, we can do much better than say “I don’t know”.

276 Appendix C

Here’s another example. Trust in people varies enormously from country to country. In some, only 10% say people usually can be trusted; in some others 60% do. But within the same country, the average trust level of people with higher education (H) always is higher than the average trust level of those with primary education (P). Anchor points 0;0 and 100;100 still apply (if trust is in percent). When we graph P vs. H for all countries in the world (not shown here), we get a single curve with remarkably little scatter. Further examples could be piled up, but let us stop before you start thinking that a box with two anchor points is all the wisdom you need.

2. Box with three anchor points Have a country with only two parties, where assembly seats are al- located “by plurality” in single seat districts. This process works out in such a way that the party that gets a larger share of votes (V) wins an even larger share of seats (S). Typically, when a party has 60% of the votes, it receives 77% of the seats. If this party has 100% votes, it obviously wins 100% seats. And if it has 0 votes, it also has 0 seats. This looks like another situation with two anchor points. But hold it. What if the votes are 50–50? In the absence of any other information (the famous expression), we cannot favor either party. So the seats must also go 50-50. We have a third anchor point in the center. Along with our single data point, 60;77, the picture is

How should we draw a curve to join the 3 anchor points and the data point? Actually, we have a hidden second data point. How come? Recall that there are only two parties. Think: What does it imply for the second one, when the first one is 60;77? Enter this extra point to the graph. Join these 5 points with a smooth curve:

277 Appendix C

We have little leeway. The simplest curve we can draw through these points is bound to a bit more complex than in earlier cases. This is very much the pattern UK used to have when it had a pretty clear two-party constellation. It no longer has. (The equation for this type of curve is S/(100-S)=[V/(100-V)]k. Here k=3. This is the slope at V=50.) Now suppose we have a direct presidential election. Whichever candidate has even a slight edge wins the entire single seat at stake. Here 51% votes leads to 100% seats. How does the graph look now?

How can we join these points? This is still the same picture of 3 anchor points, except that the smooth bend must become a sharp kink. What would happen if both candidates had exactly the same number of votes? One would have to throw a coin, and each candidate would have a 50– 50 chance. So the third anchor point at 50–50 still applies. (This equation for this curve still is S/(100-F)=(V/(100-V)k, but here k tends toward infinity.)

278 Appendix C

In the other direction, if we had a huge number of seats, the curve would become flatter. It would approach the diagonal, S=V. (For this curve k=1, in S/(100-S)=(V/(100-M)k.) So the number of seats at stake matters. Within the same “family of curves”, a larger total number of seats makes the curve flatter, while a smaller number makes it kinkier.

4. Between floor and ceiling When economy expands (per capita Gross Domestic Product goes up), more people tend to be satisfied with the situation. When economy contracts, more people tend to be unhappy. Suppose that 40% of people are happy when economy is stable (per capita GDP goes neither up nor down). Suppose that 60% of people are happy when economy expands by 4% per year. Make a picture of Satisfaction (S) vs. Growth in GDP (G). Include the extreme possibilities. We first get this:

We have just two data points, sandwiched between a floor of 0% satis- faction and a ceiling of 100% satisfaction. How are we to draw a curve that gives us a value of S for every value of G? The allowed region is a zone between S=0 and S=100%, at any value of G. When growth is very positive, satisfaction is likely to rise toward 100%, but without ever fully reaching it. When growth is very negative (economy shrinks sharply), satisfaction is likely to drop toward 0%, but again without ever fully reaching it. Symbolize this approach to a limit as slightly bent curves very close to floor at left and to ceiling at right. They occur at very large increase or decrease in GDP:

279 Appendix C

Now it becomes a matter of joining the two data points in a way that respects the conceptual limits. Try to make it as smooth and sym- metrical as possible:

This curve is often characterized as a ”drawn-out S”. (The technical term for the simplest drawn-out S curve is “simple logistic”. Its equa- tion is G=100/(1+e-k(G-G’)), where G’ is the value of G where S is 50. The k reflects steepness at G=50.) Note that our previous 3-anchor-point curve could also be called drawn-out S. But there the limits on hori- zontal axis are 0 and 100, while here yearly change in GDP could in principle range from hugely negative to hugely positive. (Technically, it could range from minus 100 to plus infinity.) Anchor points and approaches to limits are the two major forms of constraints we encounter. Anchor points are easier to visualize: a speci- fic value of y at a specific value of x. Approaches to limits are more awkward: a specific value of y when x “tends toward” minus or plus infinity. Infinity is awkward. Try to get used to it. With two anchor points, a single data point could give us some idea about the shape of the curve. But with two approaches to limits, as is the case here, we need at least two data points. Otherwise the steepness of the curve near the center could not be estimated. To have a fair estimate, these two data points should be far from the limits and far from each other. Our two data points, at G=40% and G=60%, satisfy this. The drawn-out-S curves are most usual when sizes change over time. Time ranges from minus or plus infinity. Take a bacteria colony that starts with a single bacterium and eventually fills up the entire space available. Its size increases along the drawn-out-S curve. Or take a small human group landing on a previously uninhabited island and multiplying, until arable land runs out and sets a limit to population. Or take an idea or a product that initially meets resistance but then “takes off”, until meeting some sort of a limit. These curves are not always simple and symmetric, but the broad common feature is slow takeoff from a floor and, later, a slow approach to a ceiling.

280 Appendix C

5. From anchor point to ceiling Volatility of voters means their readiness to switch parties from one election to the next. The more parties there are, the more there are ways to switch. So volatility (V) should increase with increasing number of parties (N). Suppose volatility is 24% when 3 roughly equal parties are running. What could it be when only 2 roughly equal parties are running? If you guessed at 2/3 of 24%, 18%, you overestimated a bit. Better ask again: What are the limits on N and V? And are there any anchor points or approaches to a limit? The allowed region is a box open to the right. The number of parties must be at least 1, but could be huge, in principle. Volatility can range from 0 to 100%. If only the same one party runs in both elections, then party switching is impossible. Thus, we have an anchor point: N=1 imposes V=0. When a huge number of parties run, volatility might approach 100% but not exceed it. In the absence of any other knowledge, take V=100% as the ceiling, approached as N tends toward infinity. The picture now is:

Smoothly joining the anchor point, data point, and the approach to the limit gives us

281 Appendix C

So, with two parties, a volatility of about 12% could be expected. Note that in this case we again could make an estimate on the basis of a single data point. (The equation for this type of curve is V=100(1-ek(N-1)). Here k=0.14.) You may be unhappy about many details. How can we draw a continuous curve, when parties come in integer numbers? We actually use an “effective number” that weights parties according to their size. If the vote shares are 60, 30 and 10%, this effective number is 2.17 parties. 2 (The formula is N=10,000/SUMvi .) We make many simplifying assumptions. The data point 3;24 is typical of state elections in India, the best database available.

6. Constant doubling times We have dealt with anchor points and approaches to limits. Can there be utterly different kinds of constraints, to determine a pattern? Certainly. A most usual one is constant doubling times. Young living organisms have this urge to double in size at constant intervals. In the opposite direction, a chunk of pure radioactive material looses one-half of its weight at constant half times. Have two data points for a sum deposited at fixed compound interest: 1000 dollars right now (at time 0) and 2000 dollars 15 years from now (t=15). (This implies a compound interest rate of 4.73% per year.) What is the doubling time in this case? ... years, of course. With the constraint of constant doubling times, this imposes 4000 dollars at 30 years from now – and in reverse, 500 dollars 15 years ago. And so on for +45 and –30 years. The graph shows these deduced points as “+”:

282 Appendix C

Joining these points, we get a smooth curve that goes up ever faster in the future and approaches 0 dollars at times far past. (Thus, the con- straint of approaching a floor at 0 is built in when we assume constant doubling times.)

Now the amount at any time can be read off from the curve. For instance, 10 years from now, the amount would be around 1590 dollars. Constant doubling times are the very definition of “exponential” kt growth. (The equation for this pattern is S=S0e . For our example it is S=1000e0.0473t.)

7. Overview of frequent combinations of constraints The basic simple combinations are:  two anchor points – at least one data point needed to place the curve;  one anchor point plus one approach to a limit – at least one data point needed ;  approaches to two limits, upper and lower – at least two data points needed to place the curve;  constant doubling times – at least two data points needed to place the curve.

Many other combinations can occur. The only extra one we dealt with was 3 anchor points. Note that we found hardly any cases where a straight line would do – just the diagonal of the square box, y=x. Straight-line relationships are rare indeed.

283 Appendix C

3. Fitting actual data with logically grounded curves

Up to now, we have seen how little data we may need to make a pretty good guess at what the overall pattern might be – once logical anchor points and approaches to limits are taken into count. Actual data may include a profusion of data points. They usually form a “data cloud”, which may look straight or like a bent sausage or something more outlandish. Some deviant points may occur far away from the dense data cloud. Take the data presented at face value, without arguing about the ways they were obtained. Our focus is on how to analyze valid data. The game is still the same. Establish logical anchor points and ap- proaches to limits. An equality line may offer perspective, if such equality line makes sense. (Percentages can be equal to other per- centages, but areas cannot be equal to populations, for instance.) Pub- lished data graphs may involve frames and lines that may confuse rather than help you. Here are eleven challenges. Do not expect all of them to fit exactly with patterns previously presented. We always have to think. My suggested answers are to be found in a separate document, “Solutions for problems in Chapter 3”. If if they should be available, do not expect any increase whatsoever in your science skills, if you just read the questions and then peak at the answers. You have to sweat it through, make mistakes, and learn from your mistakes. We cannot learn from mistakes we were too lazy to make. If you get completely stuck with some of the cases, go back to Chapter 2 and study it further. There is no cheap way out, if you want to acquire science thinking,

CASE 1. How are ratings for capitalism and democracy related? At least in the North Pacific, support for capitalism (call it C) and support for democracy (D) seem to go somewhat hand-in-hand, but with some scatter, as shown in graph below (Dalton and Shin 2006: 250).

284 Appendix C

What is the overall pattern? Copy the graph and insert additions needed. Do not be confused by the box shown -- it has no meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. In science, we need precision. Lines drawn freehand are not precise enough. Use a straightedge. A book edge might do, but a transparent straightedge makes life easier. When fitting with a smooth curve, try to make it so that roughly half the data points are above the curve and half are below. But keep the curve smooth.

CASE 2. When you lose ground, do you lose everywhere?

The US Democratic Party did pretty well in 1999. In the next elections, 2000, they did worse. Did they do worse in every US state? The graph below (Johnston, Hagen and Jamieson 2004: 50) shows the Democrat vote shares in 2000 (call it S, for second election) plotted against the Democrat vote shares in 1996 (call it F, for first election). The support figures in individual states range from below 30 to close to 70%.

285 Appendix C

What is the overall pattern? Copy the graph and insert the additions needed. Do not be confused by the box and lines shown – they have no meaning or have a questionable meaning. Think of allowed and for- bidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. When a scale must be extended, what is the simplest way to go? Place the edge of a sheet of paper at the scale. Copy the scale on this edge of sheet. Then shift the sheet left or right and mark further scale marks on the original graph.

CASE 3. How are support for gender equality and tolerance of homosexuality related? Average degree of support for gender equality (call it G) and tolerance for homosexuality (call it H) seem to go somewhat hand-in-hand, as shown in graph below (Norris and Inglehart 2003: 27). But scatter is wide.

286 Appendix C

What is the overall pattern? Copy the graph and insert the additions needed. Neglect the blatant deviants and focus on the dense data cloud. Do not be confused by the box and line shown – they have no meaning or have a questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. When fitting with a smooth curve, try to make it so that roughly half the data points are above the curve and half are below. But keep the curve smooth. Note that Gender Equality scale runs from 0 to 100, but Approval of Homosexuality scale runs from 0 to 10. Thus, “10” means full tole- rance. To convert to percent, what do we have to do?

CASE 4. How number of parties affects voters’ volatility The graph below shows that voters’ volatility in Indian states, from one election to the next, tends to increase when they can choose from among more parties (adjusted from Heath 2005). But scatter is wide.

287 Appendix C

What is the overall pattern? Copy the graph and insert the additions needed. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. When fitting with a smooth curve, try to make it so that roughly half the data points are above the curve and half are below.

CASE 5. How does bicameralism connect with federalism? Federalism means that a country is divided into “provinces”(Canada) or “states” (US) with more or less autonomy. Strength of federalism (call it F) expresses how autonomous they are. Bicameralism means that the national assembly has two separate “chambers”, with different powers. The first or only chamber is usually elected by “one person, one vote”. The second chamber may represent the federal units, or ancient aristocracy (UK), or whatever. Its strength (call it B) may range from nothing to being equal to the strength of the first chamber.

Full-fledged federalism should go with two equally powerful chambers, so that both population and federal subunits can be represented on an equal basis. If a completely unitary country has a second chamber, it might be weak. So there may be some connection between B and F. The graph Lijphart 1999: 214) shows limited connection.

288 Appendix C

This is a fuzzier case than the previous. Rather than measured quantities, Lijphart uses here informed estimates. He does so on a 1 to 5 point scale for F and on a 1 to 4 point scale for B. For federalism, F=1 means a purely unitary country, and F=5 means that the provinces or states are as strong as they ever get. For bicameralism, B=1 means only one chamber, while B=4 means a second chamber as powerful as the first chamber. One confusing feature here is that this graph rates “nothing” as 1 rather than 0. Tough luck, but we sometimes encounter data in such a form. We may wish to re-label the scales, so that 1 becomes 0% and F=5 and B=4 become 100%. Then F=2 becomes F’=25%; F=3 becomes F’=50%; and F=4 becomes F’=75%. And B=2 becomes B’=33.3, and B=3 becomes B’=66.7. It would help to show this new scale in your graph.

What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the box and line shown -- they have no meaning or have a questionable meaning. Hints: One might expect that strong federal subunits would go with a strong second chamber to defend their interests. So F=5, B=4 looks like a logical anchor point -- and quite a few countries inhabit this location. The case for F=1, B=1 is flimsier, as nothing prevents unitary countries from having second chambers. But quite a few countries inhabit this location (F=1, B=1), so give it a try, as anchor point. What would be the best simple curve joining 1;1 and 5;4? (Even the best one would still be a poor fit for all too many countries. But science is about doing the most about whatever we have, rather than giving up.)

CASE 6. How does strength of judicial review of laws relate to constitutional rigidity? Constitutional rigidity means that the constitution is hard to change. Judicial review means that courts can pass judgment on whether a law conforms to the constitution. Both show that the constitution is taken seriously, so they might go hand-in-hand. But the graph below (Lijphart 1999: 229) shows only faint connection. What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the box and line shown -- they have no meaning or have a questionable meaning. We again face the difficulty is that “nothing” is rated as 1 rather than 0. We may wish to re-label the scales, so that 1 becomes 0% and 4

289 Appendix C becomes 100%. Then 2 becomes 33.3 and 3 becomes 66.7. Tough luck, but we sometimes encounter data in such a form. This is an even fuzzier case than the previous. Lijphart again uses informed estimates. He does so on a 1 to 4 point scale for both factors. For rigidity, C=1 means no written constitution at all (as in UK), and C=4 means that the constitution is extremely hard to modify. For judicial review, J=1 means no judicial review, and J=4 means a very stringent review. One cannot declare a law unconstitutional when there is no constitution! Thus C=1 should impose J=1. This looks like a logical anchor point. But the graph shows that Iceland still manages to have some judicial review of constitutionality of laws despite of not having a written constitution.

We also might set up as an ideal extreme that J=4 goes with C=4, even while India falls short on C, and Switzerland falls utterly short on J. What would be the best simple curve joining 1;1 and 4;4? (Even the best one would still be a poor fit for all too many countries. But science is about doing the most about whatever we have, rather than giving up.)

290 Appendix C

CASE 7. How do interest groups and political parties compensate for each other? Some countries have many separate interest groups, such as separate trade unions and business associations. In some others they congregate under a few roof organizations, such a single nationwide association of employers and a single trade union for workers. Lijphart (1999: 183) makes use of a compound index of interest group pluralism (I). How this index is formed need not concern us here. Just accept that I=0 stands for as few separate interest groups as there can be, while I=4 stands for as many separate groups as there can be. Some countries have many political parties, while others have few. The aforementioned effective number of parliamentary parties (Examp- le 5, volatility) is a fair measure. One might think that more parties would go with more interest groups. But the graph below (Lijphart 1999: 183) shows the reverse! Yes, surprisingly, countries that have many parties tend to have low interest group pluralism, and vice versa. It looks as if there were some complementarity. But what is the form of this relationship?

Copy the graph and insert the additions needed. Do not be confused by the box and line shown -- parts of this have no meaning or have a

291 Appendix C questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. Here the fact that index I runs from 0 to 4 rather than from 0 to 100% presents no special difficulty, as long as we keep it in mind that I=4 is an upper limit, by definition.

CASE 8. How satisfaction with economy changes with growth/decrease in GDP When Gross Domestic Product increases, more people are likely to be satisfied with economy, and vice versa. The graph below (Lewis-Beck et al., Electoral Studies 32: 524–528, 2013) confirms it. This graph shows on the y-axis the percentage (W) of people polled who felt that the economic situation had worsened versus, on the x-axis, the percentage (G) by which GDP had increased or shrunk, compared to previous year. The data points are labeled with years in which they occurred; pay no attention to these dates but just to their locations on the graph. What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the lines shown -- they may have no meaning or have a questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance.

292 Appendix C

CASE 9. How do representatives reflect the ideology of their constituents? Assume one-seat electoral districts. One might well guess that the representative of a very conservative district would vote accordingly in the assembly, and vice versa for a very non-conservative district. (The latter would be called leftist in most of the world but “liberal” in the US). But what about a district where conservatives barely outweigh non-conservatives? Would their conservative representative take into account her non-conservative constituents and vote sort of 50-50 on issues in the assembly? Or would she vote all-out conservative as if she represented only those constituents who voted for her? The graph (Dalton 2006: 231) shows a mixed picture.

What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the lines shown – their location has no meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. Both district and representative conservatism can in principle range from 0 to 1 (0 to 100%).

293 Appendix C

CASE 10. How does support for redistribution affect actual redistribution? All developed democracies tend to redistribute some income from wealthier to poorer people, through taxation. (Tax havens, indirect taxes and access to policymaking often mean, though, that wealthier people actually pay fewer taxes than the poor.) Some people like it, some don’t. One may well presume that the extent of redistribution is higher when more people support it. The graph below (Lupu and Pontusson 2011) shows the extent of redistribution (call it R) plotted against the support for redistribution (S).

What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the box and line shown – they have no meaning or have a questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. This is a tricky case. It would seem that both S and R could in principle range from 0 to 100 %. This is so indeed for support. But redistribution could come close to 100% only if one person owned all the wealth and nearly all of it were redistributed to all the rest. We are not yet there. The graph above suggests that R reaches a ceiling around 32 % even when support for it approaches 100%, for the average country. So assume an anchor point at S=100%, R=32%.

294 Appendix C

CASE 11. How do more parties reduce the largest seat share in government? As the number of parties increases, those extra parties may well whittle down the shares of the largest parties – shares of their seats in the assembly and also of their ministerial seats, if they form the govern- ment. So we might expect a decreasing pattern. The graph below (Electoral Studies 37: 7, 2015) confirms it.

What is the average overall pattern? Copy the graph and insert the addi- tions needed. Do not be confused by the lines shown – they have no meaning or have a questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality lines. Do not expect that all of these enter in every instance. Call the effective number of parties N. Call the seat share of the largest party in the government L.

295 References

Anscombe, Francis J. (1973) Graphs in statistical analysis. The American Statistician 27: 17–21. Comte, Auguste (18xx) Plan of Scientific Studies Necessary for Reorganization of Society. Dalton, Russell J. (1988) Citizen Politics in Western Democracies. Chatham, NJ: Chatham House. Dalton, Russell J. (2006) Citizen Politics. Washington, DC: CQ Press. Dalton, Russell J. and Shin, Doh Chull (2006) Citizens, Democracy, and Markets Around the Pacific Rim: Congruence Theory and Political Culture. Oxford: Oxford University Press. Hawking, Stephen (2010) The Grand Design. Heath, Oliver (2005) Party systems, political cleavages and electoral volatility in India: A state-wise analysis, 1998-1999. Electoral Studies 24: 177-99. Johnston, Richard, Hagen, Michael G. and Jamieson, Kathleen H. (2004) The 2000 Presidential Election and the Foundations of Party Politics. Cambridge: Cambridge University Press. Kvålseth, Tarald O. (1985) Cautionary note about R2. The American Statisti- cian 39: 279–85. Lijphart, Arend (1984) Democracies: Patterns of Majoritarianism and Con- sensus Government. New Haven, CT: Yale University Press. Lijphart, Arend (1994) Electoral Systems and Party Systems. Oxford: Oxford University Press. Lijphart, Arend (1999) Patterns of Democracy: Government Forms and Performance in Thirty-Six Countries. New Haven, CT: Yale University Press. McCloskey, ….. (2009) The cult of statistical significance. Mickiewicz, Ellen (1973) Handbook of Soviet Social Science Data. New York: Free Press Norris, Pippa, and Inglehart, Ronald (2003) Islamic Culture and Democracy: Testing the “Clash of Civilizations” Thesis, pp. 5-33 in Ronald Inglehart, editor, Human Values and Social Change. Leiden & Boston: Brill. Rousseau, Jean-Jacques (1762). Le contrat social. English translation by M Cranston: The Social Contract, London: Penguin Books, 1968. Stein, James D. (2008) How Math Explains the World. New York, NY: HarperCollins/Smithsonian Books. Taagepera, Rein (1976) Why the trade/GNP ratio decreases with country size, Social Science Research 5: 385–404. Taagepera, Rein (1979). People, skills and resources: An interaction model for world population growth, Technological Forecasting and Social Change 13: 13–30 References

Taagepera, Rein (1997) Expansion and contraction patterns of large polities: Context for Russia, International Studies Quarterly 41: 475–504. Taagepera, Rein (1999) The Finno-Ugric republics and the Russian state. London: Hurst. Taagepera, Rein (2007) Predicting Party Sizes: The Logic of Simple Electoral Systems. Oxford: Oxford University Press. Taagepera, Rein (2008) Making Social Sciences More Scientific: The Need for Predictive Models. Oxford: Oxford University Press. Taagepera, Rein (2010) Adding Meaning to Regression. European Political Science 10: 73–85. Taagepera, R. (2014). “A world population growth model: Interaction with Earth’s carrying capacity and technology in limited space”, Technological Forecasting and Social Change 82, 34–41. Taagepera, Rein and Sikk, Allan (2007) Institutional determinants of mean cabinet duration. Early manuscript for Taagepera and Sikk (2010). Taagepera, Rein and Sikk, Allan (2010) Parsimonious model for predicting mean cabinet duration on the basis of electoral system. Party Politics 16: 261–81. Taagepera, Rein and Hayes, James P. (1977) How trade/GNP ratio decreases with country size, Social Science Research 6: 108–32. Thorlakson, Lori (2007) An institutional explanation of party system con- gruence: Evidence from six federations. European Journal of Political Research 46: 69–95.

297