<<

Equality of Opportunity of . Evidence from a quasi-natural experiment.∗

Sebastian Camarero Garcia †

July 2017, [Corresponding Working Paper soon available]

Abstract

The goal of this paper is to shed light into how equality of opportunity in education (Equality of Educational Opportunity (EEOp) or respectively Inequality of Educational Opportunity (IEOp)) may be shaped by the recent trend to accelerate and intensify the educational process. For this propose, I analyze the impact of a controversial reform in Germany that shortened the duration of secondary () by one school year from 9 to 8 years while keeping the curriculum unchanged. Since, both the first student cohort in the new 8 year system and the last one taught for 9 years had to pass the same final university access diploma exams. The sharp, staggered introduction of this reform across the different German federal states can thus be exploited as a quasi-experimental setting, which allows estimating the reform induced increase in learning intensity on IEOp for students in a two step Difference-in-Difference estimation approach (DID). To measure this effect, I take the most recent available German-specific data from the Program for International Student Assessment (PISA) studies 2003, 2006, 2009, 2012 (PISA-I-2003-2012) providing comparable measures of cognitive skills in Reading, and Sciences for students tested at the end of the 9th grade. Regression findings suggest that increased learning intensity induced by the Gymnasium-8-reform (G- 8-reform) did not improve EEOp. In the short-term, IEOp appears not to have changed. However, in the medium-term, a larger fraction in the variation of test scores can be explained by circumstances beyond the control of a 9th grade student. Thus, the analysis indicates that the reform induced increase in learning intensity aggravated IEOp though only after some time - because, for instance, favorable circumstances such as private tuition opportunities may have only materialized into test score differences after a period of adjustment. Moreover, results provide evidence for the existence of subject-dependent curricular flexibilities, with Maths/Sciences being more inflexible, thus more responsive to changing learning intensity than Reading. Thus, this paper is one of the first to provide based on a quasi-experimental setting causal estimators of how a factor, such as learning intensity, affects IEOp (hence also social mobility). JEL-Classification: D39, D63, I24, I29, O52 Keywords: Equality of Opportunity, Education & Inequality, Learning Intensity, German School System

∗I would like to thank my supervisor Andreas Peichl. Moreover, I would like to thank Felix Chopra, Cung Truong Hoang, Kilian Huber, Paul Hufe, Stephen Kastroyano, Panos Mavrokonstantis, Tim Obermeier, Federico Rossi, and David Schönholzer as well as the participants at the Public Economics and the CDSE Seminar at the ZEW/University of Mannheim for helpful suggestions and discussions. The usual disclaimer applies. †PhD-Candidate at CDSE - University of Mannheim; E-Mail: [email protected]

i List of Abbreviations

CTT common time trend.

DID Difference-in-Difference estimation approach.

EEOp Equality of Educational Opportunity. EOp Equality of Opportunity. ESCS PISA index of economic, social and cultural status.

FC Family Characteristics. FE fixed effect.

G-8-model Gymnasium-8-model. G-8-reform Gymnasium-8-reform. G-9-model Gymnasium-9-model. GDR German Democratic Republic (1949-1990) which consisted of the following today’s German federal states: Brandenburg(BB), East-(BE), Mecklenburg-Western Pomerania (MWP), (S), Saxony-Anhalt (ST), (TH).

IC Individual Characteristics. IEOp Inequality of Educational Opportunity. IOL absolute measure of Inequality of Opportunity. IOp Inequality of Opportunity. IOR relative measure of Inequality of Opportunity (IOp). IQB Institut zur Qualitätsentwicklung im Bildungswesen (Institute for Educational Quality Improvement). ISCED International Standard Classification of Education. ISCO International Standard Classification of Occupation. ISEI International Socio-Economic Index of Occupational Status.

OECD Organization of Economic Co-operation and Development. OLS Ordinary Least Squares.

PC Parental Characteristics. PISA Program for International Student Assessment.

SC Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany. SES socio-economic status.

ii 1 Introduction

In an era of relatively high income and wealth inequality compared to the post-war decades in most Western countries (Piketty and Zucman, 2014), many people feel anger and discomfort about the economic system and their perspectives. Since, in democratic market-oriented societies, the general belief that by working and studying hard everyone may have a fair chance to climb up the social ladder has been central for maintaining cohesion and stability of the political and economic system. Therefore, today many analysts suggest that an increase in the number of both citizens who fear that their children may be worse off in the future (fear of downward mobility) and of groups in society who think that the “game is rigged” (fear of a lack of upward mobility) may be crucial for explaining rising political polarization within most Western countries (e.g. Brexit, Trump’s election). In other words, social mobility1 is getting an increasingly important issue within the broader context of aiming to understand drivers of and finding answers to recent trends of inequality within society. Thus, knowing more about the extent of Inequality of Opportunity (IOp) and in particular regarding education (IEOp)2 appears to be one central margin that influences social mobility (Chetty, Friedman, Saez, Turner, and Yagan, 2017); because education is said to be the main vehicle for upward mobility (Woessmann, Lergetporer, Kugler, and Werner, 2014). Focusing on Germany, two aspects illustrate these points. First, with the wealth distribution being quite unequal, as 10% of society own 60% of total wealth and the bottom 20% nothing or are indebted, social justice and thus EOp has become central in the political debate (The Economist, 2016). Second, as in other industrial countries, the level of social mobility observed in postwar-decades has declined since the 1980s. For instance, the Organization of Economic Co-operation and Development (OECD) has repeatedly shown that Germany belongs to the countries that are least upward-mobile with educational success being highly dependent on a student’s parental education background (compare OECD(2013b) and Figure A.1). Thus, the importance of the notion of EEOp becomes clear.

In this paper, I will adopt the canonical interpretation as illustrated by Ferreira and Peragine(2015) stating that society has achieved equality of opportunity if what individuals achieve with respect to some desirable objective, is fully determined by their choices and personal efforts, instead of by circumstances that are beyond an individual’s control. Circumstances are all the factors an individual cannot control but affect her outcome, while effort encompasses the choices an individual makes (e.g. how hard to study). According to the EOp normative criterion, inequality due to unequal circumstances is unfair because it is due to factors outside of individual control, whereas inequality due to unequal efforts is morally acceptable.

One way of leveling the playing field3 is to offer everyone access to (at least) education. However, to what extent do individuals have an equal opportunity to achieve good educational results ? In times of public spending constraints, accelerating growth of scientific knowledge and economic competition, among OECD countries, the debate on educational policies has become more output oriented. In fact, public attention has shifted on the key issue of how to make a country’s educational system more effective. Therefore in the area of economics of education, many aspects of both schooling quantity and quality have been investigated regarding their impact on cognitive skill formation and earnings.

1For an overview of the literature on social mobility, I refer to Fields and Ok(1999) 2IOp and Equality of Opportunity (EOp) refer to the same concept, putting just emphasis on either what may be considered to be the unfair part within the distribution of opportunities (IOp) or the fair part (EOp). Thus, if opportunities depend less on factors beyond the control of an individual and thus more on efforts, one may either state that EOp has increased or IOp decreased. In the following, I will try to use both terminologies in a manner to ease the interpretation of results. Finally, instead of EOp in education, I use similar as Brunori, Peragine, and Serlenga(2012) the expression EEOp and vice versa for IOp in education, IEOp. 3The expression of “leveling the field” within the context of EOp was first used by Roemer (Ferreira and Peragine, 2015).

1 To a much lesser extent, however, a factor combining schooling quality and quantity characteristics, learning intensity has been analyzed (e.g. Büttner and Thomsen(2015); Marcotte(2007); Pischke(2007)). With respect to schooling, learning intensity can be defined as the ratio of the amount of curricular content that is covered in a given amount of instructional time. From a social welfare perspective, it is interesting to reveal the effects of increasing learning intensity on both academic achievements and EEOp. For instance, understanding how more intense education influences the formation of skills may contribute to improving the efficiency of educational systems. Thus, Pareto-improvements may be achieved if learning intensity turned out to be an instrument resolving the trade-off between educational spending and output, i.e. academic merits.

Following the description of these two relevant and recent trends, the question concerning their interconnect- edness arises: How does variation in the intensity of education affect social mobility ? In this paper, however, I would like to draw the reader’s attention to the narrower question of how variation in learning intensity may affect IEOp, which is part of understanding the previous big picture question.

In order to approach an answer to this specific object of research, I will focus on a reform in Germany that changed learning intensity. During the last decade, the German federal states gradually shortened secondary school duration in Gymnasium at different points in time between 2001 and 2008 from nine to eight years (so called “G-8-reform”). While schooling duration was reduced, the curriculum was kept unchanged for the first affected (treated) cohorts, who thus experienced a considerable increase in learning intensity. Since, in each state the first cohort of students now only had 8 years of schooling in contrast to the 9 years of their predecessor cohort entering secondary school just one year earlier. However, as both groups were planned to take the same university access diploma exams in the same final year, treated students now had less time in total for homework or repetition requiring them to learn more material per school year (i.e. learning intensity increased). Therefore, the sharp staggered introduction of the reform across federal states satisfies the characteristics of a quasi-experimental setting. This allows taking into account common empirical deficiencies by applying a DID framework in order to estimate the reform induced increase in learning intensity on IEOp.

Thus, this paper aims finding first answers to questions such as: Is it possible to reduce the amount of schooling by increasing learning intensity without affecting EEOp ? Does the reform effect vary over time as students and their environment may adjust to the reform effect ? For this purpose, I use the German specific PISA data providing a representative sample of students in the 9th grade with standardized test scores in Reading, Mathematics and Sciences. Thereby, based on the associated rich set of background variables, I classify relevant circumstances similar to Ferreira and Gignoux(2014). Furthermore, surveys show that Germans remain split on the question whether Gymnasium should last 8 or 9 years (Woessmann, Lergetporer, Kugler, Oestreich, and Werner, 2015). In Eastern Germany, a majority supports shortened school duration, whereas the opposite is true across Western federal states that only recently adopted the new system. Thus, I contribute some evidence to evaluate the controversial reform. Finally, knowing how learning intensity may causally affect IEOp provides policy implications on how one would like to design curricula taking into account both the effects on cognitive skill formation and on EEOp. Furthermore, such evidence is needed to integrate the factor of intensity into the human capital literature.

The remainder of this paper is organized as follows. Section 2 introduces how IOp is measured and defined for the purpose of this study. Section 3 provides an overview of the related literature. Section 4 illustrates the relevant institutional background and the G-8-reform on which the quasi-experimental identification strategy is based on. Then in section 5, a discussion of the data used follows. Section 6 presents the empirical strategy whose results along with robustness checks are provided in section 7. Finally, section 8 concludes.

2 2 Measuring Inequality of Opportunity in Education (IEOp)

The idea that societies should distribute opportunities equally has a long tradition within political . Recently, it gained significant attention in the philosophical discourse after Rawls’ seminal contribution (Rawls, 1971) and the fruitful discussion which has ensued (Sen(1980), Dworkin(1981a, 1981b), Arneson (1989), Cohen(1989)). Most importantly, this debate established the idea that prerequisite for measuring EOp (or IOp) is distinguishing whether a form of inequality is morally acceptable or not within a society.4 Consequently, by the end of the 20th century, in the area of economic distributional analysis research started to shift attention from outcomes to opportunities. Since, defining equality of opportunity as objective (e.g. in a welfare analysis) allows taking into account that, on the one hand people tend to accept outcome differences due to individual responsibility (efforts) given identical circumstances (i.e. reward principle), but on the other hand they also tend to consider compensation for differences that can be attributed only to circumstances beyond an individual’s control to be fair (i.e. compensation principle). Regarding the compensation, one started to distinguish between an ex-ante (prior to the determination of the effort level) and an ex-post (after the determination of the effort level) principle.5 Moreover, Lefranc and Trannoy(2016) have illustrated how luck may be incorporated as intermediary category ("residual luck") between circumstances and efforts.

However, these ideas only started to capture broader attention of economists when in the 1990s scholars such as in particular Roemer (Roemer, 1998) started to translate these philosophical concepts into a more formal theoretical economic framework establishing a kind of canonical approach how to practically measure EOp. As an empirical literature on measuring EOp followed, with recent surveys on this topic offered by Ramos and Van de gaer(2015) and Roemer and Trannoy(2015) examining both direct and indirect measurement approaches, several estimation methods of EOp have been proposed. For instance, following the indirect, ex-ante approach many studies implement a parametric method proposed by Ferreira and Gignoux(2011) or Björklund, Jäntti, and Roemer(2012). But also other approaches including non-parametric estimation techniques (Checchi and Peragine, 2010), norm-based measures (Almås, Cappelen, Lind, Sørensen, and Tungodden, 2011) and stochastic dominance criteria (Lefranc, Pistolesi, and Trannoy, 2008) have been used.6 In the following, I try to define and explain the approach taken in this paper and how EEOp or IEOp should be understood in this context.7 To begin with, following the "canonical" model as formalized e.g. by Roemer (1998), laying down a set of definitions is useful to understand the concept of EOp in education, i.e. EEOp (compare e.g. Ferreira and Peragine(2015)):

• advantage: An advantage denotes an individual achievement (usually income, but in the context of this paper, it corresponds to educational outcomes as measured by PISA-test scores).

• efforts: The vector of efforts, E, denotes the set of variables that influence the outcome variable (advantage) and over which the student has some sort of control (e.g. choice of time for studying).

• circumstances: The vector of circumstances, C, denotes the set of individual characteristics which are beyond the student’s control, for which one cannot be held responsible, e.g. socio-economic status (SES) of the household you are born into, gender, ethnicity or innate ability/talents etc.

4Since, there is, for instance, strong experimental evidence, that people distinguish acceptable (fair) and unacceptable (unfair) income inequality (see e.g. Cappelen, Sørensen, and Tungodden(2010)). It appears to be acceptable if differences are due to individual responsibilities (efforts), but not acceptable if it is due to luck or randomness (circumstances). 5More details on the philosophical background and evolution of the EOp theory can be found in Ferreira and Peragine(2015). 6 A comprehensive overview is given by Ramos and Van de gaer(2015), chapter 3. 7For a broad and detailed overview on the main EOp measurement methods, I refer to chapter 4.10 in Roemer and Trannoy (2015), to chapters 3, 4 in Ferreira and Peragine(2015) as well as to the survey conducted by Ramos and Van de gaer(2015).

3 Given these definitions, we can formulate a conceptual framework illustrating how this paper aims to approach the measurement of EEOp (compare Ferreira and Gignoux(2011)& Ferreira and Gignoux(2014)). Consider a sample of N students indexed by i ∈ {1, ..., N}. Each student i can be described by a set

of attributes {y, Cn,Em}, where y denotes an advantage (here test scores), Cn is a vector of n discrete 8 circumstances and Em denotes the vector of m discrete efforts. Thus, in fact, we can represent the population

by a (n × m) matrix [Ynm] with a typical element ynm = g(Cn,Em)|C ∈ Ω,E ∈ Θ, g :Ω × Θ =⇒ R being the advantage that is function of both circumstances and efforts. After one has agreed on what variables

constitute the n different vectors Ci for each student i, which is always questionable (Gamboa and Waltenberg, 2012), one can thus split the sample into n distinct groups of students who share the same circumstances (i.e. they are of the same type). At the same time, the sample can be split into m distinct groups of students that invest the same level of efforts, but may have different circumstances (i.e. they belong to the same tranche). Together type and tranche form a cell, the typical element of the population matrix.

Given the framework, one basically distinguishes between direct and indirect measurement approaches.9 As the direct approaches aim to model explicitly the opportunity sets, their implementation has been difficult because opportunities are not directly observable. Instead, indirect approaches measuring EOp based on the observed joint distributions of outcomes and circumstances dominate the empirical literature (see section 3.1). Thereby, one distinguishes between an ex-ante and ex-post approach. This refers to how one evaluates EOp and thus to which normative welfare criterion is chosen (for an overview see Fleurbaey, Peragine, and Ramos (2015)). For instance, before effort is realized (ex-ante), following van de Gaer’s "mins of means" criterion, EOp is achieved equalizing mean outcomes across types, i.e. IOp is measured as between-type inequality satisfying ex-ante compensation and reward principle. Instead after effort is realized (ex-post), following Roemer’s "means of min" criterion, EOp is achieved eliminating inequality within tranches satisfying ex-post compensation. In the education context of this paper, thus one would consider that inequality in scores is still fair, i.e. differences are only due to efforts, if a student of given type obtained a higher score than another one of the same type. However, a similar degree of effort exerted by students facing different circumstances (i.e. of the same tranche) should give rise to similar outcomes – otherwise such inequality would be denoted unfair. Thus intuitively, the concept of EEOp, resp. IEOp can be translated as follows: Assuming talents to be distributed normally across the whole population, students working harder, i.e. putting in more efforts, should be rewarded by achieving good educational results regardless of their specific circumstances characteristics. Thus, unfair IEOp corresponds to differences in educational achievements between students who put in the same efforts but only differ in terms of their circumstances.10 In contrast, differences in educational achievements that can be attributed to individual efforts are acceptable (reward principle).11 Therefore, IEOp resembles differences between students that can be traced only to circumstances beyond their control.

In general, the practice of deriving a measure of IOp involves two steps, an Estimation Phase to transform

the original distribution [Ynm] into a smoothed one [Y˜nm] reflecting only the unfair inequality in [Ynm] and the Measurement Phase which thereon applies a measure of inequality.12

8Note that this economic model could be also extended to the case of having continuous elements in the vectors of circumstances and/or efforts (Ferreira and Peragine, 2015). However, in this paper circumstances will be discrete. 9For the less often used norm-based approach, compare Ramos and Van de gaer(2015). 10Furthermore, as Ferreira and Gignoux(2014) mention, there is also the argument that the allocation of scarce resources such as investments into a student’s education is only efficient if it is decided upon a student’s talent, and not on her circumstances. 11We can note that this is consistent with the notion of fairness in a so called responsibility sensitive egalitarian perspective (Brunori et al., 2012; Checchi and Peragine, 2010). Compare also with the fairness principles. 12For the first step, basically two estimation approaches could be taken. However, Fleurbaey and Peragine(2013) show that ex-post and ex-ante compensation are incompatible. But, if effort is distributed independently from circumstances, ex-post and ex-ante EOp are equivalent (see proposition II in Ramos and Van de gaer(2015)).

4 Following the literature, I conduct an ex-ante,"between-types inequality" measurement approach of IOp which is in line with the indirect approach (Ferreira and Peragine, 2015), because it is based only on the

observed marginal distribution of advantages (test scores) given by the vector y = {y1, . . . , yN } and on the 13 joint distribution of advantages and circumstances over the sample population {y, Cn}. In that regard, I follow the measurement approach of Ferreira and Gignoux(2014), because it requires to take less assumptions (e.g. on how to form tranches without direct observation of efforts). Moreover, given the high requirements for sample size and data availability, applying a non-parametric approach to conduct a within-tranche inequality decomposition (Checchi and Peragine, 2010) would be hardly feasible.14 For, the more precise one tries to design the partition, the smaller cells may become leading to bias in the measure of IEOp. Consequently, this paper adopts a parametric, ex-ante estimation approach to derive EEOp measures. Test scores, denoted by y, will be a function of circumstances and efforts (denoted by C and E respectively). Using this notation, I will model scores as y = f(C,E). Efforts, however, can also depend on circumstances, i.e. E = E(C) which implies y = f(C,E(C)), whereas vice versa efforts can not change circumstances. Thus, for instance, it should be noted that unobserved innate ability is taken into account in this framework and would be considered to be an unobserved circumstance factor, that may influence directly test scores through cognitive skills, but also indirectly via its impact on work ethic and other efforts characteristics. However, such efforts are cannot change other relevant circumstances, such as gender, parental education, etc. Moreover, using the PISA-data evaluating students in the 9th grade, the individuals involved are on average about 15 years old. Therefore, they may be regarded to (if at all) only partially accountable for their choices as argued by Gamboa and Waltenberg(2012) or Hufe, Peichl, Roemer, and Ungerer(2015). In summary, this model for measuring IEOp takes the role of circumstances, efforts and their interplay into account.

Following Ferreira and Gignoux(2014) a linear functional form is used and I will model the process as follows

0 0 yi = Ciβ + Eiγ + ei (1) 0 with Ei = Ciδ + ui (2)

Ci is a vector capturing circumstances variables and Ei is the unobserved vector of m efforts per student i. However, the aim being to estimate the full effect of circumstances on scores, i.e. both the direct and indirect effect on scores (via their impact on efforts), I will estimate the reduced form model:

0 0 yi = Ci(β + γδ) + (ei + uiγ) (3) 0 i.e. : yi = Ciρ + zi , where ρ = (β + γδ) and zi = (ei + γui) (4)

The residual, zi, will include both unobserved efforts and unobserved circumstances. With the aim at this point being to estimate the mean score outcome of each type conditional on circumstances, one proceeds with:

0 ybi = Ciρb (5) 13In general, this approach may include both elements from a parametric approach by relying on a linear model of advantages as functions of circumstances/efforts (Bourguignon, Ferreira, and Menéndez, 2007) and from a non-parametric approach, e.g. the between-types inequality decomposition according to Checchi and Peragine(2010). 14Since, this approach involves basically four steps. First, one defines the advantage variable. Second, one has to choose what variables to consider to form type and tranche, consequently the respective cells. Assuming that ideally the within-type distribution should be the same, thirdly, one can remove the within-cell score inequality to get a smoothed distribution of scores. Fourth, one then computes total inequality in the smoothed distribution in order to decompose it into a fair/unfair part .

5 This will create a new, simulated distribution of scores, yb = {yb1,..., ycN }, for each individual student. Thus, every i is assigned the value of her opportunity set (which in a linear regression corresponds to the expected score conditional on circumstances). This linear model could be estimated by an Ordinary Least Squares (OLS) regression which provides the vector of predicted test scores (i.e. the smoothed distribution).

Having assigned to each individual the value of their opportunity set, the second step, the Measurement Phase involves then calculating inequality in this new distribution, using a particular inequality index, I(.). To estimate IEOp, one would estimate the following ratio:

0 I(ybi) I(Ciρb) θbIEOp = = (6) I(yi) I(yi)

i.e. the ratio between inequality in circumstances (the simulated distribution) and total inequality (actual distribution of scores). Thus, instead of using an absolute measure of Inequality of Opportunity (IOL) (of IEOp), in this paper I use a relative measure of Inequality of Opportunity (IOp) (IOR) (of IEOp). Now, the remaining issue is what inequality index I(.) to use. The literature on EOp in income has used the Mean Log Deviation (MLD) index, due to its desirable properties (path independence in particular) (Ferreira and Gignoux, 2011). For the reasons outlined in Ferreira and Gignoux(2014), the MLD is not appropriate for measuring inequality in PISA test score data. Since, it is not ordinally invariant to the standardization of the PISA test scores15. Instead in this case, these authors show that the most appropriate measure for IEOp consists of the variance. Being an absolute measure of inequality itself, it is ordinally invariant in the test score standardization and it satisfies the most important axioms to be qualified as meaningful inequality measure, i.e. it satisfies (i) symmetry, (ii) continuity and (iii) the transfer principle (see section II in Ferreira and Gignoux(2014)). Thus on overall, the variance satisfies requirements for the proposed IEOp measure. Hence, inequality of opportunity in education (IOp) can simply be calculated as: variance(yb) θbIEOp = (7) variance(y) This measure is attractive for various reasons. Firstly, it is simply the R2 from an OLS regression of test scores on circumstance C variables (compare equation4). The only caveat is that this model will not estimate causal effects of individual circumstances. That is individual elements of ρˆ may be biased due to omitted variables bias and one should not interpret them as causal effect of certain circumstances on test scores. But secondly, as shown in Ferreira and Gignoux(2011), the R2 results in a meaningful summary statistic – the lower bound of the true IEOp. Since, if being interested in the total joint effect of all circumstances on educational outcomes as measured by test scores, the object of interest is to understand what percentage of the variation in scores y is causally explained by the overall effect of circumstances (directly and indirectly via efforts). With efforts being treated as generally unobserved, omitted circumstance variables, if we observed i them, this could only lead to a finer partitioning of [Ynm], which would further increase the IEOp measure 2 (for more details, I refer to Hufe and Peichl(2015)). Therefore, the R in equation (4), θbIEOp (in7), is a valid lower bound estimate of the joint effect of all circumstances on educational achievements. In other words, it is the lower bound of the share of overall inequality in educational achievement that can be explained by predetermined circumstances, thus constituting a lower-bound estimate of ex-ante IEOp.16

15Moreover, the authors show, that no meaningful inequality index can generate cardinally identical measures for pre-/post- standardization distributions of identical outcome variables, failing either scale or translation invariance. 16Niehues and Peichl(2014) outline how an upper-bound can be estimated in order to find boundaries for IOp estimates, though this method has not yet been widely applied because of data requirements (e.g. need of panel data).

6 Thirdly, θbIEOp is an IOR of IEOp that is cardinally invariant to the standardization of test scores (Ferreira and Gignoux, 2014). Moreover, one can decompose the IEOp measure into components for each individual variable in the circumstance vector, which is similar to conducting a Shapely-Shorrocks decomposition.

Finally as Ferreira and Gignoux(2014) note, θbIEOp can be regarded as isomorphic to measuring intergenera-

tional persistence of IEOp. For the latter following Galton, one usually conducts a regression of child’s (yit)

on parental outcomes (yi,t−1):

yit = βyi,t−1 + it, (8) with β as measure of persistence. If one used family background variables instead of parental outcome 2 variables for (yi,t−1), then the R measure of immobility (equation (8)) would be similar to θbIOP (equation (7)) as long as the circumstances vector contains mostly family background variables. In this regard, one may

interpret θbIEOp to be closely connected to measures of intergenerational educational immobility. To analyze the effect of increased learning intensity due to the so called “G-8-reform” on IEOp, this paper will apply a DID strategy using θbIEOp (7) as outcome variable. The respective empirical strategy will be outlined in section 6. Before, I will briefly mention the related literature.

3 Literature Review

The following section aims at providing first a brief overview on the empirical literature estimating Equality of Opportunity (EOp) (or (IOp)). In particular, it will be shown how this paper contributes to the still limited branch of the literature working on EOp with respect to educational outcomes (i.e. EEOp). Second, I try to illustrate in which way the quasi-natural experiment exploited in this paper has been studied so far and how this relates on a broader level within the area of economics of education to what we know about the impact of changing learning duration/intensity at school on outcomes such as educational achievements. Thus, the scope of what can be analyzed by relying on this reform should become more evident.

3.1 Equality of Opportunity (EOp) literature

So far in the literature on EOp, most studies focus on estimating inequality of opportunity (IOp) with respect to economic well-being as measured, for instance, by labor earnings, per household income or consumption [Ferreira and Gignoux(2011), Checchi and Peragine(2010), Björklund et al.(2012), Almås et al.(2011), Bourguignon et al.(2007), etc.]. In a survey of comparable cross-country studies, Ferreira and Peragine (2015) illustrate that estimates of the shares of overall inequality in income due to IOp vary significantly (from 2% in Norway to 34% in Guatemala).17 However, Roemer and Trannoy(2015) list in their survey about empirical work on measuring IOp of income common patterns that appear to be robust despite differences in the datasets and methods used in its estimation. Furthermore, as shown by Lefranc et al.(2008) the correlation of IOp and inequality of outcomes is high. Similarly, intergenerational income elasticity and Gini coefficient of incomes have been shown to be highly correlated (Great Gatsby Curve) indicating a link between IEOp and intergenerational social mobility (compare equation (8))(Brunori, Ferreira, and Peragine, 2013).

17This variation can be mainly attributed to the fact that parametric estimation procedures are a lower bound for the true magnitude of IOp (Ferreira and Gignoux, 2011) and are sensitive to the set of circumstances used. Moreover, assuming that choices made during childhood are beyond an individual’s control, i.e. before an age of consent (e.g. 16), Hufe et al.(2015) show that lower bound estimates of IOp in incomes may be even higher, e.g. up to 45% in the US.

7 Concerning the relationship between EOp and economic growth, Marrero and Rodríguez(2013) explain how to incorporate the notion of IOp into macroeconomic studies. For the US, they find evidence for a negative correlation of IOp and economic growth, but growth and inequality of efforts are positively correlated. Some other papers have examined health outcomes (see Fleurbaey and Schokkaert(2009), Rosa Dias(2014)). They propose that EOp with respect to health outcomes implies distinguishing between legitimate (fair) (e.g. due to lifestyle decisions such as smoking) and illegitimate (unfair) inequality in health (e.g. due to SES). Hufe and Peichl(2016) have extended the investigation of EOp to political participation suggesting that for the US the magnitude of IOp in political participation might be higher than IOp in incomes.18

So far, only a small literature has focused on measuring EOp for educational outcomes, i.e. EEOp. For this purpose, in particular to have comparability across countries and over time, the OECD PISA-test scores have recently become one educational outcome variable in this context. Most studies only focus on measuring EEOp for developing countries (e.g. Gamboa and Waltenberg(2012)).

The evidence for developed countries is still limited and often only part of cross-country comparisons. For instance, Ferreira and Gignoux(2014) use the 2006 PISA data to investigate EEOp across 57 countries. They find varying degrees of IEOp, both across countries and all three test-subjects within each country. For Germany, they find that about 35% of inequalities in test scores are unfair (36.8% for reading, 35.1% for maths, 35.2% for science). Furthermore, these authors show that IEOp is negatively correlated with spending for primary schooling, but positively with having a system of tracking students into secondary school. In that regard, Oppedisano and Turati(2015) focus on European countries and based on the PISA-2000 and 2006 data, they evaluate how IEOp changed between those years. However, they only use reading scores as outcome variable and calculate concentration indices to measure IEOp. For Germany, this index declined from 2000 to 2006, as it did for Spain, but not for France or Italy. Finally, conducting an Oaxaca decomposition, the authors suggest that between-school variance is more important in Germany than for other countries.19 The importance of family background variables on educational achievements is also shown by Carneiro(2008). Using Portuguese PISA-2000 data, the author finds that a student’s own as well as her peer-group’s parental education contribute most to the observed inequality in test scores, which reaches up to 40% (IEOp-measure). Moreover, the persistence of educational status seems to be a channel translating into inequality in wages.20 But similar to Oppedisano and Turati(2015), both authors admit that their framework cannot provide clear indication of underlying drivers, but emphasize that parental education is an important circumstance factor.21 Raitano and Vona(2016) also use PISA-data, for the year 2012, but taking into account jointly country-level, school-level policies and peer-effects, they analyze the relationship between the socioeconomic gradient or EEOp and the characteristics of various OECD-country’s educational system.22 They show that grouping students according to their abilities amplifies family background effects (FBEs), whereas putting together students with different SES in the same school reduces the influence of parental background on test scores.

18It appears to be prevalent regarding contacts to officials, monetary contribution in campaigns or membership in organizations. 19However, taking the full PISA-sample of 15-years old, they analyze German students from different school-tracks, which would explain partly observed betweeen-school variation. For an overview of the German school system is provided in section 4. 20Intergenerational persistence of educational status is closely linked to the concept of a socioeconomic gradient, i.e. the measurement of the likelihood of achieving certain educational outcomes given the parents’ educational background 21Nonetheless Oppedisano and Turati(2015) suggest that their results provide evidence that decentralized schooling systems (as in Germany or Spain) may be beneficial to reduce IEOp in contrast to more centralized systems (France or Italy). 22Raitano and Vona(2016) conduct 3 estimations, with PISA-test scores as outcome variable on individual control variables, country-level policies, the individual SES (or family background effect (FBE)) and first the interaction of country-level policies and FBE; second they include school-level policies and the interaction of them with FBE and finally they include the interaction of FBE and peer variables. They use the PISA index of economic, social and cultural status (ESCS) as FBE variable. They, for instance, confirm that postponing tracking age for students may reduce the socio-economic gradient, however, this effect diminishes when taking into account both school sorting policies and social environment.

8 In this paper, I also use the PISA-data to measure EEOp. But focusing on Germany and the academic secondary school track, I thereby exploit a quasi-experimental setting with the aim of deriving some evidence on how the reform policy changed IEOp. So far, only very few papers have exploited some kind of reform to better isolate the effect of educational policies on EEOp. For instance, Figueroa and Van de gaer(2015) evaluate a social insurance program in Mexico focusing not only on its effect on school enrollment, but also on its effect on inequality on this educational outcome variable. They provide a simple test how to evaluate whether a program or intervention improves IEOp or not. This means evaluating whether the expected outcome conditional on circumstances changes due to treatment as this allows classifying whether policies are equalizing or not.23 Instead, Bratti, Checchi, and de Blasio (2008) study an Italian policy of the 1990s that expanded (HE) by offering educational institutions to open new sites and offer a broader range of degrees. They find that HE expansion had a significant, positive impact on university enrollment, but not on actually completing a degree. Therefore, the reform only slightly reduced IEOp as graduating from university remained dependent on family SES. In a related study on the Italian tertiary education system, Brunori et al.(2012) analyze the impact of a reform in 2001 that established 2-years master degree programs. In particular, they look at how the associated reduction in the length to get a first-level degree affected EEOp. Using different measurement approaches, they consistently find that IEOp improved between 1998 and the reform year 2001. However, it is not clear if the improvement in access to tertiary education resembles a lasting effect as after 2001 results are mixed. Conducting a similar analysis, I will though focus on a school reform. In that regard, Edmark, Frölich, and Wondratschek(2014) investigate if a 1992 Swedish school reform that considerably improved the possibilities to choose which school to attend24 may have had heterogeneous impact on students depending on their SES. Exploiting the reform setting and conducting a DID, the authors find no evidence for the existence of differential treatment effects. The overall estimated effects of the reform on students’ outcomes including long-term labor outcome variables are small. Thus, the Swedish school reform analysis suggests that EEOp was not affected. Analyzing also a school reform, my focus is on Germany. In that regard, Riphahn and Trübswetter(2013) try to study educational mobility for East- and West-Germany after reunification based on the German Mikrozensus (1991-2004). They provide evidence rejecting the hypothesis that educational mobility was initially higher in East- compared to West-Germany (as the socialist legacy may have originally suggested).25 More generally, Riphahn and Trübswetter(2013) reconfirm the importance of intergenerational persistence in educational achievements and that after reunification the secondary school system in Germany did not improve regarding EEOp.26 In this paper, first, I also try to add evidence about how EEOp changed over time in a developed country, Germany. Thereby, focus will be on secondary school, in particular on the academic track (Gymnasium). To my knowledge, this paper is among the first to provide an evaluation of EEOp combining the usage of comparable PISA-test score data with the virtues of a quasi-experimental setting to detect causal effects. Finally, as Ramos and Van de gaer(2015) point out in their conclusion, the knowledge on how institutions influence EEOp is still limited. This paper aims to contribute at understanding this issue by exploiting a reform that by reducing school duration increased learning intensity - to analyze its impact on IEOp.

23Their proposed method is best suited for the analysis of EEOp in the context of a randomized controlled trial (RCT) as often conducted in developing countries. Finally, decomposition methods are used to find which groups benefit most from a reform. 24Due to the reform students were allowed to attend a different public school than the one in their catchment area. Moreover, privately run, but publicly funded with a voucher system enabling students to attend them without fees were allowed. 25They show that female students were initially better off in the former German Democratic Republic (1949-1990) which consisted of the following today’s German federal states: Brandenburg(BB), East-Berlin(BE), Mecklenburg-Western Pomerania (MWP), Saxony (S), Saxony-Anhalt (ST), Thuringia (TH) (GDR), but that this relative advantage disappeared after reunification. 26The OECD(2016) confirms that only 19% of 25-34 years old achieve a higher educ. degree than their parents in Germany.

9 3.2 Related literature on varying learning intensity

Even though the "Gymnasium-8-reform" shows that educational politics consider changing the factor of educational intensity in school , only few studies have yet investigated the impact of such a reform. In first instance, empirical work has been concentrated on analyzing the effects of variations in schooling quantity without considerations of changing learning intensity. In that regard, mostly reforms increasing the amount of educational time have been considered. For instance, policies raising compulsory minimum school duration have been exploited to estimate the returns of additional schooling on earnings.27 Second, the impact of differences in instructional time on academic performance has been investigated. Relying on either cross-national or within-country differences in instructional time, such studies mostly suggest the impact of additional time on standardized test scores to be positive (e.g. Aksoy and Link(2000), Woessmann (2003), Lavy(2015)). Using the PISA-2006 data for 50 countries, Lavy(2015) shows that additional schooling time has a significant and positive influence on test scores in mathematics, sciences and reading - the effect being even stronger for students from lower SES which may be indicative for the equalizing characteristics of additional instructional time. Moreover, the fact that effects of schooling time are significantly lower for developing compared to developed countries suggests that the productivity of instructional time relies considerably on quality aspects of the school system and its environment.28

Only few studies have analyzed more explicitly the impact of variations in instructional time when curricular contents can be assumed to remain constant. In this context, reforms that have shortened schooling while keeping curricular content unchanged allow evaluating the impact of increasing learning intensity. Krashinsky(2014), for instance, exploits a reform in Canada that reduced the length of high school while keeping both curriculum and required standards for achieving high school diploma unchanged (i.e. learning intensity increased). Focusing on earnings as outcome variable, the study finds a temporary reduction in returns of schooling on earnings of about 10 percent for students affected by the reform. Nevertheless, the low long-term impact on wages suggests that increased learning intensity might not affect earnings negatively.29 The fact, that pre-reform students could choose to complete high school in four or five years, however, renders some doubts on whether the quasi-experimental set-up criteria are fulfilled in this study (Meyer, 1995). The results seem to be, though, in line with findings by Pischke(2007). This author exploits a reform in Germany that changed the start of a school year in all federal states to the autumn and that appears to fulfill the quasi-experimental setting criteria. While some states already followed the targeted school year cycle, many had to adapt by implementing two short school years between April 1966 and July 1967. Findings suggest that this reform significantly increased grade repetition and that entrance into the intermediate secondary school track fell by around 10 percent. Nonetheless, only small, negligible effects on earnings persisted. Therefore, Pischke(2007) predicts that based on the short school year experience the G-8-reform should not be associated with adverse effects on labor market outcomes for treated students. Marcotte(2007) estimates how variation in educational intensity affects test scores in mathematics, reading and sciences in the Maryland School Performance Assessment Program (MSPAP) by exploiting an unusual natural experiment.30

27E.g. Angrist and Krueger(1991) and Grenet(2013) find significant, positive effects of additional schooling on earnings as long as new compulsory minimum schooling laws affect students who may otherwise drop out of school without degree. 28However, the study cannot distinguish between pure time and the knowledge effect due to more content taught in more time. 29Whether this might be true due to the fact that schooling works primarily as signal or whether increased educational intensity might compensate human capital accumulation in response to reduced schooling quantity, remains an open question. 30With snowfall varying approximately randomly across Maryland each year, snow-related school closure creates random variation in the number of available school days for students from the same grade in each specific school year. Thus, a quasi-experimental variability in time available to prepare for the MSPAP test is created among students.

10 He finds positive, significant effects of additional school days on performance being largest for mathematics. Differences in effects across subjects are interpreted to be consistent with subject-related curricular flexibility. For instance, assuming mathematics to have a quite inflexible curriculum, a reduced number of school days could be less easily compensated by increasing learning intensity in mathematics compared to other subjects.31

In this paper, I exploit the G-8-reform that allows me exploiting a quasi-experimental setting to study the effect of increased learning intensity on EEOp (see section 4). Despite the public controversy about this reform that has even partially induced federal states to reverse it (see Table A.2), only few scientific studies have evaluated the G-8-reform and its effects on outcomes such as educational achievements. First, there have been a few studies aiming to analyze the reform by comparing Gymnasium-8-model (G-8- model)-cohorts and Gymnasium-9-model (G-9-model)-cohorts within one federal state. To begin with, in most federal states the respective statistical offices have conducted studies comparing grades in central exit examinations () of students in the double cohort, that is the respective year when both the last G-9-model- and the first G-8-model-cohorts graduated from Gymnasium (compare Table A.2). Generally, these statistical evaluations have found no systematic performance differences in central exit exams between students with 8 or 9 years of secondary school duration.32 Furthermore, for the federal state of Saxony-Anhalt (ST), a small series of papers (Büttner and Thomsen, 2015; Thiel, Thomsen, and Büttner, 2014; Meyer and Thomsen, 2016) has analyzed aspects of the G-8-reform. In summary, they analyze the reform’s effects on academic achievement using as outcome variable results in central exit examinations of 2007, when the double cohort graduated in Saxony-Anhalt (ST). Findings suggest that - due to more intense schooling - exam achievements in mathematics deteriorated significantly, but remained unaffected for German literature showing that learning intensity ratios differ across subjects. Moreover, no significant, negative effects on student’s non-cognitive, soft skills are detected, opposing claims that increased learning intensity and associated reduced time for non-schooling activities may have adversely affected non-cognitive skill formation. In line with this result, Milde-Busch, Blaschek, Borggräfe, von Kries, Straube, and Heinen(2010) reject the hypothesis that from a medical point of view the more intense schooling experience had significant impacts on stress levels of students However, due to reduced leisure time, G-8-model-students were less able to relax relative to their peers in the G-9-model. Finally, Meyer and Thomsen(2016) find no negative effects of the G-8-reform on the ability, motivation and likelihood to conduct university education.33

Recently, very few papers have started to use more representative data that might be more independent from school system related characteristics or relative performance measurement issues arising with marks at school (e.g. PISA-data). Moreover, identifying the G-8-reform effect by exploiting the variation in its implementation across federal states and over time, this approach allows targeting the shortcomings of previous studies, such as federal state specific trends, by applying methods taking into account variation across states (e.g. DID). For instance, the two most comparable studies to this paper not only rely on exploiting a setting with multiple

31 There have been a few similar studies exploiting quasi-random assignments of instructional time (e.g. due to timing of school year, absence periods of teachers) that usually find similar positive effects of additional time on test scores. Marcotte(2007) is an illustrative example of a study exploiting quasi-experimental variations in instruction time while keeping the curricular requirements unchanged, as a method to analyze learning time/intensity effects. 32For instance, there are federal states with no observed difference (Saarland (SL), North-Rhine-Westphalia (NRW)), in some states the G-9-model-students remained slightly better (Baden-Wuerttemberg (BW)), but in some the opposite has been the case (Hesse (HE), Berlin (BE)) and finally in some results differ between the two groups depending on the subject ( (BV)). 33But the reform had some influence on post-secondary school decisions. Since, for instance, they find significant delays in the starting dates for a first university degree for female students who graduated from a G-8-model school, because they now more likely first complete a type of vocational education. Moreover, in the questionnaire, students reveal that despite the G-8-reform, they still continue with their hobbies. However, students also state to work less outside school but getting more extra tuition.

11 federal states and several time periods, but also use the same outcome variables for educational achievement, the standardized PISA-test scores in reading, mathematics and sciences for academic-track ninth-graders.34 First, Andrietti(2015) uses a data set comprising PISA-2000 to -2009 representative of the 16 German federal states in order to exploit the G-8-reform for conducting a DID-estimation of the effects of increased learning intensity on test scores.35 He finds that the average treatment effect of the reform is significant and positive in all three educational outcomes.36 Finally, Andrietti(2015) finds no evidence for a significant increase in general grade retention rates in contrast to Huebener and Marcus(2015). Instead his results suggest, that only for boys and students with migration background grade repetition may have increased. This may indicate that the G-8-reform caused distributional changes in educational outcomes and thus may have affected EEOp, however, Andrietti(2015) does not really address this issue. In summary, the author shows that students might benefit from increased instructional time despite higher learning intensity.37 Huebener, Kuger, and Marcus(2016) in addition to Andrietti(2015) includes the PISA-2012-data which allow more federal states to be included for the analysis. First, they show based on state regulations of timetables for secondary school, that due to the G-8-reform weekly instruction hours for the average treated student increased by about 6.5 percent over a period of almost 5 years. Then, the authors suggest that this increased instruction time improved student performance on average in all three PISA testing domains. However, the effect size is small, with about 6 percent of a standard deviation in scores. Moreover, for low-performing students positive effects are insignificant, whereas their high-performing peers experience significant, positive effects indicative for a widening of the performance gap among students in Gymnasium. In that regard, Huebener et al.(2016) try to focus on the increased instruction time effect, whereas (Andrietti, 2015) puts more emphasis on the increased learning intensity aspect of the reform. In this paper, I use similar data as Huebener et al.(2016) with PISA-test scores from 2000 to 2012. However, my focus regarding the reform effect follows Andrietti(2015) with emphasis on the effects of increased learning intensity. While these studies shed light on the direct effect of the reform on test scores, they do not try to tackle the question if and how the reform by increasing learning intensity may have changed EEOp.

In this paper, first, I try to shift the emphasis in the analysis of the G-8-reform on distributional concerns, i.e. its consequences on IEOp. In other words, this paper aims to answer the question whether the G-8-reform may be considered to be a selective, i.e. at least maintaining test score results and at the same time increasing IEOp or an inclusive reform, i.e. that at least maintains test score results while increasing EEOp (Checchi and Van De Werfhorst, 2014). In that regard, to my knowledge, this paper may be among the first evaluating the G-8-reform based on the German specific PISA-test scores in order to analyze its impact on EEOp. Thus, the second aim of this paper is to contribute to the literature on EEOp (compare section 3.1) by providing some evidence on a potential policy channel, learning intensity, at the school level. Finally, this paper aims to shift attention on the largely neglected factor of learning intensity having implications for both the effectiveness and efficiency of (non)cognitive skill formation.

34In 2012, in my Bachelor Thesis: "Does shortening secondary school duration affect student achievement and educational equality? Evidence from a natural experiment in Germany: the ’G-8 reform’" I, Sebastian Camarero Garcia, already combined the usage of PISA-test scores in reading as outcome variable to analyze in a DID-estimation framework the effects of the G-8-reform on cognitive skills, finding a positive effect of about 0.15 standard deviations in test scores, with stronger effects for students with migration background (Camarero Garcia, 2012). 35As explained in section 5, to have more consistency and comparability across the studies used, I rely on PISA-I-samples for all years and do not mix PISA-E and PISA-I samples. 36Treated students being in a G-8-model experience an improvement of about 0.095-0.145 standard deviations in PISA-test scores.Furthermore, the author tries to estimate the effects of the approximate pure instruction time increase on test scores and finds similar results: a twenty-hour increase distributed over grades 5-9 or a ten-hour increase distributed over grades 8-9, correspond on average to an improvement of 0.08-0.15 standard deviations, respectively, depending on the subject. 37This is in line with studies finding a positive impact of additional instructional hours when learning intensity is kept constant.

12 4 The “G-8-reform”: shortening secondary school duration

The goal of this section is to explain the institutional background and implementation of the G-8-reform in order to illustrate how this reform can be exploited to set up a quasi-experimental estimation approach. This allows analyzing the effect of increased learning intensity on a measure of IEOp as described in section 2.

4.1 Institutional background: The German school system

Following its federalist organization, educational policy is run by each federal state (Bundesland or Land). While secondary school systems can differ across German federal states, most features are comparable.

School starts usually at the age of six, when students enter primary school for a period of four years. Only in Berlin (BE) and Brandenburg (BB) primary school starting also at the age of six comprises grades 1 to 6. After primary school, students enter a tripartite secondary school system where the choice of track is determined by their previous academic performance.38 Both the shortest track of secondary school, , and the intermediary track, , allow graduates to pursue apprenticeship programs after a total of 9 or 10 years of schooling. University access is restricted to those gaining the Abitur, the university access diploma, by completing the academic track, Gymnasium, which before the G-8-reform lasted for 9 years. Thus including primary school, students achieving the university admittance qualification used to graduate after 13 years. Nevertheless, some federal states provide already for several decades the Abitur after 8 years of secondary schooling (12 years in total). In fact, federal states that were part of the former GDR had originally developed a different secondary school system compared to federal states in West-Germany. All students were taught together until the 10th grade, after which they could either follow vocational training or reach the Abitur after completing two additional years of Gymnasium. Even though in the process of German reunification, most federal states in East-Germany adjusted to the West-German standard of 13 years of schooling to achieve the university access diploma, i.e. a Gymnasium-9-model, Saxony (S) and Thuringia (TH) decided to maintain the Gymnasium-8-model. Nonetheless, coordination through the Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany (SC), initiated a framework of ensuring comparable nationwide academic standards.39 For a more detailed overview of the German education system, I refer the interested reader to Figure A.2 in the Appendix A.4 (see also Standing Conference of the Ministers of Education(2015)). 40

38This may be regarded as simplified illustration of the track selection process, as in fact, primary schools issue recommendation for each student during grade 4 (or 6) which track would be suitable for the respective student based mainly on the student’s performance and progress during primary school. These recommendations used to be binding in many federal states during the time period considered in this study. For more information on the tracking system, compare Dustmann, Puhani, and Schönberg(2014). An overview of recent regulations of the individual federal states with respect to the transition from primary to lower is available on the website of the Standing Conference http://www.kmk.org/fileadmin/ Dateien/veroeffentlichungen_beschluesse/2015/2015_02_19-Uebergang_Grundschule-SI-Orientierungsstufe.pdf and for the period considered in this paper https://www.kmk.org/fileadmin/Dateien/veroeffentlichungen_beschluesse/2006/2006 _03_01-Uebergang-Grundschule-Sek1.pdf . 39It is the conference consisting of the Secretaries of Education and Cultural Affairs of all 16 federal states that has, for example, adopted the Uniform Examination Standards for the Abitur examination in October 2007. 40In addition to Hauptschule, Realschule and Gymnasium, recently several federal states have started to provide a type of comprehensive schooling (Integrierte Gesamtschule or Schularten mit mehreren Bildungsgängen). In these comprehensive schools, students are not tracked into specific academic paths after primary school, but can graduate after 9, 10 or 13 years of schooling. However, the vast majority of students achieving the Abitur, still attends the Gymnasium for this purpose.

13 4.2 The “G-8-reform” and its implementation

The first PISA-study in 2000 received broad public attention in Germany, because it revealed that German students achieved within the OECD only below average test scores in the three basic competences reading, mathematics and sciences (so called “PISA-shock”). Debates about improving the German school system ensued (cf. Ertl(2006), Anderson, Fruehauf, Pittau, and Zelli(2015), Ammermueller(2007)). Among the reform proposals, shortening the academic track in secondary school (Gymnasium) from 9 to 8 years, the “G-8-reform”, remains controversial to this day.41 Mainly three reasons were given to justify its introduction. First, it was aimed to reduce the relative high age of university graduates in Germany. On the one hand, this was said to increase their competitiveness on the labor market compared to the (on average) younger graduates in other OECD countries (cf. OECD(2005a)). On the other hand, with students entering the job market one year earlier, working lifetime would be extended, such that the reform was said to contribute to stabilizing the social security system of a society facing demographic change.42 Second, as the most successful countries in the PISA-test ranking, e.g. Finland, had a common school duration system of 12 years, reduced schooling appeared to be both successful and efficient. Thirdly, the “G-8-reform” was regarded to be a necessary adjustment of secondary school with regards to aiming at harmonizing tertiary education across Europe. Since, as Büttner and Thomsen(2015) illustrate, the reform of shortening secondary school duration was also enacted with respect to the Bologna Process. This European initiative aims at creating a European Higher Education Area (EHEA) providing a more comparable, flexible European framework for tertiary education. For this purpose, adjusting secondary school duration towards the average of European counterparts appeared to be sensible. Moreover, it was regarded to become an incentive for then younger school graduates to strive for achieving a university degree, thereby increasing the below OECD-countries-average rate of university graduates per birth cohort in Germany.

Opponents, however, have argued that the reform induced intensification of education might even tighten educational opportunities by aggravating the already above-average of OECD countries existing high correlation of educational achievements and socio-economic family background, as indicated by the so called socio- economic gradient.43 If the role of background factors in skill formation gained importance, e.g. parental support as a resource to deal with intensified tuition at school, such fears might be reasonable. Furthermore, parental complaints about increased stress for students due to reduced free time resources emphasize the fear of negative impacts on both academic performance and the development of non-cognitive skills typically formed by non-academic free time activities (Thiel et al., 2014).

Beginning in 2001, all 14 federal states with a Gymnasium-9-model gradually decided to shorten secondary school duration from 9 to 8 years. With the respective graduation of a double cohort consisting of both the first G-8-model and the last G-9-model student cohort that had to pass the same final exams (the Abitur exam), the reform process in each federal state took 8 years to transform all grades of Gymnasium (Huebener and Marcus, 2015; Standing Conference of the Ministers of Education, 2016c). As illustrated in Table A.2, the different federal states implemented the G-8-reform one after the other between school years 2001/2002 and 2008/2009, with the associated double cohorts graduating between

41I refer to the last column in Table A.2 for an overview on the status quo of the reform as of school year 2015/16. 42Since, younger university graduates would pay earlier and over a longer timespan contributions that stabilize social security. 43In this context, the socio-economic gradient refers to the correlation of parents’ and their child’s educational achievements. It is used as a descriptive measure indicating the extent to which educational outcomes can be "inherited" (Carneiro, 2008). International studies show that the German socio-economic gradient is relatively high (e.g. Prenzel, Artelt, Baumert, Blum, Hammann, Klieme, and Pekrun(2006); OECD(2005a)).

14 2006/2007 and 2015/2016 (see also Figure A.3 and Figure A.6 in Appendix A.4).

In general, theSC of education ministers decided that standards for the university access diploma ( Abitur) were not to be lowered in response to the reform. Instead, the minimum amount of teaching required before a student can graduate from Gymnasium was maintained at the level of having to pass at least 265 weekly lessons during secondary school (Standing Conference of the Ministers of Education, 2016c). This should guarantee a comparable standard of quality of the Abitur across all 16 federal states - independent of schooling duration. Consequently, curricular contents originally taught in seven years (from 5th to 11th grade) were now distributed across the remaining six years (from 5th to 10th grade), such that students in the G-9-model were supposed to enter the Gymnasiale Oberstufe, the final two school years of Gymnasium, as if they had completed the original 11th grade. This followed the reasoning that these final two years of Gymnasium (called also qualification phase)44, are aimed at preparing students for the Abitur with a relative comprehensive curriculum, such that adding more curricular contents into these years was said to be limited. Furthermore, the first G-8-model-cohort in each federal state was planned to enter a common qualification phase together with the last G-9-model-cohort. For this reason, the last two years of Gymnasium were kept unchanged and the “lost” year had to be compensated already during grades (5-10)[7-10 in BE or BB]. Thus, it is plausible to assume that curricular content was not reduced for the first student cohorts affected by the “G-8-reform” in any of the federal states.45

The fact that educational ministers only started to effectively consider revising curricular contents in the G-8-model compared to the previous G-9-model around 2010 does not influence the validity of the assumption taken, as it would only impact the reform effect for later “G-8-student-cohorts” (after 2012).46 Since, in order to maintain the required number of total minimum weekly periods unchanged for the new G-8-model, instructional time increased by 2-4 weekly periods per grade (during grades 5-9) for affected students compared to previous cohorts in the G-9-model-Gymnasium (Standing Conference of the Ministers of Education, 2013). This is also shown by Huebener et al.(2016) who have collected specific binding timetable regulations of each federal state and illustrate the change in the distribution of average weekly instruction hours (over grade 5-9) in Figure A.7. However, the total loss in time of one school year was not fully compensated by additional instructional time. As to limit the amount of afternoon schooling at the 5th/6th grade, to some extent, e.g. hours originally planned for revision were dropped.47 In that sense, as for the first cohorts curricular content was not reduced, the amount of material that had to be covered per week per grade increased. Therefore, teaching had to be more compact, i.e. it had to convey more contents in the given amount of a one year reduced secondary school duration. In conclusion, it is plausible to assume that the “G-8-reform” exogenously increased learning intensity defined as the amount of curricular content covered in a given period.

44Only marks during final two years together with marks in the Abitur exam form the total mark fixed in the Abitur diploma. 45This could be interpreted as evidence that students in the G-8-model had in principle no loss in overall taught knowledge when compared to students in the G-9-model, because they were obliged to pass the same final two years including final Abitur exams and thus had to learn in advance the same material during 6 instead of originally 7 school years. 46For this reason, for the scope of this analysis including only data up till 2012, changes in the implementation of the G-8-reform as indicated in the last column of Table A.2 do not affect the student cohorts analyzed in this paper (see also Section 5). 47Andrietti(2015) offers the following broad calculation based on the regulations set by theSC and grade-level state-specific data on weekly hours. However, one should note that this is only an approximation for an average student, as the exact hourly impact depends on the federal state and school a G-8-model-student attended. Nevertheless, I cite footnote 2 in Andrietti(2015) as it nicely illustrates the approximate change caused due to the G-8-reform for affected 5th graders and I also refer to figure A.8: By the end of grade 9, G8 students have covered the curriculum corresponding to 6,460 (265/8 per week over 39 weeks for five grades) of the 10,335 instructional hours required for graduation. This means that they have accumulated on average 720 more instructional hours and only 430 less hours than G9 students at the end of grade 9 (265/9 per week over 39 weeks for five grades, i.e., 5,740 hours) and grade 10 (6,890 hours), respectively.

15 4.3 Identification strategy: The quasi-experimental set up of the reform

The G-8-reform and its implementation at different points in time at the federal state level can be exploited as a quasi-experimental setting to derive the reform effect on a measure of IEOp (see section 2). This requires categorizing the 16 federal states into Treatment (T) and Control (C) groups for each PISA-test year. Table 1 illustrates how Treatment (T) and Control (C)-Groups may be formed based on the reform imple- mentation process across federal states and time.48 After describing the data used in this paper (section 5), in section 6 I will use the following definitions to explain the specific empirical strategy employed in my analysis. To begin with, I will define 4 models based on the time period included in the analysis: - Baseline-Model: medium-term perspective (Base-MT): covers the time/testing period from 2003 until 2012 - Extended-Model: medium-term perspective (Full-MT): covers the time/testing period from 2000 until 2012 - Baseline-Model: short-term perspective (Base-ST): covers the time/testing period from 2003 until 2009 - Extended-Model: short-term perspective (Full-ST): covers the time/testing period from 2000 until 2009

For the medium-term models, the following T/C-Groups with reform-time set between 2006 and 2009 exist.

• Neither T nor C: Five federal states can not be classified into either T or C-Group: Saarland (SL), North Rhine-Westphalia(NRW), Hesse(H), Saxony-Anhalt(ST), Mecklenburg-West-Pomerania(MWP). First, Saarland (SL), the first Western state that implemented the reform, has to be excluded as 9th-graders were already taught in a G-8-model-Gymnasium both in 2006 and 2009. As it changed school duration earlier (2001/2002) than most other states, it can neither be regarded as clean T nor C-Group in the Base-MT or Full-MT setting with the general reform time set between 2006 and 2009. Similarly, in the medium-term perspective (including years 2009 and 2012), North Rhine-Westphalia (NRW) has to be excluded, as 9th-graders were still taught in a G-9-model both in 2006 and 2009, such that the reform affects tested students only from 2012 onwards. Thus, NRW is also neither a clean C nor a clean T-Group state in Base-MT model. For the same reasoning, Hesse (H) must be excluded.49 Furthermore, the two Eastern states of Saxony-Anhalt (ST) and Mecklenburg-West-Pomerania (MWP) differ considerably in the way they implemented the reform from other federal states.50 Being already T-Group in 2006, they cannot be T-Group for a DID with the reform time set between 2006 and 2009.

• According to Table 1 seven federal states can be classified as the T-Group, because tested 9th graders were only in a G-8-model from 2009 onwards: Treatment-Group-(T7): Baden-Wuerttemberg (BW), Bavaria (BV), Lower-Saxony (LS), Bremen (BR), (HB), Berlin (BE), Brandenburg (BB). However, as Eastern federal states were part of the former GDR, they are likely to be still different from Western states, for instance with regards to teachers that were still educated in the GDR. Thus, focusing only on Western states, one gets Treatment-Group (T5): BW, BV, LS, BR, HB . Finally, the most conservative setting would be formed by the Treatment-Group (T3): BW, BV, LS. It consists only of Western, territorial federal states, because such states consisting of heterogeneously populated larger areas may probably exhibit some inherently different characteristics from city states.

48Note that in the main specifications, the reference point for the G-8-reform is set to be between 2006 and 2009, since for 7 out of 13 reforming federal states, 9th graders participating at the PISA 2009 test were the first cohorts affected by the reform. Thus, it is most convenient to set reform time between test years 2006 and 2009 in order to conduct a DID estimation approach. 49Despite being the only Land that did not implement the reform uniformly for Gymnasium at the start of one school-year, but successively over three years, one could still classify it to be C-Group in 2009 when only 10% of students tested had been already treated. But then the reform only occurred by 2012 and it becomes C-Group both in 2006 and 2009, when 9th-graders were still taught in a G-9-model-Gymnasium. Thus, it should be excluded for being neither T- nor C-Group in the Base/Full-MT model. 50They applied treatment only from the 9th grade onwards (not as most other states from the 5th grade onwards), such that its the reform impact had to be relatively stronger compared to Western states.

16 Table 1: "G-8-reform" Treatment/Control-Group allocation of PISA cohorts per state

PISA cohorts affected (if) Treatment cohort/grade affected reform double Treated Federal state enaction cohort grade 2000 2003 2006 2009 2012 2006 2009 2012

2004/2005 2010/2011 6 but first 6th cohort treated not in 9th grade in a PISA test year Bavaria (BV) - 1st cohort 4th cohort 2004/2005 2011/2012 5 CCCTT 5th graders 5th graders 5th graders 2004/2005 2010/2012 6 but first 6th cohort treated not in 9th grade in a PISA test year Lower-Saxony (LS) - 1st cohort 4th cohort 2004/2005 2011/2013 5 CCCTT 5th graders 5th graders 5th graders Baden- - 1st cohort 4th cohort Wuerttemberg 2004/2005 2011/2012 5 CCCTT (BW) 5th graders 5th graders 5th graders

Rhineland- --- Palatinate 2008/2009 2015/2016 5 CCCCC (RP) 5th graders 5th graders 5th graders

Schleswig- --- Holstein 2008/2009 2015/2016 5 CCCCC (SH) 5th graders 5th graders 5th graders

North Rhine- - - 3rd cohort Westphalia 2005/2006 2012/2013 5 CCCCT (NRW) 5th graders 5th graders 5th graders

- 3rd cohort 6th cohort Hamburg (HB) 2002/2003 2009/2010 5 CCCTT 5th graders 5th graders 5th graders

- 1st cohort 4th cohort Bremen (BR) 2004/2005 2011/2012 5 CCCTT 5th graders 5th graders 5th graders

- 1st cohort 4th cohort Berlin (BE) 2006/2007 2011/2012 7 CCCTT 7th graders 7th graders 7th graders

- 1st cohort 4th cohort Brandenburg 2006/2007 2011/2012 7 CCCTT (BB) 7th graders 7th graders 7th graders

2006/2007 9 2007/2008 8 1st cohort Saxony-Anhalt 2003/2004 2008/2009 7 CCTTT 7th graders (ST) 2009/2010 6 2nd cohort 5th cohort 2010/2011 5 CCCTT 5th graders 5th graders 2007/2008 9 Mecklenburg- 2008/2009 8 CCTTT 1st cohort Western 2004/2005 2009/2010 7 8th graders Pomerania 2010/2011 6 1st cohort 4th cohort (MWP) 2011/2012 5 CCCTT 5th graders 5th graders

Saxony (SN) since 1949 5 C2 C2 C2 C2 C2 hypothetical control group: always in treatment

Thuringia (TH) since 1949 5 C2 C2 C2 C2 C2 hypothetical control group: always in treatment

1st cohort 4th cohort 7th cohort Saarland (SL) 2001/2002 2009/2010 5 CCTTT 5th graders 5th graders 5th graders

2004/05 2011/2012 5 CCC (T) T - (less than 10%) 2nd/3rd/4th coh. 2005/06 2012/2013 5 CCC C T - 1st cohort 2nd/3rd/4th c. Hesse (H)a 2006/07 2013/2014 5 CCC C T - - 2nd/3rd/4th c. 5 5th graders 5th graders 5th graders a Hesse(H) only introduced the reform gradually across 3 school years (compare Table A.2 and Figure A.6 as well as for reg. settings Figure A.4/ Figure A.5). Notes: In this Table, Treatment/Control-Groups are highlighted by rectangular boxes in the following way: For the Base/Ext-ST/MT Models: Treatment T3 ≡ red rectangle, T5 ≡ red + magenta rectangle and T7 ≡ red + magenta + violet rectangle. For the Base/Ext-MT Models: Control-Group (C2) ≡ blue rectangle; for the Base/Ext-ST Models: Control-Group (C3) ≡ blue + green rectangle. Adding H to C3 would form Control-Group (C4). Finally, TH and S form a hypothetical Control-Group (C2hyp) that always remain in a Gymnasium-8-model. 17 For the medium-term models, two Control-Groups may be formed by the remaining four federal states.

• Control-Group (C2): Rhineland-Palatinate (RP), Schleswig-Holstein (SH) In two territorial, Western federal states student cohorts attended a G-9-model-Gymnasium during the whole time frame. They would best match with the T3-Group. hypothetical Control-Group (C2hyp): Saxony(S), Thuringia(TH) These two Eastern federal states have already followed a G-8-model since 1949, when the GDR was founded and maintained their secondary school system beyond reunification. As they always stayed in a G-8-model, they cannot contribute in estimating the causal reform effect of shortening secondary school duration. However, they can form a hypothetical Control-Group (C2hyp) to estimate the effect of the reform relative to the counter-factual of a permanent G-8-model-system.

In summary, Table 1 already indicates that the most comparable T/C-setting for the medium-term models consists of the Treatment-Group-(T3) and Control-Group-(C2), because it focuses on territorial, Western German federal states that are very comparable in relevant characteristics (see also Table 5). Thereby, this setting still accounts for 37.6 out of 80.6 million inhabitants and thus about 50% of the German population and hence, it will be the main specification for the Base-MT model in the results section 7. However, as will be illustrated in section 7.2, I also conduct robustness checks using T5, T7 and C2hyp (Figure A.4).

For the short-term models, the following T/C-Groups with reform-time set between 2006 and 2009 exist:

• Neither T nor C: For the same reasons as in the medium-term, I exclude 3 states: SL, ST, MWP. As these federal states have enacted treatment for affected students already one period before the reform change reference point, they constitute neither a clean T- nor C-Group both in 2006 and 2009.

• The Treatment-Groups remain identical as in medium-term models, as only the year 2012 will be dropped in the new short-term models with the reform time still set between 2006 and 2009 (T3 = BW, BV, LS — T5 = BW, BV, LS, BR, HB — T7 = BW, BV, LS, BR, HB, BE, BB).

• Control-Group (C2): RP, SH and hypothetical Control-Group (C2hyp): Saxony(S), Thuringia(TH) These two Control-Groups remain the same as in Base-MT - excluding 2012 does not change the allocation of these federal states into Control-Groups. Control-Group (C3): RP, SH, North Rhine-Westphalia (NRW) Now, NRW as territorial, Western federal state with the largest population in Germany can be added to the Control-Group C2, as in the (Full)/Base-ST model, 9th graders were taught in a G-9-model- Gymnasium across the whole time period (2000)2003 until 2009. Control-Group (C4): RP, SH, NRW, Hesse (H). One can add to the Control-Group C3, H, to consider another territorial, Western federal state as part of the Control-Group. To do so one has to take the assumption that H can be classified still into the Control-Group in 2009, as by then only 10% of 9th graders may have been treated (compare Table 1)

Finally, one can note that most comparable T/C-setting for the short-term models consists of the Treatment- Group-(T3) and Control-Group-(C3). It still accounts for 55.2 out of 80.6 million inhabitants and thus about 68% of the German population. Hence, I choose it as main specification for the Base-ST model in the results section 7. However, robustness checks will be conducted using T5, T7 as well as C2, C2hyp and C4. In section 6, I will continue using the above Model-, Treatment- and Control-Group definitions to focus on the details of the specific empirical strategy that my analysis relies on (for an overview see Appendix A.2.1).

18 4.4 Internal Validity of the strategy

According to Meyer(1995) the main idea of a quasi-experiment consists in finding a source of exogenous variation in an explanatory variable. Since, this allows avoiding estimation biases due to omitted variables endogenously affecting both the explanatory and dependent variable. However, in order to benefit from a quasi-experimental setting, three potential problems regarding internal validity should be discussed. A first important criterion for internal validity is the comparability of both the Treatment and Control-Group in the pre-reform period. On the one hand, the German federal states join a similar legislative, cultural and economic framework and common qualification standards are coordinated by theSC. Thus, exploiting variation in the reform implementation process across federal states can be considered to be already an improvement compared to relying on cross-national variation as in many existing studies (cf. Woessmann(2010)). On the other hand, by carefully comparing pre-reform characteristics of both groups, potential selection bias can be identified and ruled out (see section 6.2). Secondly, one should consider whether the reform effect is driven not only by the explanatory variable of interest (increasing learning intensity), but by other factors that endogenously affect the outcome variable (IEOp). For instance, anticipatory effects might theoretically induce potentially affected students to move with their families into a state that has not yet implemented the Gymnasium-8-reform. If for the affected cohorts tested in 2009 such behavior had occurred with respect to a Treatment-Group (T-Group) pre-reform, i.e. before the school year 2004/2005, it might have changed the population’s composition across T- and C-Groups in a way that would bias estimation results. However, such anticipatory behavior is very unlikely. Since first, due to the fast implementation process of the G-8-reform across all federal states and the fact that half of them implemented the reform within three school years (2003/2004 until 2005/2006, see Table A.2) options for moving were limited. There is no systematic pattern regarding the timing and implementation of the G-8-reform and the geographical location of federal states (Andrietti, 2015). Furthermore, direct and indirect moving costs, including bureaucratic hurdles, appear to be the reasons why changing secondary school across federal states has always remained low. Finally, strategic issues concerning the competition for the access to study programs also support the assumption that selection bias due to movements between states is unlikely. Due to the reform several double cohorts graduating during years 2009 until 2016 temporarily increase the number of applicants for university studies in Germany. As this would inversely affect the probability of immediately entering a desired study program, by completing the G-8-model a student could at least “insure” herself against the risk of having to add in total another 14th year consisting of a gap year.51 Finally, one important criterion for internal validity is the common time trend (CTT) assumption (Angrist and Pischke, 2008), which requires that in absence of the reform, both treatment and control-group would have shown a similar time trend52. Although this cannot be tested directly, placebo tests (cf. Bertrand, Duflo, and Mullainathan(2004)) for different pre-reform time periods can be conducted as robustness checks for the internal validity of this paper’s strategy. Moreover, as argued in section 4.3 by restricting on a setting of Treatment-Group-(T3/T5/T7) vs. Control-Group-(C2) in model Base-MT or Control-Group-(C2/C3/C4) in model Base-ST, in all Treatment-Group-federal states the reform was implemented in the same year of 2004/2005. Thus, the quasi-experimental setting as described in Table 1 is unlikely to suffer from estimation bias due to non-random reasons for introducing the G-8-reform slightly earlier or later among federal states.53

51Since, instead of spending 13 years at school and having to wait 1 additional year before entering a desired study program, with 12 years of schooling, enrolling at university including now 1 gap year replacing the “saved” year would be possible. 52In other words, there are no compositional changes in the student body prior to the reform and PISA test scores would have followed the same patterns in both Treatment- and Control-Group in the absence of reform induced changes in learning intensity. 53As explained in Appendix A.2.2, the T/C-settings in Base-ST/MT take into account the political parties governing.

19 5 Data

To begin with, some background information on the OECD’s PISA data, its advantages and disadvantages to measure educational outcomes as well as the standardization procedure conducted on test scores is provided in Appendix A.1.2. In this section, first, I will focus on which specific PISA data are used for the analysis of the Gymnasium-8-reform. Second, some basic descriptives based on this dataset will be provided.

5.1 PISA Data used in this project

For Germany, two types of PISA test data are available (for information on data sources see Appendix A.1.1). First, for each testing cycle, two random samples of students taking the same test on the same day were chosen. After schools from each of the 16 federal states had been randomly selected, among them, on the one hand, about 25 students were randomly taken to be tested on the base of being 15 years old - as the international cross-country comparison relies on the age-based sample (compare ??); on the other hand, students were randomly chosen on the base of being in the 9th grade. For this purpose, within selected schools, randomly two classes of 9th graders with a minimum of 25 students were chosen, thus the grade-based samples include about twice the number of students in the typical age-based sample. Obviously, for the purpose of this study as explained in section 4, I rely on the representative, random sample of students chosen based on being in the 9th grade. Since, the G-8-reform affected students based on their school grade status in a certain school year (compare Table 1). Thereby, the 9th-grader-based sample consists for each PISA study of about 10,000 students from about 225 schools (compare Table 2).

Table 2: Available grade-sample based PISA-I datasets

"before" reform "after" reform

PISA-2000b PISA-2003-I PISA-2006-I PISA-2009-I PISA-2012-I

student-dataset 914 variables 1,292 variables 1,095 variables 1,231 variables 1,215 variables # of studentsa 34,754 8,559 9,577 9,460 9,998

reading reading reading reading reading test scoresd mathematics mathematics mathematics mathematics mathematics sciences sciences sciences sciences sciences

school-dataset 470 variables 572 variables 565 variables 534 variables 502 variables # of schools 1,342 216 226 226 230 teacher-datasetc - 653 variables - 639 variables 257 variables # of teachers - 1939 - 2,201 2,084 a Number of observations for students as included in the PISA datasets (2000, 2003, 2006, 2009, 2012) as available from the Institut zur Qualitätsentwicklung im Bildungswesen (Institute for Educational Quality Improvement) (IQB) based on the grade-based sample (see also Appendix A.1.1). Note, that here the student-dataset includes both the original student questionnaire answers and their parental ones. b Note that for the year 2000, there was no specific grade-based PISA-I-sample available from IQB. However, PISA-2000 being the PISA-2000-E dataset is 9th-grade-based (Baumert, Artelt, Klieme, Neubrand, Prenzel, Schiefele, Schneider, and Weiß, 2002). Therefore, it has a lower number of variables, but higher number of observations than the other datasets. c For years 2000 and 2006, the teacher-dataset was not part of the provided German specific PISA dataset via the IQB. d The test score domains in bold letters have been in focus for the respective PISA test cycle.

20 Second in Germany, national extensions (PISA-E-samples) were conducted for the years 2000, 2003 and 2006. Each of them consists of about 45-50,000 students. For this purpose, one day after the tests taken for the PISA-I-samples, in each federal state additional students randomly selected according to the same two-stage randomized survey design underwent the same testing procedures with a national questionnaire. Combined with the original PISA-I-samples, thus enlarged grade-based and/or age-based PISA-E-samples emerged. By oversampling less populated federal states, their aim was to enable more robust comparisons of educational performance among German states. However, PISA-E was discontinued in 2009 and the Standing Conference of the Ministers of Education and Cultural Affairs of the Länder (SC) replaced it by the IQB-Ländervergleichstest (federal state comparison test). From 2009 onwards, this comparison test aims to assess national educational standards determined by theSC. Notably, scores in the IQB-federal state comparison test are adjusted to resemble the PISA-E testing scale, thus allowing the different extended datasets to be compared over time (Baumert and Prenzel, 2009). However, as for 2009 only reading and for 2012 only mathematics and science scores have been recorded, a regular cross-section for all three testing domains cannot be constructed for these larger dataset versions (for an overview see Table A.1).

Instead for the grade-based PISA-I-samples, all three test score domains are available for all testing cycles (Table 2). The test is designed to enable reexamining the evolution of scores across federal states over time. Therefore, though being smaller than the PISA-E-samples, PISA-I-samples are still large enough and applying the associated weights to each student observation, representativeness of the data can be maintained.54 To have more consistency and comparability across the studies used, this paper relies on the grade-based PISA-I-samples for all years available and avoids mixing PISA-E- and PISA-I-datasets. In that regard, my sample differs to the one of Andrietti(2015) or Huebener et al.(2016) who combine PISA-E-2000/2003/2006 with PISA-I-2009.55 Thus, the empirical analysis undertaken in this study is based on a dataset pooling grade-based samples from PISA-I-2003/2006/2009/2012 and PISA-E-2000 for the extended time period models (Table 2). However, Base-MT (2003-2012) and Base-ST (2003-2009) models will be the preferred specifications as they do not require mixing PISA-I and PISA-E-2000 datasets 56

In summary, this paper relies on the grade-based version of the PISA-I samples to construct a representative repeated cross-section of German students in the 9th grade in Gymnasium that allows analyzing IEOp in response to the G-8-reform by using variables based on students’ PISA test scores and their background characteristics. About one third of secondary school students are in Gymnasium, and indeed about one third of the 9th-grader sample is in Gymnasium. Finally, due to resource constraints the analysis restricts to using mainly datasets including both variables derived from the questionnaire for students and their parents (student-dataset in Table 2). Given the fact, that so far the IQB does not provide access to all available teacher-datasets (esp. year 2006), this paper refrains from using data extracted from teacher questionnaires (teacher-dataset in Table 2). However, questions reappearing for cross-checking purposes in student, teacher or principal questionnaire (e.g. age, gender, etc.) are included in the dataset used for this paper.

54Therefore, sample size, test score scales and the main background information evaluated by the questionnaires for students, parents and the school headmasters are for the most part comparable between PISA-I-studies (Baumert and Prenzel, 2009). 55Note, that the IQB only provides age-based samples for PISA-E-2006. Thus, reducing the dataset to 15-years-old in the 9th grade would produce a much smaller sized sample for 2006 compared to the one for 2000. Such a dataset might be more likely affected by estimation bias, because the merger of two datasets originally created to be representative for different target populations raises questions on its representativeness. However, for robustness, one may try to replicate this paper’s approach using both the grade-based German-specific version of PISA-E-studies 2000, 2003, 2006 and the PISA-I-studies 2009 and 2012. 56Note that for the year 2000, there was no specific grade-sample based PISA-I sample available from IQB. However, PISA-2000 being the PISA-2000-E dataset is 9th-grade-based (Baumert et al., 2002). Then one has instead of the usual 80 replication weights, in fact about 768 weights. One also needs to keep more attention regarding weighting as in the larger samples (PISA-E or IQB-LV) student observations per testing domain may vary and different weights may be required per testing domain.

21 5.2 Descriptive statistics and Control-Variables

First, the descriptive statistics of the main outcome variables, PISA test scores in the domains of Reading, Mathematics and Science are shown in Table 3. As expected, by focusing on students in the academic-track of secondary school, mean PISA test scores are above the German average. A typical 9th grader in Gymnasium consistently achieves about 60 points higher results than the German average 9th grader, which corresponds to about an entire proficiency level (??), i.e. the value-added of two school years. Regarding the three testing areas, students in Gymnasium have performed worst in Reading. Moreover, they appear to have stagnated or rather slightly deteriorated in their reading skills between 2000 and 2012. This observation is in line with reports on German PISA test results for the years 2000-2009 illustrating that German students perform better in Maths relative to Reading (cf. Klieme, Artelt, Hartig, Jude, Köller, Prenzel, Schneider, and Stanat(2011)), with an average score in Maths (about 580) exceeding scores in Reading (about 570). Instead, students perform best in Sciences reaching up to 590 points. Thereby, for Maths and Sciences, no clear time patterns in scores can be detected. Furthermore, as Table 3 shows, for all three testing areas the median exceeds the mean test score. This indicates that there appears to be more variation on the low end of the performance scale with some students rather performing relatively bad, pushing the median down.57 As laid down in section 4.3 and due to what datasets are best comparable (section 5.1), an analysis of the Gymnasium-8-reform requires comparing Treatment-Groups (T3/T5/T7) to Control-Groups(C2/C2hyp/C4h) in model Base-MT or additionally to Control-Groups (C3/C4) in model Base-ST. Table 3 gives the allocation of representative students - whose number increases over each PISA testing cycle starting in 2003 - into the different T- and C-Groups. Following this grouping, the estimation sample contains at least 6.223 observations in the Base-MT model and at least 4.297 in the corresponding Base-ST case. Finally, the dataset in this paper contains more than 60 schools per test year across all federal states and on average for each testing cycle the number of students increases.

Regarding the selection of relevant control variables, this study follows the most common approaches used in the literature (compare section 3). To begin with, as illustrated by Gamboa and Waltenberg(2012) and explained in section 2, the choice of control variables in the context of trying to measure IEOp is always questionable. Since, one needs to include those variables as control that represent circumstances, i.e. factors which are beyond the control of a student, but which explain parts of the dependent variable of interest, i.e. cognitive skills as measured by test scores. Table 4 provides an overview of the main control variables used in the base model specification (Base-MT (2003-2012)).58 Given data restrictions, the control variables used can be divided into two main groups, namely student-level circumstances, i.e. personal characteristics and into socio-economic family background variables, such as parental household characteristics. Concerning student-level controls, as expected on average students are 15.42 years old and thus around the age of 15. The share of female students is slightly greater than that of male ones (53% of the sample being female). This reflects the fact that in recent years female participation in Gymnasium has been steadily higher compared to that of male students (Prenzel, Sälzer, Klieme, and Köller, 2013). Finally, migration background indicates that about 17% of students had at least one foreign born parent and thus reflects a different or additional cultural identification a student might associate with compared to the German one

57The mean-median comparison and its evolution over time may be regarded as first sign for whether IEOp changes over time. Here in Table 3, median and mean seem to deviate more after than before the reform. 58For simplicity, Table 4 just summarizes the main control variables in the pooled version of the main model specification (Base-ST (2003-2012)).

22 predominant in school. The variable language spoken at home is additionally taken into account within the category of migration background as student-level characteristics, as in combination with the birth place based migration variable, it improves the extent to which one controls for the student’s migration-background. In fact, depending on the level of parental integration, one might expect that not all students with migration background speak another language at home, but German. And indeed, less than half of the number of students with migration traits indicate to speak a different language at home (see Table 4). Clearly, all this individual characteristics (gender, age and migration background) can be classified as circumstances (as defined in section 2). Next, another set of control variables involves socio-economic family background variables. First, an important circumstance variable consists of a student’s parental education background, i.e. the highest educational achievement among a student’s parents. Since, in the spirit of Todd and Wolpin(2003) parental education can be considered as a key factor influencing the student’s human capital production. Moreover, it seems to be an indicator for potential support opportunities available to a student, however,

Table 3: Descriptive Statistics: Outcome Variables and Sample Size

"before" reform "after" reform

PISA-2000 PISA-2003-I PISA-2006-I PISA-2009-I PISA-2012-I

PISA test scores of 9th graders in Gymnasium Reading Mean 577.92 570.77 568.20 562.65 565.42 Reading SD 55.86 51.98 56.97 55.25 52.81 Reading Median 578.83 572.14 571.50 566.23 567.06

Mathematics Mean 573.65 583.66 571.39 578.53 575.73 Mathematics SD 62.18 57.85 58.48 56.59 58.52 Mathematics Median 572.6754 584.7017 571.1871 580.472 576.1879

Science Mean 575.14 591.15 585.01 590.48 580.44 Science SD 67.43 60.20 61.47 58.88 58.61 Science Median 576.35 594.80 587.12 594.68 581.07 Number of federal states 16 16 16 16 16 Number of schools 409 62 67 68 78 Number of students 10,276 3,017 3,356 3,473 3,910 Treatment-Group (T3) 1,917 987 1,188 1,467 1,626 Treatment-Group (T5) 3,175 1,090 1,275 1,568 1,761 Treatment-Group (T7) 4,524 1,412 1,587 1,778 2,029 Control-Group (C2) 1,225 153 194 308 300 h. Contr.-Gr. (C2hyp) 1,387 312 295 118 162 Control-Group(h) (C4h) 2,612 465 489 426 462 Control-Group (C3) 1,830 872 989 1,159 - Control-Group (C4) 2,434 1,062 1,272 1,458 - Note: This table reports summary statistics for the sample of 9th-graders in Gymnasium and is weighted by the sample weights provided in the PISA dataset from the IQB. Note, that the average across plausible values can be taken as a metric of individual-level performance (OECD, 2012). For further information on the test scores and its weighting procedure, I refer to Appendix A.1.2. Mean, standard deviations and median of the test scores across all federal states and for all academic track schools that are in the German PISA dataset is provided for each testing cycle (2000 (see footnotes a and b in Table 2), 2003, 2006, 2009, 2012). Finally, the number of observations for the different Treatment- and Control-Groups (section 4.3) is provided.

23 students at the age of 15 are unlikely to be in control of changing their parent’s educational attainments.

Table 4: Descriptive Statistics: Control Variables - Circumstances

Base-MT (2003-2012) Mean Std. Dev. Comments

Student-level Characteristics female-dummy 0.5289 0.4989 min-max:[0-1] Age in years 15.43 0.49 min-max:[13,75-17,25] migration background (Base category: /student and both parents born in Germany) - language spoken at home 0.0552 0.2285 min-max:[0-1]; missing: 0.0060 (0.0774) - migration background 0.1679 .3738 min-max:[0-1]; missing: 0.0060 (0.0774)

Parental characteristics Parental Education: (highest ISCED level) - ISCED-level (5-6): 0.6285 0.4832 For an explanation of ISCED, see Figure A.10 - ISCED-level (3-4): [Base cat.] 0.2812 0.4495 min-max:[0-1] - ISCED-level (1-2): 0.0532 0.2244 missing: 0.0371(0.1890) Socio-Economic Status Number of books in household: - + 500: 0.2029 0.4022 - 101-500 : [Base category] 0.4703 0.4991 min-max:[0-1] - 11-100: 0.2579 0.4375 missing: 0.0497 (0.2174) - max. 10: 0.0193 0.1375 Highest-ISEI-level of job in the family - highest ISEI-level: 57.1536 17.2042 min-max:[0-90]; missing: 0.0177 (0.1317)

Family Characteristics family structure - living up in single parent household ? - single parent household: [Base: No] 0.1317 0.3382 min-max:[0-1]; missing: 0.0808 (0.2726)

family structure - mother/father employment status Father - full-time (FT) [Base category]: 0.8120 0.3907 - part-time (PT) : 0.0584 0.2345 min-max:[0-1] - unemployed (UE) : 0.0251 0.1564 missing: 0.0728 (0.2598) - out-of-labor force (OLF) : 0.0318 0.1753 Mother - full-time (FT) [Base category]: 0.2972 0.4570 - part-time (PT) : 0.4379 0.4961 min-max:[0-1] - unemployed (UE) : 0.0452 0.2078 missing: 0.0603 (0.23812) - out-of-labor force (OLF) : 0.1593 0.3660 Number of students 13,756 G-8-reform-dummy: 0.4573 (0.4982) Note: This table reports summary statistics for the sample of 9th-graders in Gymnasium pooling the data for medium-term basic model specification (Base-MT (PISA-I-2003/2006/2009/2012)) (see section 4.3) and is weighted by the sampling weights provided in the PISA dataset from the IQB. In the comments column, the amount of missing observations is provided and standard deviations are reported in parentheses. For categorical control variables, the base category is indicated by italics. Finally, the number of observations and the G-8-reform-dummy share is provided.

24 For the purpose of measuring parental education, I rely on the International Standard Classification of Education (ISCED) index. It serves to identify whether mother, father or at least one parent has achieved an academic degree, in which case this household could be classified as being an academic household, i.e. having ISCED (level 5/6) (see Table 4). In the sample, about 60% of students live up in an academic household. A medium education category includes students whose parents’ highest educational achievement is upper-secondary/post-secondary, non-tertiary [ISCED (level 3/4)] education. The ISCED (level 1/2) includes only students whose parents have achieved not more than lower-secondary school. An overview of the definitions for these different categories in the ISCED scale is provided in Figure A.10. In order to take into account socio-economic status (SES) of a student’s family background, I exploit, first, the number of books at home as a common variable indicating SES environment in which a student grows up that is a standard control variable generated in all PISA studies. Due to having no information on household income, this variable has been shown to be a good alternative proxy for the family SES, as household income as well as SES are highly correlated with the amount of books in a household. Moreover, it is plausible to assume that students by the age of 15 are financially dependent on their parents and access to culture appears to be hugely influenced by the opportunities offered in the household a child lives up. Thus, it is mostly accepted that for students aged 15 the number of books variable represents circumstances controlling for family SES.59 I take the category of having 101-500 books as base category for this variable, as about 50% of students in the sample live in such a household. Similarly, the International Socio-Economic Index of Occupational Status (ISEI) index can be taken into account as a further control variable for socioeconomic background.60 Higher ISEI scores correspond to higher levels of occupational status. As parental occupational status is unlikely to be in the control of students, it also reflects circumstances due to parental SES.

Finally, to control for family structure characteristics, first, I take into account for whether a student has to live up in a single parent household. About 14% of all students are raised under such circumstances. Since it involves, for instance, being exposed to grow up in a more stressful environment. Second, I also consider both for mother and father employment status dummies. Since, by determining the time availability and family structure, aspects that influence the environment in which a student can learn (for school) are taken into account. In the sample, the vast majority of fathers is working full-time (FT) (more than 80%), whereas the majority of mothers is part-time employed (PT) (about 44%). This is consistent with the still predominant family model in Germany during the 2000s consisting of a bread-earning father and only part-time working mother mainly in charge of child care.61 Turning to the included missing variables for all control variables shows that the response rate was always above 90 percent (last column in Table 4). Students affected by the Gymnasium-8-reform constitute 45.73 percent of the sample (in the basic specification of the medium-term model Base-MT).

59As mentioned in section 2, Hufe et al.(2015) provide another reasoning why number of books can be taken as circumstances variable: Below the age of consent (e.g. 16), any student’s potential influence on the stock of books in the households is limited and rather not choice-driven. Instead, if a student at the age of 15 or younger made parents buy books, this is rather likely to be indicative of living up in more favorable SES, as most students under the age of 16 are rather influenced by the environment they grow up than vice versa. In other words, number of books is associated with circumstances at least during childhood. It is not likely that young children in households with few books change this fact because being rather used to this environment, they may be less likely interested in reading and lack financial resources. Therefore, more books likely correlate with more favorable SES. 60As alternative one may use the International Standard Classification of Occupation (ISCO) to determine parental SES. Thereby, parents’ occupational data were obtained by asking open-ended questions, the responses to which were coded into ISCO codes. However, it is not available for all PISA datasets, in contrast to the mapping of ISCO into ISEI indexes. 61In fact, with a school system based mostly on half-day schooling and as working parents had to face only a limited supply of institutions to take care of children after school, the described and in the data predominant family structure with a FT working father and mostly PT working mother has been predominant in Western Germany for many decades. However, since the late 2000s, a slow extension of all-day schools has started and this may change the situation of student generations born during the 2000s - a group of students that is not part of the student sample under investigation for this paper’s analysis period (2003-2012).

25 6 Empirical Strategy

This section, first, briefly explains the empirical strategy that allows analyzing the effects of increased learning intensity on a measure of IEOp (see section 2) by exploiting the ‘quasi-experimental’ setting of the Gymnasium-8-reform (see section 4.3). Second, section 6.2 provides evidence for the appropriateness of focusing the regression analysis in section 7 on the main specifications being Treatment/Control-Group T3 vs. C2 for the medium-term model (Base-MT) and T3 vs. C2/C3/C4 for the short-term model (Base-ST).

6.1 Methodology and Estimation Designs

Analyzing how the G-8-reform through increasing learning intensity changed educational opportunities for students in Gymnasium involves a two-step estimation procedure. First, appropriate measures of EEOp or IEOp need to be estimated given the available outcome and control variables in the data. Second, exploiting the quasi-experimental reform setting with estimated IEOp measures as dependent variable, the effect of interest can be obtained. Starting with the first step, this analysis follows the argumentation of section 2 on how to measure IEOp. 2 That is, θbIOP (equation (7)) will be the measure for IEOp. In other words, first the R from an OLS regression of PISA test scores on circumstance C variables has to be obtained both for the different Treatment- and Control-Groups for each of the time period models (see section 4.3 and 5.2). Thus, for both medium- and short-term perspective, the following regression model is run separately for all available Treatment (T3/T5/T7) and Control-Groups (C2,C2hyp and/or C3,C4) twice for the period before the reform ((2000)-2003-2006) and after the reform (2009-(2012)).62

Yist = β0 + β1(Individual Characterististics)ist + β2(P arental Characteristics)ist

+ β3(Socio − Economic Status)ist + β4(F amily Characteristics)ist

+ FE(federal state/school)s + ist (9) where Yist = {stdpvreadist; stdpvmathist; stdpvscieist} is the outcome variable, test scores, of student i in federal state s at time t in one of three PISA test domains (compare Table 2). For the purpose of a better interpretation of the β coefficients in equation (9), it is useful to standardize test scores to allow representing coefficients as effects in percentages of an international standard deviation in PISA test scores.63 Following Table 4 concerning the control variables, I decided to restrict in equation (9) to distinguishing about four (six) control variable sets to take into account relevant circumstances (Appendix A.2.1).

62 Note, that until section 7.2, I focus in notation on the main specifications, the Base models covering (2003-2012) (Appendix A.2.1). Furthermore, the R2 is calculated over, if applicable, the pooled sample of data in the respective pre-reform years and then separately over the corresponding pooled post-reform sample - with the general reform time set to take effect between 2006 and 2009, as explained in section 4. 63The PISA-test scores in the 500 scale metric as described in ?? have thus been standardized to a mean of 0 and a standard deviation of 1. Note that, generally, three ways of standardizing appear to be reasonable. 1. Standardizing test scores for students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvsubject2 ): This allows interpreting coefficients relative to the performance in each testing year. 2. Standardizing test scores with respect to the pooled sample of all students in Gymnasium that are part of the representative grade-based German PISA test cohort in any of the test years that form the whole sample (e.g. 2003, 2006, 2009, 2012 in Base-MT)(stdpvsubject3 ): This allows interpreting coefficients relative to the performance of students across the whole sample period. 3. Standardizing test scores with respect to the sample of students in Gymnasium that are part of the representative grade-based German PISA test cohort in a reference test year, the first year in the respective time period model (e.g. 2003 in Base-ST/MT: stdpvsubject2003 ): This allows interpreting coefficients relative to the performance of students in a reference year.

26 Individual characteristics (IC) include the circumstances variables age and gender as well as migration background. As students were sampled based on being in the 9th grade, by controlling for age, differences in school entrance age and grade repetition potentially due to ability are taken into account. Thus, expecting a negative impact of age on test scores would be in line with that reasoning. Controlling for gender takes into account that subject-specific differences in academic test score performance between male and female students might be expected given the associated literature. For instance, Niederle and Vesterlund(2010) find that female students tend to perform better in verbal reasoning, but worse in mathematics compared to their male counterparts. Thus, a corresponding pattern for test scores could be anticipated. Migration background being a further personal fixed student characteristic has also been shown to be important in explaining academic achievements of students in standardized test scores in Germany (Klieme et al., 2011). Thereby, migration traits appear to be on average negatively correlated with performance due to, for instance, its adverse implications on non-cognitive skills such as self-esteem, motivation or aspirations. Socio-economic family background control variables include Parental Characteristics (PC) such as parental education levels, SES indicators such as the number of books in the household, and Family Characteristics (FC) such as family structure. Since, parental education and SES might be correlated with human capital investments on children and thus with student’s academic performance. A more academically stimulating environment tends to have positive impact on cognitive skill formation and in that regard parental education can be assumed to constitute circumstances capturing investments into a student’s early childhood. Similarly, favorable SES as measured by higher ISEI index values and/or more books available in a household should be expected to have a positive impact on a student’s test scores. Since, higher SES of the family in which a student grows up may be indicative for better and easier access to support opportunities for dealing with school-related work including preparations for tests. Otherwise, living up with a single parent or with unemployed parents might have a negative effect on test scores, because such family conditions may negatively impact skill formation and worsen access to out-of school support due to e.g. economic constraints. Moreover, for all versions of equation (9), i.e. for each combination of time period models, T/C-Groups, both pre- and post-reform, as well as for each set of control variables, federal states-fixed effects (FEs) or school-FEs can be included. Since, state-FEs take into account time-invariant differences in the outcome variables between federal states, for instance, due to variations in state-level spending on education or in school policies. Thereby, it is plausible to argue that the federal state in which a student grows up and goes to school represents a circumstance variable, because it is very likely to be beyond a student’s control where parents decide to live when their children reach the grade for entering secondary school, which is usually around the age of 10 (section 4.4).64 Using additionally school-FEs allows taking into account quality differences among schools (within a federal state) and controlling for other school-level circumstances.65

Concerning the second step, to identify the effect of the G-8-reform on a measure of IEOp as estimated in the first step, equation (9), one can apply a DID estimation method. Since, the gradual reform implementation at different points in time across federal states allows estimating the reform effect of increased learning intensity on IEOp by exploiting the differences between comparable T-/C-Groups (section 4.3). Thus, e.g. for model Base-MT, before denotes the pre-reform (2003-2006) and after the post-reform period (2009-2012).

64Evidence suggests, that the vast majority of students does not change school during Gymnasium and moreover as discussed in section 4.4, changing secondary school across federal states is uncommon and bureaucratically burdensome. 65However, using school-level controls, more caution may be initially required, as if there are potentially more discretionary factors in deciding which Gymnasium a student attends, this school would be not entirely beyond a student’s control. However, results do not change much using different forms of FEs suggesting these concerns are not relevant. Note, that by applying school-FEs without state-FEs, one can still control for characteristics both on school and state level (as federal states are in charge of school policies). As the PISA test is not conducted in the same schools across years, the school-FEs are wave-specific.

27 2 T 2 C Then, estimating the second step with Dbefore = (R )before − (R )before =6 0, we get:

2 2  2 T 2 C  2 T 2 C  DD = D(R )after − D(R )before = (R )after − (R )after − (R )before − (R )before (10)

2 T/C 2 where (R )after/before is the R from equation (9), which is the new dependent variable of interest being a measure for IEOp in the respective group [T=Treatment, C=Control] and before/after the reform. Now, for each of the three testing domains DD becomes the respective DID estimate of the reform effect on IEOp. Results for the main specification will be provided as DID tables illustrating equation (10) in section 7. However, one could also derive the DID estimate DD in equation (10) within a regression framework. For instance, for model Base-MT, in the Treatment-Group-(T3)-Control-Group-(C2)-setting, one would get:

2 Rist = δ0 + δ1(T reatG8st = AF T ERt × T reats) + γt × AF T ERt + ξs × T reats + ist (11)

2 2 2 2 2 where Rist = {R (read)ist; R (maths)ist; R (science)ist} contains the R from equation (9) associated with student i in state s at test year t in the respective test domain (Reading Comprehension, Maths or Sciences).

T reat captures the Treatment-Group-specific effect and after the time trend. δ1 is the interaction term, being one if a student attends a Gymnasium in a treatment state after the implementation of the new 2 Gymnasium-8-model. Consequently, δ1 is the interesting DID reform effect on the outcome Rist as measure of IEOp. Thus, it is obvious that equation (10) and equation (11) are equivalent, i.e.:

2 T 2 C  2 T 2 C  δb1OLS = (R )after − (R )after − (R )before − (R )before = DD (12)

It is important to note this baseline regressions model could be adjusted to take into account three issues. First, to allow for extrapolation of findings to the German-wide student population, the notion of external validity has to be considered (Meyer, 1995). This requires that the data sample should be as representative as possible with respect to the population of German 9th graders in Gymnasium during the school years in the time period under investigation, i.e. 2003 to 2012 in case of Model Base-MT. For this reason, a Weighted Least Squares regression approach (WLS) adjusting the regressions models with population weights should be employed (see Angrist and Pischke(2008)). 66 A second issue concerns whether there might have been different unobserved “implementation” effects on the level of federal states that imposed the reform on schools. For instance, certain school-system characteristics could have heterogeneously influenced the impact of the reform across federal states.67 Thus, one may adjust the DID regression approach by including federal states-FEs that would capture any specific effects at the highest level of variation that is not captured by the DID group specific means in equation (11).68 Generally, one might further like to test for federal-state specific time trends.69

66Baumert and Prenzel(2009) show that taking into account that the sampling strategy might have led to over/underrepresentation of certain student groups, for each observation a population weight has been calculated. Though originally referring to the PISA-E-2006 study, these authors provide a precise description of the data sampling strategy and the population weight generation procedure being similarly valid for the German specific PISA-I-Dataset recommended by the IQB. There are two stratification steps that slightly distort the data sample with respect to the original data: first, based on the population distribution across federal states, the distribution of students across different school types and the size of standard errors on a state level in previous studies, the number of schools for each state was fixed. Then, accordingly schools were randomly selected as primary sample unit. Finally, 9th graders were randomly chosen from the selected schools. 67In our particular case, curricula - though being coordinated on a German-wide scale by theSC - can slightly differ across federal states. Taking into account the “between-federal states” variation would thus improve reliability of estimation results. 68As in the discussion about usingFE in the first-stage, one may try to include both federal state and/or school-FEs. 69However, as Angrist and Pischke(2008) show, this requires more than three time periods and thus a different estimation approach without pooling pre- and post-reform periods would be needed.

28 Thirdly, standard errors need to be adjusted, as given the sampling strategy there may be some correlation among observations in the error term. Therefore, adjusting regressions by applying robust standard errors (s.e.), for instance, by clustering on the level of federal states would be important. More generally, bootstrapping and in particular calculating s.e.’s based on available replication weights in the PISA datasets may be useful.70

Extending this model with the background characteristics that are likely to be uncorrelated with the treatment effect, thus, should not change much the size of the reform effect. However, as Angrist and Pischke(2008) argue, including such control variables might be an opportunity to increase the precision of regression estimations. The respective control-augmented DID regression model is thus:

Rist = δ0 + δ1(T reatG8st = AF T ERt × statess) + γt × yeart + ξs × statess + αXist + ist (13) where X is the vector (of student-, school-) and potential state-level variables. Now, federal states and time FEs would be also included. However, the main reform effect should remain similar to equation (11) as long as the quasi-experimental setting holds and Treatment- and Control-Groups are well chosen.

In summary, the analysis in this paper follows the described two stage estimation procedure (i.e. equation (9) and equation (11)). Thereby, I employ relevant population weights, clustering on federal state level and include federal-state/school-FE to derive the R2 as IEOp measure. For the second stage, I follow equation (11) leaving the extended regression approach to be discussed in the context of what else could be done (section 8).

6.2 Treatment-Control-Group comparisons and Restrictions

Based on the internal validity of the quasi-experimental design of the G-8-reform, estimation of the reform effect on IEOp should neither be biased by any serious selection of students based on observable nor unobservable pre-reform characteristics. Thus, it is important trying to test for the plausibility of the common time trend (CTT) assumption by analyzing the pre-reform differences across observable characteristics between Treatment- and Control-Groups that can be formed given the reform implementation procedure (Table 1).71

Following the simple means comparison approach of Imbens and Wooldridge(2009), in Table 5 I show standardized simple means comparisons for the Control Variable sets for the main Treatment- and Control- Group comparisons to be considered in the next section (section 7). The table displays means of control variables for the most comparable T-/C-Group combinations based on the discussion in section 4.3. For the main time period models to be considered, Model Base-MT and Model Base-ST, the general reform takes effect between 2006 and 2009. Thus, PISA data 2003 and 2006 constitute pre-reform periods. As the identification strategy relies on comparing the change in IEOp before and after the reform for 9th-graders attending Gymnasium across T-/C-Groups, significant observable pre-reform differences in the control variables sets (Table 4) might call the empirical strategy into question suggesting the existence of unobserved compositional pre-reform differences.72

70Note, however, that for using the DID-approach in order to estimate effects of the reform on test scores, the two most related working papers do not apply replication weights. But Andrietti(2015) relies on clustering standard errors on the federal state level and Huebener et al.(2016) claim that standard errors based on clustering do not differ much from bootstrapping them. 71Ideally, there should be no compositional changes in the two groups over time (section 4.4). But with only up to 3 time periods before and 2 after the reform, a graphical test of parallel trends is only of limited help. For the purpose of this paper, therefore, I had to restrain from producing such graphs, but restrict to pre-reform Treatment- and Control-Group comparisons. 72Note that due to space constraints, pre-reform means comparison between Treatment- and Control-Groups for each point in time are not provided in the paper. But, they reconfirm that there have been no relevant differential changes across T/C-Groups before the reform, thus supporting the validity of the estimation strategy taken (see also Table 7 in Andrietti(2015)).

29 Table 5: Descriptive Statistics: Pre-Reform Treatment vs. Control-Group Comparison of Control-Variables

Base-MT (2003-2012) Model Base-ST (2003-2009)

T3 C2 T3-C2 T3 C3 T3-C3 C4 T3-C4

Individual characteristics female-dummy 0.537 0.501 0.036 0.537 0.549 -0.011 0.543 -0.006 Age in years 15.488 15.468 0.020 15.488 15.464 0.025 15.474 0.015 migration background (Base category: German language/both parents born in Germany) - language spoken at home 0.054 0.043 0.011 0.054 0.055 -0.002 0.054 -0.001 - migration background 0.183 0.144 0.039* 0.183 0.175 0.008 0.184 -0.000 Parental characteristics Parental Education (highest ISCED level): - ISCED-level (5-6): 0.662 0.637 0.025 0.662 0.648 0.014 0.642 0.019 - ISCED-level (3-4): [Base] 0.288 0.326 -0.037 0.288 0.288 0.000 0.291 -0.003 - ISCED-level (1-2): 0.044 0.026 0.018 0.044 0.036 0.008 0.037 0.007 - missings: 0.006 0.012 -0.006 0.006 0.028 -0.023*** 0.030 -0.024*** Socio-Economic Status Number of books in household: - + 500: 0.226 0.233 -0.008 0.226 0.246 -0.020 0.243 -0.017 - 101-500: [Base category] 0.509 0.516 -0.007 0.509 0.481 0.028* 0.489 0.020 - 11-100: 0.246 0.228 0.019 0.246 0.228 0.018 0.222 0.025* - max. 11: 0.010 0.014 -0.004 -0.001 0.010 0.015 0.016 -0.006* highest ISEI-level of job in the family - highest ISEI-level: 59.103 57.072 2.031** 59.103 58.471 0.633 58.698 0.406 - missing : 0.004 0.000 0.004 0.004 0.006 -0.002 0.009 -0.005** Family Characteristics family structure [Base: No] - single parent fam.: 0.137 0.141 -0.004 0.137 0.150 -0.013 0.156 -0.019* - missing : 0.072 0.058 0.015 0.072 0.057 0.015* 0.064 0.008 family structure - employment status Father - full-time (FT): [Base cat.] 0.854 0.841 0.013 0.854 0.843 0.012 0.845 0.009 - part-time (PT) : 0.065 0.063 0.001 0.065 0.058 0.007 0.054 0.010 - unemployed (UE) : 0.024 0.032 -0.007 0.024 0.026 -0.001 0.026 -0.001 - out-of-labor force (OLF) : 0.033 0.035 0.033 0.033 0.000 -0.005 0.034 -0.001 Mother - full-time (FT) : [Base cat.] 0.217 0.213 0.004 0.217 0.232 -0.015 0.232 -0.014 - part-time (PT) : 0.515 0.501 0.014 0.515 0.476 0.040** 0.476 0.039*** - unemployed (UE) : 0.061 0.075 -0.014 0.061 0.063 -0.002 0.061 -0.001 - out-of-labor force (OLF) : 0.194 0.187 0.194 0.194 0.202 -0.008 0.204 -0.009 Number of students 2,175 347 - 2,175 1,861 - 2,334 - Notes: This table shows a two-sample t-test for comparing in the pre-reform period the main control variables of the main specification to be considered in this paper between Treatment- and Control-Group (see section 5.2 and 6.1). This is for both T3 vs. C2 in Model Base-MT and for T3 vs. C2/C3 in Model Base-ST the respective pooled average of control variables in PISA-I-2003 and -2006. Stars denote significance of the simple mean difference in pre-reform characteristics in the form of p-values as follows: *** p<0.01; ** p<0.05; * p<0.1 ; Source: Author’s Calculation based on PISA-I-data 2003, 2006, 2009, 2012.

30 Table 5 shows instead that T-/C-Groups have very similar characteristics in terms of the main control variable set used for this analysis according to section 6.1. In particular, groups T3 and C2 appear to be very similar with no significant differences in the control variables, but a small, significant one in the ISEI index measure for SES - though the number of books measure of SES shows no significant differences among both groups (see columns 1-3 in Table 5). This supports the internal validity of the strategy, because T3 and C2 consisting of Western federal states that are not city states appear to be very comparable (compare Table A.2& Table 1). Similarly, for Model Base-ST, when we can extend the control group C2 by including NRW to form C3 (and in addition H to form C4), the Control-Group not only gets larger in sample size, but remains comparable for nearly all control variables. Adding just Western federal states that are not city states, this is consistent with the discussion in section 4.3. Otherwise, Table A.3 reveals, that comparing Treatment-Group(T3) with the hypothetical Control-Group(C2hyp) consisting of those Eastern German federal states that always remained a G-8-model appears to be difficult. As those groups differ significantly in many control variables. However, this is not too surprising given the fact that this Control-Group consists of federal states that have been part of the former GDR. Therefore, these federal states have, for instance, significantly lower shares of people with migration background, or a significant but persistent different composition in terms of maternal employment status - due to better supply of childcare institutions, there has been a consistently higher fraction of full-time (FT) working mothers in Eastern compared to Western federal states. Furthermore, adding city states to Treatment-Group(T3) to form T5 or even adding some Eastern states to form T7, Table A.3 shows that using enlarged Treatment-Groups is still relatively robust in combination with standard Control-Groups (C2 and C3/C4). But pre-reform period comparisons get worse, because Treatment-Groups consist of states that are increasingly different with respect to the relevant circumstance variables in the pre-reform period (compare Table A.3). In summary, the pre-reform simple means comparison for the control variable sets (Table 5) suggests that the estimation approach outlined in section 6.1 might be valid, at least for the following T-/C-Group designs: For both Model Base-MT/ST: T reatment(T3) vs. Control(C2) as well as additionally for Model Base-ST: T reatment(T3) vs. Control(C3) (Control(C4)). The comparability of pre-reform characteristics supports the assumption of internal validity in terms of exploiting the quasi-experimental “G-8-reform” design as discussed in section 4.3 and ?? to estimate how the associated increase in learning intensity may have influenced IEOp (as measured by equation (9)).

7 Results

In this section, results based on the empirical strategy as discussed in section 6 shall be provided. In doing so, it should be noted that for the outcome variables, PISA test scores, I follow a similar procedure as Andrietti (2015), i.e. for each of the three testing domains, Reading, Mathematics and Sciences, the average of their five plausible values is standardized based on the distribution of test scores in the respective testing year within the sample of 9th grade students who attended a Gymnasium - a sample that is part of the representative German grade-based PISA-I-dataset (compare section 5.2).73 Section 7.1 aims to illustrate and explain the first-stage and section 7.2 second-stage results for the main specifications (Base-ST and -MT). Section 7.3 provides some robustness checks including different Treatment- and Control-Group compositions. 73That is, in the rest of this paper I restrict the presentation of first-step estimation results to test scores that are standardized according to the first of 3 possible standardization procedures - as outlined in footnote 63( stdpvsubject2 ).

31 7.1 Main Results: First Stage

Before focusing on the estimates of the reform effect on IEOp, it is useful to have a look on the first-stage regression (equation (7)). Since, this allows us to investigate the impact of control variables on standardized PISA-test scores. In that regard, for this paper detailed regression outputs per test score domain are only provided for the main specification in the medium-term perspective (Base-MT): T3 vs. C2 (Table A.4, Table A.5 and Table A.6); and in the short-term perspective (Base-ST): T3 vs. C3 (Table A.7, Table A.8 and Table A.9). Since, these six output tables are sufficient to illustrate the main patterns of how the control variables affect test scores.74 Furthermore, all six sets of control variables available that capture circumstances (see Table 4) are jointly used for deriving the measure of IEOp according to equation (9) when illustrating main estimation results in section 7.2.75 Moreover, in all instances, standard errors for equation (9) are obtained by clustering on the federal state level as the G-8-reform (treatment) was implemented on the level of federal states (thus being the level on which one should cluster according to Bertrand et al.(2004)). 76 Finally, sample/population weights are applied to these first-step regression in order to take into account the stratified structure of the data and the representativeness of each observation (see section 5). Starting with the Base-MT-Model, first, Table A.4 shows the OLS regression in terms of equation (9) for PISA Reading test scores, Table A.5 illustrates the same regressions for PISA Mathematics test scores, whereas Table A.6 provides the corresponding output table for PISA Sciences test scores. Second, in each of these tables, the first four columns ((1)-(4)) refer to Control-Group C2, the last four ((5)-(8)) to Treatment-Group T3. Within both Groups, the first two columns refer to the "Before" reform period (2003-2006), the last two repeat regressions using only "After" reform (2009-2012) data. Each odd numbered column only includes federal states-FE, each even one additionally controls for school-FE.

Even though, the main interest of equation (9) is to obtain the R2 as measure for IEOp, it is also important to check whether the estimated effects of circumstance variables on test scores satisfy our expectations. First, one can find that the only control variable changing the direction of its effect on achievements scores depending on the testing domain is the gender dummy. Being female decreases a student’s achievement in the PISA-Maths test by about 50-80% and in the Sciences test by about 35-65% in terms of an international standard deviation (SD), even though the effect size slightly declines in the after reform period across both groups. However, being female increases performance in the PISA-Reading test by about 30-45% of one SD. This observation is consistent with the aforementioned evidence suggesting the existence of gender-dependent achievement differences in educational outcomes (cf. section 5.2 and Niederle and Vesterlund(2010)). All the other control variable estimates are fairly robust in their signs independent of the testing domain. As expected, the age effect is negative. On average, a student being one year older than the typical 9th grader achieves test scores that are 15-25% of one SD lower. Since, school entrance age is usually six, most students should be about 15 in the 9th grade. Those, for instance, that need to repeat one grade before entering the 9th grade will be older and are doing worse in school compared to peers who never repeated a grade.

74Note that in the detailed output regression tables for these first-stage regressions in the Appendix, to facilitate the detection of regression patterns of control variables on the outcome variable, test scores, background colors are used: controls showing consistently (across both Treatment- and Control-Groups) negative effects on test scores are highlighted by red, whereas consistently positive impacts on test scores by green background colors. The interested reader may get the detailed first-step regression results for the other Treatment-Control-Group specifications (e.g. for Table 7 and Table 8) upon request. 75For robustness check purposes, for all main specifications for each testing domain, all results are shown adding step by step control variables (covering circumstances): from (i) and (ii) constituting control set (I) until set (VI) encompassing controls (i), (ii), (iii), (iv), (v), (vi) and (vii). I refer accordingly to section 7.3 and the Overview of DID results in Appendix A.3. 76Moreover, as Andrietti(2015) shows clustering on the state level or applying wild t-bootstrap procedure produce rather similar results. Thus, for the similar, but not identical dataset in this paper, the clustering procedure is likely to be appropriate.

32 Similarly, having migration background is associated with performing lower in all three testing domains. Controlling also for whether a foreign language is spoken at home, the effect shrinks as expected, but is still negative and often significant. This supports the interpretation that more relevant than having migration background, the degree to which a migrant student’s family is integrated, may drive the effect on test scores. Looking more specifically on SES of the household in which a student grows up, a higher amount of books than the base category (between 101-500) turns out to be positively correlated with test score performance, whereas having less books at home is mostly negative for test scores. Likewise, the higher the ISEI index of a job in the family, the higher the positive effect on educational outcomes as measured by test scores.77 Thus, the SES control variables tend to match the literature suggesting that higher family SES correlates with beneficial early childhood development, but also with resources for support during school age. In that regard, parental education should be also indicative for academic support opportunities, and indeed a positive impact on test scores for both Maths and Sciences can be found for the variable indicating that a student lived up in an academic household. However, the effect for Reading is insignificant. As Maths and Sciences are subjects likely requiring more specific and targeted knowledge of parents for them to be able to support their children, this may explain the difference.78 Though, parental characteristics (PC) have less effect on test scores once Individual Characteristics (IC) and SES circumstances are taken into account. Finally, family characteristics (FC) such as family structure and employment status show no clear patterns. But, fathers working part-time (PT) instead of full-time (FT) tend to have a negative effect on test scores.

Repeating this exercise for short-term perspective (Base-ST), with the preferred specification being T3 vs. C3, first-stage regressions for Reading test scores are shown in Table A.7, for Mathematics test scores in Table A.8, whereas Table A.9 provides the corresponding output table for Sciences test scores. Again the same general patterns as described for Model Base-MT can be observed. Females perform considerably better in Reading, but usually significantly worse in Maths and Sciences tests. Age and migration background tend to be negatively correlated with educational achievement across all tested domains. A more favorable SES family background, as for instance, growing up in an academic household, is associated with a positive impact on test scores. Instead, the effect of family structure on educational achievement remains less clear. In summary, the six output tables of the first-stage regression (equation (9)) demonstrate that for both short and medium-term horizon with different Control-Group settings most of the control variables affect PISA test scores in expected directions. The fact, that these patterns are consistent across different time period Models and Control-Group settings is reconfirming for the chosen control variables to represent important circumstances. This also indicates the existence of such patterns for other T-C-Groups. Furthermore, the explanatory power of these first-stage regressions renders R2 measures in a range of about 15-40% across the different specifications. Consequently, the level of the IEOp measure found in this paper can be categorized as a lower bound within the range of few available IEOp estimates for European countries, such as Ferreira and Gignoux(2014) who find based on PISA-I-2006 data that about 35% of the German test score variation can be attributed to circumstances or Carneiro(2008) finding about 40% IEOp for the case of Portugal. Moreover, the fact that using additionally school- instead of federal-state-FE increases the observed R2 in first-stage regressions across all specifications (last rows in Table A.4, Table A.5, Table A.6, Table A.7, Table A.8, Table A.9) suggests that school-level characteristics have additional explanatory power.79

77As on average family’s highest job ISEI index level is 45, an effect of 0.001 translates into about 4.5 % of an international SD. 78Furthermore, higher educated parents might be more aware of the higher importance of Maths skills for labor market outcomes (Niederle and Vesterlund, 2010). However, growing up in an academic household (at least one parent with ISCED level 5-6) is rather insignificantly positive, whereas living up in low educated families is rather significantly negative for test scores. 79Implicitly, these FEs cover indeed additionally the effect of circumstances at the school-level on test scores.

33 7.2 Main Results: Second Stage

Now, we can switch attention to the second-stage of the estimation approach as outlined in section 6.1, applying a DID framework relying on the IEOp measure as just derived by the first-stage regressions, i.e. the R2 of the OLS regression according to equation (9) - being the share of total variance in test scores which is accounted for by the student’s predetermined circumstance variables (compare equation (7)).

Starting with Treatment-Control-Group setting T3 vs. C2 in Base-MT-Model, results are shown in Table 6. The top panel shows DID estimates (according to equation 10) for Reading, the medium panel for Mathematics and the bottom panel for Science test scores. Note, that while in the first column results are presented with federal-statesFE taken into account, in the second column school-FE are additionally included. The DID-table illustrates that the change in IEOp as measured by the R2 in the first-stage estimation exhibits a common pattern across all three testing domains - IEOp has considerably increased due to the G-8-reform. That is, the share of inequality in test scores that can be attributed to circumstances has increased. With our R2 measure being a lower bound of the true IEOp one may interpret results as follows. For the domain of Reading, at least about 6-13% of the variation in test scores can be additionally attributed to circumstances beyond the control of a 9th grade student. For Maths, at least about 15% and for Sciences at least about 18-20% of educational outcomes can be additionally considered to constitute IEOp. Thus, given initial values of about 20% in IEOp, the simple DID estimates would correspond to a relative increase in IEOp of about 50% in response to increased learning intensity induced by the G-8-reform. Zooming in, one can further note that for Control-Group C2 EEOp seems to have considerably improved, that is IEOp decreased in the After reform (2009-2012) compared to the Before reform (2003-2006) period. In contrast, for the Treatment-Group T3 the level of IEOp appears to have remained practically unchanged across all three domains and for bothFE settings. Thus, in total, in this T-C-Group setting, due to the increase in learning intensity in treated states the role of circumstances remained constant, whereas in absence of shortening secondary school duration, EEOp tends to have improved (Table 6). Thereby, one should emphasize that the Base-MT-Model rather considers a medium-term horizon as not only the first affected cohorts are taken into account, but data up to 2012 are considered - when the reform had been already fully enacted. Since, by 2012 in most federal states the double cohort had already graduated or was about to graduate from Gymnasium(Table 1).

Thus, it is interesting to see how DID results change in the T3 vs. C2 setting when conducting the same two-step estimation procedure for Base-ST-Model covering only years 2003 until 2009. Therefore, Table 7 shows the short-term effects of increased learning intensity in response to the G-8-reform on IEOp focusing on the first treated student cohorts. First, one can observe that across all testing domains and for all FE specifications, the DID estimates remain considerably positive. However, now the increase in IEOp only reaches levels that rest within a range of about 5-13% of educational outcome variation that can be additionally attributed to given circumstances. Since, with the exception of Reading, the relative deterioration in EEOp is still considerably lower than in the medium-term as revealed in Model Base-MT (Table 6). Second, the underlying patterns of the reform effect remain robust in the short-term perspective. Educational acceleration tends to inhibit students in T3 to benefit from improvements in EEOp (T3 in Table 7). Instead, ninth graders in C2 experience more EEOp as control variables lose explanatory power for academic achievements. In conclusion, the T3-C2-settings would suggest that increased learning intensity aggravates IEOp with the effect getting stronger for subjects like Mathematics and Sciences in the medium-term.

34 But for a better understanding of how the G-8-reform changed educational opportunities in Gymnasium, as discussed in section 4.3 and 6.2 it is useful to analyze the T3 vs. C3 setting when evaluating Model Base-ST. For instance, with C3 about 68% of the German student population can be considered for the short-term reform analysis. DID estimation results for this extended T3-C3-Group setting are presented in Table 8.

Table 6: Main Results: Model Base-MT (2003-2012) - T3 vs. C2 —(Figure A.4)

T3 vs. C2: with Bundesland-FE T3 vs. C2: with School-FE Reading C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before (2003-2006) 0.180 0.113 -0.067 0.242 0.173 -0.068 (0.054) (0.025) (0.060) (0.057) (0.032) (0.065) After (2009-2012) 0.131 0.129 -0.002 0.162 0.206 0.044 (0.033) (0.022) (0.040) (0.033) (0.023) (0.040)

Change in R2 -0.049 0.016 0.065 -0.079 0.033 0.112 (0.063) (0.033) (0.071) (0.066) (0.039) (0.077)

T3 vs. C2: with Bundesland-FE T3 vs. C2: with School-FE Mathematics C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before (2003-2006) 0.300 0.158 -0.142 0.353 0.257 -0.097 (0.059) (0.022) (0.063) (0.060) (0.032) (0.068) After (2009-2012) 0.161 0.168 0.007 0.190 0.232 0.042 (0.039) (0.025) (0.046) (0.039) (0.028) (0.048)

Change in R2 -0.139 0.010 0.150 -0.163 -0.025 0.139 (0.071) (0.033) (0.078) (0.071) (0.043) (0.083)

T3 vs. C2: with Bundesland-FE T3 vs. C2: with School-FE Sciences C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before (2003-2006) 0.295 0.133 -0.161 0.363 0.203 -0.160 (0.055) (0.020) (0.058) (0.052) (0.024) (0.058) After (2009-2012) 0.129 0.130 0.001 0.173 0.202 0.028 (0.037) (0.018) (0.041) (0.047) (0.024) (0.053) Change in R2 -0.166 -0.003 0.162 -0.189 -0.001 0.188 (0.066) (0.027) (0.071) (0.071) (0.034) (0.078)

Notes: Table entries are R2 measures of IEOp(Equation (7)). Robust standard errors are in parentheses and were calculated using replication weights following the method as explained in Appendix A.2.3, clustering at the level of federal states. DID results are estimated according to equation (13) taking into account population weights and the indicated fixed effects. Positive changes in R2 indicate increasing IEOp/decreasing EEOp and vice versa for negative changes. Background variables used to derive R2: (i) individual characteristics (IC) I: age and gender (ii) individual characteristics (IC)II: language spoken at home and migration background (based on (parental) birth place) (iii) parental characteristics (PC): highest parental education level (ISCED-level 1-2/ISCED-level 3-4/ISCED-level 5-6) (iv) socio-economic status (SES) I: number of books in household (max. 11, 11-100, 101-500, more than 500) (v) socio-economic status (SES)II: highest ISEI-level-index[0-90] of job in the family (vi) family characteristics (FC) I: family structure - living up in single parent household? (vii) family characteristics (FC) II: mother/father working part-time (PT) - mother/father unemployed (UE) - mother/father out of labor force (OLF) Compare: Table A.4, Table A.5 and Table A.6 for more details on first-step regression for T3/C2 according to equation (9). Source: Author’s Calculation based on PISA-I-data 2003, 2006, 2009, 2012.

35 Table 7: Results: Model Base–ST (2003-2009) vs. –MT (2003-2012): T3-C2 (Figure A.4/Figure A.5)

Model Base-ST (2003-2009) with school-FE Model Base-MT (2003-2012) with school-FE Reading C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before Reform 0.242 0.173 -0.068 0.242 0.173 -0.068 (0.057) (0.032) (0.065) (0.057) (0.032) (0.065) After Reform 0.161 0.196 0.036 0.162 0.206 0.044 (0.060) (0.037) (0.071) (0.033) (0.023) (0.040)

Change in R2 -0.081 0.023 0.104 -0.079 0.033 0.112 (0.083) (0.048) (0.096) (0.066) (0.039) (0.077)

Model Base-ST (2003-2009) with school-FE Model Base-MT (2003-2012) with school-FE Mathematics C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before Reform 0.353 0.257 -0.097 0.353 0.257 -0.097 (0.060) (0.032) (0.068) (0.060) (0.032) (0.068) After Reform 0.270 0.223 -0.047 0.190 0.232 0.042 (0.073) (0.041) (0.084) (0.039) (0.028) (0.048)

Change in R2 -0.084 -0.034 0.050 -0.163 -0.025 0.139 (0.094) (0.052) (0.108) (0.071) (0.043) (0.083)

Model Base-ST (2003-2009) with school-FE Model Base-MT (2003-2012) with school-FE Sciences C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before Reform 0.363 0.203 -0.160 0.363 0.203 -0.160 (0.052) (0.024) (0.058) (0.052) (0.024) (0.058) After Reform 0.257 0.195 -0.062 0.173 0.202 0.028 (0.067) (0.033) (0.075) (0.047) (0.024) (0.053)

Change in R2 -0.106 -0.008 0.098 -0.189 -0.001 0.188 (0.085) (0.041) (0.095) (0.071) (0.034) (0.078)

Notes: Table entries are R2 measures of IEOp(Equation (7)). Robust standard errors are in parentheses and were calculated using replication weights following the method as explained in Appendix A.2.3, clustering at the level of federal states. DID results are estimated according to equation (13) taking into account population weights and the indicated fixed effects. Positive changes in R2 indicate increasing IEOp/decreasing EEOp and vice versa for negative changes. Background variables used to derive R2: (i) individual characteristics (IC) I: age and gender (ii) individual characteristics (IC)II: language spoken at home and migration background (based on (parental) birth place) (iii) parental characteristics (PC): highest parental education level (ISCED-level 1-2/ISCED-level 3-4/ISCED-level 5-6) (iv) socio-economic status (SES) I: number of books in household (max. 11, 11-100, 101-500, more than 500) (v) socio-economic status (SES)II: highest ISEI-level-index[0-90] of job in the family (vi) family characteristics (FC) I: family structure - living up in single parent household? (vii) family characteristics (FC) II: mother/father working part-time (PT) - mother/father unemployed (UE) - mother/father out of labor force (OLF) Note: First-step regression done according to equation (9). For more details on first-step regression for T3/C2 is done according to equation (9) for Model Base-MT (2003-2012) as shown in Table A.4, Table A.5 and Table A.6; and for Model Base-ST (2003-2009) as shown in Table A.7, Table A.8 and Table A.9. Source: Author’s Calculation based on PISA-I-data 2003, 2006, 2009, 2012.

Interestingly, the short-term reform effect vanishes across all three testing domains and for bothFE settings, i.e. there appears to be no change in EEOp or IEOp in response to the G-8-reform. However, the level of R2 measures remain similar ranging between 12.5-27%, its order increasing from Reading to Sciences to Maths.

36 Since now zooming into the federal statesFE setting, students in both Groups appear to experience similar slight increases in IEOp, such that in total the DID effect cancels.

Table 8: Main Results: Model Base-ST (2003-2009) - T3 vs. C3/C4 —(Figure A.5)

T3 vs. C3: with School-FE T3 vs. C4: with School-FE Reading C3 T3 Diff. (T3 - C3) C4 T3 Diff. (T3 - C4)

Before (2003-2006) 0.155 0.173 0.019 0.175 0.173 -0.002 (0.038) (0.032) (0.050) (0.029) (0.032) (0.043) After (2009) 0.183 0.196 0.013 0.179 0.196 0.018 (0.030) (0.037) (0.047) (0.021) (0.037) (0.042)

Change in R2 0.029 0.023 -0.006 0.004 0.023 0.019 (0.049) (0.048) (0.069) (0.035) (0.048) (0.060)

T3 vs. C3: with School-FE T3 vs. C4: with School-FE Mathematics C3 T3 Diff. (T3 - C3) C4 T3 Diff. (T3 - C4)

Before (2003-2006) 0.186 0.257 0.071 0.243 0.257 0.014 (0.086) (0.032) (0.092) (0.031) (0.032) (0.045) After (2009) 0.233 0.223 -0.010 0.222 0.223 0.001 (0.033) (0.041) (0.053) (0.027) (0.041) (0.049)

Change in R2 0.047 -0.034 -0.080 -0.020 -0.034 -0.013 (0.092) (0.052) (0.106) (0.041) (0.052) (0.066)

T3 vs. C3: with School-FE T3 vs. C4: with School-FE Sciences C3 T3 Diff. (T3 - C3) C4 T3 Diff. (T3 - C4)

Before (2003-2006) 0.191 0.203 0.012 0.214 0.203 -0.011 (0.041) (0.024) (0.048) (0.021) (0.024) (0.032) After (2009) 0.215 0.195 -0.020 0.195 0.195 0.000 (0.039) (0.033) (0.051) (0.030) (0.033) (0.045)

Change in R2 0.024 -0.008 -0.032 -0.020 -0.008 0.011 (0.056) (0.041) (0.070) (0.037) (0.041) (0.055)

Notes: Table entries are R2 measures of IEOp(Equation (7)). Robust standard errors are in parentheses and were calculated using replication weights following the method as explained in Appendix A.2.3, clustering at the level of federal states. DID results are estimated according to equation (13) taking into account population weights and school- fixed effects. Positive changes in R2 indicate increasing IEOp/decreasing EEOp and vice versa for negative changes. Background variables used to derive R2: (i) individual characteristics (IC) I: age and gender (ii) individual characteristics (IC)II: language spoken at home and migration background (based on (parental) birth place) (iii) parental characteristics (PC): highest parental education level (ISCED-level 1-2/ISCED-level 3-4/ISCED-level 5-6) (iv) socio-economic status (SES) I: number of books in household (max. 11, 11-100, 101-500, more than 500) (v) socio-economic status (SES)II: highest ISEI-level-index[0-90] of job in the family (vi) family characteristics (FC) I: family structure - living up in single parent household? (vii) family characteristics (FC) II: mother/father working part-time (PT) - mother/father unemployed (UE) - mother/father out of labor force (OLF) Compare: Table A.7, Table A.8 and Table A.9 for more details on first-step regression for T3 vs. C3 according to equation (9). Due to space constraints first-step regressions for T3 vs. C4 have been omitted, but they remain available upon request from the author. Source: Author’s Calculation based on PISA-I-data 2003, 2006, 2009.

37 For the school-FE setting, the level of the IEOp measure is generally higher. Besides that, the DID changes in R2 are very small, mostly even slightly negative ranging between -4% to +0.3%. In total, the DID effects remain close to zero (in some specifications they may even indicate rather tiny improvements in EEOp). To shed further light into the analysis of Model Base-ST , one can repeat the exercise by extending the Control-Group including now both North-Rhine-Westphalia (NRW) and Hesse (H). With the main Treatment- Control-Group specification being now T3/C4, the DID results for reform effect are presented in Table 8.

Obviously, the DID estimation results for the effect of increased learning intensity on IEOp in response to the G-8-reform are very similar in the T3 vs. C3 and T3 vs. C4 setting (Table 8): These estimation specifications suggest that there has been no short-term effect on IEOp in response to the reform. In summary, the impact of the reform on IEOp appears to be dependent on the time period Model used. Focusing on the short-term effects by only including one post-reform period80, the so called Model Base-ST (2003-2009), increased learning intensity does not appear to change the percentage of test score variation that can be explained by circumstances beyond a student’s control, i.e. unfair inequality (compare Table 8. However, once we narrow the Control-Group to include only federal states that did not plan to generally shorten the duration of their G-9-model Gymnasium , a considerable increase in IEOp of about 5-15% in terms of additional R2 is observable also in Model Base-ST (compare Table 7). Furthermore, taking a medium-term perspective on the G-8-reform including the latest available PISA-I-2012 data in the so called Model Base-MT , applying the DID two step estimation approach, then only allows to consider T3 vs. C2, which renders significantly stronger increasing IEOp effects about double in size than in the short-term (compare Table 6). Consequently, for the most appropriate (clean), but somehow limited settings relying on Control-Group C2, the findings suggest, that the intensification of education induced by the reform may have considerably worsened EEOp for German students in Gymnasium. Moreover, this negative effect on EEOp tends to aggravate in the medium-term and may be lasting.81

Although, it is beyond the scope of this paper to exactly detect underlying mechanisms explaining the observed findings from the available data, at least interesting and intuitive explanations can be hypothesized based on two more fundamental drivers being important beyond this educational context. Since, the key concept of EEOp or IEOp in this paper is closely related to the issue of social mobility (section 2). The connection both concepts can be characterized by two adjoint forces, upward and downward social mobility. An increase in EEOp would be indicative for improved upward mobility, if it means that circumstances, such as the SES of the family in which one grows up, get less important for a student’s academic performance. In other words, if increasing EEOp translates into providing more equalizing learning conditions such that ability, but in particular efforts are rewarded, extending EEOp would be welfare enhancing in a society with meritocratic preferences. However, while rising EEOp may lead to social upward mobility of high-performing students with disadvantaged background, it may also lead to social downward mobility of students with beneficial circumstances that lack talent and/or efforts to maintain their position as soon as the importance of circumstances for the determination of student’s educational outcome reduced.

80Recall from the discussion in section 4.3 that this is also the only possibility to form relatively large Control-Group including NRW (C3) and even H (C4) - for latter only if taking the assumption that the fact that the 10% of 9th graders in H in 2008/2009 who are already in G-8-model do not bias results. The motivation for having larger Control-Groups that together with the Treatment-Group cover most German students in Gymnasium consists clearly in the fact that this would strengthen external validity enabling this analysis to draw conclusion of the reform’s impact on IEOp concerning the whole German school system. 81Although, one would need to consider later data points to make claims about how long-lasting or potentially permanent the effect may be. But, as can be seen from the Status Quo observation of the reform in the different federal states(Table A.2), the more one shifts attention to student cohorts that are far away from the first treated ones, the more one needs to take into account any curricular adjustment and additional reforms undertaken in response to the initial Gymnasium-8-reform.

38 Returning to the Gymnasium-8-reform, this may allow the following explanation for the observed findings. First, the fact that increased learning intensity had only a limited impact on IEOp in the short run, may be indicative for the reform promoting heterogeneously both downward mobility among students with advantageous circumstances and upward mobility among those with disadvantaged circumstances who having achieved to enter Gymnasium may have already undergone a harder selection process.82 As the reform implementation process suggests, the reform and its impact on increasing learning intensity surprised affected students and their environment consisting e.g. of parents in a manner that they could not adapt immediately. That is, being the first one confronted with the newly intensified system, it is harder to adapt as one can not easily rely on experiences of older students as it is the case for later cohorts in the new G-8-model. Therefore, this may explain why IEOp only increased moderately or not in the short-term. Thus, in the initial reform period, the lag with which favorable circumstances can adapt to foster a student, may imply that downward rather than upward mobility forces may have been more relevant for the first affected student cohorts. Second, in the medium-term after favorable circumstances had time to adapt and provide support to the associated students, upward mobility would be lessened in conjunction with downward mobility. Thus, for instance, parents may be more likely aware and prepared to deal with the increased requirements in a Gymnasium-8-model and new forms of additional professional tuition services may become available in response to the reform - based on experiences of the first affected cohorts. Consequently, favorable circumstances may then favor getting quicker, easier and better access to a support system enabling to deal with the higher learning intensity. Then, increased IEOp associated with lower upward rather than higher downward mobility may be expected in the medium-term after the G-8-reform had been enacted. For this described explanation to be valid, it would be important to show that this paper’s IEOp measure increases stronger in a situation when upward mobility dominates downward mobility, and that vice versa in a situation when downward mobility dominates upward mobility, this would be associated with a more mediocre increase in IEOp. Finding answers for how upward and downward mobility forces corresponding to different circumstances are influenced due to the reform and how their importance for educational outcome changes over time, is beyond the scope of this paper.

Moreover, looking across the short-term effect evidence for Model Base-ST ( Table 7, Table 8) DID estimates of the effect of increased learning intensity on IEOp reveal also some subject-related patterns. Since, the level of IEOp is consistently higher for both Mathematics and Sciences compared to Reading across all Treatment-Control-Group specifications. This observation could be interpreted as evidence in favor of the existence of subject-dependent curricular flexibility differences. In this context, reading skills comprise more general competencies that are not only learnt in language-related courses at school, but also indirectly in other school courses as well as in everyday life - reading being often a necessary prerequisite to simply comprehend, learn or interact with other people. Consequently, variations in learning intensity might have less influence on reading skills. In contrast, Mathematics and Sciences can be regarded to require more specific skills which are mainly accumulated through taught courses at school and less likely learnt indirectly through other courses at school or in everyday life. Thus, regarding Mathematics and Sciences being testing domains that require rather complementary skills sets, it seems to be plausible, that positive circumstances such as growing up in an academic household are relatively more important than compared to the domain of Reading. Since, beneficial resources improving the accumulation of skills relevant for Maths/Sciences tend to be more exclusive than those useful for Reading - however migration background may rather impact Reading.

82The high correlation of parental education and the probability to enter Gymnasium has been shown in many studies to be persistent in the German school system at least during the last two decades already, e.g. by Klieme et al.(2011).

39 On the other hand, given the broad definition of learning intensity one may want to claim that this may still be compatible with findings that the G-8-reform per se had less and sometimes even positive effects on Mathematics and Sciences test scores in contrast to Reading scores (Andrietti, 2015; Camarero Garcia, 2012; Huebener et al., 2016; Büttner and Thomsen, 2015). Since, increased teaching to compensate for the reduced school duration appears to have fulfilled its purpose. However, it remains open to which degree differential, testing domain dependent reform impact can be attributed to either the additional contents learned already by the time of the test compared to pre-reform cohorts or to more efficient teaching and learning processes due to the changed structure. Nevertheless, these findings are remarkable as they partly object the perceived negative impact on achievements emphasized by critics of the reform. Given the fact, that the implementation for the first cohorts affected did not adjust teaching related quality factors, the positive findings might be regarded to be even only a lower bound for effects on performance of increased learning intensity, when in addition the implementation would adjust accordingly curricular contents and teaching structures and thus improve teaching efficiency within the new school duration framework. However, the fact that the impact with respect to reading skills is less pronounced, could be also of interest. It might raise the question, whether in case of aiming to improve reading skills, current curricula and teaching methods need to be adjusted, if the relatively lower reform impact signaled that even additional schooling and increased intensity had no positive impact on reading skills. Or whether this indicates that the additional reading practice of additional teaching only equalizes the negative impact of increased intensity on the actual learning process - which would be another potential part of the explanation why IEOp levels for the domain of Reading may be less pronounced than in the other domains. However, more research is needed to investigate links between testing domains and learning intensity and how this translates into educational achievements, but also what implications can be derived with respect to IEOp (compare Crawford, Johnson, Machin, and Vignoles(2011)).

To conclude, the findings show that increased learning intensity induced by the G-8-reform did not improve EEOp as defined in section 3. In contrast, while in the short-term, the reform appears to have rather increased IEOp, in particular when having as Control-Group consisting of federal states that decided to keep the G-9-model; in the medium-term, increases in IEOp tend to have been even stronger (compare Table 7). However, it is beyond the scope of this paper to detect underlying channels and mechanism explaining these findings and its connection with both upward and downward mobility. Nevertheless, the evidence triggers relevant questions that need to be answered for a better understanding of how IEOp works.

7.3 Robustness Checks

In this section, a brief overview of some tests shall be given that illustrate how robust results presented in section 7.1 remain. Thereby, I focus on three margins of interest: How do findings change depending on which of the available six control variable sets are included in the first-step regression for deriving the R2 measure ? How do DID results change when extending the Treatment- Group ? Finally, some observations based on hypothetical Control-Group, C2hyp, i.e. states that always remained in a Gymnasium-8-model, will be presented. The main output tables for the robustness checks are provided in the Supplementary Tables part of Ap- pendix A.3: Table A.11, Table A.12, Table A.13, Table A.14, Table A.15, Table A.16, Table A.17 and Table A.18. All of these tables are structured in the same way to provide an overview of DID estimation results of increased learning intensity as induced by the G-8-reform on IEOp. That is they provide DD estimates, equation (12)), in various dimensions as explained in the following.

40 First, each table shows results just with respect to one time period Model (e.g. Model Base-MT (2003-2012)) and with respect to one Control-Group (e.g. C2). Second, in each overview table, results are provided for all three testing domains in the ordering: Reading, Mathematics and Sciences. Third, for each combination of T-C-Group and testing domain, six rows of results are provided as indicated by column (5), the Control-set. That is Control-set 1 provides results based on deriving the R2 measure of IEOp only on includingIC control variables ((i.e. (i) and (ii) as illustrated e.g. in Table A.5). Then, step by step additional control variables following Table 4 are added, until in set 6, all available control variables together are applied in the first-stage regression.83 Finally, in each table row, 4 versions of the DID reform estimate are presented: Column(6) provides the standard R2 measure based DID estimate that only takes into account federal states-FE; Column(8) presents this R2 measure based DID estimate but taking additionally school-FE into account.84 Column (7) and (9) provide the corresponding DID estimates relying on adjusted R2 measures.85 First, for the purpose of understanding better how robust our DID results remain when changing the amount of control variables chosen to cover predetermined circumstances that are relevant in the educational context of this paper, it is useful to analyze how in particular adjusted R2 measures of the DID estimate behave (see columns(7) and (9) in Table A.11, Table A.12, Table A.13 and Table A.14). Recalling that control variables in this estimation framework are important to derive the measure of IEOp according to equation (9), the adjusted R2 can help to detect which Control-set combination appears to have most explanatory power among the available circumstances variables (cf. Table 4). It turns out that looking across the DID result tables, one may conclude that including as circumstances variablesIC,PC and SES may be optimal among the six control variable sets. However, the analysis across different Control-sets also reveals that for each testing domain the final reform estimate of increased learning intensity on IEOp does not change much across Control-sets 3 to 6. This is reconfirming that the DID strategy provided for the main results based on using all six Control-sets in the first-stage is appropriate, as it is not rendering estimates that 2 significantly deviate from the highest Radjusted generating Control-set combination. Moreover, regression patterns stay robust in size and direction independent of the which Control-set is used to enter the first-stage regression for deriving IEOp measures. This is supporting evidence in favor of quasi-experimental design assumption that assignment to treatment occurred without selection on observables, but rather randomly.86

Placebo-Tests setting the reform to artificially take effect between 2003 and 2006 reconfirms robustness of results as no significant effects can be detected for the main MT-/ST-regression settings (Table A.10).

Second, for the purpose of better understanding the potential implications of this paper’s results beyond the T3 vs. C2 setting, it is interesting to repeat the estimation framework with extended Treatment-Groups to investigate external validity considerations. Therefore, all regressions discussed in section 7.1 have been rerun with Treatment-Group T5 including the two Western city states Hamburg (HB) and Bremen (BR), and for Treatment-Group T7, being T5 including additionally Berlin (BE) and Brandenburg (BB). Increasing the Treatment-Group, it becomes clear that on average the DID reform effects become smaller: e.g. in the regression settings with the Control-Group being C2 (Table A.11 and Table A.12) both in Model Basel-MT and Model Basel-ST, the increasing effect on IEOp declines as we move from T3 to T5 to T7 consistently within each testing domains and across all Control-sets. However, the general direction of the reform effect as found in section 7.1 remains. This is also the case when looking into Model Base-ST regressions with respect

83Note that Control-set 6 constitutes the specification used for all first-stage regressions of the main results in this paper. 84Conducting multi-level regressions confirms that school level circumstances are indeed already considered by school-FEs. 85Note: green background colors indicate that EEOp improves (as R2 changes are negative) or in other words that IEOp decreases, vice versa red background colors show that EEOp deteriorates or that in other words IEOp increases. 86Otherwise, the choice of which Control-set to use may bias results, if control variables differed significantly across Groups.

41 to C3 (Table A.13) or C4 (Table A.14), where the zero effect is reconfirmed also for the enlarged Treatment- Groups. In summary, even with larger treatment groups - despite their increasingly heterogeneous composition - the main results in terms of direction and size of the DID estimation result stay robust. This supports the potential external validity of results derived based on the carefully chosen Treatment-and-Control-Group settings in the previous section, in the context of the German school system. In other words, focusing on the more convincing T3-Group for the reasons explained in section 5 and6 does not mean that associated results do no longer carry on implications that are likely valid for the whole German school system.

Thirdly, as mentioned in section 4.3 and 5.2 one could imagine comparing Treatment-Groups that change Gymnasium from a G-9-model to a G-8-model with the hypothetical Control-Group C2hyp. However, this group does not seem to pass the pre-reform comparison test (Table 5) and thus any results based on it should be interpreted very cautiously. Nevertheless, for completeness, overview DID estimation results are provided in Table A.15 for Model Base-MT and Table A.16 for Model Base-ST. Interestingly, when using C2hyp as Control-Group, the DID results change and in all specifications a decrease of the reform in IEOp or in other words an improvement in EEOp can be found. This effect is strongest in the short-term (Model Base-ST), but rather vanishes in the medium-term horizon of Model Base-MT. In that regard, at least the pattern is consistent with the normal Control-Groups. That is, also taking the hypothetical Control-Group into account in relative terms the reform effect tends to shift towards more IEOp or less EEOp in the long-term compared to the short-term perspective. However, in case of the hypothetical Control-Group the DID reform estimate is negative in Model Base-ST and gets less negative in Model Base-MT, whereas with respect to e.g. C2 the effect is already positive in Model Base-ST and get even stronger in Model Base-MT. Finally thinking about what happens as we take into account also PISA-E-2000 data, I rerun the main two-step estimation framework for Model Full-MT (2000-2012) with respect to the Control-Group being C2 (Table A.17) and for completeness also for the Control-Group being C2hyp (Table A.18). It is reconfirming that the results resemble in both cases the DID estimation for time period Model Base-MT (compare for C2 Table A.11 and for C2hyp Table A.15). That is in case of C2, a slightly increasing or likely zero impact of increased learning intensity induced by the G-8-reform on IEOp can be found as we consider to extend Model Base-MT by including an additional pre-reform period, also for the full time-period Model Full-MT. Likewise, for C2hyp, the slightly negative or likely zero impact on IEOp observed in Model Base-MT can be also discovered in Model Full-MT. In conclusion, results based on Model Base-MT and the most convincing settings in section 7.2 carry on their validity also for a slightly broader time period setting.87

8 Conclusion

The goal of this paper is to shed light into how equality of opportunity in education (EEOp or respectively IEOp) may be shaped by the recent trend to accelerate and intensify the educational process. To conclude on what this I have been able to find and achieve, let me provide a summary. As outlined in section 3, this study aims to contribute to various strands of the literature. First, it tries to contribute to the still limited literature on measuring EOp with respect to educational outcomes (compare section 3.1 and3) by adding evidence how EEOp changed over time in Germany. Thereby, this paper focuses on the German secondary school system, to be more precise, the academic-track in secondary school, Gymnasium (compare section 4.1). Since, exploiting the shortening of its duration from 9 to 8 years,

87Further robustness checks involve the analysis for states reforming earlier, in 2006 (SL, MWP, ST), or later, 2012 (NRW, H).

42 known as Gymnasium-8-reform, due the its gradual implementation process across German federal states, a quasi-experimental estimation design can be applied in order to find, whether and if how the increase in learning intensity induced by this reform affected IEOp for students. Therefore, this paper also contributes to the literature on the evaluation of this very controversial and still debated German school reform. While there have been a few studies that tried to detect the direct effect of the G-8-reform on (PISA)test scores (Andrietti, 2015; Büttner and Thomsen, 2015; Huebener et al., 2016), they do not focus on the question if and how the increase in learning intensity induced by this reform may have changed EEOp (compare section 3.2). Thus, this preliminary analysis tries to shift the emphasis in the evaluation of the G-8-reform on distributional concerns, i.e. its consequences on EEOp. This may be of policy relevance in the political debate of how to design secondary school in Germany. Beyond that, it may be of more general interest, because as Ramos and Van de gaer(2015) point out the knowledge on how institutions influence EEOp is still very limited. Thus, this paper tries to link the literature that focuses on evaluating institutional aspects of schooling in order to understand their impact on individual achievements (section 3.2) with the strand of literature that starting from how at all one may be able to model and define equality of opportunity, focuses on measuring IEOp(section 3.1). Regarding the effective empirical strategy used for the purpose of the analysis in this paper (section 6), a two stage estimation procedure (i.e. equation (9) and equation (11)) is used in oder to derive a measure for IEOp,

θbIOP in equation (7) (compare section 3 for all details on measurement of EEOp). It turns out, that as a first step, the R2 of an OLS regression of the main outcome data used in this work, the German specific PISA-I-data on circumstances variables according to equation (9) can serve as IEOp measure (compare section 5 for all information on the PISA-I-data). For the second stage, I follow equation (11), which is the earlier mentioned DID estimation with Treatment-and-Control-Groups chosen according to the G-8-reform implementation setting (Table A.2) and the available data restrictions (section 5.1 and 5.2). Thereby, it is one of the first papers combining an evaluation of the G-8-reform based on its quasi-experimental setting with across time comparable PISA test data in order to analyze how learning intensity affects IEOp.

The results of this two step regression DID estimation procedure (section 7.1) show that increased learning intensity induced by the G-8-reform did not improve EEOp as defined in section 2. In contrast, while in the short-term, the reform rather seems to have increased IEOp, in particular with respect to Control-Groups consisting of federal states that decided to keep the G-9-model-Gymnasium; in the medium-term perspective, increases in IEOp tend to have been even stronger (compare e.g. Table 7). Moreover, results provide some evidence in favor of the existence of subject-dependent curricular flexibilities, with Maths/Sciences being more inflexible and thus more responsive to changes in curricular intensity compared to Reading (section 7.1). While the main focus of this paper has been to exploit the G-8-reform to derive estimates on how intensified instruction may affect IEOp in Germany, understanding the relationship between varying instruction time or learning intensity and EEOp may be of broader interest. How does intensifying the educational process affect IEOp at the school-level ? Are there heterogeneous treatment effects for different subgroups of students ? What are the implications on social mobility as education is the main vehicle for career paths and opportunities to climb up the social ladder ? Clearly, the evidence found, triggers relevant questions that need to be answered for a better understanding of how IEOp works and for policy recommendations to be drawn, first with respect to the design of the secondary school system in Germany, but possibly also some more general lessons may be drawn. However, it is beyond the scope of this paper to exactly detect underlying channels and mechanisms explaining these findings and its relationship to both upward and downward mobility.

43 To conclude, let me zoom slightly out of the narrow context of the G-8-reform, to emphasize two broader issues this paper’s topic touches. First, as the possible interpretation of the preliminary results in section 7.1 suggests, the mechanism of how IEOp and social mobility interact is likely to be very important for understanding phenomena, such as the fact, that already for at least the two last decades in Germany, a high persistence in the intergenerational educational achievements can be observed.88 In the general debate, the fact that social mobility constitutes of both an upward and downward component seems to be neglected, in the sense that focus appears to be shifted on how to improve upward mobility ignoring that this cannot be discussed independently from removing rigidities that potentially may limit downward mobility (cf. Figure A.1). Thus, it may be important to understand in more details the effects of compressing education on social mobility, i.e. EEOp.89

Second, the factor of time compression in the context of education appears to have been largely neglected so far and more research on this topic is needed, education policies consider changes on this margin, but as the example of the Gymnasium-8-reform shows, this may have unintended underestimated welfare costs. Since, intensity is a key factor for the design of an educational process/system having implications for both the effectiveness and efficiency of (non)cognitive skill formation. Understanding better the relationship of schooling duration and in particular intensity on EEOp would, therefore also be important in the context of shedding more light into the circumstances under which Signaling or respectively Human Capital Theory may be more important to evaluate the welfare benefits and cost of investments into the educational system.

Recent studies show that the costs associated with missallocation of talents may be considerable (Rossi, 2016). Thus, it is economically desirable to achieve more EEOp. Therefore, this papers shows that one so far often neglected policy margins involves implementing an appropriate level of educational intensity taking into account not only efficiency considerations, but also effects on equal access to resources in order to strive towards a fair and meritocratic educational system. Taking stock of this discussion, the paper shows that circumstances matter at school emphasizing the relevance of variations in glsgloss: learning intensity on EEOp (see also Philippis and Rossi(2016)). Future research should aim at understanding mechanisms shaping IEOp and its link for social mobility what may then allow to assess the welfare effects of IEOp with respect to its impact on political stability - allowing the evaluation of new recommendations for policies aimed at changing recent trends.

88Thus, it is interesting to investigate, if increasing learning intensity may have more impact on mediocre but privileged students in terms of their background variables (thus potentially pushing towards downward mobility) than on students with less advantageous circumstances that have already undergone a more difficult selection process such that only relatively talented and resilient among them achieved to enter Gymnasium. And it would be interesting to show if on the long-term, however, increased learning intensity may rather inhibit upward mobility by increasing dependence on favorable circumstances (see section 7.1). 89Ideally the theory of how learning (duration and intensity) and IEOp as well as how IEOp and social mobility are linked more precisely would have to be evolved.

44 References

Aksoy, T. and C. R. Link (2000, jun). A panel analysis of student mathematics achievement in the US in the 1990s: does increasing the amount of time in learning activities affect math achievement? Economics of Education Review 19 (3), 261–277. Alesina, A., S. Stantcheva, and E. Teso (2017, jan). Intergenerational Mobility and Preferences for Redistri- bution. Almås, I., A. W. Cappelen, J. T. Lind, E. Ø. Sørensen, and B. Tungodden (2011, aug). Measuring unfair (in)equality. Journal of Public Economics 95 (7-8), 488–499. Ammermueller, A. (2007, aug). PISA: What makes the difference?: Explaining the gap in test scores between Finland and Germany. Empirical Economics 33 (2), 263–287. Anderson, G., T. Fruehauf, M. G. Pittau, and R. Zelli (2015). Evaluating Progress Toward an Equal Opportunity Goal : Assessing the German Educational Reforms of the First Decade of the 21st Century. Andrietti, V. (2015). The causal effects of increased learning intensity on student achievement: Evidence from a natural experiment. Angrist, J. D. and A. B. Krueger (1991, nov). Does Compulsory School Attendance Affect Schooling and Earnings? The Quarterly Journal of Economics 106 (4), 979–1014. Angrist, J. D. and J.-S. Pischke (2008). Mostly harmless econometrics: An empiricist’s companion. An empiricist’s companion (March), 392. Arneson, R. J. (1989, may). Equality and equal opportunity for welfare. Philosophical Studies 56 (1), 77–93. Baumert, J., C. Artelt, E. Klieme, M. Neubrand, M. Prenzel, U. Schiefele, K.-J. Schneider, and M. Weiß (2002). PISA 2000-Die Länder der Bundesrepublik Deutschland im Vergleich. Leske + Budrich. Baumert, J. and M. Prenzel (2009, dec). Vertiefende Analysen zu PISA 2006, Volume 58. Wiesbaden: VS Verlag für Sozialwissenschaften. Bertrand, M., E. Duflo, and S. Mullainathan (2004, feb). How Much Should We Trust Differences-In-Differences Estimates? The Quarterly Journal of Economics 119 (1), 249–275. Björklund, A., M. Jäntti, and J. E. Roemer (2012, jul). Equality of opportunity and the distribution of long-run income in Sweden. Social Choice and Welfare 39 (2-3), 675–696. Boca, D. D., D. Piazzalunga, and C. Pronzato (2016). Early Childcare , Child Cognitive Outcomes and Inequalities in the UK. Bourguignon, F., F. H. G. Ferreira, and M. Menéndez (2007, dec). INEQUALITY OF OPPORTUNITY IN BRAZIL. Review of Income and Wealth 53 (4), 585–618. Bratti, M., D. Checchi, and G. de Blasio (2008, jun). Does the Expansion of Higher Education Increase the Equality of Educational Opportunities? Evidence from Italy. Labour 22 (s1), 53–88. Brunori, P., F. H. G. Ferreira, and V. Peragine (2013, jan). Inequality of Opportunity, Income Inequality and Economic Mobility: Some International Comparisons. Number January in Policy Research Working Papers. The World Bank. Brunori, P., V. Peragine, and L. Serlenga (2012, oct). Fairness in education: The Italian university before and after the reform. Economics of Education Review 31 (5), 764–777. Büttner, B. and S. Thomsen (2015, feb). Are We Spending Too Many Years in School? Causal Evidence of the Impact of Shortening Secondary School Duration. German Economic Review 16 (1), 65–86.

45 Camarero Garcia, S. (2012). Does shortening secondary school duration affect student achievement and educational equality ? Evidence from a natural experiment in Germany: the ’G-8 reform’. Bachelor thesis, University of St. Gallen. Cappelen, A. W., E. Ø. Sørensen, and B. Tungodden (2010, apr). Responsibility for what? Fairness and individual responsibility. European Economic Review 54 (3), 429–441. Carneiro, P. (2008, apr). Equality of opportunity and educational achievement in Portugal. Portuguese Economic Journal 7 (1), 17–41. Checchi, D. and V. Peragine (2010, dec). Inequality of opportunity in Italy. The Journal of Economic Inequality 8 (4), 429–450. Checchi, D. and H. G. Van De Werfhorst (2014). Educational Policies and Income Inequality. Chetty, R., J. N. Friedman, E. Saez, N. Turner, and D. Yagan (2017). Mobility Report Cards : The Role of Colleges in Intergenerational Mobility . Cohen, G. A. (1989, jul). On the Currency of Egalitarian Justice. Ethics 99 (4), 906–944. Crawford, C., P. Johnson, S. Machin, and A. Vignoles (2011). Social mobility: a literature review. Technical Report March. Dustmann, C., P. A. Puhani, and U. Schönberg (2014). The Long-Term Effects of Early Track Choice. Dworkin, R. (1981a). What is Equality? Part 1: Equality of Welfare. Philosophy Public Affairs 10 (3), 185–246. Dworkin, R. (1981b). What is Equality? Part 2: Equality of Resources. Philosophy Public Affairs 10 (4), 283–345. Edmark, K., M. Frölich, and V. Wondratschek (2014, oct). Sweden’s school choice reform and equality of opportunity. Labour Economics 30, 129–142. Ertl, H. (2006, nov). Educational standards and the changing discourse on education: the reception and consequences of the PISA study in Germany. Oxford Review of Education 32 (5), 619–634. Ferreira, F. H. G. and J. Gignoux (2011, dec). THE MEASUREMENT OF INEQUALITY OF OPPORTU- NITY: THEORY AND AN APPLICATION TO AMERICA. Review of Income and Wealth 57 (4), 622–657. Ferreira, F. H. G. and J. Gignoux (2014, jan). The Measurement of Educational Inequality: Achievement and Opportunity. The World Bank Economic Review 28 (2), 210–246. Ferreira, F. H. G. and V. Peragine (2015). Equality of Opportunity Theory and Evidence. World Bank Policy Research Working Paper (March). Fields, G. S. and E. A. Ok (1999). The Measurement of Income Mobility : An Introduction to the Literature. In J. Silber (Ed.), Handbook on income inequality measurement, pp. 557–596. Norwell, MA: Kluwer Academic Publishers. Figueroa, J. L. and D. Van de gaer (2015). A Simple Empirical Test for Equalizing Opportunities with an Application to Progresa . Fleurbaey, M. and V. Peragine (2013, jan). Ex Ante Versus Ex Post Equality of Opportunity. Econom- ica 80 (317), 118–130. Fleurbaey, M., V. Peragine, and X. Ramos (2015). Ex Post Inequality of Opportunity Comparisons. Fleurbaey, M. and E. Schokkaert (2009, jan). Unfair inequalities in health and health care. Journal of Health Economics 28 (1), 73–90. Gamboa, L. F. and F. D. Waltenberg (2012, oct). Inequality of opportunity for educational achievement in Latin America: Evidence from PISA 2006–2009. Economics of Education Review 31 (5), 694–708.

46 German Federal Statistical Office (2014). Area and population. Technical report, German Federal Statistical Office, Wiesbaden. Grenet, J. (2013, jan). Is Extending Compulsory Schooling Alone Enough to Raise Earnings? Evidence from French and British Compulsory Schooling Laws*. The Scandinavian Journal of Economics 115 (1), 176–210. Huebener, M., S. Kuger, and J. Marcus (2016). Increased instruction hours and the widening gap in student performance. Huebener, M. and J. Marcus (2015, oct). G8 high school reform results in higher grade repetition rates and lower graduate age, but does not affect graduation rates. DIW Economic Bulletin 5 (18), 247–255. Hufe, P. and A. Peichl (2015). Lower bounds and the linearity assumption in parametric estimations of inequality of opportunity . Hufe, P. and A. Peichl (2016). Beyond Equal Rights: Equality of Opportunity in Political Participation . Hufe, P., A. Peichl, J. Roemer, and M. Ungerer (2015). Inequality of Income Acquisition : The Role of Childhood Circumstances. Imbens, G. W. and J. M. Wooldridge (2009). Recent Developments in the Econometrics of Program Evaluation. Journal of Economic Literature 47 (1). Klieme, E., C. Artelt, J. Hartig, N. Jude, O. Köller, M. Prenzel, W. Schneider, and P. Stanat (2011, jan). PISA 2009. PISA. OECD Publishing. Krashinsky, H. (2014). How Would One Extra Year of High School Affect Academic Performance in University ? Evidence from a Unique Policy Change. Canadian Journal of Economics 47 (1), 70–97. Lavy, V. (2015, nov). Do Differences in Schools’ Instruction Time Explain International Achievement Gaps? Evidence from Developed and Developing Countries. Economic Journal 125 (588), F397–F424. Lefranc, A., N. Pistolesi, and A. Trannoy (2008, dec). INEQUALITY OF OPPORTUNITIES VS. IN- EQUALITY OF OUTCOMES: ARE WESTERN SOCIETIES ALL ALIKE? Review of Income and Wealth 54 (4), 513–546. Lefranc, A. and A. Trannoy (2016). Equality of Opportunity : How to encompass Fifty Shades of Luck. Marcotte, D. E. (2007, oct). Schooling and test scores: A mother-natural experiment. Economics of Education Review 26 (5), 629–640. Marrero, G. A. and J. G. Rodríguez (2013, sep). Inequality of opportunity and growth. Journal of Development Economics 104 (C), 107–122. Meyer, B. (1995, apr). Natural and quasi-experiments in economics. Journal of Business Economic Statistics 13 (2), 151. Meyer, T. and S. Thomsen (2016). How important is secondary school duration for postsecondary education decisions? Evidence from a natural experiment. Journal of Human Capital 10 (1), 67–108. Milde-Busch, A., A. Blaschek, I. Borggräfe, R. von Kries, A. Straube, and F. Heinen (2010, jul). Besteht ein Zusammenhang zwischen der verkürzten Gymnasialzeit und Kopfschmerzen und gesundheitlichen Belastungen bei Schülern im Jugendalter? Klinische Pädiatrie 222 (04), 255–260. Niederle, M. and L. Vesterlund (2010, may). Explaining the Gender Gap in Math Test Scores: The Role of Competition. Journal of Economic Perspectives 24 (2), 129–144. Niehues, J. and A. Peichl (2014, jun). Upper bounds of inequality of opportunity: theory and evidence for Germany and the US. Social Choice and Welfare 43 (1), 73–99. OECD (2001, dec). Knowledge and Skills for Life: First Results from PISA 2000. PISA. Paris: OECD Publishing.

47 OECD (2004, mar). The PISA 2003 Assessment Framework Mathematics, Reading, Science and Problem Solving Knowledge and Skills. PISA. Paris: OECD Publishing. OECD (2005a). Education at a Glance 2005 - Home - OECD. OECD (2005b, jul). PISA 2003 Technical Report. PISA. Paris: OECD Publishing. OECD (2009a, jan). PISA 2006 Technical Report. PISA. Paris: OECD Publishing. OECD (2009b, mar). PISA Data Analysis Manual: SPSS, Second Edition. Technical report. OECD (2010, jan). PISA 2009 Assessment Framework. PISA. Paris: OECD Publishing. OECD (2012, mar). PISA 2009 Technical Report. PISA. Paris: OECD Publishing. OECD (2013a, feb). PISA 2012 Assessment and Analytical Framework. PISA. Paris: OECD Publishing. OECD (2013b). PISA 2012 Results: Excellence Through Equity: Giving Every Student the Chance to Succeed, Volume II. Paris. OECD (2016, sep). Germany. In Education at a Glance. Organisation for Economic Cooperation and Development (OECD). Oppedisano, V. and G. Turati (2015). What are the causes of educational inequality and of its evolution over time in Europe? Evidence from PISA. Education Economics 23 (1), 3–24. Philippis, M. D. and F. Rossi (2016). Parents , Schools and Human Capital Differences across Countries . Ph. D. thesis, London School of Economics. Piketty, T. and G. Zucman (2014, aug). Capital is Back: Wealth-Income Ratios in Rich Countries 1700-2010. The Quarterly Journal of Economics 129 (3), 1255–1310. Pischke, J.-S. (2007, oct). The Impact of Length of the School Year on Student Performance and Earnings: Evidence From the German Short School Years. The Economic Journal 117 (523), 1216–1242. Prenzel, M., C. Artelt, J. Baumert, W. Blum, M. Hammann, E. Klieme, and R. Pekrun (2006). PISA 2006 in Deutschland. Die Kompetenzen der Jugendlichen im dritten Ländervergleich. pp. 1–24. Prenzel, M., C. Sälzer, E. Klieme, and O. Köller (Eds.) (2013). PISA 2012. Fortschritte und Herausforderungen in Deutschland. Münster: Waxmann. Raitano, M. and F. Vona (2016, jul). Assessing students’ equality of opportunity in OECD countries: the role of national- and school-level policies. Applied Economics 48 (33), 3148–3163. Ramos, X. and D. Van de gaer (2015, jul). APPROACHES TO INEQUALITY OF OPPORTUNITY: PRINCIPLES, MEASURES AND EVIDENCE. Journal of Economic Surveys 0 (00), n/a–n/a. Rawls, J. (1971). A Theory of Justice. Cambridge MA: Harvard University Press. Riphahn, R. T. and P. Trübswetter (2013, aug). The intergenerational transmission of education and equality of educational opportunity in East and West Germany. Applied Economics 45 (22), 3183–3196. Roemer, J. E. (1998). Equality of opportunity. Cambridge [u.a.]: Harvard University Press. Roemer, J. E. and A. Trannoy (2015). Equality of Opportunity. In Handbook of Income Distribution, Volume 2, pp. 217–300. Rosa Dias, P. (2014). Equality of Opportunity in Health. In Encyclopedia of Health Economics, pp. 282–286. Elsevier. Rossi, F. (2016). Barriers to College Investment and Aggregate Productivity. Ph. D. thesis, London School of Economics, London. Sen, A. (1980). Equality of What? The Tanner Lecture on Human Values I, 197–220. Standing Conference of the Ministers of Education (2013). Vereinbarung zur Gestaltung der gymnasialen Oberstufe in der Sekundarstufe II. Beschluss der Kultusministerkonferenz vom 07.07.1972 i.d.F. vom 06.06.2013. pp. 2–21.

48 Standing Conference of the Ministers of Education (2015). The Education System in the Federal Republic of Germany 2013/2014 A description of the responsibilities, structures and developments in education policy for the exchange of information in Europe. Technical report, Bonn. Standing Conference of the Ministers of Education (2016a). Basic Structure of the Education System in the Federal Republic of Germany. Technical report, Secretariat of the Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany, Bonn. Standing Conference of the Ministers of Education (2016b). Sekundarstufe II / Gymnasiale Oberstufe und Abitur. Standing Conference of the Ministers of Education (2016c). Vereinbarung zur Gestaltung der gymnasialen Oberstufe in der Sekundarstufe II,Beschluss der Kultusministerkonferenz vom 07.07.1972 i.d.F. vom 16.06.2016. pp. 2–21. The Economist (2016, oct). Social justice is becoming a bigger issue in Germany | The Economist. The Economist. Thiel, H., S. L. Thomsen, and B. Büttner (2014, oct). Variation of learning intensity in late adolescence and the effect on personality traits. Journal of the Royal Statistical Society: Series A (Statistics in Society) 177 (4), 861–892. Todd, P. E. and K. I. Wolpin (2003). on the Specification and Estimation of the Production Function for Cognitive Achievement *. 113, 3–33. Woessmann, L. (2003, may). Schooling Resources, Educational Institutions and Student Performance: the International Evidence. Oxford Bulletin of Economics and Statistics 65 (2), 117–170. Woessmann, L. (2010). Institutional determinants of school efficiency and equity: German states as a microcosm for OECD countries. Jahrbucher fur Nationalokonomie und Statistik 230 (2), 234–270. Woessmann, L., P. Lergetporer, F. Kugler, L. Oestreich, and K. Werner (2015). Deutsche sind zu grundlegenden Bildungsreformen bereit – Ergebnisse des ifo Bildungsbarometers 2015. ifo Schnelldienst 68 (17), 03–24. Woessmann, L., P. Lergetporer, F. Kugler, and K. Werner (2014). Was die Deutschen über die Bildungspolitik denken – Ergebnisse des ersten ifo Bildungsbarometers *. ifo Schnelldienst 67 (18), 16–33.

49 Glossary

Abitur General higher education entrance qualification. Entitles holder to admission to higher education institutions and is usually obtained at upper Gymnasium level (Gymnasiale Oberstufe) by passing the Abitur examination. The certificate of Allgemeine Hochschulreife incorporates examination marks as well as continuous assessment of pupil’s performance in the last two years of Gymnasium (qualification phase)(Standing Conference of the Ministers of Education, 2015).

Gymnasium The academic track of secondary school education in Germany covering both lower and upper secondary level (grades 5–13 or 5–12) and providing an in-depth general education aimed at the general higher education entrance qualification (Allgemeine Hochschulreife). Since 2012, in the majority of Länder Abitur can be obtained after the successful completion of 12 consecutive school years (eight years at the Gymnasium)(Standing Conference of the Ministers of Education, 2015).

Gymnasiale Oberstufe The upper level of Gymnasium, which might be also established at other types of school such as the Gesamtschule. It comprises grades 11-13 (or 10-12, 11-12, depending on the Land). It is a course of general education concluded by the Abitur examination (Standing Conference of the Ministers of Education, 2015).

Gymnasium-9-model The traditional secondary school duration for the academic track (Gymnasium) which consists usually of grades 5 to 13, thus 9 years, i.e. Gymnasium-9-model (G-9-model).

Gymnasium-8-model The new shortened secondary school duration for the academic track (Gymnasium) which consists usually of grades 5 to 12, thus 8 years, i.e. Gymnasium-8-model (G-8 model).

Gymnasium-8-reform The reform that shortened secondary school duration for the academic track (Gymnasium) from Gymnasium-9-model (G-9-model) to a Gymnasium-8-model (G-8-model). learning intensity I refer with the notion of learning intensity to the ratio of curricular content covered in a given period of time. In particular, the G-8-reform led to increased learning intensity in such a way that students, by the end of grade 9 post-reform, have received about the same amount of instruction, and covered the same curriculum, than students that have completed two-thirds of grade 10 pre-reform. city state Term refers to three German cities: Berlin, Bremen and Hamburg that have the status of being federal states.

PISA index of economic, social and cultural status Following OECD(2012): The ESCS is the PISA index of economic, social and cultural status.

International Standard Classification of Education For a definition of the International Standard Classification of Education, see Figure A.10.

International Socio-Economic Index of Occupational Status International Socio-Economic Index of Occupational Status is derived using an optimal scaling procedure that assigns scores from 16 to 90 to each of the 271 distinct occupation categories in such a way as to maximise the indirect effect of education on income through occupation and to minimize the direct effect of education on income, net of occupation (both effects being net of age). It is highly correlated with parental education and a rough proxy for parental income. See Ganzeboom, De Graaf, and Treiman (1992) for further details on this methodology.

50 Plausible Value Following OECD(2009b) in chapter 6: Plausible values are a representation of the range of abilities that a student might reasonably have. Instead of directly estimating a student’s ability θ, a probability distribution for a student’s θ is estimated. Thus, instead of obtaining a point estimate for θ , like a WLE, a range of possible values for a student’s θ, with an associated probability for each of these values is estimated. Plausible values are random draws from this (estimated) distribution for a student’s θ. cell Following Checchi and Peragine(2010) and as summarized in Gamboa and Waltenberg(2012): A cell is defined by a type and a tranche to which a group of individuals belongs. For instance, a cell is composed of the type ‘male and rich’ and the tranche ‘top 10%’ in the intra-type distribution of scores. common time trend Following Angrist and Pischke(2008): Common time trend assumption, which is also known as parallel trend assumption. It is the key assumption underlying the Difference-in-Difference estimation approach. fairness principles I take this definition from Brunori et al.(2012), section 2: Condition 1 (Reward). A measure of unfair inequality should not reflect legitimate variation in outcomes, i.e. inequalities which are caused by differences in effort. Condition 2 (Compensation). If a measure of unfair inequality is zero, there should be no illegitimate differ- ences left; i.e. two individuals with the same effort should have the same outcome. federal states List of German federal states (compare also Table A.2): Baden-Wuerttemberg (BW), Bavaria (BV), Berlin (BE), Brandenburg (BB), Bremen (BE), Hamburg (HB) , Hesse (H), Lower-Saxony (LS), Mecklenburg-Western Pomerania (MWP), North Rhine-Westphalia (NRW), Rhineland-Palatinate (RP) Saxony (SN), Saxony-Anhalt (ST), Thuringia (TH), Schleswig-Holstein (SH), Saarland (SL). tranche Following Checchi and Peragine(2010) and as summarized in Gamboa and Waltenberg(2012): A tranche is a given partition of the distribution of test scores – say the bottom 10% or the top 20%. It is assumed that individuals – belonging to any type – who belong to a given tranche have exerted the same ‘degree of effort’ (that is, the same effort conditional on the individual’s type). This is an empirical approximation of Roemer’s within-type percentile of the distribution of the advantage. While the larger the number of tranches, the closer one would be to Roemer’s theoretical conception, in empirical applications, the researcher will be limited (by sample sizes, etc.) to working with a few number of tranches. transfer principle Following Ferreira and Gignoux(2014): This principle is satisfied by the respective measure if it rises (strong axiom) or at least does not fall (weak axiom) as a result of any sequence of mean-preserving spreads. type Following Checchi and Peragine(2010) and as summarized in Gamboa and Waltenberg(2012): A type is understood as a group of individuals facing equal circumstances. Further assuming that the advantage is an increasing function of effort, the within-type distribution of the advantage is then assumed to be the outcome of different degrees of effort exerted by the individuals. Types can be defined from a single circumstance (e.g. gender, giving rise to two types: male vs. female) or from a combination of circumstances (e.g. gender and wealth, giving rise to four types: male-rich, female-rich, male-poor, female-poor).

51 A Appendix

A.1 Data

A.1.1 Data sources

I thank the IQB and the Research Data Center in Berlin for granting permission to conduct this secondary analysis and for their support. For further information on this data set as well as its availability, the reader is recommended to refer to the IQB( https://www.iqb.hu-berlin.de/fdz), where one can apply for getting data access to the German specific PISA-data for each testing cycle.

• PISA-2000: Artelt, C., Klieme, E., Neubrand, M., Prenzel, M., Schiefele, U., Schneider, W., Tillmann, K.-J., & Weiß, M. (2009). Program for International Student Assessment 2000 (PISA 2000). Version: 1. IQB – Institut zur Qualitätsentwicklung im Bildungswesen. Datensatz. http://doi.org/10.5159/ IQB_PISA_2000_v1

• PISA-2003: Prenzel, M., Baumert, J., Blum, W., Lehmann, R., Leutner, D., Neubrand, M., Pekrun, R., Rolff, H.-G., Rost, J., & Schiefele, U. (2007): Program for International Student Assessment 2003 (PISA 2003). Version: 1. IQB – Institut zur Qualitätsentwicklung im Bildungswesen. Datensatz. http://doi.org/ 10.5159/IQB_PISA_2003_v1

• PISA-2006: Artelt, C., Baumert, J., Blum, W., Hammann, M., Klieme, E., & Pekrun, R. (2010): Program for Inter- national Student Assessment 2006 (PISA 2006). Version: 1. IQB – Institut zur Qualitätsentwicklung im Bildungswesen. Datensatz. http://doi.org/10.5159/IQB_PISA_2006_v1

• PISA-2009: Artelt, C., Hartig, J., Jude, N., Köller, O., Prenzel, M., Schneider, W., & Stanat, P. (2013): Program for International Student Assessment 2009 (PISA 2009). Version: 1. IQB – Institut zur Qualitätsen- twicklung im Bildungswesen. Datensatz. http://doi.org/10.5159/IQB_PISA_2009_v1

• PISA-2012: Sälzer, C., Klieme, E., Köller, O., Mang, J., Heine, J.-H., Schiepe-Tiska, A., & Müller, K. (2015): Program for International Student Assessment 2012 (PISA 2012). Version: 2. IQB – Institut zur Qualitätsentwicklung im Bildungswesen. Datensatz. http://doi.org/10.5159/IQB_PISA_2012_v2

52 A.1.2 Background Information on the PISA data

Since 2000, the OECD by means of the PISA program analyzes every three years the performance of 15 year-old students with respect to three basic competencies (Life skills) regarded to be of special importance for a person’s future success when approaching the end of compulsory schooling age, namely reading, mathematics and science literacy (cf. Klieme et al.(2011); OECD(2009b, 2010, 2013a)). Instead of testing if students master particular curricular contents, the idea of PISA is to evaluate more general skills, such as the ability to apply knowledge in the three tested domains for solving real-world problems, i.e. skills that students should learn before leaving school as they are essential for participating in modern society (OECD, 2001).90 Apart from general cognitive skills, PISA also collects rich information on family/school characteristics as students themselves, their parents, teachers and school’s principals are supposed to fill out questionnaires. Concerning the PISA procedure, for each testing cycle, the OECD chooses an international contractor responsible for the test’s design and comparability across countries (e.g. that test questions are robust to cultural bias) and over time (making trend analysis possible (OECD, 2009b)). On the national level, each member state chooses per testing cycle a PISA National Project Manager in charge of the test implementation. The testing procedure itself resembles a two-stage stratified randomized survey testing design. First, as a primary sample unit, schools with eligible students (allowing for a minimum of 150 schools in each country) are randomly selected to get a random sample representative of all school types across all regions within a country. Then, as second-stage sampling units, eligible students (15-years-old91), are randomly selected within sampled schools (to achieve a minimum of 4500 students in the sample). Each student within a school receives a different combination of approved test questions on all three testing domains.92 The difficulty level and scope of the test, however, remains the same for each student independent of secondary school type. The paper-based test lasts for two hours, with additional 30 minutes dedicated for students to complete the questionnaire providing information concerning their family, school and socioeconomic background as well as on their attitudes, motivations or aspirations. The test is then evaluated on the national levels (with cross-checks by the international contractor) and finally results are transferred to the OECD which conducts cross-country comparisons published with the official test scores. For the purpose of having comparable measures of latent ability in each of the three domains across and within countries, however, the raw answers to test questions (called also items) have to undergo some processing (cf. OECD(2005b, 2009a, 2012)). As illustrated by Ferreira and Gignoux(2014), the so called Item Response theory (IRT) is used to back out from individual item responses the distribution of the latent variable, cognitive skills (measured as test scores per domain), thereby taking into account, for instance, the particular difficulty of an item. However, to address the issue of small-sample measurement error, because e.g. not all students answer all items, so called Plausible Values of test results are provided for each student. First, the marginal distribution of the latent variable conditional on the item responses and a set of observables is estimated, i.e. for each student a probability distribution of test scores based on their answers is estimated. Second, M draws from this distribution are taken, the Plausible Values of the student’s test score.

90The underlying question of PISA is "What is important for citizens to know and be able to do?". More generally, in PISA the concept of “literacy” refers to students’ capacity to apply knowledge and skills in key subjects, and to analyze, reason and communicate effectively as they identify, interpret and solve problems in a variety of situations. For more specifics definitions of the three testing domains, I also refer to OECD(2004) and in particular to chapter 1 (page 20-22) of OECD(2009b). 91In the age-base sample, according to OECD(2013b), this includes students who were aged between 15 years and 3 months and 16 years and 2 months at the beginning of the assessment period (plus/minus 1 month), who were enrolled in an educational institution (grade 7 or higher), regardless of type of institution and of whether they were in full-time or part-time education. 92For more details on the international PISA test procedure, I also refer to Lavy(2015), section 2, as well as to one of the publications on the PISA Assessment Framework or one of the Technical reports on test, e.g. OECD(2013a) and OECD(2012).

53 For PISA, in the datasets of each testing cycle, 5 plausible values are provided for each student in all three test domains (M = 5).93. After this IRT-adjustment, the plausible test scores are standardized, as follows:

σˆ yij =µ ˆ + (xij − µ) (A.1) σ where, xij is the post-IRT, pre-standardized score for student i, in country j; µ (σ) are original mean (standard deviation) across all countries in the sample of the respective test year, and µˆ (σˆ) denote the by PISA set mean (standard deviation) for the standardized distribution of test scores with values of 500 (100).94 Concerning the meaning of the scores, they can be best understood when compared to a standard, such as proficiency levels. For instance, in mathematics, a proficiency level is said to consist of about 70 points. This corresponds to about two years of schooling in the average OECD country (OECD, 2013b). Focusing for instance on PISA-I-2012, one can then note, that the average difference in mathematics test scores between the top and bottom quarters of students in OECD countries is 128 score points. However, most performance differences related to socio-demographic characteristics of students are much smaller than an entire proficiency level. For instance, on average across all OECD members in PISA-I-2012, boys outscore girls in mathematics by 11 points and native students score about 34 points higher than their peers with migration background.95 Thereby, the advantage of using PISA-test scores compared to other measures of cognitive skills, such as GPAs, constitutes the fact that the design of its underlying sampling procedure allows comparability of test scores both over time and across countries/regions.

However, three common doubts on the validity of PISA-test scores should be considered. First, if the student population from which the test participants are selected is not complete, as some students are excluded, this would threaten representativeness (Gamboa and Waltenberg, 2012). Regarding this concern, one should note that the sampling standards of PISA require participating countries to not exclude more than 5% of students from the population eligible to be tested. Permissible reasons are relatively strict and include only special cases such as serious illnesses or lack of language skills due to recent immigration (e.g. asylum seekers). For Germany, with at least 97% of students in the eligible age (or in the 9th grade, see section 5.1) being part of the initial student population, there is not much room for this concern (OECD, 2010, 2013a). Furthermore, one may be concerned that the actual participation rate of randomly selected students may be low, such that systematic selection may affect representativeness. However, for most developed countries, the rate of compliers is above 80% for students and 85% for selected schools, the OECD quality thresholds for the sampling process. For Germany, the participation rate of selected students is well above 80%, on average 92%, for schools, it has been usually even 100% (Klieme et al., 2011). Moreover, there has been no evidence that those selected who do not take the test can be systematically defined by their observables.

93When one conducts estimations using PISA-test scores, it is suggested to estimate any statistic s by using each of the M (plausible value) datasets separately (getting sˆm) and then average these statistics over M for the final estimate sˆ (Ferreira and Gignoux, 2014). However, I take a simplified approach in line with Andrietti(2015) and conduct my estimations based on taking for each student the average of the 5 plausible test score values per testing domain, thus just working with one average test score value per domain. Since, having conducted initially estimations for each plausible value as suggested in (OECD, 2010), estimation results remain very similar in the simplified approach, which however improves efficiency of estimation procedures. 94This means that across all OECD countries, the typical student scored 500 points in mathematics, about two-thirds of students in OECD countries between 400 and 600 points. Thus, 100 points constitute a huge difference in skills. Clearly, the PISA-test scores, thereby, have neither maximum nor minimum values and there is are no thresholds for passing the test, as it is designed to provide a relative measure that allows comparing skills in the three domains across students and over time. As indicated in section 2, to deal with difficulties in constructing meaningful measures of IEOp based on these standardized test scores measures, the variance appears to be a useful index as explained by Ferreira and Gignoux(2014). 95Further illustration from PISA-2012: Socio-economically advantaged students (in the top quarter of SES in their country) score an average of 90 points higher than their disadvantaged peers (bottom quarter) (see Table II.2.4a in OECD(2013b)), and students in city schools score about 31 points higher than those in rural one, on average (see Table II.3.3a in OECD(2013b)).

54 Finally, another concern may invoke that schools or more specifically teachers may systematically bias comparability of scores across time and regions, if they systematically train or motivate students for the test. However, as Klieme et al.(2011) based on the student’s information about their motivations for the test as well as on their teacher’s information about if and if yes how they prepared students for the test as provided in the questionnaire of PISA test studies 2000-2009, such concerns seem not to be relevant. Since, the majority of teachers report that they rather tried to make students familiar with general testing strategies, but did not train them specifically for the test. In fact, only half of teachers indicate to train students at all for PISA and those who did so not earlier than 1one month before the test. Vice versa, only 25% of participating students indicate to have prepared for the reading, only 13% for the mathematics, and only 8% for the science section in the test.96 Thus, as Klieme et al.(2011) show in more details, the assumption that for Germany test results may not be systematically influenced by preparations appears to be credible. Finally, even though questionnaires provide evidence, that test motivation slightly increased between 2000 and 2009, as the correlation between test motivation and test scores is very low (on average 0.05), it seems to be also unlikely that test motivation may bias results (Klieme et al., 2011).97 In conclusion, the advantages of using PISA data as measure of cognitive skills seem to dominate its potential caveats, which is the reason, I decided to use them - in line with the studies mentioned in section 3.2 - to evaluate the effect of G-8-reform on IEOp (cf. section 3 and4). Recently, the OECD has analyzed how across its member states equity and test scores have evolved (OECD, 2013b). Even though, these studies are related to the main topic of analysis in this paper, they mostly focus on descriptive cross-country comparisons of the relationships between PISA test scores and social background characteristics, as shown in figure A.9. As the purpose of this paper is to focus on Germany, section 5.1 explains which specific German PISA data I use.

A.1.3 Variables Definition

Circumstances-Variables used include:

1. Individual Characteristics (IC):

• (I) gender (Base: Male) and age (in years)

• (II) migration-background (Base: none) and language spoken at home (Base: German)

2. Parental Characteristics (PC)

• (III) Education: highest ISCED-index level in 3 categories (Base: ISCED-level (3-4)

3. Socio-Economic Status (SES)

• (IV) Number of Books in household (Base: 101-500)

• (V) highest ISEI-index level

4. Family Characteristics (FC)

• (VI) single parent household (Base: none) and mother/father employment status (Base: FT)

(Alesina, Stantcheva, and Teso, 2017)

96Note that as affected students and teachers only get informed about two months before the PISA test takes place, given the general limited probability of being selected for the test and as there are no particular incentives for neither teachers nor students to prepare for the test, the effect of potential preparation on scores appears to be limited. 97On the robustness of findings in economic research based on students achievement tests, see also Woessmann(2010).

55 A.2 Empirical Strategy and Robustness - Appendix

A.2.1 Overview of Definitions and T-/C-Groups

1. Concerning the time periods possible, one can define the following models: - Baseline-Model: medium-term perspective (Base-MT): covers the time/testing period (2003-2012) - Extended-Model: medium-term perspective (Full-MT): covers the time/testing period (2000-2012) - Baseline-Model: short-term perspective (Base-ST): covers the time/testing period (2003-2009) - Extended-Model: short-term perspective (Full-ST): covers the time/testing period (2000-2009)

2. Concerning Treatment- and Control-Groups, the following groups can be formed based on Table 1

• Treatment-Group (T3): Baden-Wuerttemberg (BW), Bavaria (BV), Lower-Saxony (LS)

• Treatment-Group (T5): BW, BV, LS, Bremen (BR), Hamburg (HB)

• Treatment-Group (T7): BW, BV, LS, BR, HB, Berlin (BE), Brandenburg (BB)

• Control-Group (C2): Rhineland-Palatinate (RP), Schleswig-Holstein (SH)

• only for the short-term models ((Full)/Base-ST):

– Control-Group (C3): RP, SH, North Rhine-Westphalia (NRW)

– Control-Group (C4): RP, SH, NRW , Hesse (H) One can add to the Control-Group C3, H, to consider another territorial, Western Land as part of the Control-Group. To do so one has to take the assumption that H can be classified still into the Control-Group in 2009, as by then only 10% of 9th graders may have been treated (compare Table 1).

• hypothetical Control-Group (C2hyp): Saxony(S), Thuringia(TH)

• hypothetical Control-Group (C4h): RP, SH, S, TH

• Neither T nor C:

– for the medium-term models: SL, ST, MWP, H, NRW

– for the short-term models: SL, ST, MWP

A.2.2 Further Aspects on internal validity of the quasi-experimental nature of the G-8-reform

It should be noted that there were no specific changes in the political parties forming the government of federal states that form my main T-/C-Group settings in both the medium-/short-term models, i.e. T3/C2 in Base-MT (2003-2012) or T3/C3 in Base-ST (2003-2009). In fact, though it is true that federal states governed by conservative parties (CDU) tended to be the first to introduce the G-8-reform, as evident from Table 1 nearly all states implemented the reform within a short time frame and for the time period of my analysis remained their government. Therefore, for the first affected cohorts, the whole reform period considered for the analysis was usually dominated by the same government. Thus, it is plausible that by controlling for federal state fixed effects and conducting a DID estimation framework, one takes into account general differences due to political parties governing the respective state and implementing the reform.

56 Since, the fact, that there have not been shortly before the reform implementation systematic changes in federal state governments across T/C-Group is supportive evidence, that for the period considered assuming a comparability in the stability of respective federal state educational policies is plausible.

• T3

–BW : Conservatives (CDU) led government for decades until 2011, then (2011-2016) government by Greens/SPD: thus reform in analysis period implemented by same government party and plausible to assume that due to the time lag for government policy to take effect, that school policy up till school year 2012/2013 was conducted by same party.

–BV : Conservative (CSU) led government for decades till today: thus plausible to assume that school policy mainly conducted by same political party.

–LS : Conservatives (CDU) led government in whole analysis period (2003 until 2013); afterwards and beforehand government led by SPD. Thus, plausible to assume, that for the whole analysis period, school policy influenced by same political party.

• T5

–BR : Social-democratic lead government for decades until today and thus plausible to assume that school policy mainly conducted by same political party.

–HB : Social-democratic led government for decades until 2001 and again since 2011; in between Conservatives lead government (CDU) and thus plausible to assume that for the analysis period considered (2003-2009(2012)) school policy mainly conducted by same political party.

• C2

–RP : Social-democratic (SPD) led government for decades until today; thus school policy conducted for decades by same political party and thus plausible to assume that school policy mainly conducted by same political party.

–SH : Social-democratic (SPD) lead government for decades (1988 - 2005) and today (since 2012). From 2005 until 2012, the government was led by Conservatives, but from 2010-2012 in a grand coalition with the Social-democrats. Due to the narrow majorities, school policy for Gymnasium remained similar during the analysis period.

• C3

– NRW: Social-democrats lead government (SPD) for decades until 2005 and again since 2010. In between the government was led by the conservative party (CDU). However, the reform was enacted already under the social-democratic government, and despite the intermediate change, school policy was maintained similar for the analysis period. In particular, as I only take NRW into account for the short-term period when it can be classified to belong to the Control-Group, it is unlikely

• C4

–H : Social-democrats lead government until 1999. From 1999 until 2009, the Conservatives led the government, after some turmoil in 2009, they continued to govern from 2010 until today. So they were in charge for the reform implementation.

57 Thus, as mentioned in the main text, focusing on the analysis period that covers only the first affected cohorts, the main DID assumption appears to be plausible. However, given the reversal decisions in some federal states in recent years after the analysis period considered in this paper, a similar evaluation may be less plausible over time with other policy changes occurring. Since, the reform has become a topic on the political agenda in most federal states starting around 2010 until today (compare status Quo of the reform, Table A.2). However, for the first cohort affected by the reform tested (2003-2012), there is no systematic change in governments comparing treatment and control-group federal states.

Though, one may have concerns about differences in ability, one should take into account the following. First, the measurement framework takes into account any unchangeable features of cognitive skills as unobserved circumstances. Second, recent literature in neurosciences suggest that in the spirit of the Human Capital Theory, cognitive skills appear to be malleable, in particular during early childhood through epigenetic processes. This may explain in the spirit of the Human Capital Theory literature driven by Heckman, that e.g. Boca, Piazzalunga, and Pronzato(2016) find that attending childcare institutions can significantly improve cognitive skills of children, in particular from disadvantaged SES. Thus, the measurement framework fully takes into account the role of ability, both as unobserved circumstances, and efforts. It is therefore a lower bound as explained in section 2.

Concerning the DID, the only assumption that I need to make, is the innocuous one that generally, the distribution in cognitive abilities of students between 2003 and 2012 did not systematically change between federal states in Germany . Given the fact, that moving behavior across federal states is unlikely to have occurred, this means that we simply assume that cognitive skills did not suddenly change in federal states differently in these years for any other reasons other than the reform. There is no way to provide evidence on whether there are systematic ability differences across federal states. However, even if they existed, the DID framework would take that into account. So as there are not many plausible reason one may think of given the short time period and the controls enacted via the DID that there may have been some significant changes in cognitive skills different federal states that may somehow bias any results. In any case, these thoughts should be of much less concern in this quasi-experimental setting than in other settings of published journal articles that try to measure EEOp or IEOp across countries etc. Moreover, as the reform only affects students from age 10 onwards, and the treatment just involves more intense instruction, but not different contents, I claim that these valid concerns that cannot be addressed by empirical methods and given available data, are not of major nature compared to typical returns to schooling published journal article procedures.

58 A.2.3 On the computation of standard errors including replication weights

Throughout the paper, standard errors for both steps of the DID regressions (compare Section 6) using PISA data are constructed taking into account that student performance is reported through plausible values (Plausible Value). Even though, the average of five plausible values as a measure of individual performance guarantees that estimates of group level means and regression coefficients remain unbiased, measures of dispersion require taking into account the within-student variability in plausible values.

As explained in the manuals provided by the OECD(2009b), one should compute standard errors by running regressions with individual test scores as dependent variable five times, thereby using all plausible values in turn. For each regression I employ an estimator for the sampling variance clustered at the level of federal states. The final sampling variance, SV , is given by the average of sampling variances obtained with the five plausible values. In addition, standard errors are inflated by the imputation variance (IV ) due to the fact that test scores measure the latent student’s skills with error. The imputation variance, IV , is estimated as the average squared deviation between the estimates obtained with each plausible value and the final estimate (obtained using the average of the plausible values), with the appropriate degree of freedom adjustment 1   (IV = θˆi − θˆ where θˆi is the estimate for each of five Plausible Value and is the finale estimate). 4 Finally, as shown OECD(2009b), the final error variance TV can be obtained by combining the sampling and imputation variance as follows:

1 TV = SV + (1 + ) ∗ IV = SV + 1.2 ∗ IV (A.2) K where K = 5 is the number of plausible values for each student. The final standard errors are given by the squared roots of the final error variances.

To estimate SV , one can apply Fay’s variant of the Balanced Repeated Replication (BRR) method, which directly takes into account the two-stage stratified sampling design of the PISA test. For this method, each regression is iterated over the 80 sets of replicate weights provided in the PISA dataset. The sampling variance estimate is then given by the average squared deviation between the replicated estimates and the estimate obtained with final weights, with a degree of freedom correction depending on the Fay coefficient (a parameter that governs the variability between different sets of replicate weights and is set to be 0.5 for the PISA study).

Standard errors in all first-stage and in all second-stage regressions are based on this method. For computational convenience, I follow similar to Philippis and Rossi(2016) “unbiased shortcut” procedure described in OECD (2009b), which uses only one set of plausible values to estimate the sampling variance (while the imputation variance is estimated using all five sets, as described above).

59 A.3 Supplementary Tables

Table A.1: Available grade-sample based PISA-E datasets

"before" reform "after" reform

PISA-2000-E PISA-2003-E PISA-2006-E c IQB-LV-2008/2009 d IQB-LV-2012

student-dataset 914 variables 698 variables 883 variables 494 variables 911 variables # of studentsa 34,754 46,185 39,573 39,663 44,584

reading reading reading reading - test scoresb mathematics mathematics mathematics - mathematics sciences sciences sciences - sciences

school-dataset 470 variables 633 variables 387 variables - 176 variables # of schools 1,342 1,411 1,496 - 1,048 teacher-dataset - - 194 variables 503 variables 422 variables # of teachers - - 14,572 3,376 4,213

a Number of observations for students as included in the PISA-E-Datasets (2000, 2003, 2006) and IQB- Ländervergleichsstudie(LV)-Sprachen-2009, IQB-Ländervergleichsstudie(LV)-2012 as available from the based on the grade-based sample (see also ??). Note, that here the student-dataset includes only the original student questionnaire answers as the parental ones are only provided for PISA-2006-E. b For years 2009, only readings scores were assessed, whereas in 2012 only mathematics and science test scores were constructed that can be compared with previous PISA-E results. The test score domains in bold letters have been in focus for the respective PISA test cycle. c Note that for PISA-2000-E, the IQB only provides a age-based sample, which makes its use for analyzing the G-8-reform as discussed in section 4. d For years 2000 and 2003, the teacher-dataset was not part of the provided German specific PISA dataset via the IQB, as in the other years. Similarly, the school-dataset was not provided for the IQB-LV-Sprachen-2008/2009.

60 Table A.2: Overview of the "G-8-reform" across federal states, sorted by year of double cohort

“Type of fed. state” Reform timeline Federal state (Western/Eastern, School year of Status Quoc/Reversing the reform ? (compare Figure A.6) city/territorial state, School year of reform completion population) reform start (year of double cohort)

- Eastern state Since 1949 (implemented in GDR) Never had a G-9-model. Saxony (SN) - territorial state Normal Gymnasium (5 to 12) Stayed always in G-8-model. - 4,0 mio.a

- Eastern state Since 1949 (implemented in GDR) Never had a G-9-model. Thuringia (TH) - territorial state Normal Gymnasium (5 to 12) Stayed always in G-8-model. - 2,2 mio.a

- Eastern state 2003/2004 2006/2007 Saxony-Anhalt (ST) No, not in general. - territorial state Start for 9th grade: Normal Gymnasium (5 to 12) - 2,3 mio.a

- Eastern state 2004/2005 2007/2008 Mecklenburg-Western - territorial state th No, not in general. Pomerania (MWP) Start for 9 grade: Normal Gymnasium (7 to 12) - 1,6 mio.a

- Western state 2001/2002 2008/2009 No: Gymnasium remains in G-8-model, Saarland (SL) - territorial state Start for 5th grade: Normal Gymnasium (5 to 12) but: in G-13-model - 1,0 mio.a

- Western state 2002/2003 2009/2010 No: Gymnasium remains in G-8-model, Hamburg (HB) but: while so called Stadtschule as - city state Start for 5th grade: Normal Gymnasium (5 to 12) - 1,7 mio.a comprehensive school offers a G-13-model

- Western state 2004/2005 2010/2011 Yes, general revision to G-9-(G-13-model) Bavaria (BV) starting with school year 2019/2020 as - territorial state Start for 5th + 6th grade: Normal Gymnasium (5-12)b - 12,5 mio.a announced in April 2017.

- Western state 2004/2005 2010/2011 Yes, general reversion to G-9-(G-13-model) - territorial state th th b starting with school year 2015/16, but with a Lower-Saxony (LS) Start for 5 + 6 grade: Normal Gymnasium (5-12) - 7,8 mio.a voluntary option for the G-8-(G-12-model).

- Western state 2004/2005 2011/2012 No, not in general Baden-Wuerttemberg - territorial state th But: since 2012/2013:state-wide pilot project (BW) Start for 5 grade: Normal Gymnasium (5 to 12) - 10,5 mio.a allows 44 model schools to offer a G-9-model

- Western state 2004/2005 2011/2012 No: Gymnasium remains in G-8-model, Bremen (BR) but: while so called Oberschule as - city state Start for 5th grade: Normal Gymnasium (5 to 12) - 0,7 mio.a comprehensive school offers a G-13-model

- Western state 2006/2007 2011/2012 No: Gymnasium remains in G-8-model, Berlin (BE) but: integrated comprehensive schools are - city state Start for 7th grade: Gymnasium (7 to 12) - 3,4 mio.a allowed to offer G-9-(G-13)-model

- Eastern state 2006/2007 2011/2012 No: Gymnasium remains in G-8-model, Brandenburg (BB) but: integrated comprehensive schools are - territorial state Start for 7th grade: Gymnasium (7 to 12) - 2,5 mio.a allowed to offer G-9-(G-13)-model

North - Western state 2005/2006 2012/2013 No, not in general. Rhine-Westphalia But: in 2011/2012: a pilot project with - territorial state Start for 5th grade: Normal Gymnasium (5 to 12) (NRW) - 17,6 mio.a 13/630 Gymnasien offering a G-9-model

- Western state Successive intro. in # % of all Normal Gymnasium (5-12) Yes, since 2013/2014: Hesse (H) - territorial state 2004/2005: 10%; 2005/2006: 60%; 2006/2007: 30% students allowed to choose between G-12 or - 6,0 mio.a double cohorts: 2011/2012, 2012/2013 and 2013/2014 G-13 model from 5th grade onwards

- Western state 2008/2009 2015/2016 Always maintained schools with G-9-model Rhineland-Palatinate - territorial state th (G-13-model), but since 2008/2009 G-8-model (RP) Start for 5 grade: Normal Gymnasium (5 to 13) - 4,0 mio.a offered at 19 Gymnasien

- Western state 2008/2009 2015/2016 Since 2011/2012 schools are allowed by Schleswig-Holstein - territorial state th state’s school law to offer a G-9-model (11/ 99 (SH) Start for 5 grade: Normal Gymnasium (5 to 13) - 2,8 mio.a schools), G-8-model or both (4/99 schools). a Numbers taken from the most recent census in 2011 are valid for the considered time period from 2003 to 2012 (German Federal Statistical Office, 2014). b In Bavaria (BV) and Lower-Saxony (LS), the 6th and 5th grade were allocated at the same school year into the G-8-model suggesting that educational intensity might be slightly stronger for the then 6th graders that had to compensate the shortened school duration during 7 instead of 8 years, as the then 5th grade students. However, the 9th graders in 2009 in BV and LS were affected by the reform right from the 5th grade. c See also the Secretariat of the Standing Conference of the Ministers of Education: https://www.kmk.org/themen/allgemeinbildende-schulen/bildungswege-und-abschluesse/sekundarstufe-ii-gymnasiale-oberstufe-und-abitur.html

61 Table A.3: Descriptive Statistics: Pre-Reform Treatment vs. Control-Group Comparison of Control- Variables for additional Groups

Base-MT (2003-2012) Model Base-ST (2003-2009)

T3 C2hyp T3-C2hyp T5 C3 T5-C3 T7 T7-C3

Individual characteristics female-dummy 0.537 0.560 -0.023 0.533 0.549 -0.016 0.535 -0.014 Age in years 15.488 15.514 -0.026 15.492 15.464 0.028* 15.474 0.010 migration background (Base category: German language/both parents born in Germany) - language spoken at home 0.054 0.018 0.036*** 0.056 0.055 0.000 0.056 0.001 - migration background 0.183 0.059 0.124*** 0.188 0.175 0.013 0.184 0.009 Parental characteristics Parental Education (highest ISCED level): (Base category: ISCED-level (3-4)) - ISCED-level (5-6): 0.662 0.641 0.021 0.654 0.648 0.006 0.658 0.011 - ISCED-level (3-4): 0.288 0.310 -0.021 0.285 0.288 -0.003 0.280 -0.008 - ISCED-level (1-2): 0.044 0.012 0.033*** 0.046 0.036 0.010 0.045 0.009 - missing: 0.006 0.038 -0.032*** 0.015 0.028 -0.013*** 0.017 -0.011*** Socio-Economic Status Number of books in household: (Base category: Number of books (101-500)) - + 500: 0.226 0.153 0.073*** 0.229 0.246 -0.017 0.220 -0.026** - 101-500: 0.509 0.448 0.061*** 0.501 0.481 0.019 0.496 0.015 - 11-100: 0.246 0.341 -0.095*** 0.244 0.228 0.015 0.257 0.029** - max. 11: 0.010 0.023 -0.013** 0.010 0.015 -0.005 0.011 -0.004 highest ISEI-level of job in the family - highest ISEI-level: 59.103 55.590 3.514*** 58.975 58.471 0.504 58.656 0.185 - missing : 0.004 0.018 -0.014*** 0.008 0.006 0.002 0.008 0.002 Family Characteristics family structure - living up in single parent household ? (Base: No) - single parent household: 0.137 0.176 -0.039** 0.140 0.150 -0.010 0.168 0.018 - missing : 0.072 0.069 0.003 0.076 0.057 0.019*** 0.069 0.012* family structure - mother/father employment status (Base: FT) Father - full-time (FT) : 0.854 0.841 0.013 0.847 0.843 0.004 0.832 -0.011 - part-time (PT) : 0.065 0.036 0.029*** 0.065 0.058 0.007 0.065 0.007 - unemployed (UE) : 0.024 0.058 -0.033*** 0.025 0.026 -0.001 0.034 0.009* - out-of-labor force (OLF) : 0.033 0.026 0.007 0.031 0.033 -0.001 0.031 -0.001 Mother - full-time (FT) : 0.217 0.614 -0.397*** 0.216 0.232 -0.016 0.297 0.065*** - part-time (PT) : 0.515 0.198 0.318*** 0.511 0.476 0.036** 0.448 -0.027* - unemployed (UE) : 0.061 0.096 -0.035*** 0.060 0.063 -0.003 0.067 0.004 - out-of-labor force (OLF) : 0.194 0.063 0.132*** 0.195 0.202 -0.008 0.169 -0.033*** Number of students 2,175 607 - 2,365 1,861 - 2,999 - Notes: This table shows a two-sample t-test for comparing in the pre-reform period the main control variables of the additional specification between Treatment- and Control-Group apart from table 5. This is for both T3 vs. C2hyp in Model Base-MT and for T5/T7 vs. C3 in Model Base-ST the respective pooled average of control variables in PISA-I-2003 and -2006. Stars denote significance of the simple mean difference in pre-reform characteristics in the form of p-values as follows: *** p<0.01; ** p<0.05; * p<0.1 ; Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009, 2012.

62 Table A.4: Main Results for Model Base-MT: 1st stage to derive IEOp measure (R2) for Reading scores

stdpvread2 for C2 stdpvread2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female 0.032 0.079 0.425* 0.438* 0.298*** 0.327** 0.437*** 0.453** (0.058) (0.062) (0.065) (0.060) (0.044) (0.035) (0.044) (0.055) i) age in years (reference: test month May) -0.016 -0.030 -0.276+ -0.274* -0.229*** -0.189* -0.183** -0.194*** (0.173) (0.137) (0.055) (0.041) (0.041) (0.044) (0.028) (0.015) language spoken at home is NOT German -0.571** -0.554** -0.054 -0.158 -0.357*** -0.388+ -0.190+ -0.200** (0.029) (0.026) (0.020) (0.046) (0.119) (0.136) (0.068) (0.045) ii) student has migration background -0.275* -0.256* -0.213 -0.228+ -0.109** -0.064 -0.096** -0.076*** (0.041) (0.037) (0.080) (0.048) (0.053) (0.039) (0.019) (0.005) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.573** -0.541 0.107 0.133 -0.297*** -0.310+ -0.040* 0.003 iii) (0.021) (0.140) (0.249) (0.318) (0.097) (0.122) (0.012) (0.019) tertiary educ. (ISCED 5,6) -0.157 -0.172 0.198 0.180 0.001 -0.009 -0.024 -0.041 (0.102) (0.114) (0.129) (0.131) (0.048) (0.063) (0.032) (0.044) SES No. of books at home: Baseline (101-500 books) max 10 books 0.157 0.196 -0.601 -0.492 -0.498** -0.532* -0.576** -0.497*** (0.052) (0.063) (0.315) (0.332) (0.212) (0.126) (0.089) (0.009) iv) 11-100 books in household -0.183* -0.144* -0.144 -0.092 -0.342*** -0.343** -0.198+ -0.160* (0.028) (0.019) (0.056) (0.040) (0.049) (0.055) (0.074) (0.039) more than 500 books in household 0.212* 0.223 0.097 0.084 0.109* 0.089 0.090* 0.094* (0.031) (0.084) (0.148) (0.121) (0.054) (0.046) (0.025) (0.026) Parents: highest ISEI of job (incl. mis) 0.007 0.007 0.001 0.000 0.001 0.001 0.004+ 0.003* v) (0.003) (0.004) (0.001) (0.001) (0.001) (0.001) (0.002) (0.001) FC student has single parent (incl. miss) -0.035 0.089 0.351* 0.299+ 0.106* 0.106 0.123 0.136 (0.062) (0.074) (0.047) (0.056) (0.063) (0.048) (0.062) (0.079) vi) info on family structure missing 0.006 0.009 -0.046** 0.024 -0.372** -0.295** -0.205+ -0.139 (0.092) (0.148) (0.001) (0.008) (0.155) (0.066) (0.089) (0.086) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.390 -0.333 -0.249** -0.261* -0.105 -0.101 -0.153** -0.122* (0.272) (0.213) (0.011) (0.027) (0.092) (0.105) (0.023) (0.037) father: UE at moment 0.427 0.410 0.279 0.346 0.011 -0.043 0.013 0.072 (0.715) (0.676) (0.346) (0.365) (0.147) (0.124) (0.133) (0.126) father: out of labor force 0.042 0.008 -0.040 -0.086 0.118 0.104 0.130 0.127 vii) (0.093) (0.046) (0.161) (0.125) (0.089) (0.102) (0.234) (0.202) mother: employed (PT) -0.016 -0.009 -0.086 -0.076 0.068 0.014 0.001 0.017 (0.128) (0.024) (0.190) (0.171) (0.056) (0.051) (0.014) (0.013) mother: UE at moment 0.185 0.295 0.150 0.270 -0.049 -0.063 0.204 0.178 (0.244) (0.213) (0.508) (0.397) (0.091) (0.032) (0.198) (0.205) mother: out of labor force -0.171*** -0.176 0.027 0.102 0.038 -0.026 -0.032 -0.018* (0.000) (0.061) (0.269) (0.254) (0.079) (0.024) (0.026) (0.005) Constant 0.056 0.164 3.771 4.089+ 3.328*** 2.752* 2.377** 1.875** (2.537) (1.822) (1.060) (0.805) (0.660) (0.809) (0.397) (0.209) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes state effects no yes no yes no yes no yes Observations 346 346 608 608 2,168 2,168 3,093 3,093 R2 0.224 0.306 0.155 0.191 0.149 0.222 0.154 0.246 2 Radjusted 0.166 0.240 0.120 0.143 0.139 0.197 0.147 0.225 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 6. The dependent variable is stdpvread2, i.e. standardized PISA Reading test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvread2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009, 2012.

63 Table A.5: Main Results for Model Base-MT: 1st stage to derive IEOp measure (R2) for Maths test scores

stdpvmath2 for C2 stdpvmath2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female -0.819** -0.731** -0.621** -0.601** -0.560*** -0.510*** -0.554*** -0.515*** (0.040) (0.049) (0.043) (0.029) (0.038) (0.036) (0.028) (0.044) i) age in years (reference: test month May) -0.102** -0.126 -0.329 -0.356 -0.246*** -0.240+ -0.234*** -0.253*** (0.002) (0.047) (0.094) (0.101) (0.063) (0.085) (0.012) (0.004) language spoken at home is NOT German -0.498* -0.429* -0.158** -0.202+ -0.117 -0.106 -0.207* -0.220* (0.055) (0.049) (0.003) (0.036) (0.131) (0.114) (0.057) (0.067) ii) student has migration background -0.122 -0.092 -0.155** -0.179** -0.206*** -0.125 -0.174*** -0.158** (0.079) (0.057) (0.011) (0.012) (0.055) (0.057) (0.016) (0.026) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.518 -0.514 0.070 0.169 -0.228** -0.186+ -0.186* -0.126+ iii) (0.209) (0.304) (0.299) (0.323) (0.098) (0.066) (0.052) (0.049) tertiary educ. (ISCED 5,6) -0.225 -0.255 0.101 0.101 0.031 0.015 0.031 0.019 (0.139) (0.167) (0.270) (0.261) (0.045) (0.038) (0.018) (0.020) SES No. of books at home: Baseline (101-500 books) max 10 books 0.406** 0.379*** -0.526 -0.450 -0.444*** -0.427* -0.372* -0.331** (0.014) (0.002) (0.505) (0.491) (0.151) (0.125) (0.120) (0.057) iv) 11-100 books in household -0.161 -0.134 -0.129 -0.089 -0.291*** -0.282** -0.181** -0.156** (0.050) (0.059) (0.036) (0.040) (0.047) (0.060) (0.036) (0.021) more than 500 books in household 0.264** 0.259 0.206 0.209 0.053 0.064** 0.128* 0.130* (0.013) (0.073) (0.067) (0.061) (0.056) (0.014) (0.041) (0.038) Parents: highest ISEI of job (incl. mis) 0.009* 0.008+ 0.001 0.001 0.001 0.002 0.004* 0.004** v) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001) (0.001) (0.000) FC student has single parent (incl. miss) -0.045 0.062 0.273** 0.224** 0.029 0.049 0.103+ 0.105 (0.070) (0.025) (0.018) (0.016) (0.062) (0.037) (0.037) (0.050) vi) info on family structure missing -0.226 -0.216 -0.120 -0.041 -0.306** -0.140*** -0.219 -0.166 (0.142) (0.169) (0.105) (0.144) (0.144) (0.013) (0.100) (0.081) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.319 -0.261 -0.429* -0.467* -0.050 -0.026 -0.182+ -0.141 (0.246) (0.206) (0.050) (0.044) (0.075) (0.113) (0.070) (0.074) father: UE at moment 0.065 0.062 0.127 0.103 -0.287** -0.251** 0.034 0.064 (0.299) (0.314) (0.675) (0.678) (0.136) (0.054) (0.081) (0.072) father: out of labor force -0.379+ -0.361 -0.007 -0.006 -0.090 -0.042 -0.030 -0.029 vii) (0.062) (0.150) (0.262) (0.286) (0.141) (0.140) (0.127) (0.114) mother: employed (PT) 0.117 0.106** 0.039 0.021 0.053 0.004 0.065*** 0.070** (0.067) (0.006) (0.116) (0.099) (0.058) (0.043) (0.004) (0.011) mother: UE at moment 0.170 0.233 0.164 0.146 -0.011 -0.021** 0.299+ 0.305 (0.084) (0.079) (0.225) (0.230) (0.080) (0.005) (0.130) (0.141) mother: out of labor force -0.074 -0.082 0.209 0.221 0.058 -0.022 0.097** 0.101** (0.137) (0.188) (0.133) (0.155) (0.072) (0.031) (0.016) (0.016) Constant 1.572** 1.856 5.128 5.855 4.021*** 4.001+ 3.523*** 2.914*** (0.088) (1.001) (1.666) (1.752) (1.017) (1.430) (0.248) (0.032) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes state effects no yes no yes no yes no yes Observations 346 346 608 608 2,168 2,168 3,093 3,093 R2 0.335 0.385 0.191 0.217 0.184 0.292 0.198 0.268 2 Radjusted 0.286 0.327 0.158 0.171 0.174 0.269 0.191 0.248 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 6. The dependent variable is stdpvmath2, i.e. standardized PISA Mathematics test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvmath2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009, 2012.

64 Table A.6: Main Results for Model Base-MT: 1st stage to derive IEOp measure (R2) for Sciences scores

stdpvscie2 for C2 stdpvscie2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female -0.658* -0.568+ -0.429* -0.389** -0.451*** -0.397*** -0.365** -0.348** (0.082) (0.102) (0.042) (0.017) (0.050) (0.038) (0.055) (0.065) i) age in years (reference: test month May) -0.047 -0.090 -0.263 -0.293+ -0.210*** -0.182+ -0.192*** -0.203*** (0.021) (0.028) (0.079) (0.053) (0.054) (0.068) (0.016) (0.020) language spoken at home is NOT German -0.530* -0.412+ -0.126 -0.186 -0.254* -0.274+ -0.211+ -0.230** (0.044) (0.086) (0.077) (0.068) (0.149) (0.107) (0.082) (0.041) ii) student has migration background -0.355 -0.329 -0.392*** -0.383** -0.147** -0.083 -0.214* -0.185** (0.099) (0.108) (0.002) (0.011) (0.063) (0.042) (0.060) (0.035) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.700** -0.629 0.097 0.115 -0.277*** -0.280* -0.105+ -0.052+ iii) (0.039) (0.184) (0.244) (0.269) (0.097) (0.095) (0.045) (0.023) tertiary educ. (ISCED 5,6) -0.130 -0.172 0.148 0.181 0.040 0.036 0.040 0.025 (0.172) (0.201) (0.145) (0.172) (0.053) (0.058) (0.024) (0.027) SES No. of books at home: Baseline (101-500 books) max 10 books 0.140 0.119 -0.521 -0.464 -0.251* -0.304 -0.668* -0.593*** (0.111) (0.111) (0.594) (0.590) (0.142) (0.173) (0.155) (0.057) iv) 11-100 books in household -0.183** -0.141+ -0.159+ -0.133+ -0.327*** -0.315* -0.271** -0.244** (0.004) (0.030) (0.028) (0.022) (0.051) (0.083) (0.049) (0.051) more than 500 books in household 0.199*** 0.195 0.136 0.138 0.150*** 0.157** 0.186** 0.185** (0.001) (0.066) (0.116) (0.118) (0.048) (0.033) (0.020) (0.021) Parents: highest ISEI of job (incl. mis) 0.010** 0.010** 0.003+ 0.002* 0.003* 0.003+ 0.004 0.003 v) (0.000) (0.001) (0.000) (0.000) (0.001) (0.001) (0.002) (0.001) FC student has single parent (incl. miss) -0.057 0.055 0.314*** 0.248+ 0.028 0.026 0.123+ 0.116 (0.190) (0.071) (0.005) (0.047) (0.061) (0.034) (0.049) (0.065) vi) info on family structure missing -0.084 -0.051 -0.250* -0.248* -0.307** -0.217** -0.156 -0.114 (0.238) (0.291) (0.036) (0.020) (0.150) (0.036) (0.137) (0.115) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.244 -0.184 -0.246 -0.260+ -0.108 -0.073 -0.205 -0.177 (0.272) (0.236) (0.063) (0.051) (0.095) (0.119) (0.102) (0.106) father: UE at moment 0.186 0.187 0.208 0.185 -0.018 -0.034 0.049 0.071 (0.763) (0.745) (0.502) (0.533) (0.138) (0.127) (0.146) (0.130) father: out of labor force -0.120 -0.086 0.056 0.065 0.006 0.013 0.080 0.079 vii) (0.313) (0.395) (0.206) (0.229) (0.124) (0.131) (0.243) (0.221) mother: employed (PT) -0.089 -0.089 -0.043 -0.081 0.048 -0.001 0.025 0.040* (0.084) (0.025) (0.174) (0.139) (0.056) (0.055) (0.014) (0.009) mother: UE at moment 0.113 0.206 0.070 0.107 -0.030 -0.014 0.276 0.257 (0.150) (0.104) (0.299) (0.260) (0.087) (0.090) (0.142) (0.148) mother: out of labor force -0.229 -0.247 0.094 0.094 0.013 -0.057+ 0.006 0.023 (0.088) (0.162) (0.258) (0.235) (0.066) (0.025) (0.049) (0.022) Constant 0.679 1.351 4.059 4.921+ 3.353*** 2.845+ 2.919*** 2.357*** (0.353) (0.597) (1.473) (1.074) (0.871) (1.216) (0.195) (0.181) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes state effects no yes no yes no yes no yes Observations 346 346 608 608 2,168 2,168 3,093 3,093 R2 0.344 0.418 0.157 0.200 0.161 0.241 0.156 0.231 2 Radjusted 0.295 0.362 0.122 0.153 0.151 0.217 0.149 0.210 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 6. The dependent variable is stdpvscie2, i.e. standardized PISA Sciences test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvscie2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009, 2012.

65 Table A.7: Main Results for Model Base-ST : 1st stage to derive IEOp measure (R2) for Reading test scores

stdpvread2 for C3 stdpvread2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female 0.246+ 0.275* 0.344** 0.355** 0.298** 0.327** 0.370** 0.354** (0.098) (0.074) (0.063) (0.061) (0.032) (0.035) (0.061) (0.074) i) age in years (reference: test month May) -0.184* -0.202* -0.339*** -0.349*** -0.229*** -0.189* -0.320** -0.328*** (0.056) (0.050) (0.014) (0.007) (0.021) (0.044) (0.037) (0.029) language spoken at home is NOT German -0.400*** -0.340** -0.339** -0.300* -0.357 -0.388+ -0.243 -0.239 (0.040) (0.057) (0.061) (0.092) (0.160) (0.136) (0.107) (0.135) ii) student has migration background -0.138+ -0.119+ -0.132 -0.112+ -0.109 -0.064 -0.099 -0.078 (0.056) (0.051) (0.066) (0.043) (0.048) (0.039) (0.065) (0.054) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.413* -0.380+ -0.081 -0.011 -0.297+ -0.310+ -0.154 -0.148 iii) (0.103) (0.135) (0.105) (0.112) (0.106) (0.122) (0.118) (0.126) tertiary educ. (ISCED 5,6) -0.006 -0.001 0.032 0.068 0.001 -0.009 0.060 0.044 (0.061) (0.073) (0.078) (0.070) (0.065) (0.063) (0.081) (0.088) SES No. of books at home: Baseline (101-500 books) max 10 books -0.162 -0.164 -0.832*** -0.851*** -0.498+ -0.532* -0.505** -0.451*** (0.081) (0.121) (0.009) (0.064) (0.211) (0.126) (0.063) (0.033) iv) 11-100 books in household -0.220** -0.187** -0.189* -0.173+ -0.342** -0.343** -0.248** -0.201* (0.027) (0.035) (0.048) (0.072) (0.069) (0.055) (0.042) (0.055) more than 500 books in household 0.105+ 0.117+ 0.217*** 0.223** 0.109+ 0.089 0.139+ 0.149** (0.040) (0.048) (0.020) (0.036) (0.040) (0.046) (0.050) (0.027) Parents: highest ISEI of job (incl. mis) 0.003+ 0.003+ 0.002 0.001 0.001 0.001 0.002 0.002 v) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.003) (0.004) FC student has single parent (incl. miss) 0.003 0.030 0.074 0.055 0.106 0.106 0.231* 0.238+ (0.027) (0.018) (0.117) (0.095) (0.056) (0.048) (0.067) (0.088) vi) info on family structure missing -0.143+ -0.145+ 0.064 0.080 -0.372 -0.295** -0.295 -0.243 (0.053) (0.052) (0.152) (0.137) (0.180) (0.066) (0.170) (0.175) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.241+ -0.195+ -0.229 -0.200 -0.105 -0.101 -0.046 0.003 (0.084) (0.072) (0.103) (0.129) (0.111) (0.105) (0.096) (0.093) father: UE at moment -0.008 0.003 0.191 0.200 0.011 -0.043 0.126 0.116 (0.175) (0.165) (0.115) (0.124) (0.087) (0.124) (0.206) (0.178) father: out of labor force -0.023 -0.033 0.019 0.004 0.118 0.104 0.081 0.115 vii) (0.035) (0.026) (0.029) (0.005) (0.085) (0.102) (0.392) (0.407) mother: employed (PT) 0.026 0.029 -0.083 -0.099 0.068 0.014 -0.069** -0.069* (0.036) (0.028) (0.100) (0.105) (0.062) (0.051) (0.009) (0.021) mother: UE at moment 0.127* 0.142+ -0.048 0.031 -0.049* -0.063 0.213 0.228 (0.037) (0.057) (0.089) (0.092) (0.012) (0.032) (0.250) (0.297) mother: out of labor force -0.055 -0.059 -0.130 -0.104 0.038 -0.026 -0.081* -0.096** (0.039) (0.048) (0.138) (0.137) (0.034) (0.024) (0.026) (0.017) Constant 2.572* 2.750* 5.066*** 5.479*** 3.328** 2.752* 4.598** 4.091** (0.831) (0.729) (0.389) (0.240) (0.393) (0.809) (0.608) (0.587) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes state effects no yes no yes no yes no yes Observations 1,854 1,854 1,159 1,159 2,168 2,168 1,467 1,467 R2 0.125 0.208 0.166 0.210 0.149 0.222 0.187 0.227 2 Radjusted 0.112 0.183 0.148 0.180 0.139 0.197 0.174 0.200 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 7. The dependent variable is stdpvread2, i.e. standardized PISA Reading test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvread2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009.

66 Table A.8: Main Results for Model Base-ST : 1st stage to derive IEOp measure (R2) for Maths test scores

stdpvmath2 for C3 stdpvmath2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female -0.611** -0.518** -0.598** -0.578** -0.560*** -0.510*** -0.531*** -0.545*** (0.071) (0.076) (0.095) (0.086) (0.034) (0.036) (0.036) (0.052) i) age in years (reference: test month May) -0.245** -0.243** -0.349*** -0.361*** -0.246+ -0.240+ -0.290** -0.301*** (0.046) (0.038) (0.026) (0.027) (0.104) (0.085) (0.032) (0.014) language spoken at home is NOT German -0.304** -0.240* -0.459** -0.352* -0.117 -0.106 -0.316*** -0.328** (0.056) (0.059) (0.084) (0.119) (0.173) (0.114) (0.029) (0.060) ii) student has migration background -0.083* -0.051* -0.154 -0.107 -0.206* -0.125 -0.214** -0.201** (0.021) (0.017) (0.113) (0.087) (0.064) (0.057) (0.035) (0.028) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.246 -0.236 0.106 0.201 -0.228* -0.186+ -0.366 -0.365 iii) (0.139) (0.170) (0.118) (0.134) (0.063) (0.066) (0.169) (0.169) tertiary educ. (ISCED 5,6) 0.019 0.008 0.004 0.039 0.031 0.015 0.023 0.009 (0.089) (0.102) (0.075) (0.073) (0.067) (0.038) (0.053) (0.057) SES No. of books at home: Baseline (101-500 books) max 10 books -0.081 -0.071 -0.655** -0.664* -0.444+ -0.427* -0.293* -0.232+ (0.183) (0.193) (0.126) (0.162) (0.160) (0.125) (0.094) (0.094) iv) 11-100 books in household -0.253** -0.245* -0.202+ -0.187 -0.291** -0.282** -0.269** -0.210+ (0.057) (0.067) (0.075) (0.093) (0.059) (0.060) (0.046) (0.084) more than 500 books in household 0.070 0.073 0.201** 0.197* 0.053*** 0.064** 0.105* 0.117** (0.058) (0.062) (0.045) (0.048) (0.005) (0.014) (0.028) (0.014) Parents: highest ISEI of job (incl. mis) 0.003 0.003 0.004* 0.004* 0.001 0.002 0.004 0.004 v) (0.002) (0.002) (0.001) (0.001) (0.001) (0.001) (0.003) (0.003) FC student has single parent (incl. miss) -0.056+ -0.021 0.123+ 0.120* 0.029+ 0.049 0.227** 0.241* (0.020) (0.021) (0.042) (0.033) (0.013) (0.037) (0.044) (0.058) vi) info on family structure missing -0.117* -0.161** -0.085 -0.073 -0.306 -0.140*** -0.214 -0.176 (0.031) (0.031) (0.138) (0.102) (0.140) (0.013) (0.120) (0.100) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.246** -0.191* -0.257+ -0.209 -0.050 -0.026 -0.096 -0.042 (0.055) (0.050) (0.108) (0.144) (0.130) (0.113) (0.120) (0.117) father: UE at moment -0.102 -0.094 0.215 0.168 -0.287* -0.251** 0.045 0.026 (0.062) (0.072) (0.202) (0.201) (0.076) (0.054) (0.094) (0.046) father: out of labor force -0.170 -0.166* -0.042 -0.059 -0.090 -0.042 -0.147 -0.117 vii) (0.075) (0.044) (0.035) (0.046) (0.136) (0.140) (0.388) (0.382) mother: employed (PT) 0.058+ 0.038* 0.044 0.023 0.053 0.004 0.005 0.001 (0.022) (0.013) (0.107) (0.109) (0.067) (0.043) (0.043) (0.019) mother: UE at moment 0.029 -0.032 -0.099 0.014 -0.011 -0.021** 0.315 0.351 (0.032) (0.078) (0.045) (0.045) (0.046) (0.005) (0.186) (0.219) mother: out of labor force 0.011 -0.008 0.052 0.061 0.058 -0.022 0.077 0.065+ (0.042) (0.048) (0.137) (0.138) (0.044) (0.031) (0.053) (0.024) Constant 3.876** 3.719** 5.477*** 5.863*** 4.021+ 4.001+ 4.494*** 3.746*** (0.729) (0.599) (0.443) (0.428) (1.701) (1.430) (0.446) (0.272) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes state effects no yes no yes no yes no yes Observations 1,854 1,854 1,159 1,159 2,168 2,168 1,467 1,467 R2 0.182 0.258 0.224 0.278 0.184 0.292 0.209 0.268 2 Radjusted 0.170 0.234 0.208 0.250 0.174 0.269 0.196 0.242 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 8. The dependent variable is stdpvmath2, i.e. standardized PISA Mathematics test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvmath2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009.

67 Table A.9: Main Results for Model Base-ST : 1st stage to derive IEOp measure (R2) for Sciences test scores

stdpvscie2 for C3 stdpvscie2 for T3 VARIABLES (1) (2) (3) (4) (5) (6) (7) (8) Before After Before After IC female -0.460** -0.418** -0.492** -0.454** -0.451** -0.397*** -0.413** -0.419** (0.071) (0.060) (0.091) (0.086) (0.047) (0.038) (0.072) (0.071) i) age in years (reference: test month May) -0.199* -0.219** -0.309*** -0.320*** -0.210* -0.182+ -0.311** -0.319*** (0.054) (0.046) (0.018) (0.019) (0.060) (0.068) (0.041) (0.029) language spoken at home is NOT German -0.409*** -0.329*** -0.373*** -0.326** -0.254 -0.274+ -0.262 -0.261 (0.033) (0.031) (0.032) (0.070) (0.165) (0.107) (0.155) (0.171) ii) student has migration background -0.143 -0.121 -0.223** -0.199*** -0.147+ -0.083 -0.230 -0.213+ (0.073) (0.070) (0.026) (0.006) (0.052) (0.042) (0.112) (0.093) PC highest parental education: Baseline (ISCED 3,4) at most lower sec. educ. (ISCED 1,2) -0.423* -0.368+ -0.132 -0.048 -0.277** -0.280* -0.239 -0.229 iii) (0.122) (0.153) (0.122) (0.148) (0.060) (0.095) (0.188) (0.185) tertiary educ. (ISCED 5,6) 0.055 0.058 0.043 0.063 0.040 0.036 0.026 0.014 (0.074) (0.097) (0.081) (0.082) (0.062) (0.058) (0.067) (0.071) SES No. of books at home: Baseline (101-500 books) max 10 books -0.074 -0.086 -0.929*** -0.967** -0.251 -0.304 -0.590** -0.520** (0.076) (0.086) (0.063) (0.098) (0.174) (0.173) (0.062) (0.086) iv) 11-100 books in household -0.292** -0.259** -0.285** -0.251** -0.327* -0.315* -0.271** -0.222* (0.051) (0.058) (0.043) (0.056) (0.092) (0.083) (0.053) (0.065) more than 500 books in household 0.098+ 0.088 0.256*** 0.256** 0.150*** 0.157** 0.176* 0.186** (0.034) (0.044) (0.021) (0.037) (0.013) (0.033) (0.046) (0.037) Parents: highest ISEI of job (incl. mis) 0.006* 0.006* 0.003 0.003+ 0.003+ 0.003+ 0.002 0.002 v) (0.002) (0.001) (0.002) (0.001) (0.001) (0.001) (0.004) (0.004) FC student has single parent (incl. miss) -0.048 -0.023 0.064 0.053 0.028 0.026 0.226** 0.232* (0.039) (0.030) (0.095) (0.077) (0.022) (0.034) (0.035) (0.060) vi) info on family structure missing -0.130** -0.196* -0.021 -0.025 -0.307 -0.217** -0.282 -0.224 (0.028) (0.058) (0.197) (0.153) (0.137) (0.036) (0.182) (0.179) employment status: Baseline (full time employed (FT)) father: employed (PT) -0.152+ -0.114 -0.142 -0.116 -0.108 -0.073 -0.102 -0.070 (0.064) (0.055) (0.096) (0.125) (0.152) (0.119) (0.132) (0.130) father: UE at moment 0.121 0.117 0.342** 0.313** -0.018 -0.034 0.105 0.087 (0.141) (0.146) (0.061) (0.050) (0.147) (0.127) (0.189) (0.143) father: out of labor force -0.082 -0.083 0.088+ 0.073 0.006 0.013 0.053 0.067 vii) (0.082) (0.056) (0.038) (0.034) (0.099) (0.131) (0.388) (0.402) mother: employed (PT) -0.022 -0.014 -0.088 -0.106 0.048 -0.001 -0.043 -0.036 (0.030) (0.032) (0.126) (0.119) (0.070) (0.055) (0.022) (0.024) mother: UE at moment 0.013 0.013 -0.104 -0.010 -0.030 -0.014 0.345 0.364 (0.030) (0.063) (0.051) (0.056) (0.065) (0.090) (0.205) (0.254) mother: out of labor force -0.074 -0.076 -0.126 -0.100 0.013 -0.057+ -0.074 -0.084** (0.055) (0.069) (0.141) (0.141) (0.049) (0.025) (0.048) (0.015) Constant 2.973* 3.266** 4.980*** 5.462*** 3.353* 2.845+ 4.868** 4.247** (0.817) (0.689) (0.362) (0.295) (1.048) (1.216) (0.609) (0.514) Federal States Fixed-Effects yes yes yes yes yes yes yes yes School Fixed Effects no yes no yes no yes no yes year effects no yes no yes no yes no yes Observations 1,854 1,854 1,159 1,159 2,168 2,168 1,467 1,467 R2 0.175 0.244 0.210 0.255 0.161 0.241 0.182 0.232 2 Radjusted 0.163 0.220 0.193 0.227 0.151 0.217 0.168 0.205 Notes: This table shows the first stage OLS regressions to derive the R2 as IEOp measure for conducting the DID estimation approach, with the results shown in the first sub-table in Table 8. The dependent variable is stdpvscie2, i.e. standardized PISA Sciences test scores for each testing year with respect to students in Gymnasium that are part of the representative grade-based German PISA test cohort in the respective test year (stdpvscie2 ). Background variables used to derive R2 are those explained in section 5.2: Robust standard errors, clustered on the federal state level are in parentheses *** p<0.01, ** p<0.05, * p<0.10, + p<0.15 For all regressions, population weights associated with each student observation are used. Source: Author’s Calculation base on PISA-I-data 2003, 2006, 2009.

68 Table A.10: Robustness Checks: Placebo-Tests (2003-2006) - T3 vs. C2/C3/C4 — Mathematics

T3 vs. C2: with School-FE T3 vs. C2: R2-adjusted measure with School-FE Mathematics C2 T3 Diff. (T3 - C2) C2 T3 Diff. (T3 - C2)

Before (2003) 0.353 0.225 -0.128 0.249 0.189 -0.059 (0.109) (0.049) (0.119) (0.127) (0.051) (0.137) After (2006) 0.362 0.278 -0.084 0.267 0.250 -0.017 (0.054) (0.048) (0.072) (0.062) (0.049) (0.079)

Change in R2 0.009 0.053 0.044 0.018 0.060 0.042 (0.122) (0.068) (0.139) (0.141) (0.071) (0.158)

T3 vs. C3: with School-FE T3 vs. C3: R2-adjusted measure with School-FE Mathematics C3 T3 Diff. (T3 - C3) C3 T3 Diff. (T3 - C3)

Before (2003) 0.200 0.225 0.025 0.162 0.189 0.027 (0.050) (0.049) (0.070) (0.053) (0.051) (0.073) After (2006) 0.245 0.278 0.034 0.214 0.250 0.036 (0.035) (0.048) (0.059) (0.037) (0.049) (0.062)

Change in R2 0.044 0.053 0.009 0.051 0.060 0.009 (0.061) (0.068) (0.092) (0.064) (0.071) (0.096)

T3 vs. C4: with School-FE T3 vs. C4: R2-adjusted measure with School-FE Mathematics C4 T3 Diff. (T3 - C4) C4 T3 Diff. (T3 - C4)

Before (2003) 0.219 0.225 0.007 0.185 0.189 0.004 (0.042) (0.049) (0.064) (0.043) (0.051) (0.067) After (2006) 0.268 0.278 0.010 0.241 0.250 0.009 (0.043) (0.048) (0.064) (0.045) (0.049) (0.067)

Change in R2 0.049 0.053 0.004 0.056 0.060 0.005 (0.060) (0.068) (0.091) (0.063) (0.071) (0.095)

Notes: Table entries are R2 measures of IEOp(Equation (7)). Due to space constraints, only Placebo-Test results for Mathematics test scores are shown. However, as shown in the main results, these scores tend to be good proxies between the Reading and Sciences scores. Robust standard errors are in parentheses and were calculated using replication weights following the method as explained in Appendix A.2.3, clustering at the level of federal states. DID results are estimated according to equation (13) taking into account population weights and school-fixed effects. Positive changes in R2 indicate increasing IEOp/decreasing EEOp and vice versa for negative changes. Background variables used to derive R2: (i) individual characteristics (IC) I: age and gender (ii) individual characteristics (IC)II: language spoken at home and migration background (based on (parental) birth place) (iii) parental characteristics (PC): highest parental education level (ISCED-level 1-2/ISCED-level 3-4/ISCED-level 5-6) (iv) socio-economic status (SES) I: number of books in household (max. 11, 11-100, 101-500, more than 500) (v) socio-economic status (SES)II: highest ISEI-level-index[0-90] of job in the family (vi) family characteristics (FC) I: family structure - living up in single parent household? (vii) family characteristics (FC) II: mother/father working part-time (PT) - mother/father unemployed (UE) - mother/father out of labor force (OLF) Compare: Due to space constraints first-step regressions for T3 vs. C2/C3/C4 have been omitted, but they remain available upon request from the author. Source: Author’s Calculation based on PISA-I-data 2003, 2006, 2009.

69 Table A.11: Difference-in-Difference Results: Overview - Model Base-MT - Control-Group C2

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2 Base-MT 1 0.0486019 0.0418389 0.1026374 0.0986496 read T3 C2 Base-MT 2 0.0508262 0.0413408 0.10818 0.1020977 read T3 C2 Base-MT 3 0.0492137 0.0362789 0.1089052 0.1000562 read T3 C2 Base-MT 4 0.0573307 0.0453275 0.116741 0.1091316 read T3 C2 Base-MT 5 0.0347353 0.0203472 0.1036062 0.0938706 read T3 C2 Base-MT 6 0.0734411 0.053641 0.1333018 0.1196875 read T5 C2 Base-MT 1 0.0552201 0.0485851 0.1045078 0.1004037 read T5 C2 Base-MT 2 0.053747 0.0443369 0.1076478 0.1013548 read T5 C2 Base-MT 3 0.0515521 0.0386309 0.1092806 0.1001985 read T5 C2 Base-MT 4 0.0594724 0.047456 0.1171495 0.1092804 read T5 C2 Base-MT 5 0.0391092 0.0246972 0.1052288 0.0952423 read T5 C2 Base-MT 6 0.0775819 0.0576589 0.134443 0.1204617 read T7 C2 Base-MT 1 0.0294624 0.0224113 0.075922 0.070706 read T7 C2 Base-MT 2 0.0205508 0.0105245 0.0749819 0.06732 read T7 C2 Base-MT 3 0.0195176 0.0057722 0.0762089 0.065553 read T7 C2 Base-MT 4 0.026705 0.0137465 0.0832847 0.0737119 read T7 C2 Base-MT 5 0.0044039 -0.011066 0.071847 0.0600673 read T7 C2 Base-MT 6 0.0440509 0.0226172 0.1031965 0.087029 math T3 C2 Base-MT 1 0.1374442 0.1326837 0.1173572 0.1150981 math T3 C2 Base-MT 2 0.1522048 0.1457725 0.1318355 0.1284723 math T3 C2 Base-MT 3 0.1525054 0.1440416 0.1334687 0.1284267 math T3 C2 Base-MT 4 0.1627268 0.155587 0.1438371 0.1403698 math T3 C2 Base-MT 5 0.1517936 0.1433778 0.1421455 0.1376176 math T3 C2 Base-MT 6 0.1613632 0.1488317 0.1465036 0.1384381 math T5 C2 Base-MT 1 0.1485623 0.1439196 0.126169 0.1239403 math T5 C2 Base-MT 2 0.1587099 0.1523483 0.1377001 0.1342778 math T5 C2 Base-MT 3 0.1573647 0.148909 0.1397943 0.1346712 math T5 C2 Base-MT 4 0.1678983 0.1607435 0.1502135 0.1466463 math T5 C2 Base-MT 5 0.1596974 0.1512595 0.1496798 0.1450618 math T5 C2 Base-MT 6 0.168932 0.1562866 0.1531902 0.1449399 math T7 C2 Base-MT 1 0.1225039 0.1174748 0.1161204 0.1132698 math T7 C2 Base-MT 2 0.1265655 0.1196356 0.1252417 0.1210203 math T7 C2 Base-MT 3 0.1239317 0.1146941 0.1247105 0.1185743 math T7 C2 Base-MT 4 0.1338416 0.1257928 0.1343879 0.1297064 math T7 C2 Base-MT 5 0.1232632 0.1138137 0.1334823 0.1276546 math T7 C2 Base-MT 6 0.1350299 0.1209538 0.1401597 0.1304406 science T3 C2 Base-MT 1 0.1478232 0.1432816 0.1925849 0.1923788 science T3 C2 Base-MT 2 0.1586137 0.1526844 0.1937428 0.1924779 science T3 C2 Base-MT 3 0.1603698 0.1524458 0.188899 0.1859919 science T3 C2 Base-MT 4 0.1743826 0.1680937 0.201756 0.2008021 science T3 C2 Base-MT 5 0.1563525 0.1485939 0.1925775 0.1904328 science T3 C2 Base-MT 6 0.1814693 0.1706351 0.2123226 0.2083999 science T5 C2 Base-MT 1 0.148146 0.1436911 0.1885749 0.1881179 science T5 C2 Base-MT 2 0.1568542 0.1509604 0.1887328 0.1871525 science T5 C2 Base-MT 3 0.156604 0.1486444 0.1836784 0.1803993 science T5 C2 Base-MT 4 0.1703102 0.1639529 0.1961935 0.1948287 science T5 C2 Base-MT 5 0.1541647 0.14632 0.1878349 0.1852712 science T5 C2 Base-MT 6 0.178593 0.1675509 0.2065061 0.2020138 science T7 C2 Base-MT 1 0.1289499 0.1241082 0.1688373 0.1674972 science T7 C2 Base-MT 2 0.1311111 0.1246503 0.1658215 0.1631425 science T7 C2 Base-MT 3 0.131187 0.122473 0.1598517 0.1552819 science T7 C2 Base-MT 4 0.1426166 0.1353887 0.1703004 0.167499 science T7 C2 Base-MT 5 0.1249573 0.1161332 0.1617619 0.1576607 science T7 C2 Base-MT 6 0.1522492 0.1398239 0.1838509 0.1775085 Notes: This table shows T3/T5/T7 vs. C2 in Model Base-MT for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also Table 6 and section 6.1). Note that column (6) shows the DID result with federal statesFE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

70 Table A.12: Difference-in-Difference Results: Overview - Model Base-ST - Control-Group C2

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2 Base-ST 1 0.115189 0.114791 0.111681 0.111725 read T3 C2 Base-ST 2 0.121095 0.121916 0.118177 0.119875 read T3 C2 Base-ST 3 0.117737 0.12046 0.114672 0.11858 read T3 C2 Base-ST 4 0.121297 0.125021 0.119677 0.124813 read T3 C2 Base-ST 5 0.103633 0.10752 0.111402 0.117025 read T3 C2 Base-ST 6 0.13003 0.139194 0.131452 0.143227 read T5 C2 Base-ST 1 0.10786 0.107649 0.104675 0.104539 read T5 C2 Base-ST 2 0.1121 0.113107 0.110076 0.111575 read T5 C2 Base-ST 3 0.107721 0.110617 0.106849 0.110579 read T5 C2 Base-ST 4 0.11114 0.114999 0.11136 0.116273 read T5 C2 Base-ST 5 0.095784 0.099826 0.104421 0.109862 read T5 C2 Base-ST 6 0.121562 0.130887 0.123486 0.13507 read T7 C2 Base-ST 1 0.074232 0.073239 0.095674 0.094884 read T7 C2 Base-ST 2 0.073427 0.073553 0.097302 0.098033 read T7 C2 Base-ST 3 0.072164 0.074112 0.093102 0.096023 read T7 C2 Base-ST 4 0.075347 0.07814 0.097446 0.101448 read T7 C2 Base-ST 5 0.056112 0.059006 0.089524 0.094007 read T7 C2 Base-ST 6 0.082682 0.090642 0.110492 0.121088 math T3 C2 Base-ST 1 0.051357 0.050464 0.031523 0.030167 math T3 C2 Base-ST 2 0.073721 0.073815 0.053254 0.053582 math T3 C2 Base-ST 3 0.086096 0.087854 0.060405 0.062834 math T3 C2 Base-ST 4 0.085608 0.088037 0.063293 0.066655 math T3 C2 Base-ST 5 0.0792 0.082129 0.066116 0.070348 math T3 C2 Base-ST 6 0.067613 0.072198 0.050883 0.057212 math T5 C2 Base-ST 1 0.042805 0.042077 0.024931 0.023404 math T5 C2 Base-ST 2 0.062891 0.063151 0.04483 0.044959 math T5 C2 Base-ST 3 0.073168 0.075077 0.051544 0.053782 math T5 C2 Base-ST 4 0.072468 0.075007 0.053327 0.056438 math T5 C2 Base-ST 5 0.070316 0.073397 0.058665 0.062723 math T5 C2 Base-ST 6 0.059527 0.0643 0.043525 0.049722 math T7 C2 Base-ST 1 0.013281 0.011798 0.037425 0.035791 math T7 C2 Base-ST 2 0.02857 0.027986 0.053992 0.053944 math T7 C2 Base-ST 3 0.039033 0.040013 0.057108 0.0591 math T7 C2 Base-ST 4 0.038951 0.040461 0.058802 0.061596 math T7 C2 Base-ST 5 0.032046 0.034009 0.061858 0.065531 math T7 C2 Base-ST 6 0.023221 0.026675 0.048596 0.054453 science T3 C2 Base-ST 1 0.097082 0.096825 0.10027 0.100811 science T3 C2 Base-ST 2 0.101857 0.102772 0.102155 0.104122 science T3 C2 Base-ST 3 0.093722 0.096029 0.098357 0.102221 science T3 C2 Base-ST 4 0.098091 0.10139 0.106631 0.111915 science T3 C2 Base-ST 5 0.08564 0.089083 0.102506 0.108355 science T3 C2 Base-ST 6 0.098285 0.105394 0.113678 0.124412 science T5 C2 Base-ST 1 0.089691 0.089613 0.093076 0.093426 science T5 C2 Base-ST 2 0.094175 0.095279 0.095024 0.096815 science T5 C2 Base-ST 3 0.083815 0.086304 0.09062 0.094315 science T5 C2 Base-ST 4 0.087784 0.091224 0.098081 0.103143 science T5 C2 Base-ST 5 0.078259 0.081878 0.096034 0.101724 science T5 C2 Base-ST 6 0.091031 0.098355 0.106575 0.117172 science T7 C2 Base-ST 1 0.065456 0.064626 0.087105 0.086897 science T7 C2 Base-ST 2 0.063994 0.064257 0.085631 0.08676 science T7 C2 Base-ST 3 0.057436 0.059037 0.080508 0.083504 science T7 C2 Base-ST 4 0.059844 0.062283 0.086441 0.09068 science T7 C2 Base-ST 5 0.047085 0.04963 0.083111 0.087918 science T7 C2 Base-ST 6 0.061208 0.067272 0.095825 0.105506 Notes: This table shows T3/T5/T7 vs. C2 in Model Base-ST for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also Table 7 and section 6.1). Note that column (6) shows the DID result with federal states FE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

71 Table A.13: Difference-in-Difference Results: Overview - Model Base-ST - Control-Group C3

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C3 Base-ST 1 0.010973 0.011505 0.016437 0.018523 read T3 C3 Base-ST 2 0.01901 0.019878 0.024988 0.027575 read T3 C3 Base-ST 3 -0.00342 -0.00234 -0.0023 -5.1E-05 read T3 C3 Base-ST 4 -0.00295 -0.00186 -0.00091 0.001399 read T3 C3 Base-ST 5 -0.00433 -0.00303 0.002639 0.005235 read T3 C3 Base-ST 6 -0.00354 -0.00154 0.000885 0.004162 read T5 C3 Base-ST 1 0.003644 0.004363 0.009431 0.011337 read T5 C3 Base-ST 2 0.010015 0.011069 0.016887 0.019275 read T5 C3 Base-ST 3 -0.01344 -0.01218 -0.01012 -0.00805 read T5 C3 Base-ST 4 -0.01311 -0.01188 -0.00922 -0.00714 read T5 C3 Base-ST 5 -0.01218 -0.01073 -0.00434 -0.00193 read T5 C3 Base-ST 6 -0.01201 -0.00985 -0.00708 -0.004 read T7 C3 Base-ST 1 -0.02998 -0.03005 0.00043 0.001682 read T7 C3 Base-ST 2 -0.02866 -0.02848 0.004113 0.005734 read T7 C3 Base-ST 3 -0.04899 -0.04869 -0.02387 -0.02261 read T7 C3 Base-ST 4 -0.0489 -0.04874 -0.02314 -0.02197 read T7 C3 Base-ST 5 -0.05185 -0.05155 -0.01924 -0.01778 read T7 C3 Base-ST 6 -0.05089 -0.05009 -0.02008 -0.01798 math T3 C3 Base-ST 1 -0.01862 -0.01826 -0.06333 -0.06315 math T3 C3 Base-ST 2 -0.00817 -0.00754 -0.0525 -0.05193 math T3 C3 Base-ST 3 -0.01696 -0.01612 -0.06427 -0.06384 math T3 C3 Base-ST 4 -0.0187 -0.01789 -0.06419 -0.06381 math T3 C3 Base-ST 5 -0.02194 -0.021 -0.05933 -0.05871 math T3 C3 Base-ST 6 -0.02162 -0.02015 -0.05894 -0.05793 math T5 C3 Base-ST 1 -0.02717 -0.02665 -0.06992 -0.06991 math T5 C3 Base-ST 2 -0.019 -0.0182 -0.06092 -0.06055 math T5 C3 Base-ST 3 -0.02989 -0.0289 -0.07313 -0.07289 math T5 C3 Base-ST 4 -0.03184 -0.03092 -0.07416 -0.07402 math T5 C3 Base-ST 5 -0.03082 -0.02973 -0.06678 -0.06634 math T5 C3 Base-ST 6 -0.02971 -0.02805 -0.0663 -0.06542 math T7 C3 Base-ST 1 -0.0567 -0.05693 -0.05742 -0.05753 math T7 C3 Base-ST 2 -0.05332 -0.05337 -0.05176 -0.05157 math T7 C3 Base-ST 3 -0.06402 -0.06396 -0.06756 -0.06757 math T7 C3 Base-ST 4 -0.06536 -0.06546 -0.06868 -0.06887 math T7 C3 Base-ST 5 -0.06909 -0.06912 -0.06359 -0.06353 math T7 C3 Base-ST 6 -0.06601 -0.06568 -0.06123 -0.06069 science T3 C3 Base-ST 1 -0.00581 -0.00538 -0.00808 -0.00657 science T3 C3 Base-ST 2 0.003451 0.004171 0.000448 0.002385 science T3 C3 Base-ST 3 -0.02413 -0.02336 -0.03234 -0.03102 science T3 C3 Base-ST 4 -0.02222 -0.02144 -0.02975 -0.02836 science T3 C3 Base-ST 5 -0.02045 -0.01951 -0.02322 -0.02152 science T3 C3 Base-ST 6 -0.01694 -0.01543 -0.02213 -0.01987 science T5 C3 Base-ST 1 -0.0132 -0.01259 -0.01528 -0.01396 science T5 C3 Base-ST 2 -0.00423 -0.00332 -0.00668 -0.00492 science T5 C3 Base-ST 3 -0.03403 -0.03309 -0.04008 -0.03892 science T5 C3 Base-ST 4 -0.03253 -0.03161 -0.0383 -0.03713 science T5 C3 Base-ST 5 -0.02783 -0.02671 -0.02969 -0.02815 science T5 C3 Base-ST 6 -0.0242 -0.02247 -0.02924 -0.02711 science T7 C3 Base-ST 1 -0.03743 -0.03757 -0.02125 -0.02049 science T7 C3 Base-ST 2 -0.03441 -0.03434 -0.01608 -0.01498 science T7 C3 Base-ST 3 -0.06041 -0.06036 -0.05019 -0.04973 science T7 C3 Base-ST 4 -0.06047 -0.06055 -0.04994 -0.0496 science T7 C3 Base-ST 5 -0.059 -0.05896 -0.04262 -0.04196 science T7 C3 Base-ST 6 -0.05402 -0.05355 -0.03999 -0.03878 Notes: This table shows T3/T5/T7 vs. C3 in Model Base-ST for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also Table 8 and section 6.1). Note that column (6) shows the DID result with federal states FE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

72 Table A.14: Difference-in-Difference Results: Overview - Model Base-ST - Control-Group C4

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C4 Base-ST 1 0.014695 0.015222 0.02057 0.022444 read T3 C4 Base-ST 2 0.034424 0.035185 0.039258 0.041688 read T3 C4 Base-ST 3 0.010918 0.01172 0.017893 0.019953 read T3 C4 Base-ST 4 0.011301 0.012165 0.018712 0.020863 read T3 C4 Base-ST 5 0.011957 0.01294 0.022806 0.02516 read T3 C4 Base-ST 6 0.011061 0.012372 0.019702 0.022318 read T5 C4 Base-ST 1 0.007366 0.008079 0.013564 0.015258 read T5 C4 Base-ST 2 0.025429 0.026377 0.031157 0.033388 read T5 C4 Base-ST 3 0.000903 0.001877 0.01007 0.011952 read T5 C4 Base-ST 4 0.001145 0.002144 0.010395 0.012323 read T5 C4 Base-ST 5 0.004108 0.005246 0.015825 0.017997 read T5 C4 Base-ST 6 0.002593 0.004064 0.011736 0.01416 read T7 C4 Base-ST 1 -0.02626 -0.02633 0.004563 0.005603 read T7 C4 Base-ST 2 -0.01324 -0.01318 0.018383 0.019846 read T7 C4 Base-ST 3 -0.03465 -0.03463 -0.00368 -0.0026 read T7 C4 Base-ST 4 -0.03465 -0.03472 -0.00352 -0.0025 read T7 C4 Base-ST 5 -0.03556 -0.03557 0.000928 0.002142 read T7 C4 Base-ST 6 -0.03629 -0.03618 -0.00126 0.000179 math T3 C4 Base-ST 1 -0.02231 -0.02197 -0.02839 -0.02774 math T3 C4 Base-ST 2 -0.00217 -0.00166 -0.01059 -0.00948 math T3 C4 Base-ST 3 -0.00737 -0.00679 -0.01748 -0.01648 math T3 C4 Base-ST 4 -0.00951 -0.0089 -0.01836 -0.01735 math T3 C4 Base-ST 5 -0.00795 -0.00727 -0.01207 -0.01085 math T3 C4 Base-ST 6 -0.01357 -0.01278 -0.01725 -0.01602 math T5 C4 Base-ST 1 -0.03086 -0.03035 -0.03499 -0.0345 math T5 C4 Base-ST 2 -0.013 -0.01233 -0.01901 -0.01811 math T5 C4 Base-ST 3 -0.0203 -0.01957 -0.02634 -0.02554 math T5 C4 Base-ST 4 -0.02265 -0.02193 -0.02832 -0.02757 math T5 C4 Base-ST 5 -0.01683 -0.016 -0.01952 -0.01848 math T5 C4 Base-ST 6 -0.02166 -0.02067 -0.0246 -0.02351 math T7 C4 Base-ST 1 -0.06038 -0.06063 -0.02249 -0.02211 math T7 C4 Base-ST 2 -0.04732 -0.04749 -0.00985 -0.00912 math T7 C4 Base-ST 3 -0.05443 -0.05463 -0.02078 -0.02022 math T7 C4 Base-ST 4 -0.05616 -0.05648 -0.02285 -0.02241 math T7 C4 Base-ST 5 -0.0551 -0.05539 -0.01633 -0.01567 math T7 C4 Base-ST 6 -0.05796 -0.0583 -0.01953 -0.01878 science T3 C4 Base-ST 1 -0.00145 -0.00101 0.00946 0.011064 science T3 C4 Base-ST 2 0.016339 0.016956 0.024027 0.026056 science T3 C4 Base-ST 3 -0.00047 0.000147 0.002207 0.003794 science T3 C4 Base-ST 4 0.001246 0.001935 0.003558 0.005233 science T3 C4 Base-ST 5 0.006424 0.007236 0.011134 0.013081 science T3 C4 Base-ST 6 0.00672 0.007789 0.010526 0.012724 science T5 C4 Base-ST 1 -0.00884 -0.00822 0.002266 0.003679 science T5 C4 Base-ST 2 0.008657 0.009463 0.016896 0.018749 science T5 C4 Base-ST 3 -0.01038 -0.00958 -0.00553 -0.00411 science T5 C4 Base-ST 4 -0.00906 -0.00823 -0.00499 -0.00354 science T5 C4 Base-ST 5 -0.00096 3.05E-05 0.004661 0.00645 science T5 C4 Base-ST 6 -0.00053 0.00075 0.003423 0.005484 science T7 C4 Base-ST 1 -0.03307 -0.03321 -0.0037 -0.00285 science T7 C4 Base-ST 2 -0.02152 -0.02156 0.007504 0.008694 science T7 C4 Base-ST 3 -0.03676 -0.03684 -0.01564 -0.01492 science T7 C4 Base-ST 4 -0.037 -0.03717 -0.01663 -0.016 science T7 C4 Base-ST 5 -0.03213 -0.03222 -0.00826 -0.00736 science T7 C4 Base-ST 6 -0.03036 -0.03033 -0.00733 -0.00618 Notes: This table shows T3/T5/T7 vs. C4 in Model Base-ST for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also Table 8 and section 6.1). Note that column (6) shows the DID result with federal states FE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

73 Table A.15: Robustness Check: Overview of DID Results - Model Base-MT - Control-Group C2hyp

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2h Base-MT 1 -0.0227491 -0.0146742 0.0601538 0.0706621 read T3 C2h Base-MT 2 -0.0279907 -0.0146276 0.0508613 0.0668377 read T3 C2h Base-MT 3 -0.0518645 -0.0325074 0.0147625 0.0365238 read T3 C2h Base-MT 4 -0.0508516 -0.0279142 0.0162766 0.0419544 read T3 C2h Base-MT 5 -0.0534799 -0.0268254 0.0251033 0.0552038 read T3 C2h Base-MT 6 -0.0708992 -0.0308939 0.0070143 0.0511753 read T5 C2h Base-MT 1 -0.0161309 -0.0079281 0.0620241 0.0724162 read T5 C2h Base-MT 2 -0.0250699 -0.0116315 0.0503291 0.0660949 read T5 C2h Base-MT 3 -0.0495261 -0.0301554 0.0151379 0.0366661 read T5 C2h Base-MT 4 -0.0487099 -0.0257857 0.0166851 0.0421032 read T5 C2h Base-MT 5 -0.0491059 -0.0224755 0.0267258 0.0565755 read T5 C2h Base-MT 6 -0.0667583 -0.026876 0.0081554 0.0519494 read T7 C2h Base-MT 1 -0.0418886 -0.0341019 0.0334383 0.0427186 read T7 C2h Base-MT 2 -0.0582661 -0.0454439 0.0176632 0.03206 read T7 C2h Base-MT 3 -0.0815606 -0.0630141 -0.0179337 0.0020206 read T7 C2h Base-MT 4 -0.0814772 -0.0594952 -0.0171796 0.0065348 read T7 C2h Base-MT 5 -0.0838113 -0.0582387 -0.006656 0.0214004 read T7 C2h Base-MT 6 -0.1002893 -0.0619177 -0.0230911 0.0185168 math T3 C2h Base-MT 1 -0.0383542 -0.030447 -0.113262 -0.1077581 math T3 C2h Base-MT 2 -0.0382803 -0.0250819 -0.1175293 -0.1073031 math T3 C2h Base-MT 3 -0.0630841 -0.0441709 -0.135462 -0.1199216 math T3 C2h Base-MT 4 -0.0569075 -0.0343625 -0.1259881 -0.1067127 math T3 C2h Base-MT 5 -0.0340091 -0.0070513 -0.0894577 -0.065028 math T3 C2h Base-MT 6 -0.040216 0.0008469 -0.0979307 -0.0605138 math T5 C2h Base-MT 1 -0.0272361 -0.0192111 -0.1044502 -0.098916 math T5 C2h Base-MT 2 -0.0317752 -0.0185061 -0.1116647 -0.1014976 math T5 C2h Base-MT 3 -0.0582248 -0.0393035 -0.1291364 -0.1136771 math T5 C2h Base-MT 4 -0.051736 -0.029206 -0.1196117 -0.1004362 math T5 C2h Base-MT 5 -0.0261053 0.0008304 -0.0819234 -0.0575838 math T5 C2h Base-MT 6 -0.0326471 0.0083018 -0.0912441 -0.054012 math T7 C2h Base-MT 1 -0.0532945 -0.0456559 -0.1144988 -0.1095864 math T7 C2h Base-MT 2 -0.0639195 -0.0512189 -0.124123 -0.1147551 math T7 C2h Base-MT 3 -0.0916578 -0.0735185 -0.1442201 -0.129774 math T7 C2h Base-MT 4 -0.0857927 -0.0641568 -0.1354374 -0.1173761 math T7 C2h Base-MT 5 -0.0625395 -0.0366154 -0.0981209 -0.074991 math T7 C2h Base-MT 6 -0.0665493 -0.027031 -0.1042747 -0.0685113 science T3 C2h Base-MT 1 -0.0503647 -0.0418693 -0.0230783 -0.014873 science T3 C2h Base-MT 2 -0.0663383 -0.0525347 -0.0438829 -0.0311266 science T3 C2h Base-MT 3 -0.0790938 -0.0596141 -0.0769183 -0.0596133 science T3 C2h Base-MT 4 -0.0766676 -0.0537081 -0.0709693 -0.0500218 science T3 C2h Base-MT 5 -0.0712979 -0.0444684 -0.0517354 -0.0264631 science T3 C2h Base-MT 6 -0.0730955 -0.0321738 -0.0509291 -0.011915 science T5 C2h Base-MT 1 -0.0500419 -0.0414598 -0.0270883 -0.0191339 science T5 C2h Base-MT 2 -0.0680978 -0.0542588 -0.0488929 -0.036452 science T5 C2h Base-MT 3 -0.0828596 -0.0634155 -0.0821389 -0.0652059 science T5 C2h Base-MT 4 -0.08074 -0.0578488 -0.0765318 -0.0559953 science T5 C2h Base-MT 5 -0.0734857 -0.0467423 -0.0564781 -0.0316247 science T5 C2h Base-MT 6 -0.0759718 -0.035258 -0.0567455 -0.0183011 science T7 C2h Base-MT 1 -0.069238 -0.0610427 -0.0468258 -0.0397547 science T7 C2h Base-MT 2 -0.0938409 -0.0805688 -0.0718042 -0.060462 science T7 C2h Base-MT 3 -0.1082766 -0.0895869 -0.1059656 -0.0903233 science T7 C2h Base-MT 4 -0.1084335 -0.086413 -0.1024249 -0.083325 science T7 C2h Base-MT 5 -0.1026931 -0.0769291 -0.082551 -0.0592352 science T7 C2h Base-MT 6 -0.1023156 -0.062985 -0.0794007 -0.0428064 Notes: This table shows T3/T5/T7 vs. C2hyp in Model Base-MT for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i)+(ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also section 6.1). Note that column (6) shows the DID result with federal statesFE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

74 Table A.16: Robustness Check: Overview of DID Results - Model Base-ST - Control-Group C2hyp

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2h Base-ST 1 -0.13167 -0.10788 -0.10471 -0.08823 read T3 C2h Base-ST 2 -0.1349 -0.09595 -0.101 -0.06869 read T3 C2h Base-ST 3 -0.18792 -0.13463 -0.16528 -0.11825 read T3 C2h Base-ST 4 -0.18752 -0.13006 -0.16377 -0.11234 read T3 C2h Base-ST 5 -0.1899 -0.12251 -0.15552 -0.09344 read T3 C2h Base-ST 6 -0.2207 -0.11429 -0.18673 -0.08511 read T5 C2h Base-ST 1 -0.139 -0.11502 -0.11172 -0.09541 read T5 C2h Base-ST 2 -0.1439 -0.10476 -0.1091 -0.07699 read T5 C2h Base-ST 3 -0.19793 -0.14447 -0.17311 -0.12625 read T5 C2h Base-ST 4 -0.19768 -0.14008 -0.17209 -0.12088 read T5 C2h Base-ST 5 -0.19775 -0.13021 -0.1625 -0.1006 read T5 C2h Base-ST 6 -0.22917 -0.12259 -0.19469 -0.09327 read T7 C2h Base-ST 1 -0.17263 -0.14943 -0.12072 -0.10507 read T7 C2h Base-ST 2 -0.18257 -0.14432 -0.12187 -0.09054 read T7 C2h Base-ST 3 -0.23349 -0.18098 -0.18685 -0.14081 read T7 C2h Base-ST 4 -0.23347 -0.17694 -0.186 -0.1357 read T7 C2h Base-ST 5 -0.23742 -0.17103 -0.1774 -0.11646 read T7 C2h Base-ST 6 -0.26805 -0.16284 -0.20769 -0.10725 math T3 C2h Base-ST 1 -0.04886 -0.02013 -0.12096 -0.10231 math T3 C2h Base-ST 2 -0.04189 0.005881 -0.11049 -0.07318 math T3 C2h Base-ST 3 -0.06617 0.003817 -0.1406 -0.0817 math T3 C2h Base-ST 4 -0.05855 0.017325 -0.13096 -0.06622 math T3 C2h Base-ST 5 -0.03955 0.050224 -0.09572 -0.01637 math T3 C2h Base-ST 6 -0.0551 0.090802 -0.11763 0.014335 math T5 C2h Base-ST 1 -0.05741 -0.02851 -0.12755 -0.10908 math T5 C2h Base-ST 2 -0.05272 -0.00478 -0.11891 -0.08181 math T5 C2h Base-ST 3 -0.0791 -0.00896 -0.14946 -0.09075 math T5 C2h Base-ST 4 -0.0717 0.004295 -0.14092 -0.07644 math T5 C2h Base-ST 5 -0.04843 0.041492 -0.10317 -0.024 math T5 C2h Base-ST 6 -0.06318 0.082904 -0.12499 0.006846 math T7 C2h Base-ST 1 -0.08693 -0.05879 -0.11506 -0.09669 math T7 C2h Base-ST 2 -0.08704 -0.03995 -0.10975 -0.07282 math T7 C2h Base-ST 3 -0.11324 -0.04402 -0.1439 -0.08543 math T7 C2h Base-ST 4 -0.10521 -0.03025 -0.13545 -0.07128 math T7 C2h Base-ST 5 -0.0867 0.002105 -0.09997 -0.02119 math T7 C2h Base-ST 6 -0.09949 0.045279 -0.11992 0.011576 science T3 C2h Base-ST 1 -0.0953 -0.06736 -0.0527 -0.03236 science T3 C2h Base-ST 2 -0.10149 -0.0555 -0.05884 -0.02053 science T3 C2h Base-ST 3 -0.15535 -0.09345 -0.12988 -0.07546 science T3 C2h Base-ST 4 -0.1455 -0.07842 -0.12239 -0.06264 science T3 C2h Base-ST 5 -0.1485 -0.07076 -0.10819 -0.0368 science T3 C2h Base-ST 6 -0.17429 -0.05204 -0.13762 -0.02242 science T5 C2h Base-ST 1 -0.10269 -0.07457 -0.0599 -0.03974 science T5 C2h Base-ST 2 -0.10918 -0.063 -0.06597 -0.02784 science T5 C2h Base-ST 3 -0.16526 -0.10317 -0.13762 -0.08336 science T5 C2h Base-ST 4 -0.15581 -0.08859 -0.13094 -0.07141 science T5 C2h Base-ST 5 -0.15588 -0.07797 -0.11466 -0.04343 science T5 C2h Base-ST 6 -0.18154 -0.05908 -0.14472 -0.02966 science T7 C2h Base-ST 1 -0.12693 -0.09956 -0.06587 -0.04627 science T7 C2h Base-ST 2 -0.13936 -0.09402 -0.07536 -0.0379 science T7 C2h Base-ST 3 -0.19163 -0.13044 -0.14773 -0.09418 science T7 C2h Base-ST 4 -0.18375 -0.11753 -0.14258 -0.08388 science T7 C2h Base-ST 5 -0.18705 -0.11021 -0.12758 -0.05724 science T7 C2h Base-ST 6 -0.21137 -0.09016 -0.15547 -0.04133 Notes: This table shows T3/T5/T7 vs. C2hyp in Model Base-ST for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also section 6.1). Note that column (6) shows the DID result with federal states FE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

75 Table A.17: Robustness Check: Overview of DID Results - Model Full-MT - Control-Group C2

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2 Full-MT 1 0.033046 0.036931 0.093264 0.091588 read T3 C2 Full-MT 2 0.032047 0.038399 0.09349 0.094568 read T3 C2 Full-MT 3 0.025697 0.035358 0.089567 0.094009 read T3 C2 Full-MT 4 0.031441 0.042861 0.093703 0.100081 read T3 C2 Full-MT 5 0.009385 0.022148 0.083888 0.091618 read T3 C2 Full-MT 6 0.029813 0.049689 0.097794 0.113363 read T5 C2 Full-MT 1 0.035779 0.039219 0.088497 0.088696 read T5 C2 Full-MT 2 0.031723 0.037524 0.086996 0.089778 read T5 C2 Full-MT 3 0.025085 0.034073 0.083867 0.089847 read T5 C2 Full-MT 4 0.030305 0.040983 0.088291 0.096144 read T5 C2 Full-MT 5 0.010393 0.022364 0.079617 0.088778 read T5 C2 Full-MT 6 0.030198 0.049014 0.092885 0.109591 read T7 C2 Full-MT 1 0.00766 0.010701 0.060214 0.05935 read T7 C2 Full-MT 2 -0.00179 0.003551 0.055756 0.057333 read T7 C2 Full-MT 3 -0.00762 0.000854 0.051719 0.056415 read T7 C2 Full-MT 4 -0.00316 0.006971 0.055108 0.061616 read T7 C2 Full-MT 5 -0.02463 -0.01324 0.046732 0.054531 read T7 C2 Full-MT 6 -0.003 0.015122 0.062835 0.078121 math T3 C2 Full-MT 1 0.051045 0.053245 0.060234 0.045001 math T3 C2 Full-MT 2 0.061606 0.06553 0.07126 0.05832 math T3 C2 Full-MT 3 0.050375 0.056285 0.063509 0.052661 math T3 C2 Full-MT 4 0.060182 0.067332 0.073002 0.06388 math T3 C2 Full-MT 5 0.049744 0.057805 0.071991 0.063996 math T3 C2 Full-MT 6 0.057976 0.070598 0.073331 0.070357 math T5 C2 Full-MT 1 0.061997 0.063882 0.06555 0.054744 math T5 C2 Full-MT 2 0.067896 0.071386 0.073521 0.0648 math T5 C2 Full-MT 3 0.055193 0.060522 0.066284 0.059448 math T5 C2 Full-MT 4 0.064788 0.071281 0.076159 0.070984 math T5 C2 Full-MT 5 0.056959 0.064306 0.076386 0.072303 math T5 C2 Full-MT 6 0.064858 0.076467 0.077088 0.077748 math T7 C2 Full-MT 1 0.030857 0.0323 0.04556 0.034304 math T7 C2 Full-MT 2 0.031603 0.034538 0.051558 0.042223 math T7 C2 Full-MT 3 0.019254 0.0239 0.043209 0.035603 math T7 C2 Full-MT 4 0.027894 0.033629 0.05175 0.045695 math T7 C2 Full-MT 5 0.017998 0.024511 0.051604 0.046557 math T7 C2 Full-MT 6 0.027428 0.037931 0.054375 0.053881 science T3 C2 Full-MT 1 0.055852 0.058481 0.112282 0.10136 science T3 C2 Full-MT 2 0.06769 0.072282 0.113213 0.104678 science T3 C2 Full-MT 3 0.071157 0.07811 0.112054 0.106061 science T3 C2 Full-MT 4 0.078049 0.086378 0.118382 0.114141 science T3 C2 Full-MT 5 0.064415 0.073681 0.112392 0.109258 science T3 C2 Full-MT 6 0.085998 0.100831 0.128497 0.131994 science T5 C2 Full-MT 1 0.057771 0.06004 0.107569 0.101061 science T5 C2 Full-MT 2 0.067756 0.071878 0.10737 0.103079 science T5 C2 Full-MT 3 0.069246 0.07559 0.105562 0.103533 science T5 C2 Full-MT 4 0.075568 0.083212 0.111904 0.111548 science T5 C2 Full-MT 5 0.063609 0.072126 0.106835 0.107539 science T5 C2 Full-MT 6 0.084183 0.09796 0.121978 0.128986 science T7 C2 Full-MT 1 0.032707 0.034542 0.08496 0.078063 science T7 C2 Full-MT 2 0.037865 0.041455 0.082288 0.077407 science T7 C2 Full-MT 3 0.037788 0.043486 0.077657 0.074821 science T7 C2 Full-MT 4 0.042629 0.049558 0.082498 0.081223 science T7 C2 Full-MT 5 0.029119 0.036853 0.076974 0.076686 science T7 C2 Full-MT 6 0.052592 0.065355 0.095818 0.101716 Notes: This table shows T3/T5/T7 vs. C2 in Model Full-MT for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also section 6.1). Note that column (6) shows the DID result with federal statesFE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

76 Table A.18: Robustness Check: Overview of DID Results - Model Full-MT - Control-Group C2hyp

(1) (2) (3) (4) (5) (6) (7) (8) (9) Outcome Treatment Control Model Control-set R2-DD-BL R2adjusted-DD_BL R2-DD_SF R2adjusted-DD_SF read T3 C2h Full-MT 1 -0.03324 -0.02009 0.047471 0.059464 read T3 C2h Full-MT 2 -0.03718 -0.01612 0.037116 0.057008 read T3 C2h Full-MT 3 -0.05639 -0.02624 0.008853 0.037887 read T3 C2h Full-MT 4 -0.05446 -0.01898 0.009533 0.044098 read T3 C2h Full-MT 5 -0.05972 -0.01891 0.014818 0.055168 read T3 C2h Full-MT 6 -0.08256 -0.02159 -0.00774 0.052991 read T5 C2h Full-MT 1 -0.03051 -0.0178 0.042704 0.056572 read T5 C2h Full-MT 2 -0.0375 -0.01699 0.030622 0.052218 read T5 C2h Full-MT 3 -0.057 -0.02752 0.003154 0.033724 read T5 C2h Full-MT 4 -0.05559 -0.02086 0.004122 0.040161 read T5 C2h Full-MT 5 -0.05871 -0.0187 0.010547 0.052328 read T5 C2h Full-MT 6 -0.08218 -0.02226 -0.01264 0.049219 read T7 C2h Full-MT 1 -0.05863 -0.04632 0.014421 0.027226 read T7 C2h Full-MT 2 -0.07102 -0.05097 -0.00062 0.019773 read T7 C2h Full-MT 3 -0.08971 -0.06074 -0.02899 0.000292 read T7 C2h Full-MT 4 -0.08906 -0.05487 -0.02906 0.005633 read T7 C2h Full-MT 5 -0.09373 -0.0543 -0.02234 0.018081 read T7 C2h Full-MT 6 -0.11537 -0.05615 -0.04269 0.017749 math T3 C2h Full-MT 1 -0.03333 -0.02126 -0.05968 -0.06018 math T3 C2h Full-MT 2 -0.02709 -0.00752 -0.06073 -0.05412 math T3 C2h Full-MT 3 -0.06645 -0.03872 -0.08289 -0.06811 math T3 C2h Full-MT 4 -0.05228 -0.01951 -0.07192 -0.05172 math T3 C2h Full-MT 5 -0.03878 -0.00079 -0.04753 -0.02096 math T3 C2h Full-MT 6 -0.04833 0.00888 -0.05938 -0.01421 math T5 C2h Full-MT 1 -0.02237 -0.01062 -0.05436 -0.05044 math T5 C2h Full-MT 2 -0.0208 -0.00166 -0.05847 -0.04764 math T5 C2h Full-MT 3 -0.06163 -0.03449 -0.08012 -0.06132 math T5 C2h Full-MT 4 -0.04767 -0.01556 -0.06877 -0.04461 math T5 C2h Full-MT 5 -0.03157 0.005714 -0.04313 -0.01265 math T5 C2h Full-MT 6 -0.04144 0.014749 -0.05562 -0.00682 math T7 C2h Full-MT 1 -0.05351 -0.04221 -0.07435 -0.07088 math T7 C2h Full-MT 2 -0.0571 -0.03851 -0.08043 -0.07022 math T7 C2h Full-MT 3 -0.09757 -0.07111 -0.10319 -0.08517 math T7 C2h Full-MT 4 -0.08456 -0.05321 -0.09318 -0.0699 math T7 C2h Full-MT 5 -0.07053 -0.03408 -0.06791 -0.0384 math T7 C2h Full-MT 6 -0.07887 -0.02379 -0.07834 -0.03069 science T3 C2h Full-MT 1 -0.01951 -0.00643 -0.0173 -0.01485 science T3 C2h Full-MT 2 -0.02011 0.000711 -0.02472 -0.01501 science T3 C2h Full-MT 3 -0.05126 -0.02241 -0.05877 -0.04183 science T3 C2h Full-MT 4 -0.04469 -0.01097 -0.05306 -0.03102 science T3 C2h Full-MT 5 -0.04153 -0.00267 -0.04192 -0.01443 science T3 C2h Full-MT 6 -0.04642 0.01207 -0.04686 -0.00046 science T5 C2h Full-MT 1 -0.01759 -0.00487 -0.02201 -0.01515 science T5 C2h Full-MT 2 -0.02004 0.000306 -0.03056 -0.01661 science T5 C2h Full-MT 3 -0.05317 -0.02493 -0.06526 -0.04436 science T5 C2h Full-MT 4 -0.04718 -0.01414 -0.05954 -0.03361 science T5 C2h Full-MT 5 -0.04234 -0.00422 -0.04748 -0.01615 science T5 C2h Full-MT 6 -0.04823 0.009199 -0.05338 -0.00347 science T7 C2h Full-MT 1 -0.04265 -0.03037 -0.04462 -0.03815 science T7 C2h Full-MT 2 -0.04993 -0.03012 -0.05565 -0.04228 science T7 C2h Full-MT 3 -0.08463 -0.05703 -0.09317 -0.07307 science T7 C2h Full-MT 4 -0.08011 -0.04779 -0.08894 -0.06394 science T7 C2h Full-MT 5 -0.07683 -0.03949 -0.07734 -0.04701 science T7 C2h Full-MT 6 -0.07983 -0.02341 -0.07954 -0.03074 Notes: This table shows T3/T5/T7 vs. C2hyp in Model Full-MT for all 3 test score domains and for each version adding all 6 control sets from 1 = [(i) + (ii)] until 6 = [(i) + (ii) + (iii) + (iv) + (v) + (vi) + (vii)] (compare also section 6.1). Note that column (6) shows the DID result with federal states FE, (7) shows the same but using adjusted R2. Column (8) shows the DID result with schoolFE, (9) shows the same but using adjusted R2.

77 A.4 Supplementary Figures

Figure A.1: Absolute educational mobility (2012)

Source: Figure taken from OECD(2013b). This figure illustrates Absolute educational mobility. It shows the percentage of 25-64 year-old non-students whose educational attainment is higher (upward mobility) or lower (downward mobility) or the same as (status quo) of their parents as measure in 2012 by the OECD.

78 Figure A.2: Structure of the German educational system

Source: Figure taken from Standing Conference of the Ministers of Education(2016a) This figure illustrates the basic structure of the German education system. For more details on the German educational system, see Standing Conference of the Ministers of Education(2015).

79 Figure A.3: Overview of the G-8-Reform across federal states for students tested in PISA (2003-2012) Year 2003 Year 2006

SH SH

MV MV HB HB BR BR

LS LS BE BE

ST BB ST BB

NRW NRW

SN SN TH TH H H

RP RP SL SL

BV BV

BW BW

Control (G9) Control (G9) Treatment (G8)

Year 2009 Year 2012

SH SH

MV MV HB HB BR BR

LS LS BE BE

ST BB ST BB

NRW NRW

SN SN TH TH H H

RP RP SL SL

BV BV

BW BW

Control (G9) Control (G9) Treatment (G8) Treatment (G8)

Source: This figure illustrates for the respective 9th graders in Gymnasium tested in a PISA-test year (2003, 2006, 2009, 2012) whether they were still taught in a Gymnasium-9-model (blue) or already in a reformed Gymnasium-8-models (red).

80 Figure A.4: Overview of the Treatment/Control-Group setting for the Medium-Term Model (2003-2012)

SH SH

MV MV HB HB BR BR

LS LS BE BE

ST BB ST BB

NRW NRW

SN SN TH TH H H

RP RP SL SL

BV BV

BW T3 BW T3 T5 T5 T7 T7 C2 C2 C2HYP

Source: -hand figure shows the main Treatment-Control-Group comparisons for the Medium-Term (2003-2012) model with C2 as the main Control-Group and three typical Treatment-Groups (T3/T5/T7). The right-hand side shows the same settings including the hypothetical Control Group C2hyp

81 Figure A.5: Overview of the Treatment/Control-Group setting for the Short-Term Model (2003-2009)

SH SH

MV MV HB HB BR BR

LS LS BE BE

ST BB ST BB

NRW NRW

SN SN TH TH H H

RP RP SL SL

BV BV T3 T3 T5 T5 BW T7 BW T7 C2 C2 C3 C3 C4 C4 C2HYP

Source: The left-hand figure shows the main Treatment-Control-Group comparisons for the Short-Term (2003-2009) model with C2 and C3 as the main Control-Groups and three typical Treatment-Groups (T3/T5/T7). The right-hand side shows the same settings including the hypothetical Control Group C2hyp

82 Figure A.6: Overview of how G-8-reform was implemented in each federal states

Saxony

Thuringia

Saxony-Anhalt(e)

Meckl.-West. Pomerania(e)

Saarland

Hamburg

Bavaria(a)

Lower-Saxony(a)

Baden-Wuerttemberg

Bremen

Berlin(b)

Brandenburg(b)

North Rhine-Westphalia

Hesse(d)

Rhineland-Palatinate(c)

Schleswig-Holstein(c)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 PISA PISA PISA PISA PISA

G9 Reform-Start G8/G9 parallel Double Cohort G8

Notes: This figure illustrates for each federal state in each school year whether the graduating cohort in Gymnasium had been following a G-8-model, G-9-model, consisted of the double cohort or whether in the respective federal states both models are available for students to choose (Standing Conference of the Ministers of Education, 2016b). This Figure corresponds to Table 1 and in particular to the regulations explained in Table A.2. Notes for certain federal states: a In Bavaria (BV) and Lower-Saxony (LS), the 6th and 5th grade were allocated at the same school year into the G-8-model suggesting that educational intensity might be slightly stronger for the then 6th graders that had to compensate the shortened school duration during 7 instead of 8 years, as the then 5th grade students. However, the 9th graders in 2009 in BV and LS were affected by the reform right from the 5th grade onwards. b Berlin(B) and Brandenburg(BB) introduced the G-8-reform for the 7th grade onwards as secondary school only starts at that grade in these federal states. c RP and SH planned to introduce the G-8-reform for school year 2008/09 to be completed by 2015/16. At the end, both remained the G-9-model, but allowing students to choose from that school year onwards. Thus, for the PISA test years considered, they always form part of the Control-Group as for all data available, the 9th graders were still taught in a G-9-model. d Hesse(H) introduced the reform over 3 different school years. It becomes thus part of Control-Group (C4) e MV and ST as Eastern German federal states introduced the reform directly from 9th grade onwards. Thus, the first cohorts were treated more intensively than treated students in federal states that shortened school duration from 5th grade onwards. Thus, they are not part of the main Treatment- and Control-Groups in the DID analysis. Source: Based on Author’s representation of facts as illustrated in Table 1 and the regulations explained in Table A.2.

83 Figure A.7: Number of weekly instruction hours by school entry cohort

Source: Figure taken from Huebener et al.(2016). This figure illustrates for each federal state the number of weekly instruction hours by school entry cohort (Standing Conference of the Ministers of Education, 2016b).

84 Figure A.8: G-9-model vs. G-8-model: Average instructional hours per week and by grade

Source: Figure taken from Andrietti(2015), figure 3. This figure illustrates the average Instructional Hours per week. For more details on the German educational system, see Standing Conference of the Ministers of Education(2013). The dataset used for the these figure contains numbers for hours provided in the Wochenplichtstunden der Schüler nach Schularten und Ländern. Grundstuden Im Schuljahr 1997/1998 - 2011/2012 series, retrieved from: Standing Conference (1997-2011) Wochenpflichtstunden der Schülerinnen und Schüler - Statistiks 1997 bis 2011.

85 Figure A.9: Student performance and equity across OECD countries (PISA 2012)

Source: Figure taken from OECD(2013b). This figure illustrates student performance in mathematics in the PISA-2012-test for OECD countries and the relationship to equity as measured by the percentage of variation in performance explained by ESCS.

86 Figure A.10: ISCED-scale explanation

Source: Figure UNESCO. This figure defines the different ISCED levels in the context of the German education system. In the PISA datasets the ISCED 1997 scale is used

87