<<

Strengthening the Research Base that Informs STEM Instructional Improvement Efforts: A Meta-Analysis

Kathleen Lynch Brown University

Heather C. Hill, Kathryn Gonzalez, & Cynthia Pollard Harvard Graduate School of Education

March 2019

Suggested Citation: Lynch, K., Hill, H.C., Gonzalez, K, & Pollard, C. (2019). Strengthening the Research Base that Informs STEM Instructional Improvement Efforts: A Meta-Analysis. Brown University Working Paper.

Correspondence concerning this article can be addressed to Kathleen Lynch at [email protected].

This material is based upon work supported by the National Foundation under Grant #1348669. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank Patrick Aquino, Wilson Boardman, Jonathan Hamel Sellman, Andrea Humez, Meghan Kelly, Jiwon Lee, Angie Luo, Manish Parmar, and Melanie Rucinski for their assistance with data preparation and analysis.

INSTRUCTIONAL IMPROVEMENT IN STEM

Abstract

We present results from a meta-analysis of 95 experimental and quasi-experimental preK-12 science, technology, engineering, and mathematics (STEM) professional development and curriculum programs, seeking to understand what content, activities and formats relate to stronger student outcomes. Across rigorously conducted studies, we found an average weighted impact estimate of +0.21 standard deviations. Programs saw stronger outcomes when they helped teachers learn to use curriculum materials; focused on improving teachers' content knowledge, pedagogical content knowledge and/or understanding of how students learn; incorporated summer workshops; and included teacher meetings to troubleshoot and discuss classroom implementation. We discuss implications for policy and practice.

Keywords: Professional development, curriculum, mathematics, science, STEM

1 INSTRUCTIONAL IMPROVEMENT IN STEM

Strengthening the Research Base that Informs STEM Instructional Improvement Efforts:

A Meta-Analysis

Instructional improvement efforts constitute a persistent feature of the preK-12 science, technology, engineering, and mathematics (STEM) educational landscape, with a decades-long trail of new curriculum materials and professional development programs aimed at changing how teachers interact with students around content. Such initiatives bring with them significant costs; scholars estimate that districts spend between 1 and 6% of their budgets on professional development (see, e.g., Corcoran,1995; Miles, Odden, Fermanich & Archibald, 2004; Miller,

Lord & Dorney, 1994), and the market for instructional materials totaled $11.9 billion in 2013

(Cavanagh, 2015). Prior to 2002, scholars rarely rigorously evaluated such instructional improvement programs, often using instead cross-sectional and/or self-report data to identify

“best practices” in professional development and curricular design (e.g., Becker & Park, 2011;

Desimone, 2009; Sparks, 2002). However, following calls in the early 2000s for stronger research into the impact of educational interventions (Confrey & Stohl, 2004; Raudenbush, 2008;

Shavelson & Towne, 2001), federal research portfolios began to prioritize research methods that allow causal inference, and to use student outcomes as the major indicator of program success.

Dollars’ and scholars’ turn toward using causal methods and student-level impacts has resulted in a wealth of new studies. These new studies, we argue, permit rigorous empirical analyses linking program characteristics to outcomes, a topic long of interest to practitioners who make decisions regarding intervention design and/or adoption. In this paper, we present results from a meta-analysis of preK-12 STEM curriculum materials and professional development programs intended to improve instructional quality and student learning, seeking to understand what content, activities, and formats are linked to stronger student outcomes. Our work differs

2 INSTRUCTIONAL IMPROVEMENT IN STEM from other recent efforts in that it is a formal meta-analysis rather than a structured review (e.g.,

Kennedy, 2015; Gersten, 2014), and because the newly available studies allow us to compile a dataset larger than past reviews, and thus to exclude studies with weaker designs.

We argue that this work is particularly timely. The Every Student Succeeds Act requires that districts receiving Title I funds must adopt “evidence-based interventions,” including programs and strategies proven to be effective in raising student achievement. However, recent null findings from large-scale studies (e.g., Borman, Cotner, Lee, Boydston, & Lanehart, 2009;

Garet et al., 2011; Santagata, Kersting, Givven & Stigler, 2010), as well as an analysis of studies curated by What Works Clearinghouse (WWC; Malouf & Taymans, 2016), has led many to doubt the efficacy of instructional improvement programs. By contrast, we find an effect of

+0.21 standard deviations among the programs we study. We describe our study in more detail below.

Background

STEM Instructional Improvement

Recent calls for reform in STEM education have focused on increasing both student understanding of core disciplinary ideas and student engagement with key disciplinary practices such as inquiry, argumentation and proof (NRC, 2011; NGSS 2013; CCSS-M, 2010). Yet observational studies have shown that instruction in U.S. STEM classrooms tends to be lacking in disciplinary concepts and low in student cognitive demand (e.g., Authors, in press; Banilower,

Smith, Weiss & Pasley, 2006; Kane, Kerr, & Pianta, 2014; Hiebert et al., 2005).

Perhaps as a result, programs aimed at improving STEM instructional quality abound.

These programs tend to involve two main strategies for improvement: teacher professional development, which is typically intended to change some aspect of teachers’ instruction, and

3 INSTRUCTIONAL IMPROVEMENT IN STEM new curriculum materials, which are typically intended to shape both instruction and the subject matter content teachers teach. Professional development and curriculum programs can be independent (e.g., professional development alone) or used in combination (e.g., when curriculum materials’ implementation is supported by professional development). In contrast to alternative reforms, such as standards, test-based accountability and market-based reforms, which out broad principles and signals to which schools and teachers must interpret and react, curriculum and professional development provide direct instructional guidance, attempting to shape schools’ and teachers’ day-to-day interactions with students through lesson plans and instructional strategies (Ball & Cohen, 1996). We discuss the specific theory of action for both professional development and curriculum below.

Providers of STEM teacher professional development – often districts, but sometimes non-profits, professional associations, or university-based faculty – typically design a set of experiences intended to affect change in a variety of teacher- and classroom-level phenomena.

Most programs focus on instruction as a primary target for change, providing teachers new routines, instructional strategies, and/or ways of teaching content. However, many of these programs also hope to change teachers’ beliefs, knowledge of content, and knowledge of how students learn content in order to support instructional improvement. For instance, Garet et al.

(Garet et al., 2010) describe a program in which teachers learn rational number concepts, learn about persistent student misconceptions with this topic, then receive group and individual coaching on how to apply the material learned in the content-focused sessions in their own classrooms. Schneider and Meyer (2012) describe a program in which teachers learn about assessments and formative assessment practices, then put them into practice. The STeLLA program (Taylor et al., 2015) focuses mainly on improving teachers’ capacity to conduct

4 INSTRUCTIONAL IMPROVEMENT IN STEM analyses of instruction, but also includes sessions on science content and how students learn content. Such programs reflect the view that teachers’ instructional practice cannot change without corresponding changes in teachers’ beliefs and knowledge (Smith & O’Day, 1990).

As these examples imply, the experiences designed by professional development providers can vary substantially. As critiques of the “one-shot workshop” surfaced two decades ago (Goldenberg & Gallimore, 1991; Kennedy, 1999), providers experimented with new formats, including program delivery distributed across one or more school years, 1:1 coaching, collaborative workgroups, and teacher learning in online settings. Recent reform movements have also changed professional development foci, from programs primarily targeted toward improving teacher content knowledge (Frechtling, Sharp, Carey & Vaden-Kiernan, 1995) to more diverse topics, including improving teachers’ pedagogical content knowledge (Shulman,

1986), helping teachers develop and implement content-specific formative assessment, helping teachers address the needs of English Learners, and helping teachers use technology more effectively in STEM classrooms. With these new foci, professional development activities also changed, with newer programs offering more study of student work, more focus on the curriculum materials to be used in classrooms, and more focus on lesson analysis and planning.

The theory of action around curriculum materials also involves shaping what teachers do in classrooms, but here the focus is on both introducing specific instructional moves – supported by the activities and guidance in the text itself – as well as changing the content taught in substantial ways. For instance, the elementary curriculum materials supported by the National

Science Foundation in the 1990s were designed to engage students in mathematical practices such as problem-solving, mathematical reasoning, and precise communication, but also to include topics that had not previously been taught in elementary schools, such as data, statistics,

5 INSTRUCTIONAL IMPROVEMENT IN STEM and early algebra (Stein, Remillard & Smith, 2007). In science, numerous projects have built units and curricula intended to replace conventional textbooks with more inquiry-oriented instruction and content that is either more relevant to students’ or future careers (e.g., Marx et al., 2004; Schwartz-Bloom & Halpin, 2003).

It is worth noting that many providers of curriculum materials also hold a theory of action similar to that of professional developers – that teacher beliefs and knowledge must change in order to support improvements in classroom practice. In some programs, providers intentionally mix the new curriculum materials with intensive professional development. For instance, Saxe,

Gearhardt & Nasir (2001) paired a new fractions unit with a five-day summer workshop and 13 in-school sessions focused on improving teachers’ knowledge of fractions, teachers’ knowledge of how students learn fractions, and teachers’ knowledge of student motivation. Roschelle

(Roschelle et al., 2010) and Hand (Hand et al., 2014) followed a similar approach. Many curriculum providers also embed guidance for teachers in the written materials themselves. This may include explanations of content intended for teachers – for instance, pages explaining the conceptual ideas within the unit, how they connect to one another, how to represent those ideas, alternative solution methods to problems, and even sample dialogue or scripts. Examples of such curricula include Investigations in Number, Data, and Space in mathematics (Agodini et al.,

2013) and P-SELL (Llosa et al., 2016) in science.

In other cases, curriculum providers offer less professional development, often only a few days intended to orient teachers to routines in the curriculum, adaptations for students with specific needs, and technological enhancements to the curriculum (e.g., Pane, Griffin,

McCaffrey, & Karam, 2014); Resendez & Azin, 2006). A small number of curriculum materials

6 INSTRUCTIONAL IMPROVEMENT IN STEM providers (e.g., Cervetti, Barber, Dorph, Pearson, & Goldschmidt, 2012) eschew professional development entirely, often in hopes of replicating real- conditions in schools.

Syntheses of STEM Instructional Improvement Programs

Several recent syntheses examine STEM teacher professional development and curriculum improvement efforts. We review the methodology and findings from these syntheses, then comment on their characteristics.

In the area of professional development (Blank, de las Alas, & Smith, 2008; Kennedy,

1999, 2015; Yoon, 2007; Scher & O’Reilly, 2009; Wilson, 2013), prior syntheses have generally indicated positive impacts on learning. Meta-analyses suggest that effect sizes range between

+0.21 SD (Blank & de las Alas, 2010) and +0.54 SD (Yoon et al., 2007) on student outcomes, with some suggestion that content-specific (math) rather than content-general (classroom management) programs produce greater learning gains (Kennedy, 1999; Scher & O’Reilly,

2009). However, Scher and O’Reilly (2009) noted that the pool of STEM-focused articles and reports published by 2004 could not support most expert recommendations regarding “best practices” in professional development (see, e.g., Desimone, 2009). For example, although experts have frequently expressed the opinion that professional development must be longer in duration in order to be effective (e.g., Darling-Hammond & McLaughlin, 1995; Desimone,

2011), recent syntheses of the empirical literature have returned inconsistent findings on this point (e.g., Blank & de las Alas, 2010). Kennedy (2016) found that more -intensive programs had weaker effects, whereas Yoon found that the three studies in his review that had the least amount of professional development (5-14 hours) showed no statistically significant impacts on student learning. However, Yoon’s review included only nine studies. Scher and

7 INSTRUCTIONAL IMPROVEMENT IN STEM

O’Reilly (2009) found that multi-year interventions were more effective than single-year interventions in math, but not in science.

Two recent structured reviews of professional development, each conducted several years after federal guidelines changed to prioritize causal research, provide another form of evidence regarding instructional improvement programs. One review specific to mathematics (Gersten,

Taylor, Keys, Rolfhus, & Newman-Gonchar, 2014) used the stringent WWC evidence standards to screen studies, and perhaps as a result returned too few (five) to discern patterns in program impacts. Another (Kennedy, 2015) identified 28 studies, split equally between English Language

Arts, science, and mathematics. Kennedy found that the characteristics often cited as best practices in professional development (content-focused, collective participation of all teachers within a grade/school, duration) were less predictive of positive outcomes than other features, including assisting teachers in gaining insight into their practice and helping teachers inject novel ideas into their practice. The latter categories included several coaching and curriculum-based programs. Finally, in science, Slavin and colleagues (Slavin, Lake, Hanley, & Thurston, 2014) found that professional development programs supporting inquiry-based elementary science raised student outcomes by an average of 0.36 SD. For secondary science, programs that emphasized inquiry through teacher professional development saw an average effect size of

+0.24 SD.

In the area of curriculum materials, two research syntheses by Slavin and colleagues

(Slavin & Lake, 2008; Slavin, Lake, & Groff, 2009) using studies published between 1971 and

2008 found fairly small effects for mathematics curriculum materials, at +0.03 SD and +0.10 SD for secondary and elementary curricula respectively. For science (Slavin, et al., 2014), elementary programs with “kits” and accompanying professional development had a near-zero

8 INSTRUCTIONAL IMPROVEMENT IN STEM

(+0.03 SD) average impact on student outcomes. The average impact of kits in secondary science proved similar (+0.04 SD), with the average impact of science textbooks not substantially larger

(+0.10 SD) (Cheung, Slavin, Kim, & Lake, 2017). A review of studies by the National Research

Council (2004) conducted in the early 2000s found stronger results for NSF-funded curricula than conventional materials, in that 59% of comparisons between materials favored the NSF materials. However, many regard vote counting as a weak and misleading synthesis procedure

(Hedges & Olkin, 1980), and Confrey (2006) herself noted that the studies included in the NRC report were of varying quality, and that as study rigor increased, effect sizes diminished. Slavin and colleagues, perhaps because they used more rigorous study selection criteria, did not find any relationship between study design and effect size, although they note that the methodological rigor of most included studies was still relatively weak, and “quality research is particularly lacking in this area” (Slavin, 2008, p. 480).

Finally, one recent study (Taylor et al., 2018) examined interventions that offered professional development, curriculum materials, and computer software intended to improve student science outcomes. Across 96 such studies they found an average impact of 0.489 SD, with programs evaluated by researcher-designed assessments posting significantly higher impacts, similar to an earlier report by Hill, Bloom, Black & Lipsey (2008). Other study and intervention characteristics did not significantly predict program outcomes.

Motivation for the Current Study

These research syntheses share several important characteristics. To start, nearly all noted that the evidence base for making recommendations was thin at the time of the review. Using the pool of articles and reports on professional development published by 2003, Yoon and colleagues (2007) could find only nine ELA, math, and science studies that met WWC evidence

9 INSTRUCTIONAL IMPROVEMENT IN STEM standards. Years later, Gersten et al. (2014) located only five mathematics professional development studies that met the WWC’s exacting standards. Other evaluations (e.g., Scher &

O’Reilly, 2009; Slavin, Lake, Hanley, & Thurston, 2014; Cheung, Slavin, Kim & Lake 2017) achieved a greater sample size by including quasi-experimental studies with matched comparison groups (and sometimes retrospective matching, a disputed technique [Slavin, 2008]). Kennedy’s cross-subject synthetic review (2016) with 28 studies was an exception, as many studies described more recent strong quasi-experiments or actual randomized trials. In general, however, the small sample sizes typical in these reviews limited both the generalizability of findings and also the number of program features, or moderators, that could be coded and examined. We argue that recent IES- and NSF-funded classroom-level experiments provide a thicker evidence base, eliminating the need to include studies that conducted post-hoc matching and facilitating more moderator analyses.

These reviews also share another important characteristic, in that many identified only a small number of program features for study. This is most apparent in the area of professional development, where research syntheses almost universally coded studies for duration, content- specificity, and delivery format. For curriculum materials, most analyses categorized programs by the degree of alignment with standards-based reform (e.g., Confrey & Stohl, 2004). With the accumulation of more recent rigorous studies, however, have come opportunities to understand how features of professional development or curriculum materials moderate effects on student outcomes. For instance, professional development activities (watching video, designing lessons, studying student work) may differentially impact program outcomes, as might curriculum design features, such as length of implementation and degree of support for teachers. In addition, the

10 INSTRUCTIONAL IMPROVEMENT IN STEM programs studied recently contain more varied delivery methods and features (e.g., coaching; online learning components) than those of a decade ago.

Viewed from a contemporary perspective, several other issues emerge. First, existing research syntheses of professional development studies tend to find that most returned positive and significant effects. Yoon et al. (2007), for instance, found only one of twenty effects identified across nine studies was negative, and only one was zero. Yet the more recent wave of studies tends to find more mixed impacts (e.g., Garet et al., 2008; Santagata et al., 2010). This suggests that the pool of studies included in earlier syntheses might have suffered from two issues: the “file drawer” problem, in which studies with null results are not published (Slavin,

2008); or the “boutique” problem, in which only highly promising – and often unusual – programs are studied (Author, 2004). We believe that both problems may have been ameliorated in recent years. IES and NSF funding guidelines have explicitly encouraged the evaluation of programs that have wide reach, meaning those programs may be more typical of the offerings available to teachers. Further, better reporting practices – e.g., final reports posted on websites or short reports archived on conference websites – may have rescued studies from file drawers.

Second, the practice of reviewing professional development and curriculum studies separately creates conceptual and practical difficulties. On the conceptual side, most curriculum programs also include a professional development component for teachers; likewise, some professional development programs offer teachers materials to support the implementation of new practices within classrooms. On the practical side, the combination of professional development and materials together may be especially effective, as compared to either one alone or one with a minimal dose of the other (Authors, 2001). Studying both within one review may

11 INSTRUCTIONAL IMPROVEMENT IN STEM enhance our understanding of how these instructional improvement efforts can complement one another.

Methods

Search Procedures

We searched for studies in three phases. We began by scanning the reference lists of prior reviews of math and science professional development and curriculum improvement programs

(Blank, de las Alas, & Smith, 2008; Cheung, Slavin, Kim, & Lake, 2017; Furtak, Seidel, Iverson,

& Briggs, 2012; Gersten et al., 2014; Kennedy, 1998, 2015; Scher & O’Reilly, 2009; Slavin,

Lake, & Groff, 2009; Slavin & Lake, 2008; Slavin, Lake, Hanley, & Thurston, 2014; Timperley,

Wilson, Barrar, & Fung, 2007; Wilson, 2013; Yoon, 2007; Zaslow et al., 2010) for the years

1989 through 2004. Due to resource constraints, we did not conduct additional searches for materials dated prior to 2004. We argue that this is reasonable given that prior reviewers have previously searched the published and grey literature from the pre-2004 period extensively1; given that rigorous studies of teacher PD and curriculum improvement were relatively rare prior to the early 2000s (e.g., Kennedy, 2016); we considered the likelihood relatively low that we would have uncovered previously unknown randomized trials or sufficiently well-designed quasi-experiments from the pre-2004 period.2 See Appendix Table C13 for a descriptive comparison of the studies published pre-2004 versus studies published 2004 or later. As illustrated in the table, descriptively, studies from the earlier period had larger effect sizes on average; they were also less likely to be RCTs, and less likely to use a state standardized test as an outcome variable. As we discuss below, we control for these methodological variables in all of our primary models.

12 INSTRUCTIONAL IMPROVEMENT IN STEM

In the second search phase, we conducted an electronic search using the databases

Academic Search Premier, ERIC, Ed Abstracts, PsycINFO, EconLit, and ProQuest Dissertations

& Theses, for the period January 2004 - March 2016. Searches were conducted using subject- related keywords adapted from Yoon (2009), and methodology-related keywords designed to capture experimental and quasi-experimental methods adapted from Kim & Quinn (2012).3 We also searched the websites of Regional Education Labs, What Works Clearinghouse, the World

Bank, Inter-American Development Bank, Empirical Education, Mathematica, MDRC, and AIR for relevant materials. We also searched the abstracts of the Society for Research on Educational

Effectiveness (SREE) conference. We ceased our materials search in March 2016.

In the third search phase, we downloaded, from the NSF Community for Advancing

Discovery Research in Education (CADRE) and IES websites, a list of all STEM award grantees from the years 2002 to 2012. We then conducted electronic database and Web searches to find all studies published from each award. In the case of 29 awards, we could find no publicly available reports or information that included student outcomes, and we attempted to contact project PIs via email to obtain study results. Of these 29 grant awards, we were sent or later located reports from 17. Of these, the reports from 16 studies were excluded according to the criteria above, and one was included in our analyses.

The search procedures described above netted a total of 8,099 records identified through database screening, and an additional 1,391 records identified through other sources (see Figure

1 for screening flow chart). After removing duplicates, we were left with 7,926 records.

Screening Procedures

Screening proceeded in two phases. First two raters screened each of the studies' titles and abstracts to identify potentially relevant studies, passing studies into the second phase when

13 INSTRUCTIONAL IMPROVEMENT IN STEM they covered grades preK-12, included student outcomes, included quantitative data, and focused on math and science-specific content and/or instructional strategies. All studies flagged as potentially relevant by either rater were then reviewed by one of the authors, who made a final determination about moving the study forward. A total of 656 studies met the initial relevance criteria and were advanced to full-text screening.

In the second screening phase, two authors examined the full text of each study and applied more detailed content and methodological criteria. In order to qualify for inclusion in the final dataset, we required that studies use a randomized or quasi-experimental research design with a comparison group. Following Slavin and Lake (2008), we included studies that assembled comparison groups via prospective, but not post-hoc, matching. Slavin and Lake (2008) have argued that “Prospective studies, in which experimental and control groups were designated in advance and outcomes are likely to be reported whatever they turn out to be, are always to be preferred to post hoc studies, other factors being equal.” According to Slavin and Lake (2008), post-hoc designs have several characteristic problems. First, when researchers attempt to examine the efficacy of a program by simply comparing the test scores of schools or classrooms who completed the program versus those that did not, only “survivors” are included in the treatment group. This can lead to upwardly biased estimates of the treatment impact, if more schools or classrooms began the treatment but dropped it, potentially because it was not working.

Second, when researchers construct post-hoc comparisons of treated and matched comparison schools, they generally have many potential ‘comparison’ schools to choose from; this raises the concern that researchers may inadvertantly (or even intentionally) choose ‘comparison’ schools that have made less academic progress over the study period than the treatment schools. Third, post-hoc matching studies are often commissioned by textbook publishers and curriculum

14 INSTRUCTIONAL IMPROVEMENT IN STEM developers because they are low-cost and easy to conduct. For these same reasons, study authors may easily abandon them if they do not show positive results, and thus the retrievable studies may be especially skewed toward positive findings (Slavin & Lake, 2008). To classify a study as prospective matching, we required that the study authors explicitly report that they matched treatment and comparison groups on pretest data prior to the intervention’s commencement.

Studies that explicitly stated that matching was done retrospectively, or that were silent on this point, were excluded. We also excluded studies in which treatments had been in place prior to pretesting. We also required that studies present pretest means, and that pretest differences between groups be less than 50% of a standard deviation. Despite this relatively liberal threshold for allowable pretest differences, these data requirements resulted in the inclusion of only nine quasi-experiments.

Studies had to be published in 1989 or later, be written in English, and have participating students in grades preK-12. The program had to focus on classroom-level STEM instructional improvement through professional development, curriculum materials, or both. We excluded programs with no instructional improvement component, such as after-school peer tutoring or computerized at-home skills practice, as well as studies that examined the impact of variables, rather than programs.4 Studies had to include sufficient data to calculate an effect size for student outcomes. The most common reasons for exclusion were for characteristics of the intervention

(e.g., off-topic; not a classroom-level intervention (N = 561), methodology issues (e.g., no control group; post-hoc design) (N = 310); and sample issues (e.g., did not have at least two teachers and fifteen students in each condition) (N = 45) (note that some studies had multiple reasons for exclusion) (see Figure 1).

15 INSTRUCTIONAL IMPROVEMENT IN STEM

Ninety-five studies met the review inclusion criteria and advanced to the study coding phase. In cases where we encountered multiple versions of the same study, we used all available reports to glean information about the study, and used the most recent version (often the peer- reviewed publication version) for impact estimates. Of the studies that met the inclusion criteria,

44 percent were from peer-reviewed journal articles, 23 percent were conference papers or presentations, 18 percent were technical reports, 5 percent were district, state, or federal government reports, 4 percent were doctoral dissertations, and 5 percent were from other sources. Many of these studies reported multiple effect sizes due to the inclusion of multiple outcome measures, multiple versions of the same program with a common control group, multiple samples, or multiple programs.

The final analysis sample includes 258 effect sizes nested within these 95 studies. This includes a separate effect size for each treatment contrast, each assessment of math or science achievement, and each sample of students and teachers reported by the study.5 To account for the nested nature of our data, our analysis used the robust variance estimation (RVE) approach

(Tanner-Smith & Tipton, 2013), as discussed below.

Study Coding

To develop content codes, we began by reviewing prior meta-analyses and reviews of instructional improvement interventions, naming both broad categories (professional development; curriculum; technology; subject matter) and specific codes (teachers studied curriculum materials; teachers solved math problems) that commonly appeared in the literature.

We then used this skeleton structure to code a sample of studies, modifying existing codes to be more clear and adding codes as needed to cover additional program features. We also added a set of codes that captured study size, design, and quality. Finally, our technical advisory board,

16 INSTRUCTIONAL IMPROVEMENT IN STEM comprised of university faculty and other experts in teacher professional development and curriculum materials, reviewed our codes, making suggestions for amending and adding as necessary.6

For programs involving professional development, four primary coding categories emerged. A first pair of codes captured professional development duration, indexed as the number of contact hours teachers spent in professional development experiences (both for professional development and curriculum materials programs) and as the timespan over which those hours were spread. A second set of codes recorded the main focus (or foci) of the professional development, including emphases on improving teacher content knowledge, exploring how students learn, integrating technology into the classroom, and learning how to use curriculum materials. Third, we coded for specific activities that teachers engaged in during the professional development, such as observing a demonstration of instruction or working through student curriculum materials. A final set of codes captured the format of the professional development, for instance whether it was delivered during a summer workshop, contained coaching, or involved online learning. Within each of the three substantive coding categories, we also created an indicator of the number of codes applied, hypothesizing that professional development programs with multiple foci, activities or formats (e.g., a focus on both how students learn and how to use curriculum materials) might be more effective compared to more narrowly-focused programs.

For programs that involved new curriculum materials, we coded for whether the curriculum provided implementation guidance (e.g., text describing possible student-teacher dialogues around content) or supplied teachers with kits for implementation. We also recorded

17 INSTRUCTIONAL IMPROVEMENT IN STEM the proportion of a lesson the curriculum was intended to replace (i.e., one to 100 percent), and the total number of minutes the curriculum was intended for use in a school year.

We also designed codes to capture the rigor of the design as well as aspects of its implementation. Specifically, to record potential selection bias of participants into groups, we coded whether treatment and control groups were formed via random assignment or by prospectively matching individuals/classrooms/schools on baseline data. To explore potential impacts of general and differential attrition bias, we included two measures of potential effect size-level attrition problems: (1) author-reported attrition of more than 20% at the student or cluster level, and (2) differential attrition between treatment and control groups of more than

10%. Finally, we coded whether authors simply reported a given outcome as “non-significant” in order to capture the potential for bias due to selective reporting.

We note we were unable to capture other potential sources of bias such as performance and detection bias, or biases stemming from participants’ and researchers’ knowledge of whether participants were in the treatment or control group. Given that it is evident to participants and researchers alike whether teachers are engaged in new professional development and curriculum programs, in most cases it is not possible to blind participants and researchers to condition.

We also coded for a variety of other study descriptors, such as publication type (peer- reviewed vs. not), and whether the study took place in the U.S. or abroad. We examine whether any of these study descriptors relate to effect sizes. Based on evidence that effect sizes vary by type of test (Hill et al., 2008; Taylor et al., 2018), we also coded for the nature of the student assessment, including state or district standardized tests (e.g., the Texas Essential Knowledge and Skills assessment), standardized tests available through commercial vendors (e.g.,

Woodcock-Johnson), and researcher-designed assessments. We expected the last to be more

18 INSTRUCTIONAL IMPROVEMENT IN STEM sensitive to the content of instructional improvement programs, and thus potentially provide larger effect sizes.

Coding of full-text studies was conducted by the study authors, along with two trained research assistants. Before beginning operational coding, we went through a process of establishing inter-rater reliability. Each week, each member of the team coded two studies, then met to reconcile disagreements and refine the code descriptions. This process continued until coders reached 80% agreement (four weeks for low-inference codes such as bibliographic information, five weeks for higher-inference codes surrounding program characteristics). Once we achieved a stable set of codes and the 80% threshold, each study was coded by two researchers, including at least one study author, working as a pair. Each researcher coded studies independently, and met to reconcile discrepant records. All disagreements were resolved through discussion.

Effect Size Calculation

We calculated standardized mean difference effect sizes using Hedges’ g:

∗ 푔 = 퐽 ∗ (푌퐸 − 푌퐶 )/S

In this formula, 푌퐸 , represents the average treatment group outcome, 푌퐶 , represents the average control group outcome, and S∗ represents the pooled within-group standard deviation. J is a correction factor that adjusts the standardized mean difference to avoid bias in small samples:

3 퐽 = 1 − 4 ∗ (푁퐸 + 푁퐶 − 2) − 1)

In this formula, 푁퐸 represents the number of students in the treatment group and 푁퐶 represents the number of students in the control group. Effect sizes were calculated using the software package Comprehensive Meta Analysis for the majority of cases. We used the following decision rules to calculate effect sizes: If the authors reported a standardized mean difference effect size,

19 INSTRUCTIONAL IMPROVEMENT IN STEM such as Cohen’s d or Glass’s delta, we converted author-reported effect sizes to Hedges’s g (52 percent of effect sizes).7 If authors did not report a standardized mean difference effect size but did report a covariate-adjusted unstandardized mean difference (e.g., a coefficient from a multilevel model) and raw standard deviations, we calculated a standard mean difference effect size and converted to Hedges’s g (15 percent). If covariate-adjusted mean differences were not reported, we calculated effect sizes based on raw posttest means and standard deviations (22 percent). If neither standardized mean differences nor raw means and standard deviations were reported, effect sizes were calculated based on the coefficients and standard errors of multilevel or regression models (4 percent).8 In the remaining cases, effect sizes were calculated from other results (e.g., studies that reported the results of ANOVAs; 7 percent).

We applied corrections to study-reported standard errors when necessary to account for the clustering of students within classrooms or schools (Littell, Corcoran, & Pillai, 2008).9 For the five studies that did not report the number of clusters, we followed Scher and O’Reilly’s

(2009) procedure for imputing that number based on available information, typically estimating the total number of classrooms from the total number of students. When intraclass correlations could not be inferred from the study report, we also followed Scher and O’Reilly (2009) in adopting the WWC recommended default intraclass correlations of 0.20.

We had 29 outcomes where the study did not provide enough information to calculate an effect size.10 All came from studies that did report an effect size for at least one additional achievement outcome, and so no studies were dropped from the analysis due to missing outcomes. Furthermore, many of these missing outcomes were subscales, meaning they were a smaller selection of items already represented in the reported effects. Therefore, we ignore these missing outcomes in the main analyses. However, we also examine the sensitivity of our results

20 INSTRUCTIONAL IMPROVEMENT IN STEM to the inclusion of these outcomes by imputing a range of values for these missing effect sizes and re-estimating our primary models.11

Model Selection

A common problem in meta-analyses occurs when single studies yield multiple effect sizes, as authors often measure more than one outcome. These nested effect sizes are likely to be correlated, violating the assumption of statistical independence. To address this issue, prior meta- analyses in this area have averaged effect sizes (Kennedy, 2016; Scher & O’Reilly, 2009; Slavin,

Lake, Hanley & Thurston, 2014; Yoon et al., 2007). Here, we argue that a robust variance estimation (RVE) approach developed by Tanner-Smith and Tipton (2014) more properly models our data. This approach adjusts standard errors to account for the dependencies between effect sizes, and is analogous to methods for adjusting standard errors in OLS regression models for heteroskedasticity (e.g., using Huber-White standard errors) or to account for the nesting of data within clusters (e.g., clustered standard errors). Importantly, this approach allows for the inclusion of multiple effect sizes from the same study within the meta analysis, and therefore avoids the loss of information that would occur by either dropping effect sizes or including a single average effect size for each study (for a more detailed discussion, see Tanner-Smith &

Tipton, 2014). RVE models can also control for methodological features known to affect outcomes (e.g., Hill et al. [2008]; Taylor et al., 2018). This approach has been used in multiple recent meta-analyses where individual studies report multiple effect sizes (e.g., Clark, Tanner-

Smith, & Killingsworth, 2016; Dietrichson, Bøg, Filges, & Jørgensen, 2017; Gardella, Fisher, &

Teurbe-Tolon, 2017).

The RVE approach is designed to account for two types of dependencies among effect sizes. The first type of dependency is correlated effects, which arises when a single study

21 INSTRUCTIONAL IMPROVEMENT IN STEM provides multiple effect sizes estimates for the same underlying construct or of correlated underlying measures, or when the same control group is used for multiple treatment contrasts.

The second type of dependency is hierarchical effects, which arises when multiple treatment- control contrasts are nested within a larger cluster of experiments (e.g., a research group conducts multiple evaluations of the same program). When both of these dependencies are present, Tanner-Smith and Tipton (2014) recommend selecting a method based on the more frequent type of dependence. Because correlated effects predominated in our data, we used the recommended inverse variance weights recommended by Tanner-Smith and Tipton (2014).

The weight for effect size i in study j is calculated by the following:

1 푤 = 푖푗 2 {(푣∗푗 + 휏 )[1 + (푘푗 − 1)휌]}

2 2 Where 푣∗푗 is the mean of within-study sampling variances (푆퐸푖푗), 휏 is the estimate of the between-studies variance component, 푘푗 is the number of effect sizes within each study, and 휌 is the assumed correlation between all pairs of effect sizes within each study. Therefore, effect sizes from studies with more effect sizes and higher sampling variances are given lower weight.

It is assumed that 휌 is constant across studies, and we use the recommended default value of 휌 =

0.80 (Tanner-Smith & Tipton, 2014). However, simulation studies suggest that results using the

RVE approach are not sensitive to values of 휌 (e.g., Tanner-Smith & Tipton, 2014; Tanner-

Smith et al., 2013; Wilson et al., 2011). We also conduct a series of sensitivity checks to determine whether our results are sensitive to alternative values of 휌. We use the robumeta package in Stata 15 (developed by Tanner-Smith & Tipton, 2013) to estimate our RVE models, and incorporate the small-sample correction proposed by the developers of the RVE approach

(Tanner-Smith & Tipton, 2014; Tipton & Pustejovsky, 2015). We also report the results of F- tests to test the joint significance of the program features included in our RVE models. These F-

22 INSTRUCTIONAL IMPROVEMENT IN STEM tests was conducted using the robumeta and clubSandwich packages in R (Fisher & Tipton,

2014; Tipton & Pustejovsky, 2015).

As noted above, we grouped potential moderators of program impact into five categories indicating the time/duration, focus, activities, and format of professional development, and the characteristics of new curriculum materials. To examine whether specific features in each category moderated program impact, we fit four sets of conditional meta-regression models with robust variance estimation (RVE), including the coded features as moderators and treating these moderators as fixed. Within each category, we first modeled the effect of each code separately.

To further understand their joint relationships, we then fit a model entering all codes together.

All models controlled for whether the study was a randomized controlled trial, whether the effect size estimate controlled for covariates, the type of student assessment used, whether the program focused on mathematics (rather than science), and an indicator variable for whether the study was conducted at the preschool level.

In some cases, there is within-study variability in program features (e.g., among studies with multiple treatment arms). Following the recommendation of Tanner-Smith & Tipton (2014), we include the study-level mean value of each covariate and moderator. For the two covariates where there is within-study variability in at least 10 percent of studies (state standardized test, other standardized test), we also include a within-study version of the covariate that is calculated by subtracting the study-level mean values from the original covariate values. The full dataset is available from the authors on request.

Finally, we note that the RVE method addresses heterogeneity of effect sizes differently compared to traditional meta-analyses. As described by the RVE developers, the primary aim of this method is to estimate fixed effects, such as meta-regression coefficients, rather than to model

23 INSTRUCTIONAL IMPROVEMENT IN STEM variation in effect sizes. As a result, tests for heterogeneity used in traditional meta-analysis are not available with the RVE approach (Tanner-Smith & Tipton, 2014; Tanner-Smith, Tipton &

Polanin, 2016). However, we do report the method of moments estimate of 휏2 for each of our primary models as measures of between-study heterogeneity in effect sizes.

Results

Descriptives and Overall Average Impacts

Table 1 provides descriptive statistics regarding the study designs and programs included in our dataset. Of the included studies, 22% focused on professional development alone (without the introduction of new curriculum materials), 9% focused on curriculum materials alone

(without the provision of professional development) and 75% included both new curriculum materials and professional development. Most included studies had randomized designs; only nine quasi-experimental studies met the criteria for inclusion. Roughly two-thirds of studies focused on mathematics rather than science. Table 1 also shows study-level frequencies for the format, timing and duration, foci, and activities of programs with any professional development component, and for characteristics of curriculum materials programs.12 Because we had a relatively small sample size, when two variables correlated at the 0.40 level or above, we combined them into a single predictor indicating whether the program had either feature. This occurred in two cases (a focus on content knowledge/pedagogical content knowledge and knowledge of how students learn; and, under activities, solving problems and working through student materials).

Across all included studies, we found an average weighted impact estimate of +0.21 standard deviations (see Table 2). To contextualize the magnitude of this effect, a typical treatment group student would be expected to rank about 8 percentile points higher than a typical

24 INSTRUCTIONAL IMPROVEMENT IN STEM control group student (Lipsey et al., 2012). We found a method-of-moments estimate of the between-study variance in study average effect sizes of 0.037, with a between-study standard deviation of 0.191, indicating that most studies had small to moderate positive impacts. Of the

258 effect sizes included in the 95 identified studies, 214 effect sizes (83 percent) were positive in sign and 88 effect sizes (34 percent) were statistically significant. Only 40 effect sizes were negative in sign (16 percent) and only 4 were negative and statistically significant (2 percent).

Four effect sizes had point estimates of zero (2 percent). The unweighted average effect size for math outcomes was +0.27 SD and the average effect size for science outcomes was +0.18 SD.

However, the results of conditional meta-regression model estimated using RVE indicates no statistically significant difference in mean effect sizes based on whether programs focused on math or science (see Table 3). We therefore include both math and science outcomes in all analyses.

As shown in Table 3, the type of assessment employed was a strong predictor of effect size magnitude, with impacts for both standardized (e.g., Woodcock-Johnson) and state standardized tests lower by about 0.27 SD (p<0.01) relative to impacts for researcher-designed assessments. Table 3 also shows that there were not statistically significant differences in effect sizes based on study design (i.e., whether the study was an RCT), whether study authors adjusted for covariates such as student demographic indicators, whether the study sample included preschool students, and subject matter.

Table 4 shows that the average effect size for programs that incorporated both professional development and new curriculum materials was approximately 0.10 SD (p<0.05) larger than the average ES for programs that included only professional development or only new curriculum materials.

25 INSTRUCTIONAL IMPROVEMENT IN STEM

Features that moderate program impacts

Next, we turn to features that may moderate impacts on student outcomes, beginning with the professional development models. We did not find a significant relationship between professional development contact hours and student outcomes (see Table 5). We first examined whether there was a linear association between contact hours and effect size (column 1). Next, to look for non-linearities in the contact hours data, the model presented in column 2 compares programs with few hours (i.e., below the 25th percentile; omitted) and those with a larger number of contact hours (i.e., 25th to 50th percentile; 50th to 75th percentile; and 75th percentile and above). We did not find evidence of a significant association in either model. To investigate a threshold effect, the model presented in column 3 examines programs with 16 hours or more and found no relationship; this finding held at various threshold levels (not shown). In a separate analysis, we did not find a significant relationship between the timespan (e.g., one month, one year) over which professional development activities were conducted and effect sizes (models not shown).

We next examined the associations between the focus of the professional development and effect sizes via a series of multilevel regression models (Table 6). Average effect sizes were larger when programs focused on how to use curriculum materials (+0.12 SD, p<0.10) and when they focused on improving teachers' content and pedagogical content knowledge and/or how students learned the content (+0.09 SD, p<0.05). Both of these associations remained significant and similar in size when all predictors were included in the model. Average effect sizes were also larger when the program focused on formative assessment (+0.13 SD, p<0.10), although this did not retain its size or significance in the final model. A focus on content-generic instructional strategies was not a significant predictor of the effect size magnitude. Finally, we also find that

26 INSTRUCTIONAL IMPROVEMENT IN STEM programs that included more of the foci listed in Table 6 had larger effect sizes, on average, compared to programs focused on a more narrow set of topics (+0.08 SD, p<0.01).

Next, we turned to professional development activities (Table 7). No activities for which we coded – observing demonstrations, reviewing generic student work (problems or investigations completed by students outside the teachers’ classes), solving math or science problems/working through student materials, developing curriculum or lesson plans, and reviewing teachers’ own students’ work – were significant predictors of effect size magnitude.

However, we find that the number of PD activities incorporated in the program, defined as the total number of the five activities listed in Table 7, significantly predicted effect size magnitude

(+0.05 SD, p<0.10).

Table 8 examines the relationship between effect sizes and professional development formats. On average, PD programs that had teachers participate alongside other teachers in their school, which we refer to as same-school collaboration, yielded outcomes 0.12 SD larger

(p<0.10) than programs without such collaboration. This result is significant in the final model only. In addition, programs that included PD with implementation meetings yielded significantly larger average effect sizes on average than without these meetings (+0.12 SD in the final model, p<0.05). These implementation meetings typically allowed teachers to convene briefly with other activity participants to troubleshoot and discuss obstacles and aids to putting the program into practice. Meanwhile, programs that included professional development that incorporated an online component yielded significantly smaller impacts on average relative to programs that did not involve any online components (-0.15 SD in the final model, p<0.05). Additionally, in the final model, programs that included a summer workshop had larger effect sizes, on average, as compared with those that lacked this component (+0.07 SD, p<0.10). The remaining formats

27 INSTRUCTIONAL IMPROVEMENT IN STEM examined – whether the program featured coaching from experts or professional development led by researchers – were not significant predictors of effect size magnitude. The number of PD features was also not a significant predictor of effect size magnitude.

Table 9 shows the associations between features of new curriculum materials and effect sizes. None of the features examined were significantly associated with average effect sizes.

It is important to note that the above moderator analyses provide estimates of the associations between the presence of specific program features and average effect sizes; however, programs without those specific features that positively predict impacts may still positively impact student outcomes on average. Thus in Table 10 we present the results of these moderator tests summarized in terms of regression-adjusted mean effect sizes. First, we present mean effect sizes based on subgroup analyses without controls for additional program features.

These mean effect sizes are based on unconditional meta-regression models estimated using

RVE to account for the nesting of effect sizes within studies. Next, we present mean effect sizes based on conditional meta-regression models corresponding to our main moderation analyses with each predictor included separately and controlling for the program features listed in Table 3.

Finally, we present mean effect sizes corresponding to our final moderation analyses with all predictors within each category entered simultaneously. For brevity, we focus on only those features that were statistically significant predictors of program impact in our main models.

The mean effect sizes presented in Table 10 indicate that the overall impacts of programs with and without the moderators of interest were generally positive. Estimated mean effect sizes for STEM professional development and curriculum improvement programs are generally positive even among programs that lacked the intervention features we identified as associated with larger effect sizes. For example, average effect sizes were positive among programs

28 INSTRUCTIONAL IMPROVEMENT IN STEM including either professional development or new curriculum materials but not both components

(푔̅̅̅푐 = 0.156, 푔̅̅̅푢푐̅̅= 0.136, puc < 0.01), professional development programs that had an online component (푔̅̅̅푐̅+̅ = 0.096, 푔̅̅̅푐 = 0.103, 푔̅̅̅푢푐̅̅ = 0.116, puc < 0.05), and professional development that did not contain a focus on improving teachers’ content knowledge, pedagogical content knowledge or knowledge of how students learn (푔̅̅̅푐̅+̅ = 0.179, 푔̅̅̅푐= 0.175, 푔̅̅̅푢푐̅̅ = 0.095, puc <

0.01). Although impacts were largest on intervenor-developed assessments, estimated mean effects sizes are still positive for impacts on state standardized assessments (푔̅̅̅푐 = 0.0101, 푔̅̅̅푢푐̅̅ =

0.060, puc < 0.01) and other standardized assessments (푔̅̅̅푐 = 0.087, 푔̅̅̅푢푐̅̅ = 0.084, puc < 0.01).

Furthermore, the differences in mean effect sizes within each category based on estimating unconditional models, with the exception of same-school collaboration, are generally comparable in direction and magnitude to those based on conditional models.

Study Design Moderators

Next, we examined whether a wider range of study design characteristics and implementation contexts were associated with effect size magnitudes (Table 11). Studies that included credit incentives for participating teachers showed smaller impacts on average (-0.17

SD in the final model, p<0.05). We found that the unit of assignment to the program (schools vs. teachers) did not predict effect sizes. We did not find significant relationships between effect size magnitudes and whether teacher participation was voluntary or mandatory, or whether teachers received monetary incentives; however, this information was unreported in many studies.

Finally, we found that the level of attrition did not predict the size of impacts.

To further examine the potential role of attrition bias, we additionally examined the adjusted mean effect sizes for studies that do and do not meet our attrition standards. We see

29 INSTRUCTIONAL IMPROVEMENT IN STEM little evidence that studies with attrition problems reported larger impacts (see Appendix Table

A2).

In separate analyses, no other study design features were significantly related to outcomes, including whether the study involved one district, multiple districts and/or states; whether the study was conducted in the U.S. or abroad; whether the study was conducted in an urban vs. non-urban setting; and whether the study sample was majority low-income. We also found that study size (defined as the average number of treatment clusters across effect sizes within a study) was not a significant predictor of study outcomes (results not shown).

Publication Bias

Finally, we considered whether effect sizes from peer-reviewed sources differed in magnitude, on average, relative to effect sizes from other sources. In Table 12, we present results from a meta-regression model estimated using RVE that includes whether the effect size was from a peer-reviewed publication as a moderator. We found that effect sizes from peer-reviewed studies were larger by 0.05 (uncontrolled, Model 1) or 0.07 SD (controlled, Model 2) than those from other studies when using the RVE approach and after controlling for study design, study sample, subject area, and outcome measure type. However, these differences were not statistically significant.

We also conducted two additional tests for publication bias. First, we assessed publication bias using Egger's regression test (Egger et al., 1997). Given the multiple effect sizes within each study, we conducted this test at the study level by regressing the study average standard normal deviation (the average effect size divided by the average standard error) on the inverse of the study average effect size standard error. This approach tests the null hypothesis that the intercept is zero; if the null hypothesis is rejected this indicates that smaller (or less

30 INSTRUCTIONAL IMPROVEMENT IN STEM precise) studies have systematically larger or smaller reported effect sizes relative to larger (or more precise) studies which could be due to publication bias. We then use the “trim and fill” method to examine the magnitude of potential publication bias (Duval & Tweedie, 2000).

Although the models estimated are not precisely analogous to our preferred estimation approach, results indicate potential publication bias in the full sample as demonstrated by the fact that studies with larger effect size standard errors (i.e., smaller and less precise studies) have larger effects (p<0.001). A comparison of the adjusted and unadjusted estimated average effect sizes from the trim-and-fill method indicates that the magnitude of potential publication bias is substantial. Second, to account for the nested structure of the data, we also used a modification of the Egger’s regression test by adding the standard errors of the effect sizes as a moderator to the unconditional RVE meta-regression model. If effect size standard errors are significant predictors of effect sizes, this could similarly indicate the presence of publication bias. As above, results indicate potential publication bias in the full sample (p<0.001).

However, results of conducting both methods separately for peer reviewed and non-peer reviewed studies detect potential publication bias only among peer reviewed studies. Results of both Egger’s regression test and the RVE approach indicate that smaller (or less precise) studies report larger impacts (p<0.001). However, there is less evidence of systematic differences in reported effects sizes based on study size or precision among studies from other sources

(p=0.116; p=0.050). This suggests that the association between study size or precision and reported impacts among peer-reviewed studies may be a result of publication bias rather than other factors (e.g., more effective interventions are more costly and therefore evaluated with small samples). This highlights the importance of the inclusion of the “grey literature” in

31 INSTRUCTIONAL IMPROVEMENT IN STEM systematic reviews and meta-analyses of instructional improvement efforts (Polanin, Tanner-

Smith, & Hennessy, 2016). For full results of these tests, see Appendix Table A3 and Figure A1.

Sensitivity Checks

We conducted additional analyses to examine the robustness of our results. First, we address the fact that in 20 cases, authors reported some impacts as ‘not statistically significant’ and did not provide enough information to calculate an effect size. A concern is that failing to provide this information may be correlated with program features. In addition, that effect sizes that are not statistically significant but negative in sign may be less frequently reported that effect sizes that are positive in sign. We therefore tested the robustness of our results to the inclusion of these 20 missing effect sizes by imputing a range of values for missing effect sizes

(g = 0.00, g = -0.10, and g = -0.20) and using the study-level mean of the effect size standard error to calculate their weights. If our results are driven by differential reporting of information regarding effect sizes that are not statistically significant based on whether point estimates are positive or negative in sign, including these impacts should attenuate our results. In general, including these effect sizes did not substantively change our results. In the majority of cases where our primary results showed a significant association between program features and outcomes, results are comparable in magnitude and remain statistically significant. The only exceptions are the number of professional development activities and use of same-school collaboration, which are no longer significant predictors of program impact, although the associations are comparable in magnitude to our main estimates.

We also found that our results were largely robust to excluding studies in settings outside the U.S., and to excluding studies with weaker designs (i.e., studies that were not RCTs).

Associations are generally comparable in magnitude to our main results, although in some cases

32 INSTRUCTIONAL IMPROVEMENT IN STEM less precisely estimated. The only exceptions are that a focus on how to use new curriculum materials is not a significant predictor of impacts after excluding non-RCT studies, and a focus on how to use new curriculum materials and the number of professional development activities are not significant predictors of impacts after excluding studies outside the U.S. These associations are comparable in magnitude to our main estimates. We also examined the sensitivity of our results to choosing different values of the within-study correlation between effect sizes. In our primary models, we specify the correlation to be 0.80, which is the value recommended by Tanner-Smith and Tipton (2014). Using alternative specifications of (휌 = 0.50,

0.70, and 0.90) did not change our results. Full results of these sensitivity checks are available from the authors on request.

Discussion and Conclusions

To summarize, we found that studies of STEM instructional improvement programs had on average positive effects on student achievement, with a mean pooled effect size across studies of 0.21 SD. Compared to those found in prior reviews, these pooled effect sizes are in the middle range, smaller than those identified in the Yoon et al. review (0.57 SD in math and 0.51

SD in science) and the Taylor et al. review (0.489 SD in science), and somewhat larger than those identified in the Scher & O'Reilly review (0.14 SD in math and 0.13 SD in science).

Although the effect sizes differ, our results confirm earlier reviewers' findings that studies of

STEM instructional improvement efforts tend to show positive results.

We conducted a series of analyses to examine the relationships between program characteristics and the size of achievement impacts. The characteristics that were significantly associated with improved student learning across the current set of studies included the following:

33 INSTRUCTIONAL IMPROVEMENT IN STEM

• The use of professional development along with new curriculum materials;

• A focus on improving teachers' content/pedagogical content knowledge, or

understanding of how students learn;

• Specific formats, including:

• meetings to troubleshoot and discuss classroom implementation of the

program;

• the provision of summer workshops to begin the professional development

learning process;

• same-school collaboration.

We also found that on average, programs that provided any component of the PD online had poorer student outcomes than programs that did not use an online PD component. In general, there was not a statistically significant difference in the magnitudes of these associations depending on whether programs focused on mathematics or science.

Components Associated with STEM Program Effectiveness

Taken together, we generally find that providing teachers with opportunities to learn about the materials they will use with students and/or to participate in programs that seek to improve their content or pedagogical content knowledge is associated with improved student outcomes. These findings accord with prior cross-sectional research. For example, Boyd,

Grossman, Lankford, Loeb, and Wyckoff (2009) found that teacher preparation programs that focused closely on teachers' classroom practice, including reviewing the district curriculum, produced teachers with better student outcomes in their first year of teaching. Authors (1998) also found that teachers' participation in curriculum-centered professional development was related to student achievement. These findings also lend support to the conclusions drawn in

34 INSTRUCTIONAL IMPROVEMENT IN STEM prior reviews conducted by Scher and O'Relly (2009) and Kennedy. Scher and O'Reilly (2009) found that inverventions that focused on both content and pedagogy posted stronger student outcomes as compared with interventions that focused on pedagogy alone, while Kennedy

(1999) concluded that programs were more effective when they focused on how to teach specific content and on how students learn the same content.

Most studies of curriculum materials in our dataset included at least some component of professional development, and vice versa. However, examining studies that included both elements jointly leads us to see that on average, programs that incorporated both professional development and new curriculum materials had larger impacts as compared with programs that included only one of these components. These findings lend support to the argument advanced by some scholars that curriculum materials alone, even those designed to be educative and supportive of teacher implementation, may be insufficient to change teaching practice, given the complexity of classroom instructional interactions, and the often-ingrained nature of traditional inquiry-response-evaluation teaching practices (Alozie, Moje, & Krajcik, 2010). Meanwhile, perhaps when professional development is provided without reference to specific curriculum, teachers may struggle to implement what they have learned while using existing curricular materials and textbooks.

We also found a positive association between student outcomes and teachers' participation in implementation meetings, which were defined as meetings in which teachers met formally or informally with other activity participants to discuss enacting intended practices. We also found a significant relationship, in our final model, between same-school collaboration

(teachers participating in PD alongside their colleagues) and student outcomes. These findings align with prior work (e.g., Darling-Hammond & McLaughlin, 1995) emphasizing the

35 INSTRUCTIONAL IMPROVEMENT IN STEM importance of providing teachers with opportunities to discuss instructional innovations with colleagues (e.g., Penuel, , Frank & Gallagher, 2012), and discuss and troubleshoot issues that arise when implementing new instructional approaches.

Perhaps more surprisingly, the inclusion of a summer workshop was positively related to student outcomes. Prior syntheses (Scher & O'Reilly, 2009; Yoon et al., 2009) did not examine this variable specifically. The summer workshop format for professional development has been critiqued in the past for its typically 'one-shot' nature, but it may be the case that an intensive summer professional development provides an effective 'springboard' for school-year implementation. On the other hand, programs in which participants completed a portion of the professional development online had weaker student outcomes, on average, as compared with programs that did not include any online PD. Dede, Ketelhut, Whitehouse, Breit, and McCloskey

(2008) point out that although online teacher professional development is burgeoning, relatively little rigorous research has examined its effectiveness. Although these analyses are correlational in nature, the positive results associated with both the summer professional development and the teacher implementation meetings are consistent with the notion that teachers may have benefitted from both intensive and ongoing opportunities to interact with one another around program content, which may have been less salient in the online format.

In contrast to two earlier reviews (Scher & O’Reilly, 2009; Yoon et al., 2007), we find no evidence of a positive association between the duration of professional development and program impacts. Yoon compared the effectiveness of interventions that included greater than versus less than 14 hours of professional development, finding that the former had larger impacts on student achievement. Scher and O'Reilly (2009) compared the effectiveness of interventions conducted over two or more years versus those conducted over one year, and found that those

36 INSTRUCTIONAL IMPROVEMENT IN STEM conducted over two or more years were more effective on average among math-focused, but not science-focused, interventions. However, both of these reviews note that the small number of included studies limited their ability to draw firm conclusions. The current findings, using a continuous measure of contact hours and a separate measure of timespan, suggest that programs that were limited in duration nonetheless generally had positive impacts on average. For example, several programs that combined new curriculum materials with a short amount of professional development documented moderate to large impacts on student achievement (e.g.,

Arnold et al., 2002; Clements & Sarama, 2007; Presser, Vahey, & Dominguez, 2015). In contrast, some studies of highly-intensive professional development programs showed little or no impacts on student learning (e.g., Authors, 2016; Devlin-Sherer et al., 1998; Van Egeren et al.,

2014). Our findings echo those of Kennedy (1999, 2016), who did not find a clear benefit of contact hours or program duration, and concluded that the core condition for program effectiveness was valuable content; more hours of a given intervention will not help if the intervention content is not useful.

Similar to Taylor et al., (2018), our analysis did not detect any relationship between student outcomes and study design, science subject matter, and grade level. However, we found that average impacts varied considerably depending on the type of student test used. On average, student outcomes were larger on researcher-designed assessments as compared with standardized tests. While standardized tests may have benefits such as face validity and broad content representation, it may be that standardized tests are not especially sensitive to instructional improvement efforts, due to differences in the skills measured by the tests versus those targeted in the intervention (see also Hill et al., 2008; Taylor et al., 2018; Sussman & Wilson, 2018). If this is the case, in order to understand how instructional improvement interventions influence

37 INSTRUCTIONAL IMPROVEMENT IN STEM student learning, researchers may need to include both assessments of outcomes that are closely tied to the student learning goals along with broader standardized tests.

Limitations and Future Directions

We note several limitations to the current review. One limitation is missing data. First, although we attempted to search both the published and unpublished literatures, studies may have eluded our grasp. Second, many programs and interventions that are routinely conducted in schools are never evaluated, and these programs may differ in unknown ways from those that are formally evaluated. Most of the programs studied are boutique programs, often designed, operated and evaluated by university or contract researchers. We know little about the efficacy of professional development that reaches typical teachers. The characteristics we identified as potentially effective here may not carry over to typical conditions and with the resources conventionally available in school districts.

Missing data in study reports posed another limitation. We initially hoped to code the included studies for a number of additional features that have been hypothesized in the literature to influence the effectiveness of instructional improvement programs, including district and school leadership support, competing instructional improvement initiatives, teacher recruitment methods, and the financial resources provided to support the intervention (Wilson, 2013).

However, we found that few study reports contained sufficient detail on these matters, making tracking their impact impossible. This omission is striking; in nearly all published and unpublished reports, the district context is simply a black box, or merely a site for teacher recruitment and service delivery. The inclusion of more contextual information in study reports is pressing, especially because many district administrators evaluate proposed programs based on the extent to which studies were carried out in “districts like ours.” Gathering contextual

38 INSTRUCTIONAL IMPROVEMENT IN STEM information would also provide insight into the conditions necessary for instructional improvement programs to thrive.

In addition, it is important to note that, as is generally the case in meta-analyses, the moderator analyses we have conducted are correlational. The authors of the included studies generally did not randomly manipulate the variables that we identified as associated with improved student outcomes in the moderator analyses, such as by randomly assigning teachers to participate versus not participate in a summer workshop. As a result, moderator analyses are potentially confounded by unobserved or inadequately measured study and student characteristics. The characteristics of professional development and curriculum programs that appear promising in the current study's moderator analyses thus are not definitive, but point toward potentially productive areas for future experimental work. Future randomized experiments comparing instructional programs that do versus do not have these components are warranted, to estimate causally the effects of robust STEM professional development and curriculum interventions on student learning. Additionally, our use of binary indicators rather than raw attrition rates constitutes a limitation, as continuous attrition measures would have allowed us to estimate the impacts of attrition more precisely.

Despite these limitations, however, we were able to distill findings from dozens of recent experimental and strong quasi-experimental evaluations of instructional improvement programs in STEM. Based on the extant literature, the types of practices that were associated in our review with improved student learning, such as providing teachers with opportunities to engage with the curriculum they teach, develop their content knowledge, pedagogical content knowledge and understanding of how students learn, and discuss classroom implementation, likely occur infrequently in the instructional improvement programs typically offered in US schools (e.g.,

39 INSTRUCTIONAL IMPROVEMENT IN STEM

Author, 2009). Future experimental studies that build on the current findings are warranted, to advance researchers' and policymakers' understanding of core instructional reform practices that improve student learning in STEM.

Endnotes

1 For example, in the area of middle and secondary math curriculum materials, Slavin and colleagues searched the published and grey literature dating back to 1970, conducting "[a] broad literature search ... in an attempt to locate every study that could possibly meet the inclusion requirements. This included obtaining all of the middle school studies cited by the What Works

Clearinghouse (2008b) and the middle and high school studies cited by NRC (2004), by Clewell et al. (2004), and by other reviews of mathematics programs, including technology programs that teach math (e.g., Chambers, 2003; Kulik, 2003; Murphy et al., 2002). Electronic searches were made of educational databases (JSTOR, Education Resources Information Center, EBSCO,

PsycINFO, Dissertation Abstracts), Web-based repositories (, Yahoo, Google Scholar), and education publishers’ Web sites. Citations of studies appearing in the first wave of studies were also followed up. A particular effort was made to find non-U.S. studies." They performed a similar search for elementary math curriculum materials (Slavin & Lake, 2008), elementary science curriculum materials (Slavin et al., 2014), and secondary science materials (Cheung,

Slavin, Kim, & Lake, 2016). We refer the reader to those articles for specifics. In the area of professional development, Yoon et al. (2007) searched for studies dating from 1986 and explain that "Studies were gathered through an extensive electronic search of published and unpublished research literature. The review protocol included a list of keywords that guided the literature search. Seven electronic databases were core data sources: ERIC, PsycINFO, ProQuest,

40 INSTRUCTIONAL IMPROVEMENT IN STEM

EBSCO’s Professional Development Collection, Dissertation Abstracts, Sociological Collection, and Campbell Collaboration. These databases were searched separately for each of the three subjects under review (mathematics, science, and reading and English/language arts). In consultation with a reference librarian, search parameters were developed using database- specific keywords ... A deliberately wide net captured literature on professional development and student achievement, broadly defined ... Fourteen key researchers were also asked to identify research for the study. Eight researchers responded, recommending additional studies that fit the study purpose. Finally, existing literature reviews and research syntheses were consulted to ensure that no key studies were omitted."

2As another point of comparison, the What Works Clearinghouse protocols have generally limited their scope to studies from the past twenty years, with some categories such as conference proceedings limited to the past seven years (Brown, Card, Dickersin, Greenhouse,

Kling, & Littell, 2008).

3The specific search strings applied to searches of the titles and abstracts were as follows:

(“professional development” OR “faculty development” OR “Staff development” OR “teacher improvement” OR “inservice teacher education” OR “peer coaching” OR “teachers’ institute*”

OR “teacher mentoring” OR “Beginning teacher induction” OR “teachers’ Seminar*” OR

“teachers’ workshop*” OR “teacher workshop*” OR “teacher center*” OR “teacher mentoring”

OR curriculum OR instruction*) AND ( “Student achievement” OR “academic achievement”

OR “mathematics achievement” OR “math achievement” OR “science achievement”

OR “Student development” OR “individual development” OR “student learning” OR

“intellectual development” OR “cognitive development” OR “cognitive learning” OR “Student

Outcomes” OR “Outcomes of education” OR “educational assessment” OR “educational

41 INSTRUCTIONAL IMPROVEMENT IN STEM measurement” OR “educational tests and measurements” OR “educational indicators” OR

“educational accountability”) AND ("*experiment*" OR "control*" OR "regression discontinuity” OR “compared” OR “comparison” OR “field trial*” OR “effect size*” OR

“evaluation”) AND (“Math*” OR “*Algebra*” OR “Number concepts” OR “Arithmetic” OR

“Computation” OR “Data analysis” OR “Data processing” OR “Functions" OR “Calculus” OR

“Geometry” OR “Graphing” OR “graphical displays” OR “graphic methods” OR “Science*”

OR “Data Interpretation” OR “Laboratory Experiments” OR “Laboratory Procedures” OR

“Experiment*” OR “Inquiry” OR “Questioning” OR “investigation*” OR “evaluation methods”

OR “laboratories” OR “biology” OR “observation” OR “physics” OR “chemistry” OR

“scientific literacy” OR “scientific knowledge” OR “empirical methods” OR “reasoning” OR

“hypothesis testing”).

4As Slavin (2008) defined this contrast, “A program is defined here as any set of replicable procedures, materials, professional development, or service configurations that educators could choose to implement to improve student outcomes. A program is distinct from a variable in consisting of a specific, well-specified set of procedures and supports. Class size, assigning homework, or provision of bilingual education are variables, for example, whereas programs typically are based on particular textbooks, computer software, and/or instructional processes and usually have a name and a specific provider, such as a company, university, or individual.”

5 For studies with multiple treatment arms, this includes separate effect sizes from each treatment contrast. For studies that reported impacts for multiple groups of teachers or students (e.g., studies that reported impacts separately by grade), this includes separate impacts for each teacher and student sample. However, we do not include multiple effect sizes for impacts on the same assessment for the same teacher and student sample (e.g., cases where studies administered the

42 INSTRUCTIONAL IMPROVEMENT IN STEM same assessment to examine follow-up impacts over time). We also include a separate effect size for each assessment of math or science achievement reported by each study. For example, if a study reported impacts on two separate assessments (e.g., a standardized test and a researcher- designed assessment), both effect sizes were coded and included in the analysis. We did not include impacts on assessment subscales or sub-scores if impacts on total scores on the assessment were also reported. We included impacts on subscales or sub-scores only if impacts on total scores were not reported.

6 Our technical advisory board consisted of five members – three with expertise in instructional improvement and program evaluation, and two with expertise in meta-analysis. We met in person with the first group to gather feedback on our proposed coding system. We consulted with and sent drafts of our paper to the latter group for feedback on our data and models.

7 Some author-supplied effect sizes that were described as only as being similar to a standardized mean difference (including standardized coefficients from multilevel models). These were treated as Cohen’s d effect sizes and converted to Hedges’s g.

8 For these cases a standardized effect size was calculated by the ratio of the unstandardized regression coefficient and an estimate of the standard deviation of the outcome. The outcome standard deviation was estimated using the formula provided by Higgins and Deeks (2008):

푆퐸 푆퐷 = 1 1 √ + 푁푇 푁푐 where 푁퐸 represents the number of students in the treatment group and 푁퐶 represents the number of students in the control group, and SE represents the coefficient standard error.

9 These include cases where the authors did not take into account the nesting of students within classrooms and/or schools, cases where effect sizes were reported without the associated

43 INSTRUCTIONAL IMPROVEMENT IN STEM standard errors, and cases where the authors reported cluster-adjusted impact estimates, but it was necessary to calculate a standardized effect sizes based on other information. For example, this includes cases where the authors reported that cluster-adjusted regression results were significant below a given value (e.g., p<0.05) but neither standard errors nor p-values nor other test statistics (e.g., t-statistics) were reported. Of the 258 effect sizes included in this study, we applied a correction to adjust the standard error for clustering in 114 cases. The majority of effect sizes that did not require the standard error correction for clustering were based on results of multilevel models and regression models with clustered standard errors (e.g., using t-statistics, standard errors and regression coefficients, and p-values). Other effect sizes were reported at the cluster level (e.g., differences in mean classroom performance); no clustering adjustment was necessary in these cases.

10 These include 20 cases where the authors referred to “no significant effect” on one or more outcomes but did not report an effect sizes, and 9 cases where authors reported some outcome information but did not provide enough information to calculate an effect size (e.g. cases where the outcome information included only raw means).

11 Specifically, we assumed that the standardized effect size (Hedges’s g) was took on a range of values (g = 0.00, g =-0.10, and g = -0.20), and used the study-level mean standard error based on non-missing effect sizes.

12 Some studies contained multiple treatment arms where the characteristic was present in at least one arm, but not present in at least one other arm. In Table 1, we consider whether the feature was present in any treatment arm of the study. In all subsequent analyses, we use the study-level mean.

44 INSTRUCTIONAL IMPROVEMENT IN STEM

References

Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the

Chicago public high schools. Journal of Labor Economics, 25(1), 95-135.

Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary

math curricula outperform a fourth. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

Alozie, N. M., Moje, E. B., & Krajcik, J. S. (2010). An analysis of the supports and constraints

for scientific discussion in high school project‐based science. Science Education, 94(3),

395-427.

Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math

development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-

770.

Ball, D. L., & Cohen, D. K. (1996). Reform by the book: What is: or might be: the role of

curriculum materials in teacher learning and instructional reform? Educational

Researcher, 25(9), 6-14.

Banilower, E., Smith, P. S., Weiss, I. R., & Pasley, J. D. (2006). The status of K-12 science

teaching in the United States: Results from a national observation survey.

In D.Sunal & E.Wright (Eds.), The impact of the state and national standards on K-12

science teaching (pp. 83–122). Greenwich, CT: Information Age Publishing.

Becker, K., & Park, K. (2011). Effects of integrative approaches among science, technology,

engineering, and mathematics (STEM) subjects on students' learning: A preliminary

meta-analysis. Journal of STEM Education: Innovations and Research, 12(5/6), 23-37.

Blank, R. K., & De Las Alas, N. (2010, March). Effects of teacher professional development on

45 INSTRUCTIONAL IMPROVEMENT IN STEM

gains in student achievement: How meta-analysis provides scientific evidence

useful to education leaders. Paper presented at the Society for

Research on Educational Effectiveness Spring Conference, Washington, DC.

Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).

Improving Elementary Science Instruction and Student Achievement: The Impact of a

Professional Development Program. Paper presented at the Society for Research on

Educational Effectiveness Spring Conference, Washington, DC.

Boyd, D. J., Grossman, P. L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher preparation

and student achievement. Educational Evaluation and Policy Analysis, 31(4), 416-440.

Cavanagh, S. (2015, August 17). Spending on instructional materials, construction climbing in

K-12. EdWeek. Retrieved from https://marketbrief.edweek.org/marketplace-k-

12/instructional_materials_construction_spending_climbing_in_k-12/

Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact

of an integrated approach to science and literacy in elementary school classrooms.

Journal of Research in Science Teaching, 49(5), 631-658.

Cheung, A., Slavin, R. E., Kim, E., & Lake, C. (2017). Effective secondary science programs: A

best‐evidence synthesis. Journal of Research in Science Teaching, 54(1), 58-81.

Clark, D. B., Tanner-Smith, E. E., & Killingsworth, S. S. (2016). Digital games, design, and

learning: A systematic review and meta-analysis. Review of Educational Research, 86(1),

79-122.

Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:

Summative research on the Building Blocks project. Journal for Research in

Mathematics Education, 38(2), 136-163.

46 INSTRUCTIONAL IMPROVEMENT IN STEM

Clewell, B. C., Campbell, P. B., & Perlman, L. (2004). Review of evaluation studies mathematics

and science curricula and professional development models. Washington, DC: The

Urban Institute.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:

Lawrence Earlbaum Associates.

Confrey, J. (2006). Comparing and contrasting the National Research Council report on

evaluating curricular effectiveness with the What Works Clearinghouse approach.

Educational Evaluation and Policy Analysis, 28(3), 195-213.

Confrey, J., & Stohl, V. (Eds.). (2004). On evaluating curricular effectiveness: Judging the

quality of K-12 mathematics evaluations. Washington, DC: National Academies Press.

Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach. Los

Angeles, CA: Sage.

Corcoran, T. B. (1995). Helping teachers teach well: Transforming professional development.

Philadelphia, PA: CPRE Policy Briefs. Retrieved from

https://files.eric.ed.gov/fulltext/ED388619.pdf

Darling-Hammond, L., & McLaughlin, M. W. (1995). Policies that support professional

development in an era of reform. Phi Delta Kappan, 76(8), 597-604.

Davis, E. A., & Krajcik, J. S. (2005). Designing educative curriculum materials to promote

teacher learning. Educational Researcher, 34(3), 3-14.

Dede, C., Ketelhut, D., Whitehouse, P., Breit, L., & McCloskey, E. M. (2009). A research

agenda for online teacher professional development. Journal of Teacher Education,

60(1), 8-19.

47 INSTRUCTIONAL IMPROVEMENT IN STEM

Desimone, L. M. (2009). Improving impact studies of teachers' professional development:

Toward better conceptualizations and measures. Educational Researcher, 38(3), 181-199.

doi: 10.3102/0013189X08331140

Desimone, L. M. (2011). A primer on effective professional development. Phi Delta

Kappan, 92(6), 68-71.

Devlin-Scherer, W., Spinelli, A. M., Giammatteo, D., Johnson, C., Mayo-Molina, S., McGinley,

P., ... & Zisk, L. (1998, February). Action research in professional development schools:

Effects on student learning. Paper presented at the Annual Meeting of the American

Association of Colleges for Teacher Education, New Orleans, LA.

Dietrichson, J., Bøg, M., Filges, T., & Klint Jørgensen, A. M. (2017). Academic interventions for

elementary and middle school students with low socioeconomic status: A systematic

review and meta-analysis. Review of Educational Research, 87(2), 243-282.

Duval, S., & Tweedie, R. (2000). Trim and fill: a simple funnel‐plot–based method of testing and

adjusting for publication bias in meta‐analysis. Biometrics, 56(2), 455-463.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by

a simple, graphical test. British Medical Journal, 315(7109), 629-634.

Fisher, Z., & Tipton, E. (2015). robumeta: An R-package for robust variance estimation in meta-

analysis. Retrieved from https://arxiv.org/pdf/1503.02220.pdf

Frechtling, J., Sharp, L., Carey, N., & Vaden-Kiernan, N. (1995). Professional development

programs: A perspective on the last four decades. Washington, DC: National Science

Foundation, Division of Research. Evaluation and Communication.

Furtak, E. M., Seidel, T., Iverson, H., & Briggs, D. C. (2012). Experimental and quasi-

48 INSTRUCTIONAL IMPROVEMENT IN STEM

experimental studies of inquiry-based science teaching: A meta-analysis. Review of

educational research, 82(3), 300-329.

Gardella, J. H., Fisher, B. W., & Teurbe-Tolon, A. R. (2017). A systematic review and meta-

analysis of cyber-victimization and educational outcomes for adolescents. Review of

Educational Research, 87(2), 283-308.

Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.

(2011). Middle school mathematics professional development impact study: Findings

after the second year of implementation. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

Gersten, R., Taylor, M. J., Keys, T. D., Rolfhus, E., & Newman-Gonchar, R. (2014). Summary of

research on the effectiveness of math professional development approaches. (REL 2014–

010). Washington, DC: U.S. Department of Education, Institute of Education Sciences,

National Center for Education Evaluation and Regional Assistance, Regional Educational

Laboratory Southeast. Retrieved from http://ies.ed.gov/ncee/edlabs

Goldenberg, C., & Gallimore, R. (1991). Changing teaching takes more than a one-shot

workshop. Educational Leadership, 49(3), 69-72.

Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning: A

3-year study examining the embedding of language and argumentation into elementary

science classrooms. International Journal of Science and Mathematics Education, 14(5),

847-863.

Hedges, L. V. and Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological

Bulletin, 88, 359–369.

49 INSTRUCTIONAL IMPROVEMENT IN STEM

Hiebert, J., Stigler, J. W., Jacobs, J. K., Givvin, K. B., Garnier, H., Smith, M., ... & Gallimore, R.

(2005). Mathematics teaching in the United States today (and tomorrow): Results from

the TIMSS 1999 video study. Educational Evaluation and Policy Analysis, 27(2), 111-

132.

Higgins, J. P. T & Deeks, J. J. (2008). Selecting studies and collecting data. In J.P.T. Higgins &

S. Green (Eds.), Conchrane handbook for systematic reviews of interventions (151-186).

Chichester, U.K.: John Wiley & Sons.

Higgins, J. P. T., Deeks, J. J., & Altman, D.G. (2008). Special topics in statistics. In J.P.T.

Higgins & S. Green (Eds.), Conchrane handbook for systematic reviews of interventions

(481-529). Chichester, U.K.: John Wiley & Sons.

Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for

interpreting effect sizes in research. Child Development Perspectives, 2(3), 172-177.

Kane, T., Kerr, K., & Pianta, R. (2014). Designing teacher evaluation systems: New guidance

from the measures of effective teaching project. John Wiley & Sons.

Kennedy, M. M. (1999). Form and substance in in-service teacher education. (Research

Monograph No. 13). Arlington, VA: National Science Foundation.

Kennedy, M. M. (2016). How does professional development improve teaching? Review of

Educational Research, 86(4), 945-980.

Kim, J. S., & Quinn, D. M. (2012). A meta-analysis of K-8 summer reading interventions: The

role of socioeconomic status in explaining variation in treatment effects. Paper presented

at the annual meeting of the Society for Research on Educational Effectiveness.

Lipsey, M. W., Puzio, K., Yun, C., Hebert, M. A., Steinka-Fry, K., Cole, M. W., ... & Busick, M.

D. (2012). Translating the Statistical Representation of the Effects of Education

50 INSTRUCTIONAL IMPROVEMENT IN STEM

Interventions into More Readily Interpretable Forms. National Center for Special

Education Research.

Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis. New York,

NY: Oxford University Press.

Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.

(2016). Impact of a large-scale science intervention focused on English language

learners. American Educational Research Journal, 53(2), 395-424.

Malouf, D. B., & Taymans, J. M. (2016). Anatomy of an Evidence Base. Educational

Researcher, 45(8), 454-459.

Marx, R. W., Blumenfeld, P. C., Krajcik, J. S., Fishman, B., Soloway, E., Geier, R., & Tal, R. T.

(2004). Inquiry‐based science in the middle grades: Assessment of learning in urban

systemic reform. Journal of research in Science Teaching, 41(10), 1063-1080.

Miller, B., Lord, B., & Dorney, J. A. (1994). Staff development for teachers: A study of

configurations and costs in four districts. Boston, MA: Education Development Center.

Miles, K. H., Odden, A., Fermanich, M., & Archibald, S. (2004). Inside the black box of school

district spending on professional development: Lessons from five urban districts. Journal

of Education Finance, 30(1), 1-26.

National Governors Association Center for Best Practices, Council of Chief State School

Officers. (2010). Common Core State Standards. Washington, DC: Author.

National Research Council. (2004). On evaluating curricular effectiveness: Judging the quality

of K-12 mathematics evaluations. Washington, DC: The National Academies Press.

51 INSTRUCTIONAL IMPROVEMENT IN STEM

National Research Council. (2011). Successful K-12 STEM education: Identifying effective

approaches in science, technology, engineering, and mathematics. National Academies

Press.

National Research Council, Board on Science Education. (2012). A framework for K-12

science education: Practices, crosscutting concepts, and core ideas. Washington, DC:

The National Academies Press. Retrieved from

http://www.nap.edu/catalog.php?record_id=13165

National Research Council, National Committee for Science Education Standards and

Assessment. (1996). National science education standards. Washington, DC: The

National Academies Press. Retrieved from http://www.nap.edu/catalog/4962.html

Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive

Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.

Penuel, W. R., Sun, M., Frank, K. A., & Gallagher, H. A. (2012). Using social network analysis

to study how collegial interactions can augment teacher learning from external

professional development. American Journal of Education, 119(1), 103-136.

Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by

integrating curricular activities with innovative and developmentally appropriate digital

apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at

the Society for Research on Educational Effectiveness Spring Conference, Washington,

DC.

Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between

published and unpublished effect sizes: A meta-review. Review of Educational

Research, 86(1), 207-236.

52 INSTRUCTIONAL IMPROVEMENT IN STEM

Raudenbush, S. W. (2008). Advancing educational policy by advancing research on instruction.

American Educational Research Journal, 45(1), 206-230.

Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of

Educational Statistics, 10(2), 75-98.

Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control

trial. Jackson, WY: PRES Associates, Inc.

Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,

L. P. (2010). Integration of technology, curriculum, and professional development for

advancing middle school mathematics: Three large-scale studies. American Educational

Research Journal, 47(4), 833-878.

Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as a

lever for change: An experimental study of the effects of a professional development

program on students’ mathematics learning. Journal of Research on Educational

Effectiveness, 4(1), 1-24.

Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A study

of three contrasting approaches to professional support. Journal of Mathematics Teacher

Education, 4(1), 55-79.

Scher, L., & O'Reilly, F. (2009). Professional development for K–12 math and science teachers:

What do we really know? Journal of Research on Educational Effectiveness, 2(3), 209-

249.

Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional

development program in formative classroom assessment in middle-school English

language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.

53 INSTRUCTIONAL IMPROVEMENT IN STEM

Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical

power for linking impacts on teacher practice and student achievement outcomes?

Journal of Educational and Behavioral Statistics, 36(4), 441-471.

Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high school

biology and chemistry classes improves performance. Journal of Research in Science

Teaching, 40(9), 922-938.

Shadish, W. R., & Haddock, C. K. (2009). Combining estimates of effect size. In H. Cooper, L.

V. Hedges, & J. C. Valentine, (Eds.), The handbook of research synthesis and meta-

analysis (Second edition). New York: Russell Sage.

Shavelson, R. J., & Towne, L. (Eds.). (2001). Scientific research in education. Washington, DC:

The National Academies Press.

Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational

researcher, 15(2), 4-14.

Slavin, R. E. (2008). Perspectives on evidence-based research in education—What works? Issues

in synthesizing educational program evaluations. Educational Researcher, 37(1), 5-14.

Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence

synthesis. Review of Educational Research, 78(3), 427-515.

Slavin, R. E., Lake, C., & Groff, C. (2009). Effective programs in middle and high school

mathematics: A best-evidence synthesis. Review of Educational Research, 79(2), 839-

911.

Slavin, R. E., Lake, C., Hanley, P., & Thurston, A. (2014). Experimental evaluations of

elementary science programs: A best-evidence synthesis. Journal of Research in Science

Teaching, 51(7), 870-901.

54 INSTRUCTIONAL IMPROVEMENT IN STEM

Smith , M. S., & O’Day, J. (1990). Systemic school reform. Politics of Education Association

Yearbook (pp. 233-267). London: Taylor & Francis Ltd.

Sparks, D. (2002). Designing powerful professional development for teachers and principals.

Oxford, OH: National Staff Development Council.

Stein, M. K., Remillard, J., & Smith, M. S. (2007). How curriculum influences student learning.

In F.K. Lester, Jr. (Ed.), Second handbook of research on mathematics teaching and

learning (319-369). New York, NY: Macmillan.

Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation with dependent effect

sizes: Practical considerations and a software tutorial in Stata and SPSS. Research

Synthesis Methods, 5(1), 13–30.

Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling complex meta-analytic data

structures using robust variance estimates: A tutorial in R. Journal of Developmental and

Life-Course Criminology, 2(1), 85-112.

Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P. (2015).

An efficacy trial of research-based curriculum materials with curriculum-based

professional development. American Educational Research Journal, 52(5), 984-1017.

Taylor, J. A., Kowalski, S. M., Polanin, J. R., Askinas, K., Stuhlsatz, M. A., Wilson, C. D., ... &

Wilson, S. J. (2018). Investigating Science Education Effect Sizes: Implications for

Power Analyses and Programmatic Decisions. AERA Open, 4(3), 2332858418791991.

Timperley, H., Wilson, A., Barrar, H., & Fung, I. (2008). Teacher professional learning and

development.

55 INSTRUCTIONAL IMPROVEMENT IN STEM

Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for tests of moderators and

model fit using robust variance estimation in meta-regression. Journal of Educational

and Behavioral Statistics, 40(6), 604-634.

What Works Clearinghouse. (2010). Procedures and standards handbook (Version 2.1).

Washington, DC: US Department of Education.

Wilson S.J., Tanner-Smith E.E., Lipsey, M.W., Steinka-Fry, K., Morrison, J. (2011) Dropout

prevention and intervention programs: Effects on school completion and dropout among

school aged children and youth. Campbell Systematic Reviews 2011:8 DOI:

10.4073/csr.2011.8

Wilson, S. M. (2013). Professional development for science teachers. Science, 340(6130),

310-313. doi: 10.1126/science.1230725

Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the

evidence on how teacher professional development affects student achievement (Issues &

Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,

Institute of Education Sciences, National Center for Education Evaluation and Regional

Assistance, Regional Educational Laboratory Southwest. Retrieved from

http://ies.ed.gov/ncee/edlabs

Zaslow, M., Tout, K., Halle, T., Whittaker, J. V., & Lavelle, B. (2010). Toward the Identification

of Features of Effective Professional Development for Early Childhood Educators.

Literature Review. Office of Planning, Evaluation and Policy Development, US

Department of Education.

56 INSTRUCTIONAL IMPROVEMENT IN STEM

Tables and Figures

Table 1 Categories and descriptions of codes

Code Code Code description presenta Intervention Type Professional development 22% Study included only professional development. only Professional development Study included both professional development and new 75% and curriculum materials curriculum materials. Curriculum materials only Study included only new curriculum materials. 9% Research Design and Sample Characteristics RCT Study is a randomized controlled trial. 91% Subject matter – Math Subject matter focus of study is math or math/science. 64% Preschool Study sample included preschool students. 19% Effect Size Type State standardized test Outcome is a state standardized test. 17% Other standardized test Outcome is from other standardized test. 30% Adjusted for covariates Effect size is adjusted for covariates (e.g., pretest score). 76% PD Formatb Teachers participated in professional development with Same-school collaboration 74% other teachers from their own school. Teachers met formally or informally with other activity Implementation meetings participants to discuss classroom implementation (e.g., 35% troubleshooting meeting). Online professional Part or all of the professional development was 18% development conducted online. The professional development included a summer Summer workshop 54% workshop. The professional development involved coaching or mentoring from experts who observed instruction and Expert coaching 20% provided feedback (e.g., a debriefing meeting; via video or live). PD lead by The PD was led by the intervention developers and/or researchers/intervention 64% the study authors. developers

57 INSTRUCTIONAL IMPROVEMENT IN STEM

Code Code Code description presenta PD Timing and Durationc Contact hours Total number of PD contact hours. 45 hours Timespan over which professional development was conducted: Less than one week The PD was conducted over less than one week. 15% One week The PD was conducted over one week. 3% One month The PD was conducted over 8 – 30 days. 3% (8 days to 30 days) One semester The PD was conducted over 31 days – 4 months. 13% (31 days to 4 months) One year The PD was conducted over 4 months – 1 year. 49% More than one year The PD was conducted over more than 1 year. 16% PD Focusb The professional development was focused on content- Generic instructional generic instructional strategies (e.g., improving 11% strategies classroom climate and student motivation). How to use curriculum The PD focused on how to use curriculum materials. 75% materials The PD focused on how to integrating technology into 11% Integrate technology the classroom. The PD focused on formative assessment strategies specific to mathematics and science teaching (e.g. 17% Content-specific formative strategies to elicit student understanding of fractions or assessment the scientific method). Improve content knowledge/ Pedagogical content The PD focused on improving teachers' pedagogical 55% knowledge/How students content knowledge (e.g., how students learn learn mathematics or science). PD Activitiesb Teachers studied examples of students’ work (including Review sample student work 16% watching videos of students). Teachers observed a video or live demonstration/ Observed demonstration 35% modeling of instruction. Solved problems/Worked Teachers solved problems or exercises during the PD or 42% through student materials worked through student materials during the PD. Developed curriculum/lesson Teachers developed curricula or lesson plans during the 19% plans PD. Teachers studied examples of their own students' work Reviewed own student work 11% during the PD.

58 INSTRUCTIONAL IMPROVEMENT IN STEM

Code Code Code description presenta Curriculum Materialsd The curriculum materials provided teachers with Implementation guidance implementation guidance (e.g., support for student- 43% teacher dialogues around the content). The curriculum materials included materials/guidance Laboratory/Hands-on that supported inquiry-oriented explorations (e.g., 28% experience or curriculum kits science laboratory or hands-on mathematics kits). Total number of hours that the curriculum was intended 66.6 Curriculum dosage (hours) to be used. hours Curriculum proportion The proportion of each lesson that the new curriculum 91% replaced (percent) was intended to replace existing curriculum. Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary variables, or the sample average calculated at the study level for continuous variables (e.g., contact hours and curriculum dosage). For studies that had the feature present in one treatment arm but not another treatment arm, the code is counted as present if it is present is any treatment arm (e.g., a study with one treatment arm including only curriculum materials and a second treatment arm including both professional development and curriculum materials would be included in both rows). b Codes for PD focus, activities, and format were counted as “Not Present” for studies that did not involve a PD component. c PD timing/duration includes excludes studies without a PD component. d Codes for features of interventions involving new curriculum materials were counted as “Not Present” for studies that did not involve new curriculum materials.

59 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 2

Results of estimating an unconditional meta-regression model with robust variance estimation (RVE)

Dependent variable: Effect Size (Hedges’s g) Constant 0.209*** (0.025) N effect sizes 258 N studies 95

휏2a 0.037 95% prediction intervalb (-0.165, 0.583) Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b The 95% prediction interval is calculated as the estimated average effect size +/- 1.96* 휏. *p < .10, **p < .05, ***p < .01.

60 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 3.

RVE results including controls for study design, study sample, subject area and outcome measure type

Dependent variable: Effect Size (Hedges’s g) Between-study effects RCT -0.022 (0.104) State standardized test -0.264*** (0.055) Other standardized test -0.277*** (0.053) Grade - preschool 0.133 (0.084) Effect size adjusted for covariates -0.040 (0.055) Subject matter- math -0.009 (0.044) Within-study effects State standardized test -0.208*** (0.052) Other standardized test -0.203*** (0.059)

Constant 0.395*** (0.111)

N effect sizes 258 N studies 95

휏2a 0.025 Results of joint F-testb F = 5.75, df = 26.9, p<0.001 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of all study characteristics included in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

61 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 4

RVE results including intervention characteristics (professional development and/or curriculum materials as moderators

Dependent variable: Effect Size (Hedges’s g) Between-study effects Professional development only -0.084 -0.090* (0.051) (0.052) New curriculum materials only -0.129 (0.118) Professional development only/New Curriculum materials only -0.099** (0.046)

N effect sizes 258 258 258 N studies 95 95 95

휏2a 0.025 0.025 0.025 Raw mean ES: Professional development only 0.254 New curriculum materials only 0.091 Professional development only/ New curriculum materials only 0.214 Both professional development and new curriculum materials 0.252 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. Models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). *p < .10, **p < .05, ***p < .01.

62 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 5

RVE results including professional development contact hours as moderators Dependent variable: Effect Size (Hedges’s g) Between-study effects PD contact hoursa 0.005 (0.007) PD contact hours between 25th and 50th percentile (16 – 34.5 hours) 0.012 (0.050) PD contact hours between 50th and 75th percentile (35 - 68 hours) -0.033 (0.065) PD contact hours above 75th percentile (> 68 hours) 0.069 (0.067) PD contact hours at or above 25th percentile (> 16 hours) 0.017 (0.043) N effect sizes 231 231 231 N studies 85 85 85

휏2b 0.022 0.024 0.023 Results of joint F testc F = 0. 647, df = 35.2, p = 0.590 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include only studies and/or treatment arms with a professional development component. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a PD contact hours measured as raw PD contact hours / 10. b 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). c Results of the joint F test are from a test of the joint significance of the PD contact hours predictors from the second model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

63 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 6

RVE results including professional development foci as moderators

Dependent variable: Effect Size (Hedges’s g) Between-study effects Generic instructional strategies -0.014 -0.074 (0.076) (0.072) How to use curriculum materials 0.118* 0.119* (0.060) (0.063) Integrate technology 0.185 0.139 (0.117) (0.102) Content-specific formative assessment 0.132* 0.105 (0.067) (0.062) Improve pedagogical content knowledge/how students learn 0.094** 0.096** (0.043) (0.043) Number of PD features 0.077*** (0.024)

N effect sizes 237 237 237 237 237 237 237 N studies 89 89 89 89 89 89 89

휏2a 0.025 0.025 0.024 0.024 0.026 0.025 0.023 Results of joint F testb F = 3.89, df = 20.4, p = 0.012 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include only studies and/or treatment arms with a professional development component. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all professional development foci included in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

64 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 7

RVE results including professional development activities as moderators

Dependent variable: Effect Size (Hedges’s g) Between-study effects Observed demonstration 0.033 0.025 (0.049) (0.046) Review generic student work 0.047 -0.014 (0.087) (0.085) Solved problems/Worked through student materials 0.077 0.070 (0.047) (0.047) Developed curriculum materials/lesson plans 0.064 0.036 (0.061) (0.063) Reviewed own student work 0.033 0.017 (0.064) (0.071) Number of PD features 0.046* (0.024)

N effect sizes 237 237 237 237 236 236 237 N studies 89 89 89 89 88 88 89

휏2a 0.025 0.021 0.024 0.025 0.018 0.020 0.021 Results of joint F testb F = 0.883, df = 21.2, p = 0.509 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include only studies and/or treatment arms with a professional development component. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all professional development activities included in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

65 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 8

RVE results including professional development formats as moderators

Dependent variable: Effect Size (Hedges’s g) Between-study effects Same-school collaboration 0.109 0.123* (0.076) (0.067) Implementati on meetings 0.085* 0.117** (0.049) (0.045) Any online PD -0.161*** -0.153** (0.053) (0.057) Summer workshop 0.093** 0.074* (0.043) (0.038) Expert coaching 0.035 0.053 (0.049) (0.051) PD leaders – researchers -0.019 -0.037 (0.043) (0.038) Number of PD features 0.023 (0.024)

N effect sizes 237 237 237 236 237 231 231 237 N studies 89 89 89 88 89 86 86 89

휏2a 0.020 0.022 0.024 0.023 0.026 0.025 0.015 0.023 Results of joint F testb F = 2.72, df = 26.2, p = 0.035 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include only studies and/or treatment arms with a professional development component. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all professional development formats included in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

66 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 9

RVE results including characteristics of interventions involving new curriculum materials as moderators

Dependent variable: Effect Size (Hedges’s g) Between-study effects Implementation guidance 0.060 0.062 (0.053) (0.056) Laboratory/hands-on experience, curriculum kits -0.029 -0.026 (0.055) (0.054) Curriculum dosage (Number of hours) 0.000 0.000 (0.001) (0.001) Curriculum proportion replaced (0.00-1.00) -0.060 (0.151)

N effect sizes 193 193 193 192 193 N studies 77 77 77 76 77

휏2a 0.027 0.031 0.030 0.025 0.031 Results of joint F testb F = 0.415, df = 24.4, p = 0.744 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include only studies and/or treatment arms that included new curriculum materials. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all characteristics of interventions involving new curriculum materials included in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

67 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 10

Regression adjusted mean effect sizes based on unconditional and conditional RVE meta- regression models

Subgroup analysis Conditional RVE Conditional RVE (Unconditional model model with controls RVE model) for other program features pc pc+ ̅푔̅푢푐̅̅ , 푝푢푐 푔̅푐 ̅푔̅푐̅+̅ Overall effect size 0.209*** ------Program type Professional development 0.235*** 0.254 ** -- -- and curriculum materials Professional development 0.136*** 0.156 -- -- only/Curriculum materials only+ Outcome type State standardized test 0.060*** 0.101 *** -- -- Other standardized test 0.084*** 0.087 *** -- -- Intervenor-developed 0.371*** 0.365 -- -- test+ Program feature category: Professional development focus Focus on how to use curriculum materials Yes 0.228*** 0.258 * 0.260 * No 0.134*** 0.140 0.141 Focus on content-specific formative assessment Yes 0.387*** 0.340 * 0.321 No 0.178*** 0.208 0.216 Focus on improving content knowledge/pedagogical content knowledge/how students learn Yes 0.303*** 0.269 ** 0.275 ** No 0.095*** 0.175 0.179

68 INSTRUCTIONAL IMPROVEMENT IN STEM

Subgroup analysis Conditional RVE Conditional RVE (Unconditional model model with controls RVE model) for other program features a b c d e f pc pc+ ̅푔̅푢푐̅̅ , pc 푔̅푐 ̅푔̅푐̅+̅ Program feature category: Professional development formats PD includes same-school collaboration Yes 0.198*** 0.248 0.247 * No 0.263*** 0.139 0.125 PD includes implementation meetings Yes 0.283*** 0.284 * 0.297 ** No 0.169*** 0.199 0.180 Any online professional development Yes 0.116** 0.103 *** 0.096 ** No 0.235*** 0.264 0.249 PD includes summer workshop Yes 0.218*** 0.266 ** 0.253 * No 0.190*** 0.174 0.179 Note: First column: Regression-adjusted mean effect sizes are based on subgroup analyses. Unconditional RVE meta-regression models were estimated including only effect sizes with the row feature. a b g̅̅uc̅̅= Estimated mean effect size from unconditional RVE meta-regression model. puc = p- value on constant from unconditional meta-regression model. Second and third column: Regression-adjusted mean effect sizes are based on the results of estimating conditional RVE meta-regression models. Models included each the row feature separately as a moderator. All models included controls for the variables listed in Table 3. Regression-adjusted mean effect sizes were calculated using the overall average of the study- level values of each included covariate. c 푔̅̅̅푐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression d model. pc = p-value on coefficient for program feature on interest from meta-regression model. Fourth and fifth column: Regression-adjusted mean effect sizes are based on the results of estimating conditional RVE meta-regression models. Models included all row features in a given category simultaneously as moderators. All models included controls for the variables listed in Table 3. Regression-adjusted mean effect sizes were calculated using the overall average of the study-level values of each included covariate. e 푔̅̅̅푐̅+̅ = Estimated regression-adjusted mean effect size from conditional RVE meta-regression f model. pc+ = p-value on coefficient for program feature on interest from meta-regression model. *p < .10, **p < .05, ***p < .01.

69 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 11

RVE results including other research design elements as moderators Dependent variable: Effect Size (Hedges’s g) Between-study effects Unit of assignment is teacher -0.054 -0.060 (0.043) (0.052) Business as usual control group 0.008 0.076 (0.079) (0.085) Teacher participation – Voluntary -0.035 -0.032 (0.045) (0.054) Teacher participation – Missing 0.111 0.114 (0.089) (0.103) Teacher incentive – Credit -0.167** -0.177* (0.074) (0.097) Teacher credit incentive – Missing -0.050 -0.137 (0.055) (0.097) Teacher incentive – Monetary -0.054 0.016 (0.042) (0.057) Teacher monetary incentive – Missing -0.019 0.073 (0.063) (0.110) Dependent variable: Effect Size (Hedges’s g) High cluster -0.056 0.022

70 INSTRUCTIONAL IMPROVEMENT IN STEM attrition (0.051) (0.057) Cluster attrition - Missing -0.070 -0.069 (0.064) (0.078) Higher student attrition -0.068 -0.054 (0.046) (0.052) Student attrition - Missing -0.031 0.031 (0.054) (0.065) N effect sizes 258 258 258 258 258 258 258 258 N studies 95 95 95 95 95 95 95 95

휏2a 0.024 0.022 0.026 0.024 0.026 0.026 0.027 0.028 Results of joint F testb F = 1.30, df = 23.2, p = 0.284 Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all additional research design elements in the model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015). *p < .10, **p < .05, ***p < .01.

71 INSTRUCTIONAL IMPROVEMENT IN STEM

Table 12

RVE results including peer-reviewed publication type as a moderator Dependent variable: Effect Size (Hedges’s g) Between-study effects Peer-reviewed source 0.046 0.067 (0.050) (0.041) N effect sizes 258 258 N studies 95 95 휏2a 0.038 0.026

Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. Models in column 2 include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). *p < .10, **p < .05, ***p < .01.

72

INSTRUCTIONAL IMPROVEMENT IN STEM

Figure 1. PRISMA study screening flow chart (PRISMA, 2009).

Records identified through Additional records identified database searching through other sources

(n = 8,099) (n = 1,391)

Identification IdIentification

Records after duplicates removed

(n = 7,926)

Screening Records screened Records excluded

Screening (n = 7,926) (n = 7,270)

Full-text articles excluded Full-text articles assessed (n = 561)

for eligibility Eligibility (Reasons: Did not meet (n = 656) intervention characteristics

[e.g., off-topic, not a classroom- Eligibility level intervention]: n=248; Methodology issues [e.g., no

Studies included in control group; post-hoc design]:

quantitative synthesis n=310; (meta-analysis) Sample issues [e.g., did not have (n = 95) at least two teachers and fifteen Included students in each condition], n=45;

Included Report was subsumed under another study, n=11; Full-text of study could not be located, n=11; Study characteristics issues [published before 1989 or not written in English], n=4)

73

INSTRUCTIONAL IMPROVEMENT IN STEM

Online Appendix A:

Tables and Figures

74

INSTRUCTIONAL IMPROVEMENT IN STEM

Table A1

Included studies and effect sizes

Authors Content Area Study Design Setting Grades Effect Effect Effect Served Size Size Size SE Variance (SE^2) Agodini, Harris, Remillard, & Thomas (2013) Mathematics Random Elementary 2 assignment -0.210 0.064 0.004 0.000 0.175 0.031 0.040 0.068 0.005 Argentin, Pennisi, Vidoni, Abbiati, & Caputo Mathematics Random Middle 6, 7, 8 (2014) assignment School -0.001 0.044 0.002 Arnold, Fisher, Doctoroff, & Dobbs (2002) Mathematics Random Preschool PK assignment 0.436 0.364 0.132 Authors (2016) Mathematics Random Elementary 4, 5 assignment -0.060 0.085 0.007 -0.030 0.068 0.005 -0.020 0.044 0.002 0.000 0.161 0.026 0.000 0.155 0.024 0.020 0.086 0.007 0.030 0.085 0.007 0.040 0.062 0.004 0.050 0.147 0.022 0.080 0.297 0.088 0.080 0.052 0.003 0.100 0.073 0.005 Batiza, Luo, Zhang, Gruhl, Nelson, Hoelzer, Science Random High School 9 … & Marcey (2016) assignment 0.549 0.096 0.009 0.699 0.259 0.067

75

INSTRUCTIONAL IMPROVEMENT IN STEM

Battistich, Alldredge, & Tsuchida (2003) Mathematics Random Elementary 2 assignment -0.404 0.601 0.361 -0.363 0.476 0.226 -0.157 0.498 0.248 0.133 0.405 0.164 0.206 0.616 0.379 0.292 0.588 0.346 0.755 0.570 0.324 1.007 0.529 0.280 Berlinski & Busso (2015) Mathematics Random Middle 7 assignment School -0.247 0.081 0.007 -0.171 0.080 0.006 Beuermann, Naslund-Hadley, Ruprah, & Science Random Elementary 3 Thompson (2013) assignment -0.030 0.060 0.004

0.030 0.070 0.005

0.180 0.080 0.006 Borman, Gamoran & Bowdon (2008) Science Random Elementary 4,5 assignment -0.270 0.092 0.008 -0.080 0.104 0.011 0.010 0.085 0.007 Bottge, Ma, Gassaway, Toland, Butler, & Mathematics Random Middle 6,7,8, 9, Cho (2014) assignment School, High 10, 11 School 0.040 0.188 0.035 0.390 0.190 0.036 0.440 0.191 0.037 1.000 0.200 0.040 Bottge, Toland, Gassaway, Butler, Cho, Mathematics Random Middle 6,7,8 Griffen, & Ma (2015) assignment School -0.220 0.180 0.032 0.100 0.112 0.013 0.150 0.112 0.013 0.260 0.180 0.032

76

INSTRUCTIONAL IMPROVEMENT IN STEM

0.290 0.178 0.032 0.380 0.110 0.012 0.470 0.175 0.031 0.610 0.114 0.013 Bradshaw (2011) Science Random Middle 6,7,8 assignment School 0.251 0.131 0.017 0.310 0.130 0.017

Brendefur, Strother, Thiede, Lane, & Surges- Mathematics Random Preschool PK Prokop (2013) assignment 0.872 0.288 0.083

Brown, Greenfield, Bell, Juárez, Myers, & Science Random Preschool PK Nayfield (2013) assignment 0.121 0.066 0.004

Carpenter, Fennema, Peterson, Chiang, & Mathematics Random Elementary 1 Loef (1989) assignment 0.202 0.810 0.657 0.426 0.818 0.669 0.452 0.819 0.671 0.453 0.819 0.671 0.511 0.822 0.675 0.545 0.824 0.678 0.551 0.824 0.679 0.692 0.833 0.694 0.747 0.837 0.701 Cervetti, Barber, Dorph, Pearson, & Science and Random Elementary 4 Goldschmidt (2012) Reading assignment 0.220 0.073 0.005 0.400 0.102 0.010 0.650 0.083 0.007 Clark, Arens & Stewart (2015) Mathematics Random Middle 7 assignment School 0.040 0.153 0.023 0.056 0.153 0.023 Clarke, Smolkowski, Baker, Fien, Doabler, & Mathematics Random Elementary K 0.189 0.099 0.010

77

INSTRUCTIONAL IMPROVEMENT IN STEM

Chard (2011) assignment 0.232 0.097 0.009 Clements & Sarama (2007) Mathematics Random Preschool PK assignment 0.851 0.517 0.267 1.464 0.558 0.311 Clements & Sarama (2008) Mathematics Random Preschool PK assignment 0.637 0.215 0.046 1.066 0.116 0.013 Clements, Sarama, Spitler, Lange, & Wolfe Mathematics Random Preschool, PK, K (2011) assignment Elementary 0.720 0.099 0.010 Dash, De Kramer, O'dwyer, Masters, & Mathematics Random Elementary 5 Russell, (2012) assignment 0.132 0.068 0.005 DeBarger, Penuel, Moorthy, Beauvineau, Science Quasi- Middle Not Kennedy, & Boscardin (2016) experimental School specifie d 0.637 0.249 0.062 Devlin-Sherer, Spinelli, Giamatteo, Johnson, Mathematics Quasi- Elementary 3, 4, 5 Mayo-Molina, McGinley, … & Zisk (1998) experimental -0.343 0.695 0.482 -0.227 0.692 0.478 -0.136 0.682 0.465 0.297 0.683 0.466 Dominguez, Nicholls, & Storandt (2006) Mathematics Random Elementary 3, 4, 5 assignment 0.082 0.107 0.012 0.100 0.108 0.012 Eddy & Berry (2006) Mathematics Random High School 9, 10, assignment 11, 12 0.006 0.053 0.003 Eddy, Ruitman, Sloper, & Hankel (2010) Mathematics Random High School 9, 10, assignment 11 0.175 0.189 0.036 Garet, Wayne, Stancavage, Taylor, Walters, Mathematics Random Middle 7 Song, … & Doolittle (2010) assignment School 0.020 0.044 0.002 0.050 0.066 0.004 Granger, Bevis, Saka, Southerland, Sampson, Science Random Elementary 4, 5 & Tate (2012) assignment 0.067 0.051 0.003 78

INSTRUCTIONAL IMPROVEMENT IN STEM

0.573 0.089 0.008 Greenleaf, Litman, Hanson, Rosen, Science Random High School 9, 10 Boscardin, Herman, … & Jones (2011) assignment 0.280 0.109 0.012 Gropen, Clark-Chiarelli, Chalufour, Science Random Preschool PK Hoisington, & Eggers-Piérola, (2009) assignment 0.078 0.112 0.012 0.141 0.112 0.012 0.144 0.112 0.012 0.222 0.112 0.013 Gropen, Clark-Chiarelli, Ehrlich, & Thieu Science Random Preschool PK (2011) assignment 0.379 0.119 0.014 Hand, Therrien, & Shelley (2013) Science Random Elementary, 3, 4, 5, assignment Middle 6 School -0.001 0.132 0.018 0.426 0.134 0.018 Harris, Penuel, D'Angelo, Haydel, DeBarger, Science Random Middle 6 Gallagher, Kennedy, … & Krajcik (2015) assignment School 0.220 0.102 0.010 0.250 0.124 0.015 Heller (2012) Science Random Middle 8 assignment School 0.029 0.064 0.004 0.109 0.052 0.003 Heller, Curtis, Rabe-Hesketh, & Verboncoeur Mathematics Random Elementary, 2, 4, 6 (2007) assignment Middle School 0.010 0.198 0.039 0.111 0.179 0.032 0.352 0.220 0.048 0.429 0.137 0.019 0.659 0.155 0.024 Heller, Daehler, Wong, Shinohara, & Miratrix Science Random Elementary 4 (2012) assignment 0.010 0.090 0.008

79

INSTRUCTIONAL IMPROVEMENT IN STEM

0.070 0.095 0.009 0.310 0.090 0.008 0.370 0.084 0.007 0.570 0.086 0.007 0.600 0.092 0.008 Heller, Hanson, & Barnett-Clarke (2010) Mathematics Random Elementary 4, 5 assignment 0.040 0.094 0.009 0.080 0.079 0.006 0.090 0.083 0.007 0.150 0.088 0.008 0.180 0.088 0.008 0.190 0.087 0.008 Hinerman, Hull, Chen, Booker, & Naslund- Mathematics Random Elementary, K, 1, 2, Hadley (2014) assignment Middle 3, 4, 5, School 6, 7 0.280 0.135 0.018 Jaciw, Hegseth, Ma, & Lai (2012) Mathematics Random Elementary 3, 4, 5 assignment 0.050 0.082 0.007 0.120 0.061 0.004 0.140 0.085 0.007 Jacobs, Franke, Carpenter, Levi, & Battey Mathematics Random Elementary 1, 2, 3, (2007) assignment 4, 5 0.277 0.197 0.039 0.316 0.230 0.053 0.348 0.187 0.035 0.766 0.339 0.115 0.830 0.256 0.065 0.841 0.233 0.054 1.016 0.348 0.121 Jerrim & Vignoles (2015) Mathematics Random Elementary, UK1, 7 assignment Middle School 0.055 0.046 0.002 0.099 0.054 0.003 Kaldon & Zoblotsky (2014) Science Random Elementary, 3 0.020 0.038 0.001

80

INSTRUCTIONAL IMPROVEMENT IN STEM

assignment Middle School 0.040 0.027 0.001 Kim, Van Tassel-Baska, Bracken, Feng, Science Random Elementary K, 1, 2, Stambaugh, & Bland (2012) assignment 3 0.168 0.184 0.034 Kinzie, Whittaker, Williford, Decoster, Mathematics Random Preschool PK Mcguire, Lee, & Kilday (2014) and Science assignment -0.076 0.060 0.004 -0.025 0.039 0.002 -0.010 0.056 0.003 -0.002 0.062 0.004 0.016 0.204 0.042 0.062 0.067 0.005 0.070 0.076 0.006 0.073 0.062 0.004 0.081 0.037 0.001 0.131 0.056 0.003 Kisker, Lipka, Adams, Rickard, Andrew- Mathematics Random Elementary 2 Ihrke, Yanez, & Millard (2012) assignment 0.390 0.118 0.014 0.819 0.157 0.025 Klein, Starkey, Clements, Sarama, & Iyer Mathematics Random Preschool PK (2008) assignment 0.229 0.178 0.032 0.549 0.180 0.033 Klein, Starkey, DeFlorio, & Brown (2012) Mathematics Random Preschool, PK, K assignment Elementary 0.331 0.118 0.014 0.450 0.114 0.013 0.699 0.127 0.016 0.829 0.118 0.014 Lafferty (1994) Mathematics Quasi- Middle experimental School 0.428 0.345 0.119 Lanehart, Borman, Boydston, Cotner, & Lee Science Random Elementary 3,4,5 (2010) assignment 0.188 0.063 0.004

81

INSTRUCTIONAL IMPROVEMENT IN STEM

Lang, Schoen, LaVenia, & Oberlin (2014) Mathematics Random Elementary K, 1 assignment 0.200 0.090 0.008 0.240 0.080 0.006 Lara-Alecio, Tong, Irby, Guerrero, Huerta, & Science Quasi- Elementary 5 Fan (2012) experimental -0.080 0.306 0.094 0.052 0.306 0.094 0.361 0.311 0.097 0.394 0.312 0.097 0.396 0.312 0.097 Lehrer (2015) Mathematics Random Middle 6 assignment School 0.418 0.097 0.009 0.568 0.082 0.007 Lewis & Perry (2015) Mathematics Random Elementary 2, 3, 4, assignment 5 0.496 0.136 0.018 Llorente, Pasknik, Moorthy, Hupert, Mathematics Random Preschool PK Rosenfeld, & Gerard (2015) assignment 0.000 0.078 0.006 0.010 0.039 0.002 0.150 0.081 0.007 0.240 0.048 0.002 Llosa, Lee, Jiang, Haas, O’Connor, Science Random Elementary 5 Van Booven, & Kieffer (2016) assignment 0.150 0.051 0.003 0.250 0.113 0.013 Maerten-Rivera, Ahn, Lanier, Diaz, & Lee Science Random Elementary 5 (2014) assignment -0.011 0.151 0.023 0.136 0.155 0.024 0.170 0.155 0.024 Martin, Braisel, & Turner, & Wise (2012) Mathematics Random Middle 6 assignment School 0.020 0.071 0.005 0.090 0.069 0.005 McCoach, Gubbins, Foreman, Rubenstein, & Mathematics Random Elementary 3 Rambo-Hernandez (2014) assignment 0.030 0.077 0.006

82

INSTRUCTIONAL IMPROVEMENT IN STEM

Miller, Jaciw, & Ma (2007) Science Random Elementary 3,4,5 assignment -0.020 0.040 0.002 Montague, Krawec, Enders, & Dietz (2014) Mathematics Random Middle 7 assignment School 0.613 0.250 0.063 Mutch-Jones, Puttick, & Demers (2014) Science Random Elementary 5, 6, 7, assignment and Middle 8 School 0.082 0.039 0.001 Newman, Finney, Bell, Turner, Jaciw, Mathematics Random Elementary 4,5,6,7, Zacamy, & Gould (2012) assignment and Middle 8 School 0.048 0.015 0.000 0.050 0.028 0.001 Niess (2005) Mathematics Random Elementary 3, 4, 5, assignment and Middle 6, 7, 8 School 0.129 0.185 0.034 0.362 0.212 0.045 Oh, Lachapelle, Shams, Hertel, & Science Random Elementary Not Cunningham (2016) assignment specifie d -0.070 0.111 0.012 -0.030 0.125 0.016 0.010 0.070 0.005 0.030 0.072 0.005 0.070 0.092 0.008 0.100 0.058 0.003 0.110 0.134 0.018 0.140 0.081 0.006 0.170 0.072 0.005 0.180 0.080 0.006 0.200 0.124 0.015 0.220 0.077 0.006 Pane, Griffin, Mccaffrey, Karam (2014) Mathematics Random High School 6, 7, 8, assignment 9, 10, -0.100 0.100 0.010

83

INSTRUCTIONAL IMPROVEMENT IN STEM

11, 12 -0.030 0.110 0.012 0.190 0.130 0.017 0.220 0.090 0.008 Penuel, Gallagher & Moorthy (2011) Science Random Middle 6,7,8 assignment School 0.180 0.183 0.034 0.290 0.114 0.013 0.340 0.124 0.015 Piasta, Logan, Pelatti, Capps, & Petrill (2015) Mathematics Random Preschool PK and Science assignment -0.080 0.307 0.094 -0.010 0.332 0.110 0.080 0.301 0.091 0.130 0.266 0.071 Presser, Clements, Ginsburg, & Ertle (2015) Mathematics Random Preschool, PK, K assignment Elementary 0.318 0.247 0.061 0.319 0.147 0.022 0.491 0.251 0.063 0.555 0.253 0.064 Presser, Vahey, & Dominguez (2015) Mathematics Random Preschool PK assignment 0.508 0.228 0.052 Pyke, Lynch, Kuipers, Szesze, & Driver Science Random Middle 8 (2004) assignment School 0.320 0.166 0.028 0.383 0.165 0.027 Pyke, Lynch, Kuipers, Szesze, & Watson Science Random Middle 6, 7, 8 (2006) assignment School 0.230 0.318 0.101 Pyke, Lynch, Kuipers, Szesze, & Watson Science Random Middle 7 (2005) assignment School -0.216 0.237 0.056 Reid, Chen, & McCray (2014) Mathematics Quasi- Preschool, PK, K experimental Elementary 0.060 0.079 0.006 Resendez & Azin (2006) Mathematics Random Elementary 3,5 assignment -0.070 0.057 0.003

84

INSTRUCTIONAL IMPROVEMENT IN STEM

0.050 0.086 0.007 Resendez & Azin (2008) Mathematics Random Elementary 2, 3, 4, assignment 5 0.200 0.063 0.004 0.210 0.021 0.000 0.240 0.102 0.011 Rimbey (2013) Mathematics Random Elementary 2 assignment 0.245 0.323 0.104 Roschelle, Shechtman, Tatar, Hegedus, Mathematics Random Middle 7,8 Hopkins, Empson, … & Gallagher (2010) assignment School 0.379 0.063 0.004 0.606 0.059 0.003 Roth, Wilson, Taylor, Hvidsten, Stennett, Science Random Elementary 4,5 Wickler, … & Bintz (2015) assignment 0.680 0.041 0.002 San Antonio, Morales, & Moral (2011) Mathematics Random Middle 6 assignment School 0.268 0.280 0.078 Santagata, Kersting, Givvin, & Stigler (2010) Mathematics Random Middle 6 assignment School -0.022 0.134 0.018 Sarama, Clements, Starkey, Klein, & Mathematics Random Preschool PK Wakeley (2008) assignment 0.618 0.166 0.028 Saxe, Gearhart, & Nasir (2001) Mathematics Random Elementary 4, 5, 6 assignment 0.770 1.251 1.565 1.450 1.366 1.867 Schneider (2013) Mathematics Random Middle 8, 9 assignment School, High School 0.298 0.135 0.018 0.400 0.184 0.034 Schneider & Meyer (2012) Mathematics Random Middle 6 assignment School 0.030 0.049 0.002 Schwartz‐Bloom & Halpin (2003) Science Random High School 9, 10, assignment 11, 12 0.380 0.138 0.019 0.580 0.145 0.021 Shannon & Grant (2012) Science Random High School 9, 10, assignment 11, 12 0.060 0.083 0.007

85

INSTRUCTIONAL IMPROVEMENT IN STEM

0.120 0.097 0.009 Sophian (2004) Mathematics Quasi- Preschool PK experimental 0.406 0.431 0.185 0.794 0.442 0.196 Supovitz (2013) Mathematics Random Elementary 1, 2, 3, assignment 4, 5 0.001 0.046 0.002 0.035 0.046 0.002 0.059 0.046 0.002 , Pollack, Durkin, Rittle-Johnson, Lynch, Mathematics Random Middle 8, 9 Newton, & Gogolen (2015) assignment School, High School -0.058 0.120 0.014 0.047 0.134 0.000 Starkey, Klein, & DeFlorio (2013) Mathematics Random Preschool PK assignment 0.589 0.165 0.027

1.078 0.173 0.030 Tatar, Roschelle, Knudsen, Shechtman, Mathematics Random Middle 7, 8 Kaput, & Hopkins (2008) assignment School 1.249 0.735 0.540 Tauer (2002) Mathematics Quasi- High School 10 experimental 0.208 0.484 0.234 Taylor, Getty, Kowalski, Wilson, Carlson, & Science Random High School 9, 10 Van Scotter (2015) assignment 0.090 0.040 0.002 Taylor, Roth, Wilson, Stuhlsatz, & Tipton Science Random Elementary 4, 5 (2016) assignment 0.520 0.071 0.005 Thompson, Senk, & Yu (2012) Mathematics Quasi- Middle 7 experimental School -0.125 0.247 0.061 -0.086 0.247 0.061 0.199 0.248 0.061 Vaden-Kiernan, Borman, Caverly, Bell, Ruiz Mathematics Random Elementary K, 1, 3, de Castilla, Sullivan, & Rodriguez (2016) assignment 4 -0.069 0.036 0.001 0.059 0.035 0.001

86

INSTRUCTIONAL IMPROVEMENT IN STEM

Van Egeren, Schwarz, Gerde, Morris, Pierce, Science Random Preschool PK Brophy-Herb, … & Stoddard (2014) assignment -0.269 0.193 0.037 -0.030 0.192 0.037 0.017 0.192 0.037 0.150 0.192 0.037 0.251 0.193 0.037 Walsh-Cavazos (1994) Mathematics Quasi- Elementary 5 experimental 0.554 0.439 0.193

87

INSTRUCTIONAL IMPROVEMENT IN STEM

Table A2

Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by presence of attrition problems

Subgroup analysis (Unconditional RVE model) Attrition problemsa Attrition problem at the cluster level 0.180*** Differential attrition between treatment and control groups 0.131*** No attrition problems 0.243*** Note: Table includes estimated mean effect sizes from unconditional RVE models. Mean effect sizes are based on subgroup analyses where unconditional RVE meta-regression models were estimated using only effect sizes with (or without) attrition problems. a In some studies, attrition problems were only present for some outcomes (e.g., in studies with multiple treatment arms, there may have been differential attrition between the control group and one treatment arm, but not between the control group and a different treatment arm). In these cases, only effect sizes from contrasts with (or without) the relevant attrition problems were included. Studies may be included in more than one row if studies included had both cluster- level attrition and differential attrition between treatment and control groups. *p < .10, **p < .05, ***p < .01.

88

INSTRUCTIONAL IMPROVEMENT IN STEM

Table A3

Results of publication bias tests among full sample, peer-reviewed, and non-peer-reviewed studies

Full sample Peer-reviewed Non-peer-reviewed studies studies Egger’s regression test Precision 0.034 0.006 0.064 (0.031) (0.035) (0.051) [0.276] [0.866] [0.212]

Intercept 1.488 1.875 1.052 (0.389) (0.434) (0.657) [<0.001] [<0.001] [0.116]

N studies 95 46 49 Average impact, 0.207*** 0.227*** 0.189*** based on random effects meta-analysis Average impact, trim- 0.085*** 0.122*** 0.189*** and-fill results based on random effects meta-analysis Estimating meta-regression model using RVE, including effect size standard error as a predictor Standard error 1.078 1.412 0.691 (0.219) (0.258) (0.329) [<0.001] [<0.001] [0.050]

Intercept 0.085 0.065 0.111 (0.035) (0.043) (0.050) [0.015] [0.143] [0.037]

N effect sizes 258 129 129 N studies 95 46 49 Note: Standard errors in parentheses. p-values in brackets. Models in the first column include all studies. Models in second column include only peer-reviewed studies. Models in the third column include only non-peer-reviewed studies. Top panel: Observations are at the study level. Outcomes are study-level average standard normal deviations (the study-average effect size divided by the study-average standard error). Egger’s regression test carried out using the metabias package in Stata 15; trim-and-fill method carried out using the metatrim package in Stata 15. Bottom panel: Observations are at the effect size level. Outcomes are effect sizes. *p < .10, **p < .05, ***p < .01.

89

INSTRUCTIONAL IMPROVEMENT IN STEM

Funnel plot for impacts on math, science achievement Funnel plot for impacts on math, science achievement

Peer-reviewed studies All studies

0 0

5 5

r r

. .

o o

r r

r r

e e

d d

r r

a a

d d

n n

a a

t t

S S

1 1

5 5

. .

1 1 -4 -2 0 2 4 -4 -2 0 2 4 Effect size (Hedges's g) Effect size (Hedges's g)

Funnel plot for impacts on math, science achievement

Non-peer-reviewed studies

0

2

.

r

o

r

r

e

d

r

4

.

a

d

n

a

t

S

6

.

8 . -1 0 1 2 Effect size (Hedges's g)

Figure A1. Funnel plot of included effect sizes.

90

INSTRUCTIONAL IMPROVEMENT IN STEM

Online Appendix B:

References for Included Studies

*Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary

math curricula outperform a fourth. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

*Argentin, G., Pennisi, A., Vidoni, D., Abbiati, G., & Caputo, A. (2014). Trying to raise (low)

math achievement and to promote (rigorous) policy evaluation in Italy: Evidence from a

large-scale randomized trial. Evaluation Review, 38(2), 99-132.

*Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math

development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-

770.

**Batiza, A. F., Gruhl, M., Zhang, B., Harrington, T., Roberts, M., LaFlamme, D., ... &

Hagedorn, E. (2013). The effects of the SUN project on teacher knowledge and self-

efficacy regarding biological energy transfer are significant and long-lasting: Results of a

randomized controlled trial. CBE-Life Sciences Education, 12(2), 287-305.

*Batiza, A., Luo, W., Zhang, B., Gruhl, M., Nelson, D., Hoelzer, M., ... & LaFlamme, D. (2016,

March). Regular Biology Students Learn Like AP Students with SUN. Paper presented at

the Society for Research on Educational Effectiveness Spring Conference, Washington,

DC.

*Battistich, V., Alldredge, S. and Tsuchida, I. (2003). Number power: An elementary school

program to enhance students' mathematical and social development. In S.L. Senk, & D.R.

Thompson (Eds.) Standards-based school mathematics curricula: What are they? What

do students learn? (133–160). Mahwah, NJ: Erlbaum.

91

INSTRUCTIONAL IMPROVEMENT IN STEM

**Berlinski, S.G., & Busso, M. (2013). Pedagogical change in mathematics teaching: Evidence

from a randomized control trial. Retrieved from

http://www.webmeets.com/files/papers/lacea-lames/2013/438/BB2013_september.pdf.

*Berlinski, S.G.; Busso, M. (2015). Challenges in Educational Reform: An Experiment on Active

Learning in Mathematics (IDB Working Paper Series, No. IDB-WP-561). Washington,

DC: Inter-American Development Bank. Retrieved from

http://hdl.handle.net/11319/6825

*Beuermann, D. W., Naslund-Hadley, E., Ruprah, I. J., & Thompson, J. (2013). The pedagogy of

science and environment: Experimental evidence from Peru. The Journal of Development

Studies, 49(5), 719-736.

*Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).

Improving Elementary Science Instruction and Student Achievement: The Impact of a

Professional Development Program. Paper presented at the Society for Research on

Educational Effectiveness Spring Conference, Washington, DC.

*Borman, G. D., Gamoran, A., & Bowdon, J. (2008). A randomized trial of teacher development

in elementary science: First-year achievement effects. Journal of Research on

Educational Effectiveness, 1(4), 237-264.

*Bottge, B. A., Ma, X., Gassaway, L., Toland, M. D., Butler, M., & Cho, S. J. (2014). Effects of

blended instructional models on math performance. Exceptional children, 80(4), 423-437.

*Bottge, B. A., Toland, M. D., Gassaway, L., Butler, M., Choo, S., Griffen, A. K., & Ma, X.

(2015). Impact of enhanced anchored instruction in inclusive math classrooms.

Exceptional Children, 81(2), 158-175.

92

INSTRUCTIONAL IMPROVEMENT IN STEM

*Bradshaw, T. J. (2012). Impact of inquiry based distance learning and availability of classroom

materials on physical science content knowledge of teachers and students in Central

Appalachia (Doctoral dissertation). Retrieved from ProQuest Dissertations & Theses

Global. (Order no. 1506611403) http://search.proquest.com.ezp-

prod1.hul.harvard.edu/docview/1506611403?accountid=11311

*Brendefur, J., Strother, S., Thiede, K., Lane, C., & Surges-Prokop, M. J. (2013). A professional

development program to improve math skills among preschool children in Head Start.

Early Childhood Education Journal, 41(3), 187-195.

*Brown, J. A., Greenfield, D. B., Bell, E., Juárez, C. L., Myers, T., & Nayfeld, I. (2013, March).

ECHOS: Early Childhood Hands-On Science efficacy study. Paper presented at the

Society for Research on Educational Effectiveness Spring Conference, Washington, DC.

*Carpenter, T. P., Fennema, E., Peterson, P. L., Chiang, C. P., & Loef, M. (1989). Using

knowledge of children’s mathematics thinking in classroom teaching: An experimental

study. American Educational Research Journal, 26(4), 499-531.

*Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact

of an integrated approach to science and literacy in elementary school classrooms.

Journal of Research in Science Teaching, 49(5), 631-658.

*Clark, T. F., Arens, S. A., & Stewart, J. (2015, March). Efficacy study of a pre-algebra

supplemental program in rural Mississippi: Preliminary findings. Paper presented at the

Society for Research on Educational Effectiveness Spring Conference, Washington, DC.

*Clarke, B., Smolkowski, K., Baker, S. K., Fien, H., Doabler, C. T., & Chard, D. J. (2011). The

impact of a comprehensive Tier I core kindergarten program on the achievement of

students at risk in mathematics. The Elementary School Journal, 111(4), 561-584.

93

INSTRUCTIONAL IMPROVEMENT IN STEM

*Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:

Summative research on the Building Blocks project. Journal for Research in

Mathematics Education, 38(2), 136-163.

*Clements, D. H., & Sarama, J. (2008). Experimental evaluation of the effects of a research-

based preschool mathematics curriculum. American Educational Research Journal,

45(2), 443-494.

*Clements, D. H., Sarama, J., Spitler, M. E., Lange, A. A., & Wolfe, C. B. (2011). Mathematics

learned by young children in an intervention based on learning trajectories: A large-scale

cluster randomized trial. Journal for Research in Mathematics Education, 42(2), 127–

166.

*Dash, S., De Kramer, R.M., O’Dwyer, L. M., Masters, J., & Russell, M. (2012). Impact of

online professional development or teacher quality and student achievement in fifth grade

mathematics. Journal of research on Technology in Education, 45(1), 1-26.

*Debarger, A. H., Penuel, W. R., Moorthy, S., Beauvineau, Y., Kennedy, C. A., & Boscardin, C.

K. (2017). Investigating purposeful science curriculum adaptation as a strategy to

improve teaching and learning. Science Education, 101(1), 66-98.

*Dominguez, P. S., Nicholls, C., & Storandt, B. (2006). Experimental methods and results in a

study of PBS TeacherLine math courses. Syracuse, NY: Hezel Associates. Retrieved from

https://files.eric.ed.gov/fulltext/ED510045.pdf

*Eddy, R. M., & Berry, T. (2006). A randomized control trial to test the effects of Prentice

Hall’s Miller and Levine (2006) Biology curriculum on student performance: Final

report. Claremont, CA: Claremont Graduate University.

94

INSTRUCTIONAL IMPROVEMENT IN STEM

*Eddy, R. M., Ruitman, H. T., Sloper, M., & Hankel, N. (2010). The effects of Miller & Levine

Biology (2010) on student performance: Final report. LaVerne, CA: Cobblestone

Applied Research & Evaluation, Inc. Retrieved from

http://assets.pearsonschool.com/asset_mgr/pending/2013-

09/FinalEfficacyReport_ML%20Bio.pdf

*Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.

(2011). Middle school mathematics professional development impact study: Findings

after the second year of implementation. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

*Granger, E. M., Bevis, T. H., Saka, Y., Southerland, S. A., Sampson, V., & Tate, R. L. (2012).

The efficacy of student-centered instruction in supporting science learning. Science,

338(6103), 105-108.

*Gropen, J., Clark-Chiarelli, N., Chalufour, I., Hoisington, C., & Eggers-Pierola, C. (2009,

March). Creating a successful professional development program in science for Head

Start teachers and children: Understanding the relationship between development,

intervention, and evaluation. Paper presented at the Society for Research on Educational

Effectiveness Spring Conference Washington, DC.

*Gropen, J., Clark-Chiarelli, N., Ehrlich, S., & Thieu, Y. (2011, March). Examining the efficacy

of Foundations of Science Literacy: Exploring contextual factors. Paper presented at the

Society for Research on Educational Effectiveness Spring Conference, Washington, DC.

**Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning:

A 3-year study examining the embedding of language and argumentation into elementary

95

INSTRUCTIONAL IMPROVEMENT IN STEM

science classrooms. International Journal of Science and Mathematics Education, 14(5),

847-863.

*Hand, B., Therrien, W., & Shelley, M. (2013, March). Examining the impact of using the

science writing heuristic approach in learning science: A cluster randomized study.

Paper presented at the Society for Research on Educational Effectiveness Spring

Conference, Washington, D.C.

*Harris, C. J., Penuel, W. R., D'Angelo, C. M., DeBarger, A. H., Gallagher, L. P., Kennedy, C.

A., ... & Krajcik, J. S. (2015). Impact of project‐based curriculum materials on student

learning in science: Results of a randomized controlled trial. Journal of Research in

Science Teaching, 52(10), 1362-1385.

*Harris, C. J., Penuel, W. R., DeBarger, A., D’Angelo, C., & Gallagher, L. P. (2014).

Curriculum Materials make a difference for next generation science learning: Results

from year 1 of a randomized control trial. Menlo Park, CA: SRI International.

* Heller, J. I. (2012). Effects of Making Sense of SCIENCE™ professional development on the

achievement of middle school students, including English language learners. (NCEE

2012-4002). Washington, DC: National Center for Education Evaluation and Regional

Assistance, Institute of Education Sciences, U.S. Department of Education.

* Heller, J. I., Curtis, D. A., Rabe-Hesketh, S., & Verboncoeur, C. (2007). The effects of Math

Pathways and Pitfalls on students’ mathematics achievement: National Science

Foundation final report. Washington, DC: U.S. Department of Education, National

Center for Education Statistics. http://eric. ed.gov/?id=ED498258

*Heller, J. I., Daehler, K. R., Wong, N., Shinohara, M., & Miratrix, L. W. (2012). Differential

effects of three professional development models on teacher knowledge and student

96

INSTRUCTIONAL IMPROVEMENT IN STEM

achievement in elementary science. Journal of Research in Science Teaching, 49(3), 333-

362.

*Heller, J., Hanson, T., Barnett-Clarke, C. (2010). The impact of Math Pathways & Pitfalls on

students’ mathematics achievement and mathematical language development: A study

conducted in schools with high concentrations of Latino/a student and English learners.

Oakland, CA: WestEd. Retrieved from https://mpp.wested.org/wp-

content/uploads/2013/05/mpp-ies-report.pdf

*Hinerman, K. M., Hull, D. M., Chen, Q., Booker, D. D., & Naslund-Hadley, E. I. (2014,

March). Teacher-led math inquiry in Belize: A cluster randomized trial. Paper presented

at the Society for Research on Educational Effectiveness Spring Conference,

Washington, DC.

*Jaciw, A. P., Hegseth, W., Ma, B., & Lai, G. (2012). Assessing impacts of Math in Focus, a

‘Singapore Math’ program for American schools: A report of findings from a randomized

control trial. Palo Alto, CA: Empirical Education.

**Jaciw, A. P., Hegseth, W., & Toby, M. (2015, March). Assessing impacts of" Math in Focus,"

a ‘Singapore Math’ program for American schools. Paper presented at the Society for

Research on Educational Effectiveness Spring Conference, Washington, DC.

*Jacobs, V. R., Franke, M. L., Carpenter, T. P., Levi, L., & Battey, D. (2007). Professional

development focused on children's algebraic reasoning in elementary school. Journal for

Research in Mathematics Education, 38(3), 258-288.

*Jerrim, J., & Vignoles, A. (2015). The causal effect of East Asian “mastery” teaching methods

on English children’s mathematics skills. UCL Institute of Education, University College

London.

97

INSTRUCTIONAL IMPROVEMENT IN STEM

*Kaldon, C. R., & Zoblotsky, T. A. (2014, March). A randomized controlled trial validating the

impact of the LASER model of science education on student achievement and teacher

instruction. Paper presented at the Society for Research on Educational Effectiveness

Spring Conference, Washington, DC.

*Kim, K. H., VanTassel-Baska, J., Bracken, B. A., Feng, A., Stambaugh, T., & Bland, L. (2012).

Project Clarion: Three years of science instruction in Title I schools among K – third-

grade students. Research in Science Education, 42(83), 1-17.

*Kinzie, M. B., Whittaker, J. V., Williford, A. P., DeCoster, J., McGuire, P., Lee, Y., & Kilday,

C. R. (2014). MyTeachingPartner-Math/Science pre-kindergarten curricula and teacher

supports: Associations with children's mathematics and science learning. Early

Childhood Research Quarterly, 29(4), 586-599.

*Kisker, E. E., Lipka, J., Adams, B. L., Rickard, A., Andrew-Ihrke, D., Yanez, E. E., & Millard,

A. (2012). The potential of a culturally based supplemental mathematics curriculum to

improve the mathematics performance of Alaska Native and other students. Journal for

Research in Mathematics Education, 43(1), 75-113.

*Klein, A., Starkey, P., Clements, D., Sarama, J., & Iyer, R. (2008). Effects of a pre-kindergarten

mathematics intervention: A randomized experiment. Journal of Research on

Educational Effectiveness, 1(3), 155-178.

*Klein, A., Starkey, P., Deflorio, L., Brown, E.T., (2012, March). Establishing and sustaining an

effective pre-kindergarten math intervention at scale. Paper presented at the Society for

Research in Educational Effectiveness Spring Conference, Washington, DC.

98

INSTRUCTIONAL IMPROVEMENT IN STEM

*Lafferty, J. F. (1994). The links among mathematics text, students’ achievement, and students’

mathematics anxiety: A comparison of the incremental development and traditional texts.

Unpublished doctoral dissertation, Widener University, Wilmington, DE.

*Lanehart, R.E., Borman, K.M., Boydston, T. L., Cotner, B. A., & Lee, R. S. (2010, March).

Improving gender, racial, and social equity in elementary science instruction and student

achievement: The impact of a professional development program. Paper presented

Society for Research on Educational Effectiveness Spring Conference, Washington, DC.

*Lang, L. B., Schoen, R. R., LaVenia, M., & Oberlin, M. (2014). Mathematics Formative

Assessment system – Common Core State Standards: A randomized field trial in

kindergarten and first grade. Paper presented at the Society for Research on Educational

Effectiveness Spring Conference, Washington, DC.

*Lara‐Alecio, R., Tong, F., Irby, B. J., Guerrero, C., Huerta, M., & Fan, Y. (2012). The effect of

an instructional intervention on middle school English learners' science and English

reading achievement. Journal of Research in Science Teaching, 49(8), 987-1011.

*Lehrer, R. (2010). Assessing data modeling and statistical reasoning project: IES final report.

Washington, DC: Institute for Educational Sciences.

**Lewis, C.C., & Perry, R. (2014). Lesson study with mathematical resources: A sustainable

model for locally-led teacher professional learning. Mathematics Teacher Education and

Development, 16(1).

**Leuzzi, A. & Pennisi, A. (no date). Learning about the effectiveness of EU structural fund

projects in Italian schools: A large-scale RCT for math teachers. [Presentation slides].

Retrieved from http://www.esparama.lt/c/document_library/get_file?uuid=cbebbfe8-

e2a9-4486-8422-fbded985dfca&groupId=19002

99

INSTRUCTIONAL IMPROVEMENT IN STEM

*Lewis, C. C., & Perry, R. R. (2015). A randomized trial of lesson study with mathematical

resource kits: Analysis of impact on teachers’ beliefs and learning community. In

Middleton, J.A., Cai, J., & Hwang, S. (Eds). Large-scale studies in mathematics

education (133-158). New York, NY: Springer International Publishing.

*Llorente, C., Pasnik, S., Moorthy, S., Hupert, N., Rosenfeld, D., & Gerard, S. (2015, March).

Preschool teachers can use a PBS KIDS Transmedia Curriculum Supplement to support

young children's mathematics learning: Results of a randomized controlled trial. Paper

presented at the Society for Research on Educational Effectiveness Spring Conference,

Washinton, DC.

*Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.

(2016). Impact of a large-scale science intervention focused on English language

learners. American Educational Research Journal, 53(2), 395-424.

*Maerten-Rivera, J., Ahn, S., Lanier, K., Diaz, J., & Lee, O. (2016). Effect of a multiyear

intervention on science achievement of all students including English language learners.

The Elementary School Journal, 116(4), 600-624.

*Martin, T., Brasiel, S. J., Turner, H., & Wise, J. C. (2012). Effects of the Connected

Mathematics Project 2 (CMP2) on the mathematics achievement of Grade 6 students in

the Mid-Atlantic region: Final report. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

*McCoach, D. B., Gubbins, E. J., Foreman, J., Rubenstein, L. D., & Rambo-Hernandez, K. E.

(2014). Evaluating the efficacy of using pre-differentiated and enriched mathematics

curricula for grade 3 students: A multisite cluster-randomized trial. Gifted Child

Quarterly, 58(4), 272-286.

100

INSTRUCTIONAL IMPROVEMENT IN STEM

*Miller, G., Jaciw, A., Ma, B., & Wei, X. (2007). Comparative effectiveness of Scott Foresman

Science: A report of randomized experiments in five school districts. Palo Alto, CA:

Empirical Education.

*Montague, M., Krawec, J., Enders, C., & Dietz, S. (2014). The effects of cognitive strategy

instruction on math problem solving of middle-school students of varying ability. Journal

of Educational Psychology, 106(2), 469.

*Mutch-Jones, Puttick, & Demers (2014, April). Differentiating science instruction: Teacher and

student findings from the Accessing Science Ideas project. Poster presented at the

American Educational Research Association Annual Meeting, Philadelphia, PA.

* Newman,D., Finney, P.B., Bell, S., Turner, H., Jaciw, A.P., Zacamy, J.L., & Feagans Gould, L.

(2012). Evaluation of the Effectiveness of the Alabama Math, Science, and Technology

Initiative (AMSTI). (NCEE 2012–4008). Washington, DC: National Center for Education

Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of

Education. Retrieved from http://dx.doi.org/10.2139/ssrn.2511347.

*Oh, Y., Lachapelle, C. P., Shams, M. F., Hertel, J. D., & Cunningham, C. M. (2016, April).

Evaluating the efficacy of Engineering is Elementary for student learning of engineering

and science concepts. Paper presented at the American Educational Research Association

Annual Meeting, Washington, DC.

*Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive

Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.

*Penuel, W. R., Bates, L., Pasnik, S., Townsend, E., Gallagher, L. P., Llorente, C., & Hupert, N.

(2010, June). The impact of a media-rich science curriculum on low-income preschoolers'

101

INSTRUCTIONAL IMPROVEMENT IN STEM

science talk at home. In Proceedings of the 9th International Conference of the Learning

Sciences, (238-245). International Society of the Learning Sciences, Chicago, IL.

*Penuel, W. R., Gallagher, L. P., & Moorthy, S. (2011). Preparing teachers to design sequences

of instruction in systems science: A comparison of three professional development

programs. American Educational Research Journal, 48(4), 996-1025.

*Piasta, S. B., Logan, J. A., Pelatti, C. Y., Capps, J. L., & Petrill, S. A. (2015). Professional

development for early childhood educators: Efforts to improve math and science learning

opportunities in early childhood classrooms. Journal of Educational Psychology, 107(2),

407-422.

*Presser, A. L., Clements, M., Ginsburg, H., & Ertle, B. (2012). Effects of a preschool and

kindergarten mathematics curriculum: Big Math for Little Kids. New York, NY: Center

for Children and Technology, Retrieved from

http://cct.edc.org/sites/cct.edc.org/files/publications/BigMathPaper_Final.pdf

*Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by

integrating curricular activities with innovative and developmentally appropriate digital

apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at

the Society for Research on Educational Effectiveness Spring Conference, Washington,

DC.

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Driver, H. (2004). Implementation study of

Chemistry that Applies (2002-2003): SCALE-uP Report No. 2. George Washington

University, Washington, DC.

102

INSTRUCTIONAL IMPROVEMENT IN STEM

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2006). Implementation study of

Exploring Motion and Forces (2005-2006): SCALE-uP Report No. 13. George

Washington University, Washington, DC.

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2005). Implementation study of

The Real Reasons for Seasons (2004-2005): SCALE-uP Report No. 7. George

Washington University, Washington, DC.

*Reid, E. E., Chen, J. Q., & McCray, J. (March, 2014). Achieving high standards for pre-K –

Grade 3 mathematics: A whole teacher approach to professional development. Paper

presented at the Society for Research Educational Effectiveness Spring Conference,

Washington, DC.

*Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control

trial. Jackson, WY: PRES Associates, Inc.

*Resendez, M., and Azin, M. (2008). A study of the effects of Pearson’s 2009 enVision Math

Program. Jackson, WY: PRES Associates, Inc.

*Rethinam, V., Pyke, C., Lynch, S., Pyke, C., & Lynch, S. (2008). Using multilevel analyses to

study the effectiveness of science curriculum materials. Evaluation & Research in

Education, 21(1), 18-42.

*Rimbey, K. A. (2013). From the Common Core to the classroom: A professional development

efficacy study for the common core state standards for mathematics (Doctoral

dissertation). Retrieved from Pro Quest. (Order No. 3560379).

**Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,

L. P. (2010). Integration of technology, curriculum, and professional development for

103

INSTRUCTIONAL IMPROVEMENT IN STEM

advancing middle school mathematics: Three large-scale studies. American Educational

Research Journal, 47(4), 833-878.

*Roth, K., Wilson, C., Taylor, J., Hvidsten, C., Stennett, B., Wickler, N., ... & Bintz, J. (2015,

March). Testing the consensus model of effective PD: Analysis of practice and the PD

research terrain. Paper presented at the International Conference of the National

Association of Science Teacher Researchers, Chicago, IL.

*San Antonio, D. M., Morales, N. S., & Moral, L. S. (2011). Module‐based professional

development for teachers: a cost‐effective Philippine experiment. Teacher development,

15(2), 157-169.

*Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as

a lever for change: An experimental study of the effects of a professional development

program on students’ mathematics learning. Journal of Research on Educational

Effectiveness, 4(1), 1-24.

*Sarama, J., Clements, D. H., Starkey, P., Klein, A., & Wakeley, A. (2008). Scaling up the

implementation of a pre-kindergarten mathematics curriculum: Teaching for

understanding with trajectories and technologies. Journal of Research on Educational

Effectiveness, 1(2), 89-119.

*Sarama, J., Lange, A., Clements, D.H., & Wolfe, C.B. (2012). The impacts of an early

mathematics curriculum on emerging literacy and language. Early Childhood Research

Quarterly, 27, 489-502. doi: 10.1016/j.ecresq.2011.12.002

*Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A

study of three contrasting approaches to professional support. Journal of Mathematics

Teacher Education, 4(1), 55-79.

104

INSTRUCTIONAL IMPROVEMENT IN STEM

*Schneider, S. (2013). Final Report for IES R305A70105: Algebra Interventions for Measured

Achievement – Full Year. San Francisco, CA: WestEd.

*Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional

development program in formative classroom assessment in middle-school English

language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.

Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical

power for linking impacts on teacher practice and student achievement outcomes?

Journal of Educational and Behavioral Statistics, 36(4), 441-471.

*Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high

school biology and chemistry classes improves performance. Journal of Research in

Science Teaching, 40(9), 922-938.

*Shannon, L., & Grant, B. (2012). A final evaluation report of Houghton Mifflin Harcourt’s Holt

McDougal Biology. Charlottesville, VA: Magnolia Consulting, LLC.

*Sophian, C. (2004). Mathematics for the future: Developing a Head Start curriculum to support

mathematics learning. Early Childhood Research Quarterly, 19(1), 59-81.

*Star, J. R., Pollack, C., Durkin, K., Rittle-Johnson, B., Lynch, K., Newton, K., & Gogolen, C.

(2015). Learning from comparison in algebra. Contemporary Educational Psychology,

40, 41-54.

*Starkey, P., Klein, A., & , Deflorio, L. (2013). Changing the developmental trajectory in early

math through a two-year preschool math intervention. Paper presented at the Society for

Research on Educational Effectiveness Spring Conference, Washington, DC.

105

INSTRUCTIONAL IMPROVEMENT IN STEM

Sussman, J., & Wilson, M. R. (2018). The Use and Validity of Standardized Achievement Tests

for Evaluating New Curricular Interventions in Mathematics and Science. American

Journal of Evaluation, 1098214018767313.

*Tatar, D., Roschelle, J., Knudsen, J., Shechtman, N., Kaput, J., & Hopkins, B. (2008). Scaling

up innovative technology-based mathematics. The Journal of the Learning Sciences,

17(2), 248-286.

*Tauer, S. (2002). How does the use of two different mathematics curricula affect student

achievement?: A comparison study in Derby, Kansas. Retrieved from

http://www.cpmponline.org/pdfs/CPMP_Achievement_Derby.pdf

*Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P.

(2015). An efficacy trial of research-based curriculum materials with curriculum-based

professional development. American Educational Research Journal, 52(5), 984-1017.

*Thompson, D. R., Senk, S. L., & Yu, Y. (2012). An evaluation of the Third Edition of the

University of Chicago School Mathematics Project: Transition Mathematics. Chicago,

IL: University of Chicago School Mathematics Project.

*Vaden-Kiernan, M., Borman, G., Caverly,S., Bell, N., Ruiz de Castilla, V., Sullivan, K. &

Rodriguez, D. (March, 2016). Findings from a multi-year scale-up effectiveness trial of

Everyday Mathematics. Paper presented at the Society for Research in Educational

Effectiveness Spring Conference, Washington, DC.

*Van Egeren, L. A., Schwarz, C., Gerde, H., Morris, B., Pierce, S., Brophy-Herb, .., & Stoddard,

D. (2014, August). Cluster-randomized trial of the efficacy of early childhood science

education with low-income children: Years 1-3. Poster presented at the 2014 Discovery

Research K-12 PI Meeting, Arlington, VA.

106

INSTRUCTIONAL IMPROVEMENT IN STEM

*Walsh-Cavazos, S. (1994). A study of the effects of a mathematics staff development module on

teachers' and students' achievement. (Doctoral dissertation). Retrieved from ProQuest

(Order no. 9517241)

107

INSTRUCTIONAL IMPROVEMENT IN STEM

Online Appendix C:

Full Results of Sensitivity Checks

Table C1

Regression adjusted mean effect sizes based on unconditional and conditional RVE meta- regression models: Differences by subject (math vs. science)

Subgroup analysis Conditional RVE model (Unconditional RVE model) a b c d 푔̅̅̅푢푐̅̅ , puc 푔̅̅̅푐 pc Subject Matter: Focus of interventione Science 0.208*** 0.225 0.848 Math (including Math and 0.201*** 0.216 Science) Subject Area: Outcome measuref Science 0.188*** 0.223 0.942 Math 0.201*** 0.226 Note: First column: Mean effect sizes in the second column are based on subgroup analyses where unconditional RVE meta-regression models were estimated separately for studies that focused on math/science, and for math/science outcomes. a b 푔̅̅̅푢푐̅̅= Estimated mean effect size from unconditional RVE meta-regression model. puc = p- value on constant from unconditional meta-regression model. Second column: Mean effect sizes in the third column are based on the results of estimating RVE meta-regression models including all variables listed in Table 3, or including an indicator for math outcome (using the study-level mean) and including other variables listed in Table 3. Regression-adjusted mean effect sizes were calculated using the overall average of the study- level values of each included covariate. c 푔̅̅̅푐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression d model include covariates listed in Table 3. pc = p-value on coefficient for program feature on interest from meta-regression model including relevant set of controls. e Subject Matter: Focus of intervention refers to the focus of the intervention: whether the intervention focused on math (including 3 studies which focus on both math and science). f Subject Area: Outcome measure refers to the subject area of the outcome measure. This accounts for the fact that some studies focus on both math and science, and include both math and science outcomes. *p < .10, **p < .05, ***p < .01.

108

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C2

Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by grade level

Subgroup analysis (Unconditional RVE model) Grade level of participants Preschool 0.390*** Kindergarten 0.133** Early Elementary 0.200*** Upper Elementary 0.176*** Middle School 0.159*** High School 0.166** Note: Estimated mean effect sizes from unconditional RVE meta-regression models. Mean effect sizes are based on subgroup analyses where unconditional meta-regression analyses were estimated using RVE separately for studies that included participants in each grade level. Studies may be included in more than one row if studies included students across multiple grades. *p < .10, **p < .05, ***p < .01.

109

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C3

Sensitivity tests: Results of estimating an unconditional meta-regression model with robust variance estimation (RVE) and imputing missing effect sizes

Imputing missing Imputing missing Imputing missing effect sizes as zero effect sizes as effect sizes as (g = 0.00) negative negative (g = -0.10) (g = -0.20) Constant 0.197*** 0.193*** 0.190*** (0.024) (0.024) (0.024) N effect sizes 278 278 278 N studies 95 95 95 휌a 0.80 0.80 0.80 Note: a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

110

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C4

Sensitivity tests: RVE results including professional development foci as moderators Imputing missing Imputing missing Imputing missing effect sizes as zero effect sizes as effect sizes as (g = 0.00) negative negative (g = -0.10) (g = -0.20) Between-study effects: Professional development foci Generic instructional strategies -0.078 -0.079 -0.080 (0.067) (0.067) (0.067) How to use curriculum materials 0.118* 0.114* 0.110* (0.062) (0.063) (0.064) Integrate technology 0.135 0.134 0.132 (0.104) (0.105) (0.107) Content-specific formative assessment 0.097 0.095 0.093 (0.062) (0.063) (0.064) Improve pedagogical content knowledge/how students learn 0.094** 0.095** 0.096** (0.042) (0.042) (0.043) N effect sizes 257 257 257 N studies 89 89 89 휌a 0.80 0.80 0.80 Between-study effects: Number of PD foci Number of PD features 0.074*** 0.073*** 0.072*** (0.023) (0.024) (0.024) N effect sizes 257 257 257 N studies 89 89 89 휌a 0.80 0.80 0.80 Note: All models include only studies and/or treatment arms with a professional development component. All models include controls for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

111

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C5

Sensitivity tests: RVE results including professional development activities as moderators

Imputing missing Imputing missing Imputing missing effect sizes as zero effect sizes as effect sizes as (g = 0.00) negative negative (g = -0.10) (g = -0.20) Between-study effects: Professional development foci Observed demonstration 0.033 0.033 0.033 (0.048) (0.048) (0.049) Review generic student work -0.026 -0.029 -0.032 (0.070) (0.070) (0.069) Solved problems/Worked through student materials 0.076 0.074 0.072 (0.046) (0.047) (0.048) Developed curriculum materials/lesson plans 0.046 0.046 0.047 (0.062) (0.062) (0.063) Reviewed own student work -0.002 -0.007 -0.011 (0.070) (0.073) (0.076) N effect sizes 256 256 256 N studies 88 88 88 휌a 0.80 0.80 0.80 Between-study effects: Number of PD activities Number of PD features 0.039* 0.038 0.036 (0.022) (0.023) (0.024) N effect sizes 257 257 257 N studies 89 89 89 휌a 0.80 0.80 0.80d Note: All models include only studies and/or treatment arms with a professional development component and include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

112

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C6

Sensitivity tests: RVE results including professional development formats as moderators

Imputing missing Imputing missing effect Imputing missing effect sizes as zero sizes as negative effect sizes as (g = 0.00) (g = -0.10) negative (g = -0.20) Between-study effects: Professional development formats Same-school collaboration 0.106 0.104 0.101 (0.066) (0.065) (0.064) Implementation meetings 0.104** 0.102** 0.099** (0.047) (0.047) (0.047) Any online PD -0.137** -0.136** -0.135** (0.054) (0.053) (0.053) Summer workshop 0.072** 0.071* 0.071* (0.038) (0.038) (0.038) Expert coaching 0.054 0.057 0.060 (0.051) (0.051) (0.051) PD leaders – researchers -0.039 -0.043 -0.048 (0.037) (0.037) (0.037) N effect sizes 251 251 251 N studies 86 86 86 휌a 0.80 0.80 0.80 Between-study effects: Number of PD formats Number of PD features 0.020 0.019 0.018 (0.023) (0.023) (0.023) N effect sizes 257 257 257 N studies 89 89 89 휌a 0.80 0.80 0.80 Note: All models include only studies and/or treatment arms with a professional development component and include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01

113

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C7

Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect sizes

Imputing Imputing Imputing missing missing effect missing effect sizes as sizes as effect sizes as zero negative negative (g = 0.00) (g = -0.10) (g = -0.20) Between-study effects Implementation guidance 0.075 0.080 0.086 (0.058) (0.058) (0.059) Laboratory/hands-on experience, curriculum kits -0.026 -0.027 -0.028 (0.053) (0.054) (0.054) Curriculum dosage (Number of hours) 0.000 0.000 0.000 (0.001) (0.001) (0.001)

N effect sizes 205 205 205 N studies 77 77 77 휌a 0.80 0.80 0.80 Note: All models include only studies and/or treatment arms with a professional development component. All models include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

114

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C8

Sensitivity tests: Results of estimating an unconditional meta-regression model with robust variance estimation (RVE)

Preferred Excluding Excluding Alternative values of correlation model non-US non-RCT between pairs of observed effect sizes studies studies within studies Constant 0.209*** 0.223*** 0.208*** 0.209*** 0.209*** 0.209*** (0.025) (0.025) (0.025) (0.025) (0.025) (0.025) N effect 258 248 239 258 258 258 sizes N studies 95 89 86 95 95 95 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Note: a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

115

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C9

Sensitivity tests: RVE results including professional development foci as moderators Preferred Excluding Excluding Alternative values of correlation between model non-US non-RCT effect sizes within studies studies studies Between-study effects: Professional development foci Generic instructional strategies -0.074 0.071 -0.079 -0.074 -0.074 -0.074 (0.072) (0.074) (0.077) (0.072) (0.072) (0.072) How to use curriculum materials 0.119* 0.077 0.110 0.119* 0.119* 0.119* (0.063) (0.057) (0.066) (0.063) (0.063) (0.063) Integrate technology 0.139 0.203* 0.136 0.139 0.139 0.139 (0.102) (0.095) (0.104) (0.102) (0.102) (0.102) Content-specific formative assessment 0.105 0.109 0.104 0.105 0.105 0.105 (0.062) (0.069) (0.067) (0.062) (0.062) (0.062) Improve pedagogical content knowledge/how students learn 0.096** 0.078* 0.101** 0.096** 0.096** 0.096** (0.043) (0.039) (0.044) (0.043) (0.043) (0.043) N effect sizes 237 227 222 237 237 237 N studies 89 83 82 89 89 89 휌a 237 0.80 0.80 0.50 0.50 0.50 Between-study effects: Number of PD foci Number of PD features 0.077*** 0.069*** 0.075*** 0.077*** 0.077*** 0.077*** (0.024) (0.024) (0.025) (0.024) (0.024) (0.024) N effect sizes 237 227 222 237 237 237 N studies 89 83 82 89 89 89 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Note: All models include only studies and/or treatment arms with a professional development component. All models include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

116

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C10

Sensitivity tests: RVE results including professional development activities as moderators

Preferred Excluding Excludin Alternative values of correlation model non-US g non- between effect sizes within studies RCT studies studies Between-study effects: Professional development activities Observed demonstration 0.025 0.004 0.017 0.025 0.025 0.025 (0.046) (0.048) (0.047) (0.046) (0.046) (0.046) Review generic student work -0.014 -0.057 0.015 -0.014 -0.014 -0.014 (0.085) (0.081) (0.093) (0.085) (0.085) (0.085) Solved problems/Worked through student materials 0.070 0.098** 0.062 0.070 0.070 0.070 (0.047) (0.045) (0.048) (0.047) (0.047) (0.047) Developed curriculum materials/lesson plans 0.036 -0.001 0.044 0.036 0.036 0.036 (0.063) (0.073) (0.065) (0.063) (0.063) (0.063) Reviewed own student work 0.017 0.046 0.009 0.017 0.017 0.017 (0.071) (0.085) (0.073) (0.071) (0.071) (0.071) N effect sizes 236 226 221 236 236 236 N studies 88 82 81 88 88 88 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Between-study effects: Number of PD activities Number of PD features 0.046* 0.041 0.048* 0.046* 0.046* 0.046* (0.024) (0.024) (0.024) (0.024) (0.024) (0.024) N effect sizes 237 227 222 237 237 237 N studies 89 83 82 89 89 89 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Note: All models include only studies and/or treatment arms with a professional development component. All models include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

117

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C11

Sensitivity tests: RVE results including professional development formats as moderators

Preferred Excluding Excluding Alternative values of correlation between model non-US non-RCT effect sizes within studies studies studies Between-study effects: Professional development formats Same-school collaboration 0.123* 0.152** 0.127* 0.123* 0.123* 0.122* (0.067) (0.067) (0.068) (0.067) (0.067) (0.067) Implementation meetings 0.117** 0.091** 0.109** 0.117** 0.117** 0.117** (0.045) (0.0435) (0.046) (0.045) (0.045) (0.045) Any online PD -0.153** -0.169** -0.160** -0.153** -0.153** -0.153** (0.057) (0.063) (0.059) (0.057) (0.057) (0.057) Summer workshop 0.074* 0.069* 0.077* 0.074* 0.074* 0.074* (0.038) (0.038) (0.039) (0.038) (0.038) (0.038) Expert coaching 0.053 0.016 0.051 0.053 0.053 0.053 (0.051) (0.049) (0.053) (0.051) (0.051) (0.051) PD leaders – researchers -0.037 -0.025 -0.040 -0.037 -0.037 -0.037 (0.038) (0.034) (0.038) (0.038) (0.038) (0.038) N effect sizes 231 221 216 231 231 231 N studies 86 80 79 86 86 86 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Between-study effects: Number of PD formats Number of PD features 0.023 0.029 0.022 0.023 0.023 0.023 (0.024) (0.026) (0.024) (0.024) (0.024) (0.024) N effect sizes 237 227 222 237 237 237 N studies 89 83 82 89 89 89 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Note: All models include only studies and/or treatment arms with a professional development component. All models include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

118

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C12

Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect sizes

Preferred model Excluding Excluding non- Alternative values of correlation non-US RCT studies between effect sizes within studies studies Between-study effects Implementation guidance 0.062 0.028 0.057 0.062 0.062 0.062 (0.056) (0.051) (0.060) (0.056) (0.056) (0.056) Laboratory/hands-on experience, curriculum kits -0.026 -0.028 -0.026 -0.026 -0.026 -0.026 (0.054) (0.059) (0.056) (0.054) (0.054) (0.054) Curriculum dosage (Number of hours) 0.000 0.000 0.000 0.000 0.000 0.000 (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

N effect sizes 193 185 175 197 197 197 N studies 77 73 69 78 78 78 휌a 0.80 0.80 0.80 0.50 0.70 0.90 Note: All models include only studies and/or treatment arms with new curriculum materials. All models include controls for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 휌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01.

119

INSTRUCTIONAL IMPROVEMENT IN STEM

Table C13

Descriptions of studies published before and after 2004

Code Code presenta Studies Studies p-value of published published differencee

pre-2004 2004 or later (N = 9) (N = 86) Effect Size (Hedges’ g) 0.360 0.224 0.023 Intervention Type Professional development only 22% 22% 0.993 Professional development and curriculum materials 67% 76% 0.563 Curriculum materials only 22% 8% 0.173 Research Design and Sample Characteristics RCT 56% 94% 0.001 Subject matter – Math 89% 62% 0.107 Preschool 11% 20% 0.533 Effect Size Type Intervenor developed test 69% 51% 0.063 State standardized test 3% 19% 0.039 Other standardized test 28% 31% 0.743 Adjusted for covariates 10% 84% <0.001 PD Formatb Same-school collaboration 67% 74% 0.620 Implementation meetings 33% 35% 0.927 Online professional development 0% 20% 0.144 Summer workshop 67% 53% 0.437 Expert coaching 0% 22% 0.117 PD lead by researchers/intervention developers 78% 65% 0.449 PD Timing and Durationc Contact hours 56 hours 44 hours 0.371 Timespan over which professional development was

conducted: Less than one week 25% 14% 0.419 One week 13% 3% 0.148 One month (8 days to 30 days) 0% 4% 0.578 One semester (31 days to 4 months) 13% 13% 0.980 One year 13% 53% 0.031 More than one year 38% 14% 0.090

120

INSTRUCTIONAL IMPROVEMENT IN STEM

Code presenta Studies Studies p-value of published published differencee

pre-2004 2004 or later (N = 9) (N = 86) PD Focusb Generic instructional strategies 22% 9% 0.234 How to use curriculum materials 78% 74% 0.828 Integrate technology 0% 12% 0.284 Content-specific formative assessment 22% 16% 0.654 Improve content knowledge/Pedagogical content knowledge/How students learn 56% 55% 0.959 PD Activitiesb Review sample student work 22% 15% 0.583 Observed demonstration 0% 38% 0.021 Solved problems/Worked through student materials 44% 42% 0.883 Developed curriculum/lesson plans 22% 19% 0.795 Reviewed own student work 0% 12% 0.281 Curriculum Materialsd Laboratory/Hands-on experience or curriculum kits 22% 29% 0.669 Curriculum dosage (hours) 71.2 66.2 0.850 Curriculum proportion replaced (percent) 87% 91% 0.651 Implementation guidance 22% 45% 0.186 Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary variables, or the sample average calculated at the study level for continuous variables (e.g., contact hours and curriculum dosage). For studies that had the feature present in one treatment arm but not another treatment arm, the code is counted as present if it is present is any treatment arm (e.g., a study with one treatment arm including only curriculum materials and a second treatment arm including both professional development and curriculum materials would be included in both rows). b Codes for PD focus, activities, and format were counted as “Not Present” for studies that did not involve a PD component. c PD timing/duration excludes studies without a PD component. d Codes for features of interventions involving new curriculum materials were counted as “Not Present” for studies that did not involve new curriculum materials. e p-value from t-tests. All t-tests were conducted at the study level, except for t-test for variables in the Effect Size Type category which were conducted at the effect size level.

121

INSTRUCTIONAL IMPROVEMENT IN STEM

122