Educational

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/hedp20

Dynamic measurement: A theoretical–psychometric paradigm for modern educational

Denis Dumas , Daniel McNeish & Jeffrey A. Greene

To cite this article: Denis Dumas , Daniel McNeish & Jeffrey A. Greene (2020) Dynamic measurement: A theoretical–psychometric paradigm for modern , Educational Psychologist, 55:2, 88-105, DOI: 10.1080/00461520.2020.1744150 To link to this article: https://doi.org/10.1080/00461520.2020.1744150

Published online: 03 Apr 2020.

Submit your article to this journal

Article views: 334

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=hedp20 EDUCATIONAL PSYCHOLOGIST 2020, VOL. 55, NO. 2, 88–105 https://doi.org/10.1080/00461520.2020.1744150

Dynamic measurement: A theoretical–psychometric paradigm for modern educational psychology

Denis Dumasa, Daniel McNeishb, and Jeffrey A. Greenec aDepartment of Research Methods and Information Science, University of Denver; bDepartment of Psychology, Arizona State University; cLearning Sciences and Psychological Studies, University of North Carolina at Chapel Hill

ABSTRACT Scholars have lamented that current methods of assessing student performance do not align with contemporary views of as situated within students, contexts, and time. Here, we intro- duce and describe one theoretical–psychometric paradigm—termed dynamic measurement— designed to provide a valid representation of the way students respond to school-based instruc- tion by estimating the learning potential of those students given their observed learning trajec- tory. We examine the century-long history of dynamic measurement, from its nascent theoretical beginnings, through its limited initial implementations, to the current development of a special- ized modeling framework that allows it to be applied to large-scale educational data. We illustrate how dynamic measurement models (DMMs) can realize the goals of modern theory and measure- ment using a large longitudinal dataset of mathematics assessment data (i.e., 11,368 students nested within 98 schools and measured at 6 time points). These historical review and methodo- logical demonstrations show the value of dynamic measurement, including better modeling of and contextually dependent influences on student learning.

Modern educational psychology theory and research are Here, we review and describe one recently developed built upon the assumption that educational outcomes are methodological procedure—dynamic measurement models the result of dynamic interactions among intraindividual fac- (DMM; Dumas & McNeish, 2017; McNeish et al., 2020)—as tors (e.g., prior knowledge, , ) and the one potential solution to the psychometric and resource contexts in which people develop, learn, are taught, and per- challenges that have otherwise hampered educational psy- form (Bronfenbrenner & Morris, 2006; National Academies chology’s aspirational goals of fully accounting for the inter- of Sciences, Engineering, and Medicine, 2018). Most actions of student individual differences, history, and researchers and educators espouse the beliefs that educators context that result in performance. DMMs provide research- should focus on student potential and that teaching should ers and educators a powerful, flexible, and scalable way to target students’ zone of proximal development, dynamically assess not just what students can do on a given day, in a adapting as students internalize instruction and improve given context, with a given assessment, but also how their history and context led to that performance. DMMs also their performance (Vygotsky, 1931/1997). These beliefs estimate what students could achieve in the future if their imply that student abilities change and grow over time, and current circumstances persist and, critically, provide some that different students can display differing trajectories of into how that future could be different if instruction growth in response to instruction, context, and history were optimized to their potential. (Alexander et al., 2009). By extension, judgments about stu- In this article, first we provide a century-spanning history ’ dents future potential based upon a single observation of of dynamic assessment, reviewing its potential to realize their ability, without information about student learning tra- educational psychology’s theoretical and aspirational goals, jectories or the contexts from which those learning trajecto- as well as the challenges that limited its realization. Then we ries emerged, may perpetuate social inequities (Haertel, introduce DMM, showing how this modern psychometric 2018; NRC & NAE, 2010). Yet, despite the acknowledgment and statistical technique can provide better support for that an individual student’s performance is a complex emul- inferences about learning and education from large-scale test sion of potential and opportunity, educational practice and data, including inferences about how former and current policy continue to rely upon static or oftentimes resource- conditions have shaped individual performance. Finally, we intense assessments that fail to properly account for the describe new directions for educational psychology theory, effects of context, and likely underestimate students’ capacity measurement, and advocacy that leverage DMMs to address to learn in the future. pressing questions in research (e.g., capturing interactions

CONTACT Denis Dumas [email protected] Department of Research Methods and Information Science, University of Denver, Denver, CO 80210. ß 2020 Division 15, American Psychological EDUCATIONAL PSYCHOLOGIST 89 among individuals and context), (e.g., model- Earliest attempts ing learning capacity as a function of multiple assessments Many of the progenitors of educational and psychological with non-linear trajectories), policy (e.g., shaping contexts to measurement and testing were aware that the inferences foster optimal growth), and critical thought (e.g., leveraging about learning derived from static testing methods held ser- multiple methods and theories to advance the goals of ious issues, and could result in problematic societal conse- social justice). quences when used in high-stakes educational decisions. For A few caveats are necessary before we begin our review. example, as early as 1909, Alfred Binet advocated for a pro- First, we do not seek to advance any particular existing the- cess-assessment paradigm in which students would be ory of learning within the educational psychology literature observed learning new material as a means of assessing their (e.g., constructivism, situated learning). Instead, we follow capacity for performance, although he never developed such Alexander et al. (2009) in conceptualizing learning pragmat- a procedure. Despite Binet’s later misgivings about static ically, as a phenomenon that occurs in ways that are testing, and perhaps because of the less resource intensive- described across each extant theory, and perhaps cannot be nature of static testing compared to the demands of specifically described across all learning instances by any dynamic assessment, the decade of the 1910s saw a massive existing theory. In doing this, we use the term learning to growth of static testing procedures. This was especially evi- mean a change in student performance in response to dent in North America where psychometric batteries devel- instruction within the school context. Questions of how oped by Thorndike et al. (i.e., the Alpha and Beta tests; learning occurs and how to facilitate it are beyond the scope Thorndike, 1919) were used to sort military recruits into of this article but can be explored in novel ways using positions during World War I. DMMs. Second, in this article we use the terms static and After World War I ended, scholars in North America dynamic in specific ways. We use the term static to refer to and Europe began to specifically critique the static testing assessment or measurement practice that targets abilities paradigm that had prevailed during the war. Chief among that are current at the time of testing (e.g., fourth grade these critics was W.E.B. DuBois. In his essay Race mathematics in fourth grade) and tend to do so without , DuBois (1920/2013) laid out a specific argument emphasis on contextual factors that support the develop- for the critical need to incorporate information about the ment of those abilities. Using that terminology, static meas- dynamic process of learning into large-scale tests, and ures could be either summative or formative, depending on offered the prophecy that static testing—with its conflation the inferences made from their scores, and their application of current knowledge and future potential to learn—would in the educational setting. In contrast, we use the term always benefit individuals who had experienced the richest dynamic to refer to measurement that targets developing learning opportunities early in life. In addition, DuBois ’ constructs (e.g., fourth grade students future capacity to made the additional point that, if students who scored lower learn mathematics) through a procedure that couples testing on static tests were then sorted into contexts where they with instruction in order to observe student improvement, never received the instruction required to change their edu- with an eye toward the instructional and environmental cational trajectory (e.g., low-expectation classrooms), the test contexts that support the further growth of those developing would constitute a self-fulfilling prophecy. Such critiques constructs (Elliott et al., 2018). Given its focus on prospect- called into question the existing predictive validity evidence ive inferences about students and how to optimize their fit of static test scores by illustrating how high-stakes testing with learning contexts, dynamic measurement has tended to practices could create a student’s future, rather than predict be associated with formative assessment goals, although it. Very soon after DuBois published his major critique of decisions made from dynamic assessment scores have at static testing practice, Thorndike (1921, 1924) conceded that times been used for relatively high-stakes purposes (e.g., static testing can never validly quantify a student’s capacity development of individual education plans in a special edu- to learn from instruction and theorized about an approach cation setting; Swanson & Howard, 2005). to testing that would incorporate student growth in response to school-based teaching. At the same time, Dearborn Historical review of dynamic measurement (1921) also envisioned an educational testing paradigm that involved observations of the process of learning, not just In the following review and empirical example, we show that student performance. A few years later, De Weerdt (1927) dynamic measurement practice has a clear superiority over offered a perspective on measurement in which the improv- static measurement practice when it comes to estimating stu- ability of individual students in response to instruction was dents’ capacity to learn. So, why then have educators, psychol- the construct of interest, rather than static ability. Despite ogists, and policy makers continued to rely on static forms of these early theoretical forays into the nascent area of assessment? The answer lies in historical issues of resource- dynamic measurement during the 1920s, the static testing driven compromises, a lack of statistical tools and knowledge, paradigm continued to gain in popularity among psycholo- and the challenges of extending psychometric models to data gists and educational researchers, likely in part due to the from repeated measurement occasions. We discuss each of need to calculate psychometric scores by hand (Thurstone, those issues below to provide both an understanding of where 1926). By the 1930s in North America, static testing was the the field is and how it can move beyond the inequities inher- de facto accepted method for the quantification of educa- ent in much of current measurement policy and practice. tionally relevant constructs, despite the theoretical and 90 D. DUMAS ET AL. moral misgivings of the creators of that method (e.g., Binet, The Feuerstein method 1909/1975; Thorndike, 1921, 1924). Immediately following World War II, Israeli educational However, at the same time in Western Europe, scholars psychologist Reuven Feuerstein worked with recently immi- were conducting major theoretical and empirical work grated child survivors of Nazi concentration camps related to dynamic measurement. For example, the German (Feuerstein et al., 1974). Many of these students had experi- psychologist Kern (1930) demonstrated that the intelligence enced extreme physical and social deprivation during their tests of the time showed very different results in terms of development and had, of course, missed educational oppor- the rank-order of participant scores when the tests were tunities that they would have otherwise received by their administered more than once, with participant scores not age. For this reason, identifying the appropriate grade-level reaching a relatively stable rank-order until after the fifth for any individual student was a complex challenge. testing occasion. Rey (1934/2012) presented a method of Feuerstein initially utilized a static testing paradigm to deter- testing that was interspersed with instruction that was mine students’ grade level, with traditional cognitive tasks designed to quantify the educability of students, similar to developed by Binet (Binet & Simon, 1908/1948) or Piaget De Weerdt (1927) idea of improvability, while also indicat- (1960). However, he found that such static testing proce- ing the properties of a dynamic educational process. dures systematically under-estimated the appropriate level of Contemporary with Rey (1934/2012) work, it was the theor- educational challenge for students. Although at the time etical and empirical perspectives of Russian psychologist Feuerstein was apparently unaware of Vygotsky’s theorizing Vygotsky et al. that would allow for the fuller conceptualiza- about the ZPD, he had been a student of Andre Rey and tion of dynamic measurement as a way to answer questions therefore had been exposed to Rey (1934/2012) conceptual- about not just what students could do given their current ization of the difference between ability and educability. circumstances, but also what they might achieve in From this understanding, Feuerstein concluded that his stu- the future. dents’ deprivation in the concentration camps had severely attenuated their current abilities and knowledge but had left Vygotsky’s perspective their educability, or their potential to learn, intact. In order to quantitatively estimate individual students’ During the first half of the 1930s, Vygotsky (e.g., Vygotsky, potential to learn, Feuerstein developed a testing framework 1931/1997) published the details of his theoretical perspec- that eventually came to be called Dynamic Assessment (DA), tive on education and learning. One principal feature of in which a student responded to test items multiple times, ’ Vygotsky s theoretical work was the conceptualization of the with targeted learning opportunities provided by a clinician, critical difference in psychological between student often Feuerstein himself, interspersed with testing unassisted performance, as was and is typically required on (Feuerstein et al., 1979). Feuerstein and his colleagues static educational and psychological measures, and student adapted existing types of cognitive tasks (e.g., verbal analo- performance when instructed by a more knowledgeable gies, categorization tasks) to fit the DA framework by cali- other. In Vygotsky’s terminology, the difference between brating specific instructional techniques to each type of task, student unassisted and assisted performance on a task was and packaged both the instruction and the tasks themselves termed the zone of proximal development (ZPD; 1935/2011). as the Learning Potential Assessment Device (LPAD; Although many applied usages of the theoretical ZPD exist Feuerstein et al., 1979, 1987). As Feuerstein et al. work pro- in educational psychology, the most relevant to measure- gressed, the term cognitive modifiability was adopted as a ment has been in differentiating between the quantification descriptor of the meaning of student scores on a dynamic of developed abilities that are current at the time of testing test (Feuerstein et al., 1981). and are indicated by student unassisted performance, and In the U.S., Feuerstein’s DA method was generally met developing capacities that lie in the future at the time of test- with resistance from educational researchers (e.g., Frisby & ing and can only be indicated by student performance in Braden, 1992). The most common and serious objection response to instruction or assistance (Grigorenko & concerned the unstandardized instruction used in DA prac- Sternberg, 1998). Immediately following Vygotsky’s major tice; students were traditionally assessed one-on-one by a theoretical work on the ZPD in the early 1930s, Soviet edu- clinician or other instructor, sometimes even a child’s parent cational measurement practices in the identification of intel- (e.g., Tzuriel & Caspi, 2017). Therefore, the actual instruc- lectually disabled students shifted to focus on developing tion or assistance that the students received as a part of DA capacities, or student performance in response to instruction could be very different depending on what the clinician or and assistance, instead of developed abilities (Brozek, 1972). instructor deemed necessary for that particular child. In fact, in a move that appears to echo the 1920 argument Further, different clinicians may have differed widely in of W.E.B. DuBois, beginning in 1936, static testing was pro- their judgments of particular learning situations and pro- hibited in the Soviet Union because it was considered anti- vided different assistance, even to the same student. For thetical to the revolutionary values of the Communist party. these reasons, it was difficult or impossible to isolate the This was because Soviets perceived the goal of static testing variance in DA scores that was accounted for by the individ- to be the perpetuation of class distinctions by labeling chil- ual student, the instructor themselves, or the method of dren who had not experienced educational opportunity as instruction used. This major critique precluded U.S. meas- permanently deficient (Sternberg & Grigorenko, 2002). urement experts’ adoption of DA as a valid method for the EDUCATIONAL PSYCHOLOGIST 91 quantification of student individual differences in potential processing theory approach, and generally eschewed more to learn or cognitive modifiability, despite its general accept- social cognitive or situated explanations of the learning phe- ance in Israel. In response, those interested in the quantifi- nomenon (Kliegl et al., 1989). One indicator of this theoret- cation of learning capacity have attempted a series of other ical viewpoint is the use of the term cognitive plasticity to modified or augmented dynamic assessment methodologies describe the target construct (Baltes & Kliegl, 1992). in order to yield meaningfully interpretable scores. Contemporary with the testing-the-limits approach but featuring a much longer period of instruction for students (i.e., up to seven days), as well as a more qualitative analysis Late 20th century improvements of the errors that students made on test items, the LernTest Unlike the earliest attempts at dynamic measurement, which approach was developed in Germany with the goal of deter- were generally focused on improving the paradigms of edu- mining the gaps in students’ understanding or cognitive cational measurement for all students, DA researchers after processes (Guthke, 1992; Guthke & Stein, 1996). The Feuerstein were almost uniformly inspired by the desire to Lerntest approach often utilized previously created cognitive improve psychometric practice for those students who are at stimuli (e.g., RPM), but administered those stimuli multiple the greatest disadvantage from static tests, such as econom- times interspersed with instruction (Guthke & Beckmann, ically impoverished students or those students with intellec- 2000). In addition, the LernTest approach was often specific- tual disabilities. In this way, DuBois (1920/2013) prediction ally focused on social-justice related applications of educa- that static testing would not support efforts to create educa- tional testing, including the identification of students with tional equality seems to have taken root at the heart of what high learning potential from historically marginalized popu- DA meant, especially in the U.S., during the latter part of lations (e.g., Hessels & Hamers, 1993). the 20th century. One example of this social-justice oriented During the mid-1990s, Swanson (1995a, 1995b) developed pattern was the research program of Milton Budoff who, a program of research aimed at uncovering the ways in beginning in the late 1960s and continuing through the which static measurement with cognitive batteries under- 1980s, worked to uncover how intellectually disabled stu- estimated the potential of students with learning disabilities dents may be able to achieve better outcomes in school to perform in school. From this general body of research, instructional contexts than individually administered and the Swanson Cognitive Processing Test (SCPT; Swanson, static intelligence tests would imply (i.e., the difference 2000) was developed. As with nearly every other major between developed abilities and developing learning capaci- dynamic test, the SCPT is essentially a package of previously ties; Budoff & Friedman, 1964; Budoff & Pagell, 1968). existing cognitive stimuli, including verbal and visuospatial Budoff (1987a, 1987b) developed a number of dynamic tests working memory tasks, coupled with instructional methods designed to measure learning potential, in which students designed to support students’ improvement on those tasks. were administered test stimuli multiple times separated by However, it is worth noting that the SCPT distinguishes instruction. Similarly to Feuerstein, Budoff developed these itself from other previous dynamic tests in that it is cur- measures by dynamizing existing cognitive tasks such as rently distributed by a major test publisher, and as such is Raven’s Progressive Matrices (RPM; Raven, 1941), which available for wide use in practical settings for a fee. The gen- essentially entailed designing instruction specifically to sup- eral construct of interest on the SCPT is termed processing port improvement on the RPM, and then administering the potential (Swanson, 1995b). Swanson also pushed DA meth- items interspersed with that instruction while observing stu- ods further than previous researchers by defining a number dent growth in order to quantify student learning capacity. of component scores for the general construct of processing In a more practice-oriented line of work, the approach of potential (e.g., gain scores, residualized scores; Swanson, Haywood (2008) and Lidz (1992) was more specifically 2000). In addition, Swanson et al. were more closely con- focused on evaluating the efficacy of instructional methods cerned with some traditional psychometric evidence (e.g., for a particular student (Haywood & Lidz, 2006), especially reliability indices) about their test scores than were previous regarding the application of DA to school-psychology con- DA researchers. In addition to reliability information, texts in which an individualized education plan is being Swanson et al. also systematically investigated the validity- written for a particular student. In a more specifically psy- related correlations among their dynamic scores and other chometric form of DA, the testing-the-limits method static tests, finding the addition of their gain or residualized (Carlson & Wiedl, 1978, 1979) adapted existing cognitive scores into various predictive models significantly improved measures (e.g., RPM; Cattell Culture-Fair Test; Catell, 1979) the explained variance in school achievement. It should be to be dynamic using systematic instruction and assistance. noted that, although many instances of DA required only In an effort to create comparable dynamic scores across one day of testing, the test items were interspersed with individual students, the testing-the-limits method incorpo- instruction and practice opportunities for students, meaning rated very specific graduated levels of instruction and assist- that, within one day of testing, multiple observations of stu- ance, which allowed researchers to make more specific dent abilities (and therefore student improvement and inferences about student cognitive processing (Carlson & growth) were able to be recorded. Wiedl, 1992). Interestingly, despite specifically measuring a In the final years of the 20th century and 1st years of the developing construct, the researchers who created this 21st, Sternberg et al. (e.g., Grigorenko & Sternberg, 1998; method entirely conceptualized it within an information Sternberg et al., 2002) conducted a series of investigations 92 D. DUMAS ET AL. into the efficacy of DA to reveal learning potential that and decreasing the cost of administering DA to students would be otherwise hidden using traditional static testing (Passig et al., 2016). The last two decades have also seen a methods. Sternberg et al. (2002) endeavored to directly major expansion in the types of populations that are tar- reevaluate early 20th century claims, based on evidence from geted using DA. Although the original DA foci of students static tests, about the low intellectual potential of Black sub- with intellectual disabilities or other social disadvantages has Saharan Africans. After applying a DA method, Sternberg remained (Aravena et al., 2016), the scope of DA work has et al. (2002) were able to provide specific evidence that static widened substantially to include gerontological populations testing procedures had indeed failed to uncover the learning (Navarro & Calero, 2009), students with language deficits potential of Sub-Saharan African children who had limited early in development (e.g., deaf children; Lederberg & experience with Western-style schooling. Spencer, 2009), and students who are identified for gifted education (Calero et al., 2011), among others. In addition, the scope of socially nondominant or economically disad- Recent innovations vantaged populations that have been included in DA The first two decades of the 21st century have seen an research has also widened, especially as concerns indigenous expansion of systematic work designed to measure develop- populations around the world (Chaffey, 2009). ing constructs (e.g., Poehner & van Compernolle, 2011). Also in the last 20 years, the field of special education Although certainly the DA methodological perspective can- and intellectual disability identification has expanded from not be said to have moved fully into the mainstream of edu- its historic focus on static test scores to a response to inter- cational research (Stringer, 2018), some concerted efforts vention approach (RTI; Fuchs & Fuchs, 2006). For many, have been made to make DA simultaneously more feasible RTI is essentially synonymous with DA (e.g., Grigorenko, and standardized (e.g., by offering computerized instruction; 2009), although the researchers that use only one of the two Yang & Qian, 2019). For example, DA researchers from terms in their work sometimes have different emphases and around the world, including and especially the Dutch foci (Lidz & Pena,~ 2009). In any case, the explicit mission of research team led by psychologist Wilma Resing (e.g., RTI is to use a student’s responsiveness to the instruction Resing, 2013), have made major strides in identifying solu- they receive, rather than that student’s static test scores, as tions to the more finely-grained problems related to the pro- the indicator of learning disability for sorting into special cess of DA, such as identifying the particular sequences of education programs (Compton et al., 2010). As such, both graduated prompts for specific cognitive tasks that demon- the DA and RTI perspectives use “instruction as test” strate the greatest usefulness for the most students (Fuchs et al., 2011, p. 339) to quantify the degree to which a (Veerbeek et al., 2019), the identification of the shifts in student improves over the course of instruction. strategic processing that drive student growth throughout a However, RTI methodology features many of the same DA (Resing et al., 2009), and the capability of DA scores to problematic issues as DA in terms of disentangling student predict subsequent classroom performance over and above responsiveness (i.e., a student individual difference) from preexisting individual differences (Stevenson et al., 2013). the effectiveness of the instruction administered. Moreover, Another major methodological improvement in DA research analytical issues with both DA and RTI include the need for in recent years has been the use of item-response theory a formal psychometric framework for producing and inter- (IRT) to score student responses across time (Stevenson preting the quality of dynamic scores, as well as the ongoing et al., 2013). Previous methods incorporating classical test need to conceptualize the functional form (i.e., the shape of theory have struggled to demonstrate reliable and valid psy- the growth trajectory) learning takes over the course of an chometric properties for dynamic scores, especially those intervention. One advanced methodological take on RTI was designed to indicate change over time, an issue that has recently published by Bose et al. (2019), who used a random been discussed in the literature for decades (Cronbach & effect mixture model to quantify the slope of growth for Furby, 1970). But with longitudinal IRT, scores from indi- individual students within an RTI sample, and then identify vidual time-points of assessment can be appropriately scaled subclasses of students based on that growth in order to pro- over time, allowing for analysis of growth over the course of vide more targeted recommendations regarding student instruction. needs. This very recent work highlights the promise of RTI, One serious tension in the DA literature over the course but the fact remains that both DA and RTI have major bar- of the 20th century was the issue of standardization of the riers (e.g., time and monetary cost associated with adminis- instruction in order to make the dynamic scores relatively tering both instruction and assessment) to their application comparable across students (e.g., Budoff, 1987a; Swanson, on a wide scale that require a shift in analytic paradigm in 1995b). In the 21st century, this issue has begun to be order to alleviate. addressed through the incorporation of automated and com- puterized instruction as part of a DA, allowing either all stu- Summary of the barriers to adoption of dynamic dents to receive precisely the same instruction, or for that measurement modeling instruction to be adapted to student ability in a way that is specifically reportable and reproducible (Poehner et al., This review of dynamic measurement, in relation to static 2015). In addition, the incorporation of virtual-reality sys- testing, clearly reveals the history of educational measure- tems has shown promise in improving the standardization ment in general as one of compromise between feasibility, EDUCATIONAL PSYCHOLOGIST 93 cost, and modeling theoretical assumptions regarding the in comparison to the established techniques for addressing effects of context and history upon learners’ current and such issues in static testing. For this reason, DA methods future performance. For example, even as early as Binet have struggled to enter the mainstream of educational and (1909/1975), DuBois (1920/2013), and Thorndike (1921, psychological research. In order to remedy this situation, a 1924), the inherent shortcomings of static testing were iden- formal psychometric paradigm designed for the quantification tified. Why then the century-old pressure to compromise, of developing constructs must be established and defended as and apply static measurement methods when dynamic meas- tractable, scalable, and superior to static methods. urement has been perennially preferred, at least theoretic- ally? Importantly, DA as it has been traditionally applied is A modern approach: Dynamic measurement modeling not just a little more expensive than static testing, it is much more resource intensive in terms of time and personnel, DMMs, which integrate the goals and practices of DA with meaning that, in order to collect samples of equal size, static modern psychometric and statistical methods, allow educa- testing is far and away more affordable than is DA. This is tional new ways to explore and enact their the- because DA typically requires a clinician (e.g., a school oretical, psychometric, and aspirational goals of helping all psychologist; Haney & Evans, 1999) to individually test students realize their capacity while also advancing social every student in a sample, and those clinicians must also be justice (i.e., paying special attention to the educational needs specifically trained in the instructional method designed for of historically marginalized students). At its core, a DA is use with that particular DA. Further, although some composed of repeated assessments for a particular ability or researchers in the DA literature and who worked to design skill, interspersed with targeted instruction designed to sup- dynamic measures (e.g., Swanson, 2000) reported improved port student development of that ability. Although DA has predictive validity of their dynamic assessments over and historically focused on targeted assessment of individual stu- above traditional static tests, that improvement in prediction dents one-on-one with a school psychologist or other of future student outcomes appears to have not been enough instructor, the advent of widespread longitudinal standar- to justify the greatly increased cost of DA compared to static dized testing for academic abilities (e.g., reading and math) tests for most practitioners and researchers. For this reason, means that schools themselves constitute a DA environment, the cost-benefit ratio of DA has been a weakness of the in that they regularly assess student knowledge and abilities, method in the past, implying the need to create methods and administer instruction designed to improve those abil- that could potentially allow for the estimation of quantities ities. When assessments are administered at multiple time associated with DA (e.g., learning capacity) but at a greatly points interspersed with instruction, the degree to which reduced cost to individual researchers. individual students responded to that instruction can be Moreover, though the theory of DA is elegant and concep- revealed. Given this scenario, the relatively newly developed tually appealing, generally researchers have struggled to quan- (Dumas & McNeish, 2017, 2018) DMM paradigm capitalizes tify its components into a formal statistical modeling on already existing instruction, typically administered by framework (e.g., Embretson, 1987, 1991;Sijtsma,1993). This classroom teachers, as well as existing testing programs (e.g., is through no fault of the researchers working on DA, as the state or nationally sponsored longitudinal testing programs) theory includes complex assumptions about growth, deceler- in order to quantify the rate and capacity of these school- ation, and individual differences. Ways of modeling each of based dynamic learning systems for every student. The pro- these aspects of DA have been relatively recent additions to cess of fitting and interpreting a DMM can be statistically the statistical literature: linear mixed-effects models to pro- complex, but the fundamental goal of DMM is the same as mote individual differences in change were not popularized DA: to provide reliably estimated scores that validly indicate until Laird and Ware (1982) and the extension to nonlinear learning performance and capacity both within and across models that can incorporate decelerating curves did not students, as well as learning contexts (e.g., schools). In add- appear until the mid-1990s with Davidian and Giltinan (1995) ition, because DMM is based on a formalized statistical and Pinheiro and Bates (1995). Software-based applications of framework, it better aligns with the goals of educational these statistical procedures also did not appear in programs psychology by allowing for specifically and quantitatively like SAS until 1999 (Wolfinger, 1999). Software packages that testable hypotheses about contextualized student learning. are most popular in educational research (e.g., SPSS) still offer Like any other methodological development, DMM is no support for nonlinear mixed-effects models. built on technologies and processes that came before it. For Further, traditional psychometric quantities are well- example, with the benefit of hindsight, it can be shown that defined in the static testing context but generalizations to the well-known Rasch model is, at its statistical core, a logis- multivariate contexts where examinees are repeatedly meas- tic regression model with a random effect on the intercept ured are not common and continue to require scholarly that allows it to estimate student-specific ability scores advances in the psychometric literature (e.g., Deonovic et al., (Strauss, 1992). In the same way, confirmatory factor ana- 2018). Methods for inspecting the longitudinal invariance of lysis (CFA) can be demonstrated mathematically to be a lin- tests (Meredith, 1993) and longitudinal item response theory ear regression with latent predictors, although of course such that dependencies across time are accounted for CFA methods have contributed in major ways to the (Embretson, 1991) are also relatively recent additions to the research capacity of the field of educational psychology both psychometric literature and continue to be under-developed theoretically and methodologically (Hancock & Mueller, 94 D. DUMAS ET AL.

2013). Analogously, DMM is, when boiled-down to its math- organization NWEA1. NWEA has contracts with school-sys- ematical underpinnings, a nonlinear mixed effects model that tems in nearly every U.S. state and administers tests to mil- incorporates an upper asymptote on growth, although we lions of students each year across a number of academic contend it has the potential to contribute to educational domains (i.e., mathematics, reading, language usage, and sci- psychology research in important ways. In IRT or CFA, indi- ence) to provide a description of within and between growth vidual indicators of student performance (i.e., items), are for both students and schools across grade levels. MAP scored using a theoretically and statistically determined factor Growth assessments are described in NWEA (2011) and structure, and latent scores are predicted. In DMM, a theoret- achievement and growth norms for the US public school ically relevant growth function (i.e., decelerating growth) is fit student population are provided in Thum and Hauser to indicators of student performance over time, and the upper (2015). The test scores used here are vertically-scaled Rasch limit, or capacity, of that growth is predicted for each student, Unit Scores taken from NWEA’s MAP Growth mathematics along with the effect of any contextual or demographic cova- test. Vertically-scaled means that scores from different riates of interest. Of course, the mathematics, statistical for- grades can be directly compared to one another and could mulations, and computational estimators that are utilized in be used to model student growth (see Briggs & Weeks, 2009 DMM have all existed in various forms previously but have for a full discussion of vertical scaling issues in modeling yet to be combined and described in a way that allows educa- student learning and growth). tional psychologists to explore and test the complex relations For simplicity and clarity, we will demonstrate how to fit posited in their theories. the DMM using a subset of the extant data. The subset we DMMs address the cost-benefit concerns associated with use contains data collected during the years 2007–2016, DA. A recently published piece (McNeish et al., 2020) has from students who were administered the MAP Growth demonstrated that DMM estimated capacity scores can pre- mathematics tests in each year from Grade 3 to Grade 8 (so dict a distal learning outcome three times as effectively (i.e., that there are no missing data on a per annum basis). We R-square is three times the magnitude) as scores from trad- also included only those students who attended the same itional static measurement models. This suggests that DMM school (i.e., did not change schools) across all six time- may be a substantial step toward providing the increase in points. This choice was made to simplify the data structure predictive power that previously developed DAs were not and to avoid methodological issues that could arise from necessarily capable of attaining. The DMM framework cross-classification or multiple cluster-membership of stu- developed by McNeish and Dumas (2017, 2018, 2019), and dents non-hierarchically nested within multiple schools this paper, make strides toward bringing this powerful across the time points. Such school-level cross-classification method to educational psychologists, addressing concerns is a currently important methodological issue in the educa- about tractability and implementation of DA. tional research literature (e.g., Leroux & Beretvas, 2018; Though DMM has a similar name to the recently devel- Moerbeek & Safarkhani, 2018), and DMM is theoretically oped dynamic SEM framework (DSEM; Asparouhov et al., capable of handling that complexity, but as an explanatory 2018), the two approaches have quite distinct goals and the example, we thought it prudent to avoid cross-classification similarity of the name stems from the broad definition of in schools to allow for the clearest interpretation of DMM “dynamic” across different disciplines. “Dynamic” in DSEM output and parameters. We also only retained schools which stems from stationary time-series modeling popularized in had at least 50 students tested at each of the time-points, in econometrics whose goal is to distinguish between state and order to alleviate concerns about small sample bias traits. These models are interested in moment-to-moment (McNeish, 2017). In addition, NWEA does administer MAP variability around a constant mean and are not designed for Growth tests to students multiple times per year, but those applications to growth modeling (see McNeish & Hamaker, measurement occasions can occur at varying times through- 2020, pp. 2–4 for a more thorough discussion). “Dynamic” out the school year, depending on the needs of the school in DMM stems from its initial appearance in “dynamic or district. In order to simplify the data structure for this assessment” whereby “dynamic” referred to the process of demonstration, only the final test administration for each assessment occurring repeatedly over time, interspersed with student in each grade was retained, so our analytic sample instruction to observe a monotonic increase in student abil- included one mathematics score per year for each student. ity. So while DSEM and DMM contain similar words in Such a methodological choice allowed us to avoid treating their respective names, the only similarity between the goal the time-points in the model as unstructured, therefore sim- and data structure of the models is that they are applied to plifying the estimation. Because these data constitute an longitudinal data. In the next section, the stages of fitting explanatory example, we are not attempting to draw sub- and interpreting a DMM are demonstrated and explained. stantive conclusions from these data but do intend to dem- onstrate how substantively interesting DMM parameters could be estimated in future research work. These Example data set: MAP Growth assessments

To demonstrate how DMM can apply modern statistical 1Beginning in 2015, NWEA is the corporate name for Northwest Evaluation methods to embody the ideas of DA, we walk through how Association. MAP, Measures of Academic Progress, and MAP Growth are registered trademarks, and Northwest Evaluation Association and NWEA are the model can be fit using data from MAP Growth assess- trademarks of Northwest Evaluation Association in the U.S. and ments produced by the nonprofit educational testing other countries. EDUCATIONAL PSYCHOLOGIST 95 methodological choices resulted in an analytic sample of best-fitting. The specific three curves that were fit here were: 68,208 tests administered to 11,368 students within 98 (a) an exponential decay curve, (b) a Michaelis-Menten schools across 6 time points. curve, and (c) a logistic curve (see McNeish & Dumas, 2017 or McNeish et al., 2020 for a full statistical presentation and discussion of the various nonlinear functions that are com- Visualizing learning growth patible with DMM). To compare the fit of these models, we To visualize the patterns of growth in the MAP Growth began with a marginal nonlinear model using SAS Proc Nlin dataset, Figure 1 shows a plot of MAP Growth mathematics that does not estimate student-specific curves because this score growth of individual students in gray along with a model is computationally efficient and can yield rough semiparametric marginal trajectory (created from a penal- approximations of fit via the mean squared error. See ized b-spline) imposed in black (i.e., each gray line is a Figure 2 for a visual depiction of the way these three mar- unique student, the black line is the average). Given the size ginal models differed in the way they fit the empirical MAP of the data, we used a random sample of 50 students to Growth mathematics data. The mean squared error for the facilitate interpretation of the student-specific learning tra- Michaelis-Menten and exponential curves were nearly iden- jectories (otherwise, a plot of 11,000 trajectories would have tical (235.6 and 235.5, respectively) whereas the mean appeared as a solid mass). Because there is much inter-stu- squared error for the logistic model was much higher at dent variability in the observed growth trajectories in this 266.0 (higher values indicate worse fit). dataset, the average growth in the left panel of Figure 1 Because of the similarity in the fit of the Michaelis- appears to increase almost linearly. However, the right panel Menten and exponential curves to these data, we fit full shows the marginal (i.e., average) curve in isolation with an nonlinear mixed-effects models to our analytical dataset adjusted scale to demonstrate that the curve expresses clear with each of these trajectories. The models were fit in SAS nonlinearity over time. The marginal curve in the left panel of Figure 1 does not seem to demonstrate any inflection, meaning that it follows a J-shaped, learning curve trajectory that rapidly increases during the early time-points, eventu- ally decelerating to an upper asymptote. However, a number of specific growth functions are compatible with a J- shaped trajectory, and the next section compares fit of these curves to the data to determine which growth function might be the most appropriate to describe the change in MAP Growth scores over time from grades 3–8.

Comparison of nonlinear trajectories Any growth modeling framework, including DMM, relies on an implicit assumption that the functional form of the growth in the dataset is appropriately represented in the model (Grimm et al., 2016). For this reason, we compared the fit of three different nonlinear functions to the MAP Figure 2. Fitted trajectory plot of empirical MAP Growth mathematics score Growth dataset in order to determine which trajectory was means (black dots) compared to three marginal nonlinear curves.

Figure 1. Spaghetti plots of student-specific growth trajectories on the MAP Growth mathematics tests. The marginal (average) curve is superimposed in black on the left panel, and the nonlinearity of the marginal curve is highlighted in the reduced scale of the y-axis in the right panel. 96 D. DUMAS ET AL.

Proc Nlmixed using maximum likelihood via Gaussian have higher asymptotes, depending on the empirical data). quadrature with 3 quadrature points and the conjugate gra- We initially fit the model with unique variances at each dient optimization with automatic restart update to best time-point, but the estimated variances were very similar, accommodate the large size of the data (Powell, 1977). The which is also corroborated visually by the spaghetti plot in BIC of the Michaelis-Menten model was 503,566 compared the left panel of Figure 1. Therefore, in order to simplify the to a BIC of 503,471 with the exponential model. Lower BIC estimation of the model, we fit the model so that the time- values indicate a better fit, so the exponential model is pre- point specific residual variances were equal across time. ferred, and retained for further analysis. We expand our In multilevel model notation, this model can be written explanation of this model and the interpretation in the next as section. Also note that all SAS code used to run the DMM ¼ b þðb b Þð ½b Þ þ e analysis in this demonstration is available from the authors’ Mathti Ci 0i Ci exp RiGradet ti b ¼ c þ Open Science Framework page (https://osf.io/chn95/)or 0i 0 u0i from the authors via email. b ¼ c þ (1) Ci C uCi b ¼ c þ Ri R uRi Student-specific curves where " #! The exponential mixed-effects model estimates the rate and s capacity of learning for each student in the dataset. For clar- 0 ui MVN 0, sC0 sC (2) ity of presentation, for the moment we will only consider s s s the two-level model where repeated measures are nested R0 RC R within students, ignoring that students are then nested e ð r2Þ within schools. The exponential model is a three-parameter i MVN 0, (3) nonlinear model with many possible parameterizations. The first expression in Equation (1) shows that the MAP Here, we present the parameterization previously used by Growth mathematics scores for student i at time t is equal Grimm et al. (2011). Following with the purposes of DMM, to the ith student’s intercept b plus the difference between this parameterization features an intercept, an upper 0i the ith student’s capacity and their intercept ðb b Þ asymptote, and a rate. The intercept carries its standard Ci 0i times one minus e (the irrational constant approximately interpretation: it is the expected value of the curve at the equal to 2.718) raised to the grade at time t times the oppos- first time-point (i.e., third grade in these data). The upper ’ b ite of the student i s rate parameter Ri plus a time-point asymptote is the expected value of the outcome measure e : (i.e., MAP Growth mathematics test score) when time specific residual error term, ti The DMM described by approaches infinity. As the predicted limit on the learning Equation (1) can also be depicted as a conceptual path system, this asymptote parameter is representative of the model with in the SEM framework (Figure 3). Inspection of capacity of the learning system to benefit that particular stu- Figure 3 will reveal that the inclusion of a theoretically dent. The third parameter is the rate which describes how determined and statistically estimated upper asymptote (i.e., quickly the data grow from the intercept toward the asymp- learning capacity) sets this model apart from more conven- tote as a function of time. In the two-level model, each of tional growth models. It should also be noted that fitting these parameters is given a student-specific random effect, this model in SEM programs such as Mplus or lavaan which allows every student to have potentially unique values. requires linearizing the model, a process that can obscure The random effects are modeled to have an unstructured the interpretation of student-specific score estimates (Blozis covariance matrix, meaning that random effects can covary & Harring, 2018). For this reason, we focus on the mixed- (e.g., students who have higher intercepts might tend to also effects framework in this article.

Figure 3. Conceptual path model of the DMM described by Equation (1), with 5 hypothetical time-points. This conceptual path diagram depicts how this DMM can be conceptualized and visualized within the SEM framework. If fitting the model in the structural equation modeling framework as a structured latent curve model, the loadings would be constrained to values based on the first partial derivatives with respect to the latent variable of interest. It should be noted that, in this art- icle, the DMM models were fit within the mixed-effects framework and not as structured latent curves, but this figure is included here to aid in conceptualizing the general configuration of the model and suggest how it could be fit in a separate application setting. EDUCATIONAL PSYCHOLOGIST 97

Table 1. Estimated parameters from two-level exponential dynamic measure- ment model for MAP Growth mathematics data. Parameter Notation Estimate Fixed effects Capacity cC 259.85 c Intercept 0 199.53 Rate cR 0.19 Random effect variances Capacity sC 157.47 Intercept s0 94.18 Rate sR 0.0039 Random effect correlations Intercept, Capacity Corrðu0, uCÞ 0.85 Intercept, Rate Corrðu0, uRÞ 0.11 Capacity, Rate CorrðuC, uRÞ 0.44

The second expression in Equation (1) shows that the ith student’s growth parameters are equal to a fixed effect that captures the average of the whole sample (c) plus a student- specific random effect (ui). Equation (2) shows that these Figure 4. A random sample of 50 student-specific fitted trajectory plots (in student-specific random effects are not estimated directly, gray) from the exponential model and the marginal exponential curve superim- but rather are assumed to follow a multivariate normal dis- posed (in black). tribution with a mean vector of 0 and an unstructured covariance matrix to capture how much between-student capacity (0.44), meaning that students who exhibit faster variability there is in the growth parameters. Equation (3) growth tend to have higher asymptotes. The correlation shows that the time-point specific residuals are assumed to between the intercepts and rates was rather weak (0.11), come from a multivariate normal distribution with a 0 mean indicating that there is not much of a relation between stu- ’ vector and a constant variance, r2: Because the growth dents mathematics scores in Grade 3 and how fast they parameters are on different scales, we used a Cholesky grow in their mathematics skill throughout elementary and decomposition (Kohli et al., 2019) when estimating the ran- middle school; this correlation was statistically significant, dom effect covariance matrix in order to provide improved but this is likely attributable to the very large sample. numerical stability. McNeish and Dumas (2018) derived a reliability coeffi- Table 1 shows the parameter estimates from this two- cient for DMM asymptote scores. Similar to the IRT con- level model. For example, results show that students display text, the reliability of capacity scores is conditional in that it an average intercept around 199.53 on the MAP Growth can change across the range of possible values. The condi- mathematics score scale and their average capacity is esti- tional reliabilities can also be integrated to yield a single mated to be 259.35. The average rate of learning in this marginal reliability. Generally, the reliability of the capacity dataset is 0.19, indicating that 19% of the remaining differ- scores in this dataset were good with values falling roughly ence between the current outcome value and asymptote is between 0.80 and 0.90 across the range of capacity scores achieved for each one-unit increase in time2. In order to with a marginal reliability of 0.855. visualize these patterns, Figure 4 displays student-specific In practical terms, DMMs provide several advantages DMM growth functions across a random subset of 50 stu- over static measurement models. First, they incorporate dents in our analytic dataset, with the marginal growth multiple measurement occasions to better account for learn- function overlain. As can also be seen in Table 1, the ing over time and context, as opposed to static snapshots of between-student variance in the asymptotes is 157.47, performance. Second, by incorporating both inter- and the between-student variance in the intercepts is 94.18, and intra-student differences across time into account, the DMM the between-student variance in the rates is 0.0039. The is able to estimate three parameters (i.e., intercept, rate, and results also show that there is a strong correlation between capacity) that can be interpreted as indicators of the quality student mathematics scores in Grade 3 and the estimated of student learning in response to their context and instruc- eventual capacity of their mathematics learning (0.85), tion (see Dumas & McNeish, 2017, 2018 for evidence of the meaning that students who have high scores in Grade 3 are consequential validity of DMM scores). Because these three also estimated to reach higher achievement in the future. parameters are estimated with random effects, the gener- There is a moderate correlation between their rate and their ation of student-specific scores on each of them (especially learning rates and capacities are expected to be theoretically 2To further explain how the rate parameter works in an exponential decay important) can be readily accomplished. From the inspec- model, at 3rd grade, the difference between the expected outcome and tion of these student-specific scores based on the rate and asymptote is 259.4 – 199.5 ¼ 59.9. For a one-unit increase in time (i.e., by Grade 4), 19% of this difference will be achieved. That is, there will be capacity random effects, substantive researchers can make 0.19 59.9 ¼ 11.38 points of growth at Grade 4, on average. At Grade 4, the inferences about student response to instruction, learning distance to asymptote is now 259.4 – (199.5 þ 11.38) ¼ 48.52. By Grade 5, capacity, and the contextual or psychological influences that 19% of this remaining difference will be achieved, or 48.52 0.19 ¼ 9.22 points, on average. The amount of growth changes across time but the influence it. Such indicators account for context in ways proportional change of the remaining difference is constant. previous methods aspired to but could not achieve. In 98 D. DUMAS ET AL. addition, by allowing the random effects to covary, DMM Table 2. Estimated parameters from three-level exponential dynamic measure- also specifically incorporates the inherent dependency ment model for MAP Growth mathematics data. among the knowledge that students bring to schooling, their Parameter Notation Estimate improvement over the course of instruction, and their Fixed effects Capacity cC 253.44 c estimated capacity to benefit from instruction as time pro- Intercept 0 198.25 c gresses. Finally, as we describe in detail next, when school- Rate R 0.25 level contextual variables are included in a DMM, additional Student-level random effect variances Capacity sC 157.63 advantages are manifest, including the ability to realize Intercept s0 82.15 Vygotsky’s conceptualization of ZPD and quantify the effects Rate sR 0.0026 Student-level random effect correlations of context on students’ current performance and potential. Intercept, Capacity Corrðu0, uCÞ 0.76 Intercept, Rate Corrðu0, uRÞ 0.08 Capacity, Rate CorrðuC, uRÞ 0.24 Incorporating school-level context School-level random effect variances Capacity uC 234.88 u Several extant theories of learning emphasize the effect of the Intercept 0 14.84 Rate uR 0.0047 environment on student learning (Cobb & Bowers, 1999), but School-level random effect correlations partitioning the variance explained by the individual learner Intercept, Capacity Corrðr0, rC Þ 0.58 ðr r Þ and the instructional context has been a major challenge in Intercept, Rate Corr 0, R 0.24 Capacity, Rate ðrC, rRÞ0.85 previous dynamic assessment work (Frisby & Braden, 1992). Corr In our analytic sample used for demonstration purposes here, students are nested within schools, and therefore a three-level opportunity to determine the proportion of the variance in DMM can be fit in order to partition the variance in the the intercept, capacity, and rate parameters that are attribut- intercepts, rates, and capacity scores between individual stu- able to schools and to students. This is conceptually similar dent level and school level variance components. The three- to calculating the intraclass correlation, except that the trad- level model shown in Equation (4) is similar to the two-level itional intraclass correlation is defined in terms of an uncon- model presented in Equation (1) in that the growth trajectory ditional model that only includes an intercept and no similarly follows an exponential trajectory. The difference growth parameters. between Equation (1) and Equation (4) is that each of the In the case of the three-level DMM fit here, the total growth parameters now contains two random effects to parti- amount of variance in the capacity estimates is 392.51, tion the variance into student-level and school-level compo- which is calculated by adding the variance at both the stu- þ nents: the uij terms represent student-level random effects dent and school levels (i.e., 157.63 234.88). The percentage whereas the rj terms represent the school-specific random of variance in capacity scores attributable to the student- effects. The random effects within a level (e.g., student or level is 157.63/392.51 ¼ 40.2% whereas the percentage attrib- school) were allowed to covary but the random effects across utable to the school-level is 234.88/392.51 ¼ 59.8%. For this levels are assumed to be independent (Humphrey & particular subset of people used in this analysis, this finding LeBreton, 2019). Therefore, we modeled the student-level ran- demonstrates that the total variance in the eventual asymp- dom effects with an unstructured covariance matrix tote of learning in mathematics is driven more by school- (Equation (5)) and the school-level random effects with a sep- level characteristics than student-level characteristics, arate unstructured covariance matrix (Equation (6)). As in although student-level characteristics still account for a the two-level model, the time-point specific residuals are sizeable proportion of the variance. To attribute these modeled with a constant variance over time. between-school differences to specific school attributes (e.g., curricular choices), school-level covariates would need to be Math ¼ b þðb b Þðexp½b Grade Þ þ e tij Cij 0ij Cij Rij t tij included in the model, a possibility that is highlighted in the b ¼ c þ u þ r 0ij 0 0ij 0j (4) b ¼ c þ u þ r Future Directions section of the current article. Cij C Cij Cj A similar partition is found for the rate parameters, b ¼ c þ uRij þ rRj Rij R where 64% (i.e., 0.0047/[0.0047 þ 0.0026]) of the variance is where attributable to the school-level. Because the intercept occurs " #! very early in students’ educational careers at third grade, the s0 large majority of the variance is attributable to student-level ui MVN 0, sC0 sC (5) characteristics (85%; 82.15/[82.15 þ 14.84]). Of course, in sR0 sRC sR future substantively-oriented DMM research, these variance " #! u components could also be explained by predictor variables 0 r u u at both the student and school levels, allowing specific i MVN 0, C0 C (6) u u u hypotheses about learning and education to be tested and R0 RC R particular theories to be explicitly compared. For example, Dumas and McNeish (2017, 2018) showed e MVNð0, r2Þ (7) i that student SES did substantially predict DMM intercepts, The parameter estimates from this three-level model are but yielded negligible effect sizes when predicting DMM shown in Table 2. The three-level model affords the capacities, indicating that impoverished students in the U.S. EDUCATIONAL PSYCHOLOGIST 99 may not have developed the same knowledge as their more meant that schools had less room to grow toward their affluent peers at the time of testing, but their capacity to capacities, and therefore their rates are estimated to be lower learn in the future was essentially equal. Such a finding is in by the model. This finding indicates both the promise and line with early hypotheses from those who conceptualized challenge of DMM: it can reveal complicated but more and first worked in DA (e.g., Rey, 1934/2012; Vygotsky, accurate depictions of how contextual factors can moderate 1931/1997) that the measurement of properties associated students’ current and predicted future performance. It is with students’ improvement in response to instruction important to note that this pattern of random effect correla- would be far less influenced by demographic background tions is specific to the analytic sample used here, and read- and past traumatic experiences than are single time-point ers who are interested in applying DMM to their own data static test scores. could expect to find meaningfully different patterns of The findings in the example analysis presented here sug- covariance among the DMM parameters. gest the majority of the variance in DMM scores could be accounted for at the school level, and therefore a promising Delimitations and future directions for DMM line of future research would involve identifying the peda- gogical or contextual properties of schools that most influ- As with any theoretical, methodological, or empirical work ence learning trajectories. For instance, one extant study in educational psychology, DMM currently has specific that used DMM to test the efficacy of an instructional inter- delimitations to its capabilities as well as areas of further vention showed that the intervention itself significantly development and application that are pertinent. Several of influenced the rate of student learning, but not students’ these delimitations and future directions are discussed here. asymptotic capacity scores (Dumas, McNeish, Sarama, et al., 2019). This finding suggests the malleability of pedagogy DMM scores are dependent on valid items and its effects upon students’ learning, while also further illustrating how capacity scores better capture what all stu- Regardless of the complexity of the measurement model, the dents could do, given time for beneficial conditions to exert validity of all psychometrically calculated student scores are their effects. dependent on the validity of the actual stimuli presented to In our school-level model with the MAP Growth dataset, participants. As the process of measurement modeling has the pattern of the correlations for the random effects is also gotten more complex over the course of the last century, different across the student and school levels. As in the moving from classical test theory to latent variable models solely two-level model above, at the student-level of the and now to DMMs, the requirement for validity of the item- three-level model, all the correlations among the DMM stimuli themselves has remained constant. DMM-generated parameters are positive. There is a strong correlation (0.76) capacity scores show many advantages over single-time-point between intercepts and capacity scores: students with high ability scores, but those DMM scores must be calculated scores in 3rd grade tend to have high capacities. There is based on student behavior in response to psychologically also a moderate correlation (0.24) between capacity and rate: valid stimuli. Determining valid stimuli is no trivial matter; students who learned faster over elementary and middle debate already exists in the field concerning whether or not school tended to have higher capacities. The correlation abstract stimuli, designed to measure highly domain-general between intercepts and rates is weakly positive (0.08) and constructs (e.g., reasoning) or whether more concrete stimuli technically statistically significant due to the large sample that are designed to measure domain-specific constructs (e.g., but is not likely meaningful substantively: growth in math- fractions) are more useful in educational psychology ematics is not heavily dictated by scores in 3rd grade. In (Haertel, 2018). Although the technique was not included in contrast, at the school-level, two of the three random effect the empirical demonstration in this article, DMM can be dir- correlations are negative. The correlation between intercepts ectly built on to a latent variable measurement model (e.g., and school-level capacity scores remains moderately strong CFA, IRT), allowing the growth trajectory to be estimated and positive (0.58), meaning that schools with higher aver- based on latent quantities at each time-point. This method age third grade math scores also tend to have higher pre- was demonstrated by (McNeish & Dumas, 2017), with the dicted capacities. However, the correlation between capacity added caveat that it greatly increases the computational run- and rate estimates at the school level is strongly negative time of DMMs. (0.85), meaning that schools with higher average capacities for mathematics learning tend to have slower average Modeling diverse psychological constructs growth rates through elementary and middle school. The correlation between the intercepts and the rates is moder- Currently, DMM methodology has been entirely focused on ately negative (0.24) meaning that schools with higher performance measurement, where students respond to stim- average intercepts tend to have lower average growth rates uli that can be scored as correct, incorrect, or by degrees of during elementary and middle school in this sample. This correctness. The reason for this is that DMM is designed for pattern of correlation at the school-level could indicate the use with constructs that grow in a monotonic and decelerat- possibility of a ceiling effect for schools, but not for stu- ing way (i.e., as a learning curve) in response to instruction. dents, where the correlations follow a very different pattern. From our perspective, this growth pattern is most typically At the school level in these data, high intercepts may have observed in educational data for constructs measured with 100 D. DUMAS ET AL. performance items (e.g., mathematics, reading). However, the principal intentions of the early designers and proponents the use of DMM to generate capacity scores for self-reported of the dynamic measurement paradigm (e.g., Feuerstein et al., constructs (e.g., self-efficacy) may be an interesting future 1979;Vygotsky,1931/1997) was to improve the inferences direction to pursue as part of this research agenda, assuming that are made not just by educational psychologists concern- that construct grows monotonically within that research ing students on average, but also those inferences made in context. If the self-report construct of interest is hypothe- more practical settings (e.g., by school psychologists) and sized to vacillate around a stable mean (i.e., as in time-series individual student assessment contexts. Indeed, DA as it was modeling) than the DSEM modeling framework may be originally conceived was quintessentially an individual assess- more relevant than DMM (McNeish & Hamaker, 2020). At ment technique, in which a clinician would administer the this point in the development of DMM, it is unknown instruction along with the item-stimuli to students while whether capacity scores for self-reported constructs could be observing their improvement (Haywood & Lidz, 2006). As statistically estimated, or whether they would be psycho- previously reviewed, this individual assessment DA paradigm logically meaningful. did show encouraging results, especially for improving the consequential validity of measures of children from historic- Student- and school-level covariates ally disadvantaged groups, but issues related to cost have kept DA applications to individual assessment limited, especially in Although not included in the example analysis in this art- the U.S. In the future, dynamic measures that are designed icle, DMM, like other mixed effects growth models (Grimm for individual assessment could be psychometrically calibrated et al., 2016), is capable of including covariates on the param- and normed using DMM. This process would necessarily eters, in order to ascertain the influence of a predictor vari- involve identifying items or scales that are designed to tap able on the intercepts, rates, or capacities estimated by the the same construct to be administered in a particular order model. For example, the student-level influence of a cogni- over time, while also designing and implementing a particular tive variable such as working memory on learning capacities instructional method that is designed to produce the most in mathematics could be incorporated relatively easily by meaningful student improvement on the items being adminis- setting them as covariates on the student-level capacity par- tered. On the other hand, existing school-based instruction ameter. In the future, such a method for the investigation of could continue to be used as part of a dynamic measurement student-level influences on DMM parameters could include framework, as a way to save on cost and leverage existing a variety of performance measures (e.g., cognitive strategies efforts by schools to improve student performance. or skills), self-report attributes (e.g., motivational con- structs), or sociocultural aspects (e.g., student demographics, SES). In addition, this method could be extended to model Uses and interpretation of dynamic scores the influence of specific attributes of students’ learning con- text by including covariates on the third level of the DMM Despite the rich evidence provided in the current demon- model, whether that level be classroom, teacher, or school stration for the suitability of DMM in modeling educational (as it is in the example here). For example, the inclusion of data, as well as the existing validity and reliability evidence categorical covariates at the classroom level that indicate the published elsewhere (Dumas & McNeish, 2017, 2018; administration of a curricular intervention appear to be a Dumas, McNeish, Schrieber-Gregory, et al., 2019; McNeish highly important future direction for DMM, because such et al., 2020; McNeish & Dumas, 2017, 2018, 2019), much an analysis would allow for the testing of the impact of an specific future work would be required to fully validate the intervention at the cluster level. In future work, it may be uses and appropriate interpretations of DMM capacity important to further consider relevant contextual covariates scores from a particular measurement context. Related to at the level of several clustering variables, such as psycho- this issue is the theoretical and terminological disentangling logical attributes of teachers (e.g., burnout, implicit bias), of the DMM estimated capacity and the 20th century con- school-level funding designations (e.g., public, charter, mag- ceptualization of aptitude (Bracht, 1970; Swanson, 1990). In net), or other socioeconomic school-level factors (e.g., indi- our reading of the literature, aptitude was conceptualized as cators of school-violence or entrenched poverty in the an estimate of student capability to learn that was, if not community). Like other mixed-effects models, DMM can fully innate, then inherent to a student before a particular include time-varying covariates on the individual time- learning experience (e.g., undergraduate education) began. points of the learning curve, but the DMM parameters (i.e., In contrast to this, capacity as it is described here and esti- learning rate and capacity) are not time-varying in the mated from the MAP dataset via DMM, is an estimate of a model, so the covariates placed on those model parameters student’s predicted performance after the learning experi- should be theoretically relevant across the full window of ence (in this case school-based math instruction) is com- observation. plete. So, aptitude was conceptualized as an attribute that exists in students before instruction is administered, whereas capacity is an estimate of a student’s future performance, Possibility for individual assessment given their recent learning trajectory. This difference Given the deep history of the core ideas related to dynamic between aptitude and capacity is subtle, but we believe cru- assessment reviewed in this article, it is apparent that one of cial applications of DMM to be effective: student capacity EDUCATIONAL PSYCHOLOGIST 101 emerges from student learning trajectories, not the other Zimmerman and Schunk’s text on the history of educational way around. psychology, Pressley and Roehrig (2003) wrote: … we would be less than candid if we did not mention a Cost of dynamic measurement and data availability curious slippage between the scholarly literature in educational psychology and the public of educational psychology. As reviewed here, one of the main barriers to the adoption The most prominent educational psychology topic in the popular media is the analysis of standardized test results, with a of dynamic assessment both in educational practice and in typical theme being that American students’ achievement lags educational research has been the cost. The development of behind students in other countries. For the most part, this work DMM may help to lower the cost-barrier to the inclusion of is not theoretically driven, and hence, such data are not such methods in educational psychology because it opens up collected in designs that permit testing of theoretical … the possibility of estimating DA-related developing con- possibilities We should probably be thinking hard about how educational psychologists can provide data for the current great structs (e.g., learning capacity) from existing large-scale lon- debates about student achievement that address the concerns of gitudinal data. However, this strategy of using large scale policymakers and the public as well as the professional secondary data is not possible for all educational psycholo- educational psychology research community. Whether we like it gists, who may be interested in constructs not generally or not, our worth as a profession in the eyes of the public and those who control the research purse strings probably depends assessed in large-scale testing, or who may not have access to some extent on the perception that educational psychologists to such datasets. For these researchers, the cost barriers can provide important about trends in student associated with DA still exist with DMM, and the need for achievement and how to improve achievement as indexed by either industry funding or a relatively large federal grant is the standardized tests that are the goal standard of the day. (p. 362) likely if researchers are to design and calibrate their own dynamic testing procedure. At this point in the development These historical claims highlight the unique position edu- of DMM, it is not known for certain how many observations cational psychology holds at the nexus of scholarship on are needed to estimate reliable capacities, and the number of learning, assessment, and achievement. With this position necessary time points varies depending on the construct comes a responsibility to the public, one that educational being assessed, the participants, and efficacy of the instruc- psychologists have endeavored to meet in the last 17 years. tion. Indeed, even in this relatively large-scale application, Pressley and Roehrig (2003) echoed the thoughts and values only one measurement occasion per year was available for of many educational psychologists who believe in the poten- the sample of students we used, and it is possible, and even tial of all students and the importance of accounting for likely, that more testing occasions could have improved the interactions among history, context, and individual achieve- reliability of the DMM capacity scores estimated in this ment or performance. We argue that innovations within ’ study: capacity scores exhibited a marginal reliability of educational psychology s theories and methods have indeed 0.855 but that value may increase as testing occasions are enabled progressive insights about how to conceptualize and added to the model. positively affect the ways students learn within school or However, it has been shown (i.e., Timmons & Preacher, informal learning contexts, with important and novel contri- 2015), that the timing of the measurement occasions is butions to the current great debates in education. highly relevant to the estimation of nonlinear mixed-effects In this article, we have argued that DMMs represent a way for educational psychologists to more equitably and models, and that, in order to best facilitate the estimation of accurately model student outcomes (i.e., developed abilities) these models, the greatest density of measurement should as well as potential (i.e., developing learning capacities) via a occur in the time-window that is closest the maximum modern theoretical-psychometric framework that accounts curvature of the hypothesized growth function. DMM mod- for students’ history, their current context, and how various els operate in a nonlinear mixed-effects framework that can outcomes would and perhaps should develop in that context. be fit to most datasets with a lower-bound sample size of Such a framework allows educational psychologists to more around 50 participants (McNeish, 2016), and with a sample completely explore the dynamic interactions among person, size of 100 if the normality assumption is violated (Jacqmin- context, and instruction posited by Vygotsky and other Gadda et al., 2007). Future methodological research is prominent scholars, via multiple measurement occasions needed to determine precisely how much greater the reliabil- that can capture the effects of changes in instruction and ity in capacity scores can be expected to be if a certain num- context more completely and accurately than previous ber of additional time points are collected (analogous to the methodologies. Spearman-Brown formula in classical testing), and such Specifically, the DMM framework, as we have conceptual- research would be useful for planning costly data collection ized, reviewed, and demonstrated it in this article, is much procedures a priori. more capable than previous methods of aligning the theoret- ical interests and goals of educational psychology with the Conclusion methodological procedures by which educational psycholo- gists do their work. By extension, DMM may also help to Public and private debates about what educational psycholo- better incorporate phenomena that educational psychologists gists do, and should do, to foster educational excellence and have long struggled to study (e.g., pedagogical effects on stu- equity have persisted throughout the field’s history. In dent potential; dependence of student past and present 102 D. DUMAS ET AL. achievement; nonlinear improvement in student perform- References ance) into psychometric testing procedures. For example, Alexander, P. A., Schallert, D. L., & Reynolds, R. E. (2009). What is some in the psychometric literature (von Davier et al., learning anyway? A topographical perspective considered. 2019), have recently called for the better incorporation of Educational Psychologist, 44(3), 176–192. https://doi.org/10.1080/ learning (i.e., student improvement over time) into large- 00461520903029006 scale educational testing, echoing the historical discussions Aravena, S., Tijms, J., Snellings, P., & van der Molen, M. W. (2016). reviewed here (e.g., Binet, 1909/1975; DuBois, 1920/2013; Predicting responsiveness to intervention in dyslexia using dynamic – Thorndike, 1921). Given its ability to model the complex assessment. Learning and Individual Differences, 49, 209 215. https://doi.org/10.1016/j.lindif.2016.06.024 interactions researchers have posited but have not been able Asparouhov, T., Hamaker, E. L., & Muthen, B. (2018). Dynamic struc- to measure, DMM appears to be a theoretical-psychometric tural equation models. Structural Equation Modeling: A paradigm that the field of educational psychology—both in Multidisciplinary Journal, 25(3), 359–388. https://doi.org/10.1080/ its basic and applied instantiations—has long needed. 10705511.2017.1406803 Ideally, DMM may be a prime example of how educational Baltes, P. B., & Kliegl, R. (1992). Further testing of limits of cognitive plasticity: Negative age differences in a mnemonic skill are robust. psychology theory and research can inform the development , 28(1), 121–125. https://doi.org/10.1037/ of methodology, and then in turn how that newly developed 0012-1649.28.1.121 methodology can aide both large-scale educational testing Binet, A. (1975). Modern ideas about children. Suzanne Heisler. applications and more targeted empirical research done by (Original work published 1909) educational psychologists. Although a number of growth Binet, A., & Simon, T. (1948). The development of the Binet-Simon – modeling methods exist in the educational research litera- Scale, 1905 1908. In W. Dennis (Ed.), Readings in the (pp. 412–424). Appleton-Century-Crofts. (Original work ture (see Castellano & Ho, 2013 for an overview), and some published 1908) of those existing methods can be used to predict students’ Blozis, S. A., & Harring, J. R. (2018). Fitting nonlinear mixed-effects future performance based on their growth trajectories, models with alternative residual covariance structures. Sociological DMM has the capability of directly estimating (i.e., as a Methods & Research. Advance online publication. https://doi.org/10. parameter in the model) a student’s capacity given their 1177/0049124118789718 Bose, M., Kohli, N., Newell, K. W., & Christ, T. J. (2019). Response to learning trajectory, as well as doing so while statistically intervention: Empirical demonstration of a dual-discrepancy popula- accounting for school-level characteristics. tion via random effects mixture models. Learning and Individual In sum, the DMM theoretical-psychometric paradigm is Differences, 71,23–30. https://doi.org/10.1016/j.lindif.2019.03.004 one that has deep theoretical roots stretching back to Bracht, G. H. (1970). Experimental factors related to aptitude-treatment many of the most-prominent educational thinkers of the interactions. Review of Educational Research, 40(5), 627–645. early 20th century including Edward Thorndike (1921), https://doi.org/10.3102/00346543040005627 Briggs, D. C., & Weeks, J. P. (2009). The impact of vertical scaling W.E.B. DuBois (1920/2013), and Lev Vygotsky (1931/1997). decisions on growth interpretations. Educational Measurement: Throughout the latter half of the 20th century, dynamic Issues and Practice, 28(4), 3–14. https://doi.org/10.1111/j.1745-3992. assessment methods were conceptualized, enacted and 2009.00158.x refined by scholars working within the educational psych- Bronfenbrenner, U., & Morris, P. A. (2006). The ecology of develop- ology literature such as Feuerstein et al. (1979), Lee mental processes. In R. M. Lerner (Ed.), The handbook of child – Swanson (1995a), and et al. (2002). psychology (Vol. 1, pp. 793 828). Wiley & Sons. Brozek, J. (1972). To test or not to test: Trends in the Soviet views. However, the field lacked a scalable, resource-viable way of Journal of the History of the Behavioral Sciences, 8(2), 243–248. modeling student potential and how it can change based https://doi.org/10.1002/1520-6696(197204)8:2<243::AID-JHBS2300 upon context. Recently, we developed the DMM method 080212>3.0.CO;2-3 to do just that: model dynamic learning in large-scale Budoff, M. (1987a). Measures for assessing learning potential. In C. S. educational datasets (Dumas & McNeish, 2017, 2018; Lidz (Ed.), Dynamic assessment: An interactional approach to evalu- ating learning potential (pp. 173–195). Guilford Press. Dumas, McNeish, Schreiber-Gregory, et al., 2019; McNeish Budoff, M. (1987b). The validity of learning potential assessment. In & Dumas, 2017, 2018, 2019). Here, we have demonstrated C. S. Lidz (Ed.), Dynamic assessment: An interactional approach to DMM in a large dataset of MAP Growth mathematics evaluating learning potential (pp. 53–81). Guilford Press. assessments, and further extended the model to incorporate Budoff, M., & Friedman, M. (1964). “Learning potential” as an assess- the clustering of students in schools. With between 60 and ment approach to the adolescent mentally retarded. Journal of – 70% of the variance in students’ learning rate and learning Consulting Psychology, 28(5), 434 439. https://doi.org/10.1037/ h0040631 capacity scores being at the school-level, and therefore Budoff, M., & Pagell, W. (1968). Learning potential and rigidity in the attributable to measurable aspects of context, we believe adolescent mentally retarded. Journal of , 73(5), the time is ripe for the use of such DMM methods to 479–486. https://doi.org/10.1037/h0026219 explain how schools and other learning contexts support Calero, M. D., Belen, G.-M M., & Robles, M. A. (2011). Learning or attenuate student learning. potential in high IQ children: The contribution of dynamic assess- ment to the identification of gifted children. Learning and Individual Differences, 21(2), 176–181. https://doi.org/10.1016/j.lin- Funding dif.2010.11.025 Carlson, J. S., & Wiedl, K. H. (1978). Use of testing-the-limits proce- This research was supported by a postdoctoral research award from the dures in the assessment of intellectual capabilities in children with National Academy of Education/Spencer Foundation awarded to learning difficulties. American Journal of Mental Deficiency, 82(6), Denis Dumas. 559–564. EDUCATIONAL PSYCHOLOGIST 103

Carlson, J. S., & Wiedl, K. H. (1979). Toward a differential testing Feuerstein, R., Krasilowsky, D., & Rand, Y. (1974). Innovative educa- approach: Testing-the-limits employing the Raven matrices. Intelligence, tional strategies for the integration of high-risk adolescents in Israel. 3(4), 323–344. https://doi.org/10.1016/0160-2896(79)90002-3 The Phi Delta Kappan, 55(8), 556–558. Carlson, J. S., & Wiedl, K. H. (1992). Principles of dynamic assessment: Feuerstein, R., Miller, R., Hoffman, M. B., Rand, Y., Mintzker, Y., & The application of a specific model. Learning and Individual Differences, Jensen, M. R. (1981). Cognitive modifiability in adolescence: 4(2), 153–166. https://doi.org/10.1016/1041-6080(92)90011-3 Cognitive structure and the effects of intervention. The Journal of Castellano, K. E., & Ho, A. D. (2013). A practitioner’s guide to growth Special Education, 15(2), 269–287. https://doi.org/10.1177/00224669 models. Council of Chief State School Officers. 8101500213 Cattell, R. B. (1979). Are culture fair intelligence tests possible and Feuerstein, R., Rand, Y., & Hoffman, M. (1979). The dynamic assess- necessary? Journal of Research and Development in Education, 12(2), ment of retarded performers: The learning potential assessment device, 3–13. theory, instruments, and techniques. University Park Press. Chaffey, G. W. (2009). Gifted but underachieving: Australian indigen- Feuerstein, R., Rand, Y., Jensen, M. R., Kaniel, S., & Tzuriel, D. (1987). ous children. In T. Balchin, B. Hymer, & D. J. Matthews (Eds.), The Prerequisites for assessment of learning potential: The LPAD model. Routledge international companion to gifted education (pp. 106–114). In Dynamic assessment: An interactional approach to evaluating learning potential (pp. 35–51). Guilford Press. Routledge. ’ Cobb, P., & Bowers, J. (1999). Cognitive and situated learning perspec- Frisby, C. L., & Braden, J. P. (1992). Feuerstein s dynamic assessment approach: A semantic, logical, and empirical critique. The Journal of tives in theory and practice. Educational Researcher, 28(2), 4–15. Special Education, 26(3), 281–301. https://doi.org/10.1177/00224669 https://doi.org/10.3102/0013189X028002004 9202600305 Compton, D. L., Fuchs, D., Fuchs, L. S., Bouton, B., Gilbert, J. K., Fuchs, D., & Fuchs, L. S. (2006). Introduction to response to interven- Barquero, L. A., Cho, E., & Crouch, R. C. (2010). Selecting at-risk tion: What, why, and how valid is it? Reading Research Quarterly, first-grade readers for early intervention: Eliminating false positives 41(1), 93–99. https://doi.org/10.1598/RRQ.41.1.4 and exploring the promise of a two-stage gated screening process. Fuchs, D., Compton, D. L., Fuchs, L. S., Bouton, B., & Caffrey, E. – Journal of Educational Psychology, 102(2), 327 340. https://doi.org/ (2011). The construct and predictive validity of a dynamic assess- 10.1037/a0018448 ment of young children learning to read: Implications for RTI “ ” Cronbach, L. J., & Furby, L. (1970). How we should measure change : frameworks. Journal of Learning Disabilities, 44(4), 339–347. – Or should we? Psychological Bulletin, 74(1), 68 80. https://doi.org/ https://doi.org/10.1177/0022219411407864 10.1037/h0029382 Grigorenko, E. L. (2009). Dynamic assessment and response to inter- Davidian, M., & Giltinan, D. M. (1995). Nonlinear models for repeated vention: Two sides of one coin. Journal of Learning Disabilities, measurement data. CRC Press. 42(2), 111–132. https://doi.org/10.1177/0022219408326207 De Weerdt, E. H. (1927). A study of the improvability of fifth grade Grigorenko, E. L., & Sternberg, R. J. (1998). Dynamic testing. school children in certain mental functions. Journal of Educational Psychological Bulletin, 124(1), 75–111. https://doi.org/10.1037/0033- Psychology, 18(8), 547–557. https://doi.org/10.1037/h0073097 2909.124.1.75 Dearborn, W. F. (1921). Intelligence and its measurement: A sympo- Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling: sium–XII. Journal of Educational Psychology, 12(4), 210–212. Structural equation and multilevel modeling approaches. Guilford https://doi.org/10.1037/h0065003 Publications. Deonovic, B., Yudelson, M., Bolsinova, M., Attali, M., & Maris, G. Grimm, K. J., Ram, N., & Hamagami, F. (2011). Nonlinear growth (2018). Learning meets assessment. Behaviormetrika, 45(2), 457–474. curves in developmental research. Child Development, 82(5), https://doi.org/10.1007/s41237-018-0070-z 1357–1371. https://doi.org/10.1111/j.1467-8624.2011.01630.x DuBois, W. E. B. (2013). W. E. B. DuBois on sociology and the Black Guthke, J. (1992). Learning tests: The concept, main research findings, community. University of Chicago Press. (Original work published problems and trends. Learning and Individual Differences, 4(2), 1920) 137–151. https://doi.org/10.1016/1041-6080(92)90010-C Dumas, D., & McNeish, D. M. (2017). Dynamic measurement model- Guthke, J., & Beckmann, J. F. (2000). Learning test concepts and ing: Using nonlinear growth models to estimate student learning dynamic assessment. In A. Kozulin & Y. Rand (Eds.), Experience of ’ capacity. Educational Researcher, 46(6), 284–292. https://doi.org/10. mediated learning: An impact of Feuerstein s theory in education and – 3102/0013189X17725747 psychology (pp. 175 190). Pergamon Press. Dumas, D., & McNeish, D. M. (2018). Increasing the consequential val- Guthke, J., & Stein, H. (1996). Are learning tests the better version of idity of reading assessment using dynamic measurement modeling. intelligence tests? European Journal of Psychological Assessment, – Educational Researcher, 47(9), 612–614. https://doi.org/10.3102/ 12(1), 1 13. https://doi.org/10.1027/1015-5759.12.1.1 Haertel, E. H. (2018). Tests, test scores, and constructs. Educational 0013189X18797621 – Dumas, D., McNeish, D., Sarama, J., & Clements, D. (2019). Pre-school Psychologist, 53(3), 203 216. https://doi.org/10.1080/00461520.2018. 1476868 mathematics intervention can significantly improve student learning Hancock, G. R., & Mueller, R. O. (2013). Structural equation modeling: trajectories through elementary school. AERA Open, 5(4), A second course (2nd ed.). IAP. 233285841987944. https://doi.org/10.1177%2F2332858419879446 Haney, M. R., & Evans, J. G. (1999). National survey of school psychol- Dumas, D., McNeish, D., Schreiber-Gregory, D., Durning, S. J., & ogists regarding use of dynamic assessment and other nontraditional Torre, D. M. (2019). Dynamic measurement in health professions assessment techniques. Psychology in the Schools, 36(4), 295–304. education: Rationale, application, and possibilities. Academic https://doi.org/10.1002/(SICI)1520-6807(199907)36:4<295::AID-PIT – Medicine, 94(9), 1323 1328. https://doi.org/10.1097/ACM.000000 S3>3.0.CO;2-G 0000002729 Haywood, H. C. (2008). Twenty years of IACEP, and a focus on Elliott, J. G., Resing, W. C. M., & Beckmann, J. F. (2018). Dynamic dynamic assessment: Progress, problems, and prospects. Journal of assessment: A case of unfulfilled potential? Educational Review, Cognitive Education and Psychology, 7(3), 419–442. https://doi.org/ – 70(1), 7 17. https://doi.org/10.1080/00131911.2018.1396806 10.1891/194589508787724042 Embretson, S. E. (1987). Toward development of a psychometric Haywood, H. C., & Lidz, C. S. (2006). Dynamic assessment in practice: approach. In C. S. Lidz (Ed.), Dynamic assessment: An interactional Clinical and educational applications. Cambridge University Press. approach to evaluating learning potential (pp. 141–170). Guilford Hessels, M. G. P., & Hamers, J. H. M. (1993). The learning potential Press. test for ethnic minorities. In J. H. M. Hamers, K. Sijtsma, & A. J. J. Embretson, S. E. (1991). A multidimensional latent trait model for M. Ruijssenaars (Eds.), Learning potential assessment: Theoretical, measuring learning and change. , 56(3), 495–515. methodological and practical issues (pp. 285–311). Swets & Zeitlinger https://doi.org/10.1007/BF02294487 Publishers. 104 D. DUMAS ET AL.

Humphrey, S. E., & LeBreton, J. M. (Eds.). (2019). The handbook of National Research Council and National Academy of Education (NRC multilevel theory, measurement, and analysis. American and NAE). (2010). Getting value out of value-added: Report of a Psychological Association. https://doi.org/10.1037/0000115-000 workshop. Committee on value-added methodology for instructional Jacqmin-Gadda, H., Sibillot, S., Proust, C., Molina, J. M., & Thiebaut, improvement, program evaluation, and educational accountability, R. (2007). Robustness of the linear mixed model to misspecified Henry Braun, Naomi Chudowsky, and Judith Koenig (Eds.), Center error distribution. Computational Statistics & Data Analysis, 51(10), for Education, Division of Behavioral and Social Sciences and 5142–5154. https://doi.org/10.1016/j.csda.2006.05.021 Education. The National Academies Press. Kern, B. (1930). Wirkungsformen der Ubung [Effects in training]. Navarro, E., & Calero, M. D. (2009). Estimation of cognitive plasticity Helios. in old using dynamic assessment techniques. Journal of Kliegl, R., Smith, J., & Baltes, P. B. (1989). Testing-the-limits and the Cognitive Education and Psychology, 8(1), 38–51. https://doi.org/10. study of age differences in cognitive plasticity of a mnemonic 1891/1945-8959.8.1.38 skill. Developmental Psychology, 25(2), 247–256. https://doi.org/10. NWEA. (2011). Technical manual: Measures of Academic Progress 1037/0012-1649.25.2.247 (MAP) and Measures of Academic Progress for Primary Grades Kohli, N., Peralta, Y., & Bose, M. (2019). Piecewise random-effects (MPG). NWEA. modeling software programs. Structural Equation Modeling: A Passig, D., Tzuriel, D., & Eshel-Kedmi, G. (2016). Improving children’s – Multidisciplinary Journal, 26(1), 156 164. https://doi.org/10.1080/ cognitive modifiability by dynamic assessment in 3D Immersive 10705511.2018.1516507 Virtual Reality environments. Computers & Education, 95, 296–308. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitu- https://doi.org/10.1016/j.compedu.2016.01.009 – dinal data. Biometrics, 38(4), 963 974. Piaget, J. (1960). The psychology of intelligence. Littlefield, Adams. Lederberg, A. R., & Spencer, P. E. (2009). Word-learning abilities in Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likeli- deaf and hard-of-hearing preschoolers: Effect of lexicon size and hood function in the nonlinear mixed-effects model. Journal of language modality. Journal of Deaf Studies and Deaf Education, Computational and Graphical Statistics, 4(1), 12–35. https://doi.org/ – 14(1), 44 62. https://doi.org/10.1093/deafed/enn021 10.2307/1390625 Leroux, A. J., & Beretvas, S. N. (2018). Estimation of a latent variable Poehner, M. E., & van Compernolle, R. A. (2011). Frames of inter- regression growth curve model for individuals cross-classified by action in Dynamic Assessment: Developmental diagnoses of second – clusters. Multivariate Behavioral Research, 53(2), 231 246. https:// language learning. Assessment in Education: Principles, Policy & doi.org/10.1080/00273171.2017.1418654 Practice, 18(2), 183–198. https://doi.org/10.1080/0969594X.2011. Lidz, C. S. (1992). Dynamic assessment: Some thoughts on the model, 567116 the medium, and the message. Learning and Individual Differences, Poehner, M. E., Zhang, J., & Lu, X. (2015). Computerized dynamic 4(2), 125–136. https://doi.org/10.1016/1041-6080(92)90009-4 assessment (C-DA): Diagnosing L2 development according to Lidz, C. S., & Pena,~ E. D. (2009). Response to intervention and learner responsiveness to mediation. Language Testing, 32(3), dynamic assessment: Do we just appear to be speaking the same lan- 337–357. https://doi.org/10.1177/0265532214560390 guage? Seminars in Speech and Language, 30(02), 121–134. https:// Powell, M. J. D. (1977). Restart procedures for the conjugate gradient doi.org/10.1055/s-0029-1215719 method. Mathematical Programming, 12, 241–254. https://doi.org/ McNeish, D. (2016). Estimation methods for mixed logistic models 10.1007/BF01593790 with small sample sizes. Multivariate Behavioral Research, 51, Pressley, M., & Roehrig, A. (2003). Educational psychology in the mod- 790–804. ern era: 1960 to present. In B. Zimmerman & D. H. Schunk (Eds.), McNeish, D. (2017). Challenging conventional wisdom for multivariate Educational psychology: A century of contributions (pp. 333–366). statistical models with small samples. Review of Educational Lawrence Erlbaum Associates. Research, 87(6), 1117–1151. Raven, J. C. (1941). Standardization of progressive matrices, 1938. McNeish, D., & Dumas, D. (2019). Scoring repeated standardized tests – to estimate capacity, not just current ability. Policy Insights from the British Journal of , 19(1), 137 150. https://doi. Brain and Behavioral Sciences, 5,19–24 https://doi. org/10.1177/ org/10.1111/j.2044-8341.1941.tb00316.x 2372732219862578 Resing, W. C. M. (2013). Dynamic testing and individualized instruc- McNeish, D., & Dumas, D. (2017). Nonlinear growth models as meas- tion: Helpful in cognitive education? Journal of Cognitive Education – urement models: A second-order growth curve model for measuring and Psychology, 12(1), 81 95. https://doi.org/10.1891/1945-8959.12. potential. Multivariate Behavioral Research, 52(1), 61–85. https:// 1.81 doi.org/10.1080/00273171.2016.1253451 Resing, W. C. M., de Jong, F. M., Bosma, T., & Tunteler, E. (2009). McNeish, D., & Dumas, D. (2018). Calculating conditional reliability Learning during dynamic testing: Variability in strategy use by indi- for dynamic measurement model capacity estimates. Journal of genous and ethnic minority children. Journal of Cognitive Education – Educational Measurement, 55(4), 614–634. https://doi.org/10.1111/ and Psychology, 8(1), 22 37. https://doi.org/10.1891/1945-8959.8.1. jedm.12195 22 McNeish, D., Dumas, D., & Grimm, K. (2020). Estimating new quanti- Rey, A. (1934/2012). A method for assessing educability: Some applica- ties from longitudinal test scores to improve forecasts of future per- tions in psychopathology. Haywood, C. (Trans). Journal of Cognitive – formance. Multivariate Behavioral Research, 52,61–85. https://doi. Education and Psychology, 11(3), 274 300. https://doi.org/10.1891/ org/10.1080/00273171.2019.1691484 1945-8959.11.3.274 McNeish, D., & Hamaker, E. L. (2020). A primer on two-level dynamic Sijtsma, K. (1993). Psychometric issues in learning potential assess- structural equation modeling for intensive longitudinal data in ment. In J. H. M. Hamers, K. Sijtsma, & A. J. J. M. Ruijssenaars Mplus. Psychological Methods. Advance online publication. https:// (Eds.), Learning potential assessment: Theoretical, methodological and doi.org/10.1037/met0000250. practical issues (pp. 175–193). Swets & Zeitlinger Publishers. Meredith, W. (1993). Measurement invariance, factor analysis and fac- Sternberg, R. J., & Grigorenko, E. L. (2002). Dynamic testing: The torial invariance. Psychometrika, 58(4), 525–543. https://doi.org/10. nature and measurement of learning potential. Cambridge University 1007/BF02294825 Press. Moerbeek, M., & Safarkhani, M. (2018). The design of cluster random- Sternberg, R. J., Grigorenko, E. L., Ngorosho, D., Tantufuye, E., Mbise, ized trials with random cross-classifications. Journal of Educational A., Nokes, C., Jukes, M., & Bundy, D. A. (2002). Assessing intellec- and Behavioral Statistics, 43(2), 159–181. https://doi.org/10.3102/ tual potential in rural Tanzanian school children. Intelligence, 30(2), 1076998617730303 141–162. https://doi.org/10.1016/S0160-2896(01)00091-5 National Academies of Sciences, Engineering, and Medicine. (2018). Stevenson, C. E., Hickendorff, M., Resing, W. C. M., Heiser, W. J., & How people learn II: Learners, contexts, and cultures. National de Boeck, P. A. L. (2013). Explanatory item response modeling of Academies Press. children’s change on a dynamic test of analogical reasoning. EDUCATIONAL PSYCHOLOGIST 105

Intelligence, 41(3), 157–168. https://doi.org/10.1016/j.intell.2013.01. Thurstone, L. L. (1926). The scoring of individual performance. Journal 003 of Educational Psychology, 17(7), 446–457. https://doi.org/10.1037/ Strauss, D. (1992). The many faces of logistic regression. The American h0075125 Statistician, 46, 321–327. https://doi.org/10.2307/2685327 Timmons, A. C., & Preacher, K. J. (2015). The importance of temporal Stringer, P. (2018). Dynamic assessment in educational settings: Is design: How do measurement intervals affect the accuracy and effi- potential ever realised? Educational Review, 70(1), 18–30. https:// ciency of parameter estimates in longitudinal research? Multivariate doi.org/10.1080/00131911.2018.1397900 Behavioral Research, 50(1), 41–55. https://doi.org/10.1080/00273171. Swanson, H. L. (1990). Influence of metacognitive knowledge and apti- 2014.961056 tude on problem solving. Journal of Educational Psychology, 82(2), Tzuriel, D., & Caspi, R. (2017). Intervention for peer mediation and 306–314. https://doi.org/10.1037/0022-0663.82.2.306 mother-child interaction: The effects on children’s mediated learn- Swanson, H. L. (1995a). Using the Cognitive Processing Test to assess ing strategies and cognitive modifiability. Contemporary ability: Development of a dynamic assessment measure. School Educational Psychology, 49, 302–323. https://doi.org/10.1016/j.ced- Psychology Review, 24(4), 672–693. psych.2017.03.005 Swanson, H. L. (1995b). Effects of dynamic testing on the classification Veerbeek, J., Vogelaar, B., Verhaegh, J., & Resing, W. C. M. (2019). of learning disabilities: The predictive and discriminant validity of Process assessment in dynamic testing using electronic tangibles. the Swanson-Cognitive Processing Test. Journal of Psychoeducational Journal of Computer Assisted Learning, 35(1), 127–142. https://doi. Assessment, 13(3), 204–229. https://doi.org/10.1177/073428299501 org/10.1111/Jcal.12318, https://doi.org/10.1111/jcal.12318 300301 von Davier, A. A., Deonovic, B., Yudelson, M., Polyak, S. T., & Woo, Swanson, H. L. (2000). Swanson cognitive processing test: Review and A. (2019). Computational psychometrics approach to holistic learn- applications. In C. Lidz & J. G. Elliott (Eds.), Dynamic assessment: ing and assessment systems. Frontiers in Education, 4, 69. https:// Prevailing models and applications (pp. 71–107). Elsevier Science. doi.org/10.3389/feduc.2019.00069 Swanson, H. L., & Howard, C. B. (2005). Children with reading disabil- Vygotsky, L. S. (1931/1997). The history of the development of higher ities: Does dynamic assessment help in the classification? Learning mental functions. In R. W. Rieber (Ed.), The collected works of L S Disability Quarterly, 28(1), 17–34. https://doi.org/10.2307/4126971 Vygotsky (Vol 4). Plenum Press. Thorndike, E. L. (1919). A standardized group examination of intelli- Vygotsky, L. S. (1935/2011). The dynamics of the schoolchild’s mental gence independent of language. Journal of , 3(1), development in relation to teaching and learning. Kozulin, A. 13–32. https://doi.org/10.1037/h0070037 (Trans.). Journal of Cognitive Education and Psychology, 10(2), Thorndike, E. L. (1921). Intelligence and its measurement: A sympo- 198–211. https://doi.org/10.1891/1945-8959.10.2.198 sium– I. Journal of Educational Psychology, 12(3), 124–127. https:// Wolfinger, R. D. (1999, April). Fitting nonlinear mixed models with doi.org/10.1037/h0064596 the new NLMIXED procedure. In Proceedings of the 24th Annual Thorndike, E. L. (1924). Measurement of intelligence. Psychological SAS Users Group International Conference (SUGI 24) (pp. 278–284). Review, 31(3), 219–252. https://doi.org/10.1037/h0073975 Yang, Y., & Qian, D. D. (2019). Promoting L2 English learners’ Thum, Y. M., & Hauser, C. H. (2015). NWEA 2015 MAP norms for reading proficiency through computerized dynamic assessment. student and school achievement status and growth. NWEA Research Computer Assisted Language Learning. Advance online publication. Report. NWEA. https://doi.org/10.1080/09588221.2019.1585882