EFFECTS OF METALINGUISTIC KNOWLEDGE AND LANGUAGE APTITUDE ON LEARNING

A Dissertation Submitted to the Temple University Graduate Board

In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

By Brian Wistner January, 2014

Examining Committee Members

Jim Sick, Advisory Chair, International Christian University David Beglar, CITE/TESOL Steven Ross, External Member, University of Maryland Edward Schaefer, External Member, Ochanomizu University Marshall Childs, CITE/TESOL

©

Copyright

2014

by

Brian Wistner

iii

ABSTRACT

The purpose of this study was to investigate the effects of metalinguistic knowledge and language learning aptitude on second language (L2) procedural knowledge. Three lines of inquiry were undertaken: (a) confirming the factorial structure of metalinguistic knowledge and language learning aptitude; (b) testing the relative effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge; and (c) assessing the relative contributions of receptive and productive metalinguistic knowledge and components of language learning aptitude to

L2 procedural knowledge.

Two-hundred-forty-nine Japanese university students participated. One receptive and two productive tests of metalinguistic knowledge related to metalinguistic terminology and English grammatical rules were administered. Learners’ language learning aptitude was measured using the Lunic Language Marathon, which consisted of four scales: number learning, sound-symbol association, vocabulary learning, and language analytical ability. Participants’ L2 procedural knowledge was assessed through performance on a timed writing task. The writing samples were scored for overall quality, L2 complexity, accuracy, and fluency.

The scores from each test were subjected to Rasch analyses to investigate the construct validity and unidimensionality of the instruments. The results of the Rasch analyses indicated that the test items fit the Rasch model, supporting the construct validity of the instruments. The unidimensionality of each instrument was established

iv

through Rasch principal component analyses. Interval-level Rasch measures were used for the subsequent analyses.

The results of exploratory and confirmatory factor analyses indicated that metalinguistic knowledge and language learning aptitude were distinct constructs. A two-factor model showed good model fit and explained the relationship between the two constructs. Structural equation modeling revealed that metalinguistic knowledge significantly predicted L2 procedural knowledge, complexity, accuracy, and fluency.

Language learning aptitude, however, was not a statistically significant predictor of the

L2 procedural knowledge variables. The results of a path model analysis indicated that productive metalinguistic knowledge was the strongest predictor of L2 procedural knowledge, language analytical ability predicted receptive metalinguistic knowledge, and number learning was negatively associated with L2 procedural knowledge. The findings point to the facilitative role of metalinguistic knowledge in L2 learning and the viability of L2 declarative knowledge becoming proceduralized through practice.

v

ACKNOWLEDGMENTS

First, I would like to acknowledge the helpful advice I received from my dissertation defense committee. Drs. Jim Sick, David Beglar, Steven Ross, Edward

Schaefer, and Marshall Childs provided valuable feedback and insightful comments that helped to improve many aspects of this dissertation.

I would like to thank Hideki Sakai for his contribution to this research. In addition to providing advice throughout the research process, Hideki played a pivotal role in the design of the Japanese testing instruments. He generously provided feedback on previous versions of the tests, offering essential advice on the content and phrasing of the items.

I would also like to express my gratitude to the students who participated in this research and to my colleagues who cooperated with data collection and test scoring.

Finally, I would like to express my heartfelt appreciation to my family for their support and patience throughout my doctoral studies.

vi

TABLE OF CONTENTS

Page

ABSTRACT ...... iv

ACKNOWLEDGMENTS ...... vi

LIST OF TABLES ...... xii

LIST OF FIGURES ...... xiv

CHAPTER

1. INTRODUCTION ...... 1

The Background of the Issue ...... 1

Statement of the Problem ...... 5

Purposes of the Study...... 7

The Audience for the Study ...... 9

Delimitations ...... 10

Key Terminology ...... 11

The Organization of the Study ...... 13

2. REVIEW OF THE LITERATURE ...... 15

Declarative and Procedural Knowledge ...... 15

Characteristics of L2 Implicit and Explicit Knowledge ...... 19

Measuring Procedural and Declarative Knowledge ...... 22

Previous Studies of L2 Implicit and Explicit Knowledge ...... 26

Language Learning Aptitude ...... 39

Conceptualizations of Foreign Language Aptitude ...... 40

vii

Operationalizations of Language Aptitude ...... 46

Modern Language Aptitude Test ...... 46

Pimsleur Language Aptitude Battery ...... 47

Language Aptitude Battery for the Japanese ...... 48

Lunic Language Marathon ...... 50

Cognitive Ability for Novelty in Acquisition of Language as Applied to Foreign Language Test ...... 50

Effects of Language Aptitude on L2 Learning ...... 52

Studies of Metalinguistic Knowledge, Proficiency, and Aptitude ...... 61

Knowledge and Theory-Based Gaps ...... 70

Analytical Gaps ...... 71

Purposes of the Study...... 72

Research Questions ...... 76

3. METHODS ...... 82

Participants ...... 82

Instrumentation ...... 85

Receptive Metalinguistic Knowledge Test ...... 85

Productive Metalinguistic Knowledge Test ...... 88

Language Learning Aptitude Test ...... 91

Procedural Knowledge Test ...... 94

Procedures ...... 99

Analysis...... 100

Rasch Analysis ...... 100

Structural Equation Modeling ...... 103

viii

Analytical Procedures and Evaluative Criteria ...... 106

Modeling Procedures ...... 109

Issues of Model Complexity and Sample Size ...... 110

4. PRELIMINARY ANALYSES ...... 111

Analysis...... 111

Receptive Metalinguistic Knowledge Test ...... 112

Productive Metalinguistic Knowledge Test ...... 119

Technical Terminology Scale ...... 120

Rule Explanation Scale ...... 125

Language Learning Aptitude Test ...... 129

Number Learning Scale ...... 132

Sound-Symbol Association Scale ...... 135

Vocabulary Learning Scale ...... 141

Language Analytical Ability Scale ...... 146

L2 Procedural Knowledge Test ...... 150

Facets Analysis ...... 151

PCA of Complexity, Accuracy, and Fluency Measures ...... 154

Summary of the Preliminary Analyses ...... 160

5. RESULTS ...... 162

Data Screening ...... 162

Descriptive Statistics ...... 168

Research Question 1: Relationship Between Metalinguistic Knowledge and Language Learning Aptitude...... 170

Research Question 2: Effects of Metalinguistic Knowledge and Language Learning Aptitude on L2 Procedural Knowledge ...... 177

ix

Research Question 3: Effects of Components of Metalinguistic Knowledge and Language Learning Aptitude on L2 Procedural Knowledge ...... 191

Summary of the Results ...... 198

6. DISCUSSION ...... 200

Research Question 1: Relationship between Metalinguistic Knowledge and Language Learning Aptitude...... 200

Summary of the Results for Research Question 1 ...... 201

Interpretation of the Results for Research Question 1 ...... 202

Research Question 2: Effects of Metalinguistic Knowledge and Language Learning Aptitude on L2 Procedural Knowledge ...... 208

Summary of the Results for Research Question 2 ...... 208

Interpretation of the Results for Research Question 2 ...... 210

Role of Metalinguistic Knowledge in Instructed L2 Learning ...... 210

Language Learning Aptitude and L2 Procedural Knowledge ...... 221

Research Question 3: Effects of Components of Metalinguistic Knowledge and Language Learning Aptitude on L2 Procedural Knowledge ...... 228

Summary of the Results for Research Question 3 ...... 229

Interpretation of the Results for Research Question 3 ...... 230

Theoretical Implications ...... 234

Pedagogical Implications ...... 236

7. CONCLUSION ...... 240

Summary of the Findings ...... 240

Limitations ...... 245

Suggestions for Future Research ...... 247

x

Final Conclusions...... 249

REFERENCES ...... 251

APPENDICES

A. RECEPTIVE METALINGUISTIC KNOWLEDGE TEST (JAPANESE VERSION) ...... 269

B. RECEPTIVE METALINGUISTIC KNOWLEDGE TEST (ENGLISH TRANSLATION) ...... 270

C. PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST ...... 273

D. PROCEDURAL KNOWLEDGE TEST ...... 275

E. PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST SCORING RUBRIC (JAPANESE VERSION) ...... 276

F. PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST SCORING RUBRIC (ENGLISH TRANSLATION) ...... 279

G. BACKGROUND QUESTIONNAIRE (JAPANESE VERSION)...... 282

H. BACKGROUND QUESTIONNAIRE (ENGLISH TRANSLATION) ...... 283

I. CORRELATION MATRIX OF DECLARATIVE KNOWLEDGE, LANGUAGE APTITUDE, AND L2 PROCEDURAL KNOWLEDGE ...... 284

xi

LIST OF TABLES

Table Page

1. Example Target Structures of the Productive Metalinguistic Knowledge Test ...... 90

2. Rasch Item Statistics for the Receptive Metalinguistic Knowledge Test (Measure Order) ...... 116

3. Rasch Principal Components Analysis for the Receptive Metalinguistic Knowledge Test ...... 118

4. Rasch Item Statistics for the Productive Metalinguistic Technical Terminology Scale (Measure Order) ...... 122

5. Rasch Principal Components Analysis for the Productive Metalinguistic Technical Terminology Scale ...... 123

6. Rasch Item Statistics for the Productive Metalinguistic Rule Explanation Scale (Measure Order) ...... 127

7. Rasch Principal Components Analysis for the Productive Metalinguistic Rule Explanation Scale ...... 129

8. Rasch Item Statistics for the Number Learning Scale (Measure Order) ...... 134

9. Rasch Principal Components Analysis for the Number Learning Scale ...... 137

10. Rasch Item Statistics for the Sound-Symbol Association Scale (Measure Order) ...... 139

11. Rasch Principal Components Analysis for the Sound-Symbol Association Scale ...... 141

12. Rasch Item Statistics for the Vocabulary Learning Scale (Measure Order) ...... 144

13. Rasch Principal Components Analysis for the Vocabulary Learning Scale ...... 146

14. Rasch Item Statistics for the Language Analytical Ability Scale (Measure Order) ...... 148

15. Rasch Principal Components Analysis for the Language Analytical Ability Scale ...... 150

16. Descriptive Statistics for the Essay Ratings ...... 151

xii

17. Descriptive Statistics for the Rasch Procedural Knowledge Measures ...... 152

18. Descriptive Statistics for the CAF Measures ...... 156

19. Rotated Component Loadings for the First PCA ...... 157

20. Rotated Component Loadings for the Fifth PCA ...... 159

21. Skewness and Kurtosis Statistics for the Metalinguistic Knowledge, Language Aptitude, and L2 Procedural Knowledge Variables ...... 165

22. Descriptive Statistics for Declarative Knowledge ...... 169

23. Descriptive Statistics for Language Aptitude ...... 169

24. Descriptive Statistics for L2 Procedural Knowledge ...... 169

25. Pattern Matrix of the Rotated Factor Loadings for Metalinguistic Knowledge and Language Learning Aptitude ...... 172

26. Factor Matrix for a One-Factor Solution ...... 173

27. Rotated Factor Loadings for Metalinguistic Knowledge, Language Learning Aptitude, and L2 Developmental Measures ...... 180

xiii

LIST OF FIGURES

Figure Page

1. Two-factor model of metalinguistic knowledge and language learning aptitude...... 77

2. Structural model of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge...... 79

3. Path model of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge...... 80

4. Item-person map of the receptive metalinguistic construct ...... 114

5. Item-person map of the technical terminology construct ...... 124

6. Item-person map of the rule explanation construct ...... 128

7. Item-person map of the language learning aptitude construct...... 131

8. Item-person map of the number learning construct ...... 136

9. Item-person map of the sound-symbol association construct ...... 140

10. Item-person map of the vocabulary learning construct ...... 145

11. Item-person map of the language analytical ability construct ...... 149

12. Facets vertical ruler of Rasch L2 procedural knowledge measures...... 153

13. CFA of metalinguistic knowledge and language learning aptitude ...... 177

14. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge ...... 183

15. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 complexity ...... 185

16. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 accuracy ...... 187

17. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 fluency ...... 190

xiv

18. Path model of the effects of components of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge ...... 192

19. Modified path model of the effects of components of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge ...... 197

xv

CHAPTER 1

INTRODUCTION

The Background of the Issue

Researchers working in the field of second language acquisition (SLA) have placed great importance on investigating the structure and representation of second language (L2) knowledge. Learners’ have been hypothesized as being combinations of different types of knowledge; the combination most often invoked is that of implicit and explicit knowledge. Although some researchers label the constructs differently, most models of L2 knowledge and acquisition include L2 implicit and explicit knowledge or similarly labeled constructs. Where most researchers differ is on identification and description of the processes involved in the processing and acquisition of L2 implicit and explicit knowledge, not on the existence of the two constructs. Early theories of L2 acquisition (e.g., Krashen, 1981) posited differing roles for L2 implicit and explicit knowledge and differing acquisition-related processes, and a number of studies reported theoretical accounts and investigations of L2 linguistic representation (e.g., R.

Ellis, 1993; Gass, 1983; Hulstijn & Hulstijn, 1984; Rubin, 1981; Sharwood Smith, 1981;

Zobl, 1995). Later theories of L2 learning posited the existence of declarative and procedural knowledge of an L2 and described how one type of knowledge might transform into or facilitate the acquisition of the other (e.g., DeKeyser, 1998; N. C. Ellis,

2005; R. Ellis, 1994a; Hulstijn, 2005), the development and acquisition of L2 knowledge

(e.g., R. Ellis, 2002; R. Ellis, Loewen, & Erlam, 2006; Macaro & Masterman, 2006;

1

Robinson, 2005b, White, Munoz, & Collins, 2007), psychological constraints on processing (Hu, 2002), integration of knowledge (Jiang, 2007), the role of metalinguistic knowledge in oral proficiency development (Golonka, 2006), the use of metalinguistic knowledge (Roehr, 2006), bilingual declarative and metalinguistic processing (Bialystok,

2001; Bialystok, Craik, & Luk, 2008), and studies of event-related potentials (Tokowicz

& MacWhinney, 2005).

Recent conceptualizations of L2 declarative knowledge (e.g., Alderson, Clapham,

& Steel, 1997; Ranta, 2002; Roehr, 2008a, 2008b) have considered the construct to include aspects of language learning aptitude. Roehr (2008b) argued that metalinguistic ability (i.e., aptitudinal aspects of metalinguistic knowledge) is part of the multidimensional construct of declarative knowledge. She reported a one-factor solution for factor analyses of scores from L2 grammatical knowledge tests and L2 metalinguistic ability tests. These findings cast doubt on previous interpretations of the factorial structure of linguistic knowledge and language aptitude. Is L2 declarative knowledge multifaceted? If so, to what extent do various factorial structures account for the relationships among the factors? Is language aptitude situated within a declarative knowledge construct, or does it constitute a unique construct? Numerous substantive and empirical questions remain unanswered regarding these fundamental components of SLA.

These implicit and explicit constructs have become the center of numerous debates in the SLA literature, which to-date have remained unresolved. Before the acquisition of the two types of knowledge can be theoretically defined and empirically tested, there is a need to identify the underlying structure of the two types of L2

2

knowledge and to examine the dimensionality and relationships that are theorized to exist among the numerous latent variables that have been implicated in SLA processes and L2 knowledge representation.

I became interested in the topics of L2 knowledge representation, acquisition, and individual differences through the experiences I gained while teaching English in Japan. I first taught English as a foreign language at junior high schools in Japan, which enabled me to observe the materials, activities, and learners’ orientations to the pedagogic approaches applied in the classrooms. The majority of the students were engaging with an

L2 for the first time, and they seemed to take a direct, explicit approach to L2 learning.

The language classes could be characterized as L1-mediated, deductive, and pedagogically explicit. The teachers and students did not hesitate to use Japanese—many of the lessons involved more interaction in Japanese than English. Theoretically, this type of interaction should result in primarily declarative knowledge of the L2. Moreover, this knowledge should predominantly consist of L1 representations of rules and terminology related to the L2. Indeed, the teachers that I worked with used a vast range of metalinguistic terminology in Japanese when explaining English grammar. In turn, the students often responded in Japanese to the teachers’ directions. When they did speak in

English, the utterances seemed to be explicitly controlled and monitored. These observations led me to consider the role of explicit knowledge and instruction in L2 acquisition.

A further observation from teaching in Japanese schools was that few students displayed the characteristics and behaviors associated with successful language

3

acquisition in instructed L2 learning (e.g., linguistic risk-taking, meaning-focused input and output). The majority of the students only spoke when called upon, and these utterances usually included only a morphological aspect of the L2, and one or two word answers, or at times a complete sentence. During group work, students commonly relied on their L1 to negotiate task instructions and even task demands and completion.

Obviously schools and teachers exert influence on the linguistic environment in which students learn and acquire an L2. However, students’ characteristics and preferences also affect the learning environment and acquisition-related cognitive processes.

Individual differences in L2 learning mediate the development of L2 proficiency and explain variation in learning outcomes. One of the most researched L2 individual difference variables is motivation. Scholars consider this construct to wield significant influence on adult L2 learning and acquisition. However, researchers also consider language learning aptitude to be one of the most influential learner variables in L2 learning. Despite this recognition, language learning aptitude has received considerably less attention in the L2 literature. I could more concretely understand the differences in motivational intensity among the learners with whom I interacted daily. It was unclear to me how these same learners might differ in language learning aptitude. Surely there were meaningful differences in learning aptitudes at the beginning of their L2 studies. How, then, did language learning aptitude affect the learners’ L2 development? Did language learning aptitude explain variation exhibited by Japanese learners of English? These were questions to which I did not have unequivocal answers.

4

Considering the observations and experiences described above, I became interested in the effects of metalinguistic knowledge and language learning aptitude on

L2 learning and acquisition. The former is relevant to L2 acquisition and pedagogy as an instructional choice. When and how much metalinguistic instruction should be provided to learners? Does the provision of metalinguistic knowledge result in positive learning outcomes? The latter could explain variation in the development of L2 knowledge. Do learners who differ in their aptitudinal profiles exhibit different learning outcomes?

Finally, combining the two constructs results in empirical research questions that are related to central issues in L2 acquisition: Controlling for language learning aptitude, what is the contribution of metalinguistic knowledge to L2 proficiency? Research questions such as these captured my interest and led me to undertake research in the areas of L2 knowledge, acquisition, and individual differences.

Statement of the Problem

There exist two considerable gaps in the research literature: knowledge and theory-based gaps and analytical gaps. The former relates to the effects of declarative knowledge and language learning aptitude on L2 procedural knowledge, and the latter concerns statistical methods of analysis.

First, regarding the effects of declarative knowledge on L2 learning, the findings of previous studies have been mixed. Specifically, the role of metalinguistic knowledge in L2 learning is unclear. Substantial evidence exists for the beneficial effects of an explicit approach to L2 learning (e.g., Norris & Ortega, 2000). However, recent studies

5

have found weak to moderate relationships between metalinguistic knowledge and L2 proficiency. Empirical evidence is needed to clarify the contribution of metalinguistic knowledge to the development of L2 procedural knowledge.

Second, regarding the effects of language learning aptitude on L2 learning, although much research has been conducted on this phenomenon, relatively little evidence exists regarding the effects of language aptitude on declarative and procedural knowledge. That is, studies of language learning aptitude need to be situated within a framework that accounts for the development of declarative and procedural knowledge of an L2.

Third, the analytical gaps in the literature concern the measurement models and statistical analyses employed in the study of metalinguistic knowledge and language learning aptitude on L2 learning outcomes. True-score theory is ubiquitous in the L2 research literature. It is based on a deterministic approach to the measurement of latent constructs. However, Rasch modeling approaches the measurement of human performance differently. It encompasses a stochastic approach to the measurement of latent variables. This measurement model provides interval-level ability measures which can be used in statistical analyses, resulting in a high degree of methodological rigor.

Furthermore, structural equation modeling has been used infrequently in the study of L2 declarative and procedural knowledge. This type of modeling enables researchers to specify hypotheses a priori and to test empirically the predicted research outcomes.

6

Purposes of the Study

The first purpose of the current study is to model the effects of metalinguistic knowledge on L2 procedural knowledge. Metalinguistic terminology is commonly used in foreign language classrooms, and L2 learners often expect explicit rule explanations.

The role and effectiveness of this declarative knowledge on L2 learning outcomes is unclear. Although explicit L2 instruction facilities L2 learning (Norris & Ortega, 2000), the degree to which metalinguistic knowledge contributes to this process is uncertain.

Empirical evidence is needed to inform pedagogical approaches to L2 instruction that incorporate metalinguistic explanations and explicit instruction of grammatical rules. For these reasons, this line of inquiry could contribute to L2 theory development and L2 pedagogy. Specifically, testing models related to declarative and procedural knowledge can contribute to skill theory explanations of L2 learning, and clarifying the role of metalinguistic knowledge can inform approaches to L2 teaching and curriculum design.

The second purpose of the current study is to examine empirically the effects of language learning aptitude on L2 procedural knowledge. The investigation of the effects of language learning aptitude has relevance to L2 theory development and pedagogy.

Central to debates in the field of SLA is that of the critical period hypothesis and learning mechanisms. Opinions vary on the existence and importance of a critical period for language learning. Language learning aptitude is believed to be relatively ineffectual in

L1 acquisition. However, preliminary evidence points to a prominent role of language learning aptitude in adult L2 acquisition (e.g., Abrahamsson & Hyltenstam, 2008;

DeKeyser, 2000; DeKeyser, Alfi-Shabtay, & Ravid, 2010; Harley & Hart, 1997). Testing

7

the effects of language learning aptitude on procedural knowledge can bring empirical evidence to bear on the theoretical contribution of this construct to L2 learning. This is significant in that language learning aptitude is one of the most researched individual differences, yet few theories of L2 learning have incorporated it as an explanatory variable. Situating language learning aptitude in a model of declarative and procedural knowledge could shed new light on the role of this construct in L2 learning.

The third purpose of the present study is to investigate metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge using the analytic methods of

Rasch modeling and structural equation modeling, which addresses the two analytical gaps. Applying these analytic methods is significant in that Rasch modeling provides evidence of how well data fit the Rasch model, which takes the opposite approach from most statistical modeling. Assessing the degree that data fit the expectations of the Rasch model provides evidence of construct validity, test-sample targeting, and unidimensionality. These measurement-relevant qualities are seldom examined in studies of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge.

Similarly, the use of structural equation modeling enables the testing of a priori hypotheses. This is important in that is forces the researcher to clearly specify the relationships in the data before subjecting the data to statistical analyses. Furthermore, the regression weight of each operationalization of the variables can be examined, which provides key information relating to the measurement of the latent variables. The use of these two methods of analysis provides a rigorous statistical investigation of these three constructs.

8

The Audience for the Study

The current study will be of interest to researchers working in the field of SLA and L2 pedagogy. Researchers will most likely be interested in the results pertaining to the confirmed or rejected relationships found among language learning aptitude, metalinguistic knowledge, and L2 procedural knowledge. Clarifying the relationships among these constructs enables researchers to posit relationships among the variables and to on certain aspects of a multidimensional construct that might have been confounded in previous studies.

Likewise, L2 teachers, program administrators, and curriculum designers will most likely be interested in the identified components that comprise L2 declarative knowledge—the way in which these components could be included in a course of study might be of interest. This interest could also extend to language teaching policy, which is often decided at the governmental level, such as a ministry of education. If the relationships among declarative aspects of L2 knowledge are better defined and implications for language teaching are found, educational stakeholders would be able to conceptualize pedagogic interventions and approaches that focus on specific aspects of

L2 explicit knowledge. For instance, delineating between instructional interventions that focus on developing processing procedures based on previously taught declarative knowledge and those that focus on teaching metalinguistic knowledge could assist instructors in developing a rationale for the activities and approaches that are used in L2 classrooms.

9

Delimitations

While the issues under investigation in the current study are theorized to be relevant to all L2 learners, only Japanese learners of English were recruited to participate in the study. These learners grew up in Japan and attended Japanese schools. The instructional styles employed in L2 English classes in these schools can include aspects that are unique to Japanese society. Large amounts of explicit, L1-mediated instruction characterize many of these classes.

Second, the majority of these learners were English majors at universities in Japan.

This group of learners could be considered as unique in that they share distinctive characteristics that are not common to all L2 learners. For instance, the learners shared an

L1 (i.e., Japanese), and the range of ages of the participants was constrained—the majority of the participants (98%) were between the ages of 18 to 23. They were also admitted to universities based on entrance examination scores. Students with comparable backgrounds or knowledge might have performed similarly on the examinations, resulting in a group with similar characteristics.

Third, metalinguistic knowledge was tested in Japanese. The learners who participated in this study received L1-mediated instruction, and used L1 technical terminology to study English. Learners in other environments or countries can have differing degrees of knowledge of related metalinguistic terminology, and different aspects of metalinguistic terminology can be emphasized in other educational milieu.

Thus, while the results of the present study should be generalizable to Japanese

10

university-level learners of English, generalizing to other populations should be done cautiously—careful consideration of the characteristics of this study’s participants is required.

Fourth, the focus of this study was on metalinguistic knowledge, which the participants had been exposed to in L2 classrooms and through explicit instruction and study. Naturalistic learners often acquire an L2 within the context of their daily lives, with relatively little explicit instruction. The results, then, might not generalize to all L2 learners.

Key Terminology

Terminology from the fields of SLA, L2 pedagogy, and psychology are used throughout the present study. The terms declarative knowledge, language learning aptitude, procedural knowledge, and implicit knowledge are defined as follows:

1. Declarative knowledge / explicit knowledge / metalinguistic knowledge: This

knowledge is part of the declarative , which distinguishes it from the

nondeclarative system (Squire & Zola, 1996). R. Ellis (2004) defined L2 explicit

knowledge as “the conscious awareness of what a language or language in general

consists of and/or the roles that it plays in human life” (p. 229). Explicit learning

and knowledge are defined by the presence of awareness of learning processes

and the ability to reflect on the stored knowledge. Explicit knowledge can be

brought into awareness, whereas implicit knowledge cannot be brought into

consciousness. More specifically, metalinguistic knowledge refers to explicit

11

knowledge related to the declarative aspects of a language. These aspects include

grammatical rules, knowledge of other linguistic rules and features (e.g.,

discourse and ), and technical terminology used to describe language.

This knowledge is considered to be verbalizable (Paradis, 1994). Explicit effort

and reflection, and sometimes instruction, are needed to gain declarative

knowledge.

2. Language learning aptitude: This term refers to the cognitive processes and

abilities that facilitate L2 learning. These are commonly divided between memory

and analytical abilities. These abilities are most likely stable over time, but skill

development could be observed for some aspects of language aptitude such as L1

grammatical sensitivity. Language learning aptitude is thought to influence the

rate and ultimate attainment of L2 acquisition.

3. Procedural knowledge: This knowledge is often described as knowledge of how

to do something. In skill acquisition theory (e.g., Anderson, 2002), this

knowledge is the result of practice, which involves recalling and applying

declarative rules. Through practice, declarative knowledge is chunked, and its

recall and application become proceduralized. That is, the recall of a single chunk

of information could initiate the application of a set of procedures, which

facilitates task completion and skill development. The activation of

proceduralized knowledge can become automatized through practice. Such

automatized task performance can appear to be implicit due to the lack of

conscious manipulation of stimuli. However, proceduralized declarative

12

knowledge is qualitatively different from implicit linguistic competence (Paradis,

2009).

4. Implicit knowledge: Knowledge that a person possesses, but the content of the

knowledge is not open to direct inspection. It is often the product of implicit

learning, which happens without direct application of attention. An example is L1

acquisition, which proceeds effortlessly and without explicit instruction. Implicit

knowledge and learning are defined by the absence of awareness of learning

processes and the lack of ability to reflect on the stored knowledge. Explicit

knowledge can be brought into awareness, whereas implicit knowledge cannot be

brought into consciousness.

The Organization of the Study

Chapter 2, Review of the Literature, is made up of three main sections: declarative and procedural knowledge, language learning aptitude, and theory-based and analytical gaps in the literature. The purposes of the study and the research questions that guide the study are presented at the end of the chapter. In Chapter 3, Methods, I present information about the participants, instruments, procedures, and statistical analyses employed in this study. In Chapter 4, Preliminary Analyses, I report the results of analyses of the receptive metalinguistic knowledge test, the productive metalinguistic knowledge test, the language learning aptitude test, and the L2 procedural knowledge test. In Chapter 5, Results, I report the results of analyses performed to answer each research question. In Chapter 6, Discussion, I compare the results of the present study

13

with those of previous studies and situate the findings within procedural and declarative theories of L2 learning. In Chapter 7, Conclusion, after restating the purpose of the study and the research questions that guided it, I provide a brief summary of the findings, the limitations of the study, suggestions for future research, and final comments.

14

CHAPTER 2

REVIEW OF THE LITERATURE

The review of the literature is organized as follows. First, the concepts of declarative and procedural knowledge are reviewed. Second, the notions of implicit and explicit knowledge of a second language are defined, and the characteristics of each type of knowledge are explained. Third, empirical studies of L2 implicit and explicit knowledge are reviewed. Fourth, theories of language learning aptitude are reviewed.

Fifth, the relationship between L2 knowledge and language aptitude is examined. Sixth, empirical studies of L2 declarative and procedural knowledge and language aptitude are reviewed. Finally, the purposes, research questions, and hypotheses that guide the current study are presented.

Declarative and Procedural Knowledge

In a review of the structure and function of memory systems, Squire and Zola

(1996) divided long-term memory into two broad categories: declarative and nondeclarative. Declarative memory stores factual knowledge and information about events. Nondeclarative memory consists of four components: procedures (i.e., skills and habits), priming, simple classical conditioning, and nonassociative learning. Each component of declarative and nondeclarative memory is mapped to specific regions of the brain. These subsystems subserve the learning, processing, and recall of linguistic knowledge.

15

Declarative and nondeclarative knowledge stores are often referred to as explicit and implicit knowledge, respectively. Implicit knowledge is also commonly referred to as procedural knowledge due to the processes engaged in the development and use of nondeclarative knowledge stores. These are cognitive processing procedures that are not open to explicit inspection (Paradis, 2009). Implicit and explicit knowledge and learning involve the concept of awareness; thus, the use of those terms might highlight the presence or absence of awareness of learning processes and the ability to reflect on the stored knowledge. Explicit knowledge can be brought into awareness, whereas implicit knowledge cannot be brought into consciousness. Studies of declarative and procedural knowledge often focus on the distinction of knowing what from knowing how. In skill acquisition theory (e.g., Anderson, 2002), declarative knowledge is proceduralized, which results in chunks that are activated in automatized performance. Such automatized task performance can appear to be implicit due to the lack of conscious manipulation of stimuli. However, proceduralized declarative knowledge is qualitatively different from implicit linguistic competence (Paradis, 2009). It should be noted, though, that the use of implicit linguistic competence or highly proceduralized, chunked declarative knowledge can result in similar task completion or linguistic performance. That is, the use of one type of knowledge might be indistinguishable from the other in observable performance.

In practice, this is not problematic because if a learner is relying on implicit knowledge or highly proceduralized knowledge, the resultant task performance should be acceptable.

When comparing linguistic performance between learners who rely on implicit or procedural knowledge and those who employ only nonprocedural declarative knowledge,

16

the task performance of the former group should be superior, assuming other variables are controlled for. For present purposes, the important difference between implicit knowledge and procedural knowledge is that implicit knowledge cannot be brought into consciousness, while procedural knowledge, which began as declarative knowledge, is most likely still verbalizable.

The evidence for distinct, separate declarative and procedural knowledge memory systems comes primarily from amnesiac patients’ task performance. Squire and Zola

(1996) reviewed the findings from three experimental tasks: probabilistic classification learning, perceptuomotor skill learning, and artificial grammar learning. Amnesic patients were gradually able to develop procedural knowledge that aided in task completion, but they were relatively less successful at correctly answering questions regarding factual components of the task.

In the L2 literature, views of knowledge representation have changed over the last few decades, with early research focusing on representation and control (e.g., Bialystok

& Sharwood Smith, 1985) and later research concerned with representation and consciousness in learning (e.g., Hulstijn, 2005). This trend becomes evident when viewing the summaries of L2 knowledge representation in SLA textbooks. For instance,

Larsen-Freeman and Long (1991) focused on the acquisition and learning distinction posited in monitor theory. This discussion was mainly framed within a criticism of

Krashen’s theory of L2 learning and acquisition. Gass and Selinker (2001) provided a cursory overview of implicit and explicit knowledge in a section titled “alternative modes of knowledge representation” (p. 206). In a revised edition of the same text, Gass and

17

Selinker (2008) expanded the implicit and explicit discussion and added an explanation of L2 declarative and procedural knowledge. R. Ellis (1994b, 2008b) reviewed not only the L2 implicit and explicit knowledge literature, but also considered the concept of proceduralization and the status of the putative relationships between the two types of knowledge. Ortega (2009) reviewed the application of skill acquisition theory to L2 learning and covered the concepts of implicit and explicit learning of an L2. The discussion of L2 knowledge representation has also transitioned from the L2 acquisition literature to publications related to L2 teaching (e.g., DeKeyser, 2009). Thus, views of L2 knowledge have changed over time as inferred from use of the cover-term acquisition to the specification of the types of knowledge that have been learned or acquired or the degree of proceduralization of linguistic knowledge.

Paradis (1994, 2009) has argued that there are two distinct knowledge stores (i.e., implicit and explicit) for linguistic knowledge. These systems are separate and hold unique, complementary knowledge about language and the processes used in linguistic performance. However, there have been claims that linguistic knowledge can move between the implicit and explicit systems. For instance, proponents of the strong state that explicit knowledge can be transformed into implicit knowledge (e.g., DeKeyser, 1998). Likewise, scholars have also posited that a weak interface exists, allowing for declarative and procedural components to influence each other (e.g., R. Ellis, 1994a). Research by Paradis (1994, 2004, 2009) and Ullman (2005) refute all claims of an interactive interface through which knowledge transforms from one type into the other. However, questions remain as to whether explicit knowledge of

18

an L2 is facilitative to the development of implicit linguistic competence, and as to whether language learning abilities equally influence the learning of declarative and procedural knowledge.

Characteristics of L2 Implicit and Explicit Knowledge

R. Ellis (2004) defined L2 explicit knowledge as “the conscious awareness of what a language or language in general consists of and/or the roles that it plays in human life” (p. 229). This knowledge generally is considered to be verbalizable (Paradis, 1994), but R. Ellis cautioned that the ability to verbalize knowledge should be considered separately from a working definition of explicit knowledge. R. Ellis proceeded to identify eight defining features of L2 explicit knowledge: (a) explicit knowledge is conscious; (b) explicit knowledge is declarative; (c) L2 learners’ declarative rules are often imprecise and inaccurate; (d) L2 learners develop breadth and depth of explicit knowledge; (e) explicit knowledge is implicated in controlled processing; (f) difficult tasks encourage the use of explicit knowledge; (g) explicit knowledge could be verbalizable; and (h) explicit knowledge is learnable (pp. 235-240). These characteristics can be further divided between the dimensions of linguistic representation and processing. The eight characteristics are divided as such and summarized below.

A basic assumption underlying the construct of explicit knowledge is that it is learnable. Explicit knowledge of an L2 is consciously held and declarative in nature.

Learners are generally aware of what they know regarding the definitions of metalinguistic terminology and the structure and content of grammatical rule

19

explanations. The accuracy of these definitions and rules, however, is imprecise, resulting in production and rule application errors (c.f., mistakes that are possibly noticed through monitoring). These knowledge stores can be refined through the increase of linguistic information (i.e., breadth) and the development of increased accuracy of the knowledge

(i.e., depth). Increases in breadth and depth lead to increases in the accuracy and precision of the knowledge.

Explicit knowledge is accessed and processed using controlled processing. These processes do not form a continuum with procedural knowledge—that is, the stores are neurobiologically separate (Paradis, 1994, 2009), and knowledge does not transfer from one area to the other. The only continuum is in the degree of use of controlled or automatic processing (Paradis, 2009). Task demands can influence the degree of use of explicit knowledge. If a learner finds an L2 task to be challenging, this disposition can lead to increased use of declarative knowledge (R. Ellis, 2004). Finally, explicit knowledge is potentially verbalizable (R. Ellis). Given sufficient breadth and depth of the knowledge stores, learners are likely to be able to verbalize their knowledge, as is possible with other declaratively held information.

Implicit linguistic competence develops through the processing of input, which results in intake. Linguistic intake, however, is not just the part of the input that is noticed—it is the implicit linguistic representations that are abstracted from the input

(Paradis, 2009). Indeed, if implicit linguistic competence consists of procedures that operate outside of consciousness, the implicit representations should reflect linguistic properties, not the lexical items or phrases themselves. Paradis has argued, then, that

20

intake is doubly implicit (pp. 56-57). The tallying of grammatical structures in the input is carried out implicitly, and the resulting linguistic abstractions are held in implicit memory. This process results in implicit linguistic competence.

Paradis (2004, 2009) claimed that learners seldom internalize an L2 grammar.

The grammatical properties that are internalized can be used in real-time communication and are processed in a similar way as L1 linguistic features. However, many properties of the L2 remain as declarative knowledge. Learners can then proceduralize this declarative knowledge and most likely use a combination of implicit and explicit knowledge when communicating in the L2. Declarative knowledge is thought to be accessed when gaps in implicit knowledge are encountered. However, no matter how proceduralized the declarative knowledge might be, it remains in declarative knowledge stores and does not interface or transform into implicit knowledge. Thus, the development of implicit linguistic competence is implicated in the processing of input (Paradis, 2009), and procedural knowledge is constructed through the practice of the application of declarative knowledge (DeKeyser, 2007). If it is improbable that L2 learners internalize new grammars, then developing L2 procedural knowledge through explicit study and practice remains a prudent approach to L2 learning.

Paradis (2009, pp. 16-22) also distinguished the implicit and explicit properties of the lexicon. L1 speakers hold morphosyntactic properties of lexical items in implicit memory. Morphological properties are applied to lexical items during online communication. People are not aware of these features until explicitly noticing a pattern or learning about them in, for example, a linguistics course. Phonological properties are

21

also subserved by implicit procedural memory, and L1 speakers apply phonological features using implicit linguistic procedures.

In opposition to the implicit lexicon, vocabulary items are part of declarative knowledge. Their meanings are held in declarative memory, and their sound-meaning and sound-symbol associations are explicit. Identifying a picture or processing words in isolation call on declarative memory. Native speakers simultaneously develop an implicit lexicon and an explicit vocabulary. However, “Second language learners usually gain knowledge of a vocabulary before they acquire a lexicon, and often explicitly learn, at least partially, the syntactic properties of (some) words” (Paradis, 2009, p. 17).

Considering the different acquisition routes and processing procedures of native speakers and adult L2 learners, cognitive abilities such as language learning aptitude, working memory, noticing, and hypothesizing appear to exert greater influence in post-critical period L2 learning than L1 acquisition.

Measuring Declarative and Procedural Knowledge

Measuring declarative knowledge is a relatively straight-forward affair. This knowledge is explicit and potentially verbalizable, and given time, learners should be able to recall what is declaratively known about an L2. Tests that tap receptive declarative knowledge can be designed to elicit responses that serve as indicators of levels of metalinguistic knowledge—that is, knowledge about language. Indeed, this type of knowledge includes L1 representations of metalinguistic knowledge that are abstractly or indirectly related to actual L2 processing and use. Likewise, productive tests can also

22

be used. Asking learners to explain grammatical rules or to respond using metalinguistic terminology can yield indicators of productive declarative knowledge.

Measuring procedural knowledge is more complex due to the nature of the knowledge. If the procedures are implicit, conscious reflection on the processes is impossible. One cannot explain the processes involved in applying L1 grammar to concepts (See Levelt [1989] for a discussion of speech production and Kormos [2006] for a discussion of L2 speech production). Grammatical judgment tests (GJT) often have been used to test learners’ development. These tests are designed to measure learners’ intuitions about L2 grammar. That is, scores on GJT represent the degree to which learners accept or reject a variety of L2 grammatical structures. Problems arise, however, in the interpretation of the test scores. Learners’ scores are difficult to interpret when, for example, responses are varied for sentences that contain the same grammatical structure. Learners can accept a certain structure in one sentence, yet reject it in a different sentence. Furthermore, it is difficult to know what part of a sentence initiates a judgment of ungrammaticality. Without addressing these issues, GJT score interpretation becomes difficult and muddied. Moreover, the use of timed GJT is rare in the literature. Untimed GJT scores are associated with explicit knowledge, and timed GJT scores are associated with implicit knowledge (R. Ellis, 2005).

One method for overcoming these issues is through the elicitation of learner language. L2 tasks can be used to elicit spoken or written data from L2 learners. These data can then be described and analyzed. Quantifying aspects of the output enable researchers to create linguistic measures of L2 procedural knowledge. Common

23

elicitation tasks include role-plays, picture descriptions, interviews, and prompts. In addition to a focus on meaning, the inclusion of time pressure in L2 task conditions can predispose learners to rely on implicit or procedural knowledge for language production

(R. Ellis, 2005).

L2 task production has often been assessed through the study of three aspects of output: accuracy, fluency, and complexity. These concepts serve as descriptors and measures of learners’ L2 procedural knowledge. Individuals and groups can be compared on these linguistic dimensions to examine L2 proficiency, task effects, and psycholinguistic orientations. In input processing, VanPatten (1990, 2007) framed the discussion as a distinction between meaning and form. That is, when learners process input, attention is first directed to process meaning. If processing procedures have been developed, learners then have the capacity to process form in the input. Skehan (1998) expanded the discussion to include output. In his model of limited processing capacity, fluency and accuracy are in competition. That is, more fluent language production often results in lower grammatical accuracy. Complexity can tax cognitive resources of learners who are focused on restructuring and linguistic risk-taking, but fluent production could simultaneously be complex depending on whether a learner is relying on exemplar- based linguistic knowledge or rule-based processing. Likewise, accurate production could also be grammatically complex.

In opposition to Skehan’s (1998) view, Robinson (2001b) asserted that complexity results from task-directing resources or task-depleting resources. That is, task design features can facilitate learners to produce more complex linguistic structures, and

24

different components of cognitive resources are utilized in task performance—limited processing capacity does not account for task performance. While no clear consensus has been reached as to which theoretical position is preferable, the evidence to date favors

Skehan’s proposal (R. Ellis, 2003; R. Ellis & Barkhuizen, 2005).

Recent conceptualizations of the definitions and operationalizations of L2 complexity, accuracy, and fluency have resulted in a range of theoretical positions.

Skehan (2009) argued that in assessing task performance, more nuanced measures of fluency are needed in addition to a measure of lexis. Furthermore, accuracy and complexity can be disentangled in analyses of task characteristics and conditions.

Information organization and manipulation can lead to the use of more complex language, and task structure can facilitate accuracy. When these characteristics are present in a task, complexity and accuracy could be enhanced. However, incorporating these design features in a single task does not inherently result in increased task complexity.

In a review of the research into the effects of planning, R. Ellis (2009) emphasized that planning can result in greater fluency and complexity and possibly accuracy. When learners employ strategic planning, greater fluency is found in task performance; however, “results are more mixed where complexity and accuracy are concerned, possibly because there is a trade-off in these two aspects” (p. 501). These results lend support to Skehan’s (1998) trade-off hypothesis.

Norris and Ortega (2009) called for researchers to present empirical evidence in support of the warrants and interpretations drawn from learner language analyses. One part of that argument involves the analysis of complexity, accuracy, and fluency

25

measures. In the studies reviewed, they found that L2 fluency and complexity measures loaded on two distinct factors, illustrating the unique contribution of those aspects to language performance.

Previous Studies of L2 Implicit and Explicit Knowledge

Numerous studies of L2 implicit and explicit knowledge have been conducted regarding the nature of the knowledge stores, the facilitative roles the two types of knowledge might provide, the acquisition and development of L2 knowledge, and the processing constraints on the access to declarative and proceduralized knowledge. In this section, the studies that are directly related to the current study are reviewed.

In one of the first attempts to examine the respective roles of L2 implicit and explicit knowledge in the judgment of grammaticality, Bialystok (1979) set conditions under which L2 judgments can differ. Although the sample size differed for the various analyses, approximately 285 learners participated in the study. Grade level was used as a measure of proficiency with the 10th-grade students being in their fourth year of French study, the twelfth-grade students in their sixth year, and the adult group was considered to be at a higher level of proficiency than the twelfth-grade students.

Task-related factors and learner-related factors were manipulated to create three experimental conditions: The first group had to judge only grammaticality; the second group was asked to determine grammaticality and to identify the part of speech implicated in the erroneous sentences; and the third group, in addition to judging grammaticality, had to choose the violated rule from a list of nine rules for each sentence.

26

Each group was given all test items in two time conditions: The first condition required participants to respond within three seconds, and the second condition required participants to respond within 15 seconds. Learner-related variables were grade level and language competence. The testing instrument was an aural GJT consisting of twenty-four sentences. Six of the sentences were grammatically correct, and 18 were constructed to contain errors involving a verb, adjective, or pronoun.

The results indicated that the 15-second time condition resulted in better performance, and there was an interaction effect for time and identification of ungrammatical sentences. As for rule violation identification, rules that pertained to lexical items were easier than rules related to tense and agreement. Overall, Bialystok

(1979) interpreted the results as suggestive of a processing model in which implicit knowledge is used initially to judge grammaticality, and explicit knowledge is then referenced to determine the locus of the errors. This finding is significant in that it implies that even on untimed GJTs, learners can rely on implicit knowledge to judge grammatical sentences. This implication, however, does not suggest that learners’ developing systems are target-like—overgeneralized forms can be implicated in accessing the learners’ implicit knowledge.

Green and Hecht (1992) sought to identify the extent to which L1 German learners of English who had been taught explicit grammatical rules could state the rules which explained grammatical violations. The learners were also required to correct the test sentences, rewriting them to conform to target-language norms. Three-hundred university-level German L1 learners of English and 50 university-level native speakers of

27

English completed a test in which they were presented with 12 English sentences; each sentence contained one grammatical , and the German L1 participants wrote the rule in German for each sentence that they believed had been violated. The participants were also asked to rewrite the erroneous sentences to correct the errors. The grammatical rules that were targeted included a variety of morphologic and syntactic structures. Two raters judged the acceptability of each rule given by the participants for each erroneous sentence. Acceptable rule explanations ranged from detailed metalinguistic explanations to abbreviations for syntactic structures (e.g., SPO for subject, predicate, object). If the two raters disagreed, a third rater was asked to arbitrate. The rating process resulted in 12 dichotomously scored rule explanations. Seven hypotheses were posited regarding participants’ ability to provide rules and correct sentences. The expected results were as follows:

1. As the rules were commonly taught (in school or university, presumably), the

majority of the participants were expected to have learned them.

2. More able or experienced learners should have a better command of the rules than

less able learners.

3. All learners who can access an appropriate rule should succeed in correcting the

erroneous sentences.

4. If learners cannot access an appropriate rule, they were predicted to fail in correcting

the erroneous sentences.

5. Rule provision should vary as a function of rule complexity.

28

6. L2 learners of English should be more successful in providing appropriate rules

because native speakers of English receive a qualitatively different type of rule

instruction.

7. Native speakers of English, regardless of access to explicit metalinguistic rules,

should succeed in correcting all erroneous grammatical errors.

The results suggested that, overall, the more experience the German learners had with English, the more success they had on the error correction task. For each expectation, the results differed somewhat. Expectation 1 was not supported—the German learners of

English were able to provide an appropriate rule in only 46% of the cases. Even though they had been explicitly taught the rules in language classes, they were unable to provide the rules for the majority of the test items. Expectation 2 was partially supported.

Except for a correlation coefficient between successful rule explanation and error correction for the L2 English learners (r = .53, p < .01), no hypotheses were statistically tested. Therefore, the results, expressed as percentages correct, should be viewed as suggestive evidence—they cannot be cited as a rigorous test of the research expectations.

Methodologically, some issues could be raised with the study. First, the errors in each sentence were underlined. In this task, learners were not required to actively reflect on their interlanguages to identify and locate the grammatical errors in the target sentences.

This implies that the employed task primarily required the use of only metalinguistic knowledge. Second, another assumption that requires scrutiny is the inference from task results to the manifestation of implicit knowledge. The researchers assumed that learners who could correct the sentences but could not state the violated rule possessed implicit

29

knowledge of the target structures. Although this is certainly a possible explanation for the observed results, other plausible explanations exist. Namely, the lack of a time constraint on task performance allowed the participants to actively reflect on their declarative knowledge—that is, the participants could have completed the correction task without accessing L2 implicit knowledge. Differentiating between L2 implicit knowledge and proceduralized L2 analyzed knowledge would require different task conditions which were not employed in their research design, even if measures can only be relatively more implicit or explicit (DeKeyser, 2003). Thus, this study revealed some insights into learners’ metalinguistic knowledge of English grammar, but any implications related to implicit linguistic competence need to be interpreted cautiously.

As a means of addressing the issue of modes through which implicit knowledge can be acquired, De Jong (2005) conducted an experiment to ascertain the effects of aural input processing. The participants were 52 Dutch students who learned a miniature artificial language based on Spanish. The participants were divided among three conditions: receptive training, receptive and productive training, and a control. The results indicated that both experimental treatments developed a knowledge base that could be used for comprehension but was not available to the same degree for production.

Both experimental conditions outperformed the control group. The implications were that some grammar might be learned through listening, but that the short duration of the experiment created difficulties in assessing what type of knowledge (i.e., implicit or explicit) was developed.

30

Han and R. Ellis (1998) focused on the problem of measuring L2 implicit and explicit grammatical knowledge and the roles of these types of L2 knowledge in the development of L2 language proficiency. To these ends, the researchers employed various types of tests designed to distinguish between L2 implicit and explicit knowledge.

A timed picture description task and a timed GJT were used to tap L2 implicit knowledge, and an untimed GJT and interview-based rule provision test were employed to tap L2 explicit knowledge. Forty-eight learners of English participated in the study—all of the participants were enrolled in an intensive English program at a U.S. university with the range of ages spread widely (19 to 47 years). Most notably, some of the participants had only been in the U.S. for ten days before participating while others had been in residence for eight years. Five instruments were used in the study: (a) oral production test, (b) GJT,

(c) interviews, (d) TOEFL, and (e) SLEP. The oral production test consisted of a picture description task with cues written on the pictures to elicit the participants to supply a verb complement in their descriptions of each picture. There were 14 dichotomously scored items. The second test was a timed GJT, which consisted of 34 sentences (20 grammatical and 14 ungrammatical). Participants were given 3.5 seconds to respond to the stimuli sentences given by computer. All items were scored dichotomously. The same timed GJT was given one week later resulting in two timed GJT test scores, although the test content was the same and no treatment was given between the testing sessions. The third test involved individual interviews during which participants were presented with the same GJT items that were used for the timed GJTs. Participants were asked to judge the sentences, report if their judgments were informed by the use of rule or feel, and if the

31

judgment was rule-based, to explain the rule which was used. Judgment of the sentences was scored dichotomously, and rule explanations were scored using a six-point scale (0 =

No awareness of target structure or unable to explain employed rule; 5 = States correct rule using appropriate technical language). Interview data were available for 30 of the participants. TOEFL and SLEP scores were available for 26 and 46 of the participants, respectively. The test scores were factor analyzed (PCA; oblique rotation), which resulted in a five-factor solution. The final reported factorial structure was a two-factor solution; thus, presumably as only two factors had eigenvalues over 1.00, the factor analysis was rerun with two factors being set for extraction. Scores from the untimed GJT and the metalinguistic comments loaded strongly on the first factor, while scores from the oral production test, TGJT1, and TGJ2 loaded on factor 2. These two factors were interpreted as representing L2 explicit and implicit knowledge, respectively. The factor analysis results were interpreted as evidence for the claim that tests of relatively independent measures of L2 implicit and explicit knowledge can be made. Furthermore, scores from the SLEP test were statistically correlated with scores from the oral production test,

TGJT1, and the untimed GJT; scores on the TOEFL were statistically correlated with scores on the untimed GJT. These correlation coefficients were interpreted as representing the relation among the degrees to which the two proficiency tests required the use of L2 implicit knowledge or L2 explicit knowledge.

Pursing a similar agenda, R. Ellis (2005) sought to validate the use of scores from various L2 tests that purport to provide relatively independent measures of L2 implicit and explicit knowledge. Twenty native speakers of English and 91 learners of L2 English

32

participated in the study. The learner group was comprised of mixed L2 proficiency; they had studied L2 English for ten years on average. The majority of the sample was native speakers of Chinese (70.5%). The battery of tests was designed to test knowledge of 17 grammatical structures that are problematic for learners to acquire.

As tests of implicit knowledge, participants completed an oral imitation test, an oral narrative test, and a timed GJT. The oral imitation test contained 34 items. After listening to an item, the participants were asked to respond in two parts: (a) to express agreement or disagreement with the content of the statement; and (b) to repeat the sentence in correct English. All items were scored dichotomously and reported as percentage correct. For the oral narrative test, participants were required to read a story twice and then to tell the story within three minutes. The retellings were judged using obligatory occasion analysis for the target structures. Each participant’s percentage correct for each target structure was averaged to create total scores. A 68-item computer- based timed GJT was also given. There were four items targeting each of the 17 grammatical structures. After determining English native speakers’ average times on the timed GJT items, 20% more time was added to each average time to account for L2 learners’ processing speed. Times ranged from 1.8 to 6.24 seconds per item. A percentage correct was calculated for the dichotomously scored items.

To test L2 explicit knowledge, an untimed GJT and a metalinguistic knowledge test were administered. The untimed GJT was comprised of the same content as the timed

GJT, but there was no time restriction for answering. In addition to judging the target sentences, participants had to rate the certainty of their judgments and report whether

33

they used rule or feel to answer. A percentage correct for the judgments was used as a score. The other L2 explicit knowledge test was a metalinguistic knowledge test. This instrument required participants to respond to multiple-choice questions about metalinguistic terminology and to find examples of target structures in a reading passage.

Scores were calculated as the total percentage correct.

R. Ellis (2005) hypothesized that the oral imitation test, oral narrative test, and timed GJT would load on an implicit factor, and that the untimed GJT and metalinguistic knowledge test would load on an explicit factor. The results confirmed the predictions.

The imitation, oral narrative, and timed GJT tests loaded on one factor, the untimed GJT loaded on both factors, and the metalinguistic knowledge test loaded on the second factor.

After dividing the untimed GJT into two parts, grammatical items and ungrammatical items, and discarding the grammatical items, the remaining 34 ungrammatical items loaded on the second factor. The results were interpreted as supporting the feasibility of designing relatively independent measures of L2 implicit and explicit knowledge.

As this study is directly related to the present study, further examination and critique of the methodological rigor is warranted. A number of issues can be raised concerning the psychometric orientation and the methodological decisions informing the study. First, R. Ellis (2005) requested a two-factor solution in a confirmatory study.

Requesting a two-factor solution in SPSS always results in a two-factor solution; this analytic method should not be used for hypothesis testing. Framing the study as confirmatory and then requesting factors in an exploratory analytic method does not result in a test of a hypothesis of the latent structure of L2 knowledge. The requested two-

34

factor solution was also suspect in that the second eigenvalue did not reach 1.00, which is often considered a lower limit for factor extraction.

Second, higher-order factors were not investigated. All of the manifest variables were theorized to load directly on one of two latent variables: L2 implicit or explicit knowledge. Indeed, these two latent variables theoretically account for the types of L2 knowledge that have been hypothesized in many previous studies. Second-order factors, however, are theoretically implied when considering the supporting evidence for an implicit-explicit model of L2 knowledge. Viewing L2 analyzed knowledge and metalinguistic knowledge as equal or similar operationalizations of L2 explicit knowledge is difficult to support theoretically. Positing two first-order factors of analyzed knowledge and metalinguistic knowledge and one second-order factor of L2 explicit knowledge is more harmonious with current theoretical accounts of L2 explicit knowledge.

Finally, metalinguistic knowledge was tested in the L2. This testing method could confound metalinguistic knowledge with L2 proficiency. No warrants were provided in R.

Ellis (2005) for why the sample would have developed rich knowledge stores of metalinguistic terminology in the L2. To posit this claim, the linguistic environment and classroom instruction to which the students were exposed and studied in would need to be explained in detail. If the sample had primarily studied the L2 explicitly through their L1, then testing the participants’ knowledge of L2 metalinguistic knowledge would confound with their L2 proficiency. The tests used in the study might have tapped primarily L2 proficiency and not the target construct of metalinguistic knowledge.

35

In a study that focused on the levels of metalinguistic knowledge among

Australian university students, Elder, Warren, Hajek, Manwaring, and Davies (1999) asked what first-year undergraduates know metalinguistically about language, and what, if any, relationship exists between levels of metalinguistic knowledge and language learning in university. Multiple groups of native speakers of English participated in the study. Each group was studying a different L2 (Italian, n = 94; Chinese, n = 57;

Beginning French, n = 78; Advanced French, n = 105). All groups sat for a metalinguistic knowledge test in English, the words-in-sentences subtest of the MLAT, and an inductive language learning test. The L2 French learners also took a metalinguistic knowledge test in French (n = 87) and various French proficiency tests (n = 32).

Comparisons of the scores on the various tests revealed that first-year university students tended to have low levels of metalinguistic knowledge. Grammatical terms such as subject, noun, verb, and adjective were relatively easy for the sample, but other items such as predicate were difficult. Similar results were reported for the L2 French learners’ levels of L2 French metalinguistic knowledge. Learners were better able to identify examples of metalinguistic terms than they were to use the terms to explain grammatical rules. Regarding learners’ background factors, previous study of an L2 other than English was positively related to scores on the English metalinguistic knowledge test. Finally, the relationship between metalinguistic knowledge and L2 learning success in university was examined. Statistically significant correlations were found for the advanced learners of

L2 French on the French metalinguistic knowledge test (r = .56) and the English knowledge test (r = .43). The results were interpreted as supporting the findings of

36

Alderson et al. (1997) in that overall, metalinguistic knowledge is not related to L2 learning in university, but that as learners’ proficiency levels increase, so do their levels of metalinguistic knowledge.

R. Ellis (2008) examined grammatical difficulty in terms of L2 implicit and explicit knowledge. He tested the claims of (Pienemann, 1998) as they relate to learning difficulty of various system- and item-based grammatical structures. The participants and data were the same as those reported in R. Ellis (2006).

For the part of the study that used implicational scaling, 20 participants were randomly selected from the data set. R. Ellis then attempted to test implicit and explicit grammatical difficulty by using implicational scaling to analyze experimentally elicited data. The study is unique in that tests of Processability Theory to-date have used extended, free constructed response data from learners. Experimentally elicited data are often characterized as brief, nonrepresentative samples of learners’ true competence. If experimentally elicited samples of learner data prove to be robust for testing theories of

SLA development and acquisition, the validity of the tests of L2 implicit and explicit knowledge would be further strengthened and supported.

The grammatical structures were examined through descriptive statistics and implicational scaling. The results supported previous claims that L2 implicit and explicit knowledge are distinguishable through scores on elicited imitation tests (implicit) and the ungrammatical items on untimed grammatical judgment tests (explicit). The scores on the

L2 implicit knowledge measures supported the claims of Processability Theory; that is, grammatical difficulty as predicted by Processability Theory was observed in the implicit

37

scores. The explicit scores, as R. Ellis (2008) predicted, however, did not align with the predicted hierarchical structure of difficulty. The implications of this study are that researchers need to carefully define acquisition and use a measure of implicit knowledge to test and measure L2 acquisition. Tests of L2 explicit knowledge are distinct from tests of L2 implicit knowledge—they measure a different construct which is developmentally and qualitatively different from L2 implicit knowledge. Additionally, R. Ellis provided support for the use of experimentally elicited samples of spoken learner data for measuring L2 implicit knowledge.

In summary, there is empirical support for a declarative and nondeclarative framework of human memory-related systems, which is synonymous with an implicit- explicit description of long-term memory. Declarative memory holds knowledge pertaining to facts and events, while nondeclarative memory stores procedures related to skills, priming, perception, conditioning, and nonassociative learning. The latter has been referred to as procedural memory in the literature, and these procedures are implicated in the application of implicit linguistic competence. Many studies reporting on investigations of implicit and explicit knowledge can be found, and the results lend support to the view that the two memory systems are implicated in different tasks and that the relative degree of difficulty of developing implicit or explicit knowledge of an L2 differs according to grammatical structure. However, previous studies have operationalized implicit and explicit knowledge in ways that make direct comparisons of results difficult, and in some cases the language of the testing instruments could have confounded the results, which weakens the validity of the interpretations and conclusions

38

regarding the factorial structure and relative influence of other variables on declarative and procedural L2 knowledge.

Language Learning Aptitude

From its psychometric beginnings to recent conceptualizations, language aptitude has been one of the most researched individual difference variables in L2 acquisition research. Indeed, rn ei (2005) referred to L2 language aptitude research as a “success story” (p. 62), due not only to the historical span over which research trends have continued, but also because of the theorizing of the role of language aptitude in L2 acquisition and the robust findings that inform current views of L2 acquisitional processes. Sawyer and Ranta (2001) highlighted its importance in a discussion of influential variables on L2 acquisition, arguing that “language aptitude is the one that contributes most to accounting for differential success of individual learners” (p. 320). In the following review, first, the ways in which language aptitude has been conceptualized and operationalized are described. Second, empirical studies of the effects of language aptitude on L2 acquisition are reviewed. Third, studies of metalinguistic knowledge, language aptitude, and L2 proficiency are examined. Finally, the reviewed studies are examined within a declarative-procedural model of L2 acquisition.

39

Conceptualizations of Foreign Language Aptitude

Carroll’s (1962) conceptualization of foreign language learning aptitude resulted from the psychometric evaluation of instruments designed to test aspects of language aptitude. Based on the results of statistical analyses, Carroll settled on four factors that underlie foreign language aptitude:

1. Phonetic coding ability: The ability to identify sounds in a speech stream and

associate them with a written form. The sound-symbol associations must be

encoded into long-term memory.

2. Grammatical sensitivity: The ability to identify and distinguish grammatical roles

and functions of words and other linguistic elements in sentence-level discourse.

3. Rote learning ability: The ability to remember sounds or written words in a

foreign language and commonly their L1 counterparts.

4. Inductive language learning ability: The ability to infer rules in a new language

from a limited sample of linguistic material and to make generalizations based on

the induced linguistic systematicity.

One of the earliest and most used language aptitude tests, The Modern Language

Aptitude Test (Carroll & Sapon, 1959; MLAT), was conceived to tap these facets of language aptitude; however, aspects of the original theoretical description of the aptitude factors were not represented in the original test (Skehan, 1998).

Skehan (1989, 1998) proposed a three-factor model of language aptitude. The three components were as follows:

40

1. Auditory ability: This factor includes sound-symbol associations and is

“essentially the same as Carroll’s (1965) phonemic coding ability” (Skehan, 1998,

p. 201).

2. Linguistic ability: This factor integrates grammatical sensitivity and inductive

language learning abilities. These two abilities are implicated in passive and

active language abilities, respectively.

3. Memory ability: This factor accounts for the role of memory in language learning.

Previous theories focused primarily on the encoding of associations. Skehan

(1998) called for a stronger emphasis on the coding, storage, and retrieval of

linguistic material and knowledge.

These three components provide explanatory power of SLA through their connection to processing stages of acquisition. Auditory ability is implicated in the processing of input, linguistic ability is principal in central processing during implicit and explicit learning, and memory for language is highly relevant for output (Skehan, 1998).

Sawyer and Ranta (2001) addressed some of the criticisms of L2 language aptitude research by outlining five oft-mentioned problems and offering solutions or describing research-related attempts to overcoming the highlighted controversies. The first problem was the lack of connection between the proposed latent factors of language aptitude and the instruments used to test those constructs. The researchers proposed that inductive language learning and grammatical sensitivity could be collapsed into a latent language analytical ability factor. Indeed, studies specifically addressing language

41

analytical ability have been conducted in recent years, and language analytical ability has been shown to correlate moderately with L2 learning.

The second problem addressed the claim that language aptitude tests only measure intelligence and not language aptitude ability. This criticism strikes at the viability of a distinct language aptitude variable and the construct validity of tests of foreign language aptitude. However, research has shown that although there is overlap with intelligence, language aptitude accounts for different proportions of the variance in intelligence and L2 proficiency (Sasaki, 1993). This result is not unexpected or problematic in that foreign language learning is primarily cognitive in nature, as is most learning, and overall intelligence can be expected to influence learning outcomes regardless of the subject matter. The issue of concern is whether language learning is synonymous with intelligence. Empirical research has demonstrated the unique contribution that language aptitude brings to the L2 learning process.

The third problem described by Sawyer and Ranta (2001) is related to the stability of aptitude over time—that is, whether aptitude is a trait or whether it can be developed as a skill. The evidence so far supports a conceptualization of language aptitude that is more trait-like than learned. Indeed, the findings of recent studies suggest that aptitude is relatively stable over time (e.g., DeKeyser, 2000; Harley & Hart, 1997, 2002).

The fourth concern of language aptitude research is that language aptitude only plays a part in formal L2 learning. As the field of SLA is concerned with both naturalistic and formal L2 acquisition, the applicability of language aptitude to all learning environments is paramount. However, in studies of instructed L2 acquisition with

42

participants who have only experienced instructed language learning, the relevance of this concern is greatly diminished.

The final problem outlined by Sawyer and Ranta (2001) speaks to the lack of a clear connection of language aptitude to L2 acquisition processes. The explanatory power of language aptitude would be moderated if a clear relationship could not be found between aptitude and the primarily cognitive processes implicated in L2 acquisition.

Researchers have begun to turn their attention to this issue.

In a review article, Skehan (2002) considered the relationship of language aptitude and L2 acquisition. He divided the discussion into four areas of inquiry: input processing capacity, the representation of memory, aptitude and output, and the role of focus-on- form approaches to L2 instruction. Based on this discussion, he proposed a conceptualization of language aptitude revolving around four categories: (a) noticing, (b) patterning, (c) controlling, and (d) lexicalizing. Within each category of this framework, language aptitude components are matched to SLA processing stages. The nine processing stages are noticing, pattern identification, extending, complexifying, integrating, becoming accurate, creating a repertoire, automatizing rules, and lexicalizing

(See Skehan, pp. 88-92, for a detailed discussion of these stages). Of interest here is the construct coverage of current language aptitude tests as related to the putative processing stages and their aptitude component counterparts. Sections of the major language learning aptitude tests cover components of stage 1 (phonemic coding), stage 2 (fast analysis, grammatical sensitivity), stage 3 (inductive language learning), and stage 4 (grammatical sensitivity). These stages build upon one another in a hierarchical structure. If a learner

43

has low aptitude in the first five stages, high aptitude in the latter stages might not be enough to compensate for learning difficulties occurring in the first five stages. The

MLAT and LABJ subtests are concentrated within the first four SLA processing stages.

As further learning is predicated upon these initial determinates of acquisition, high aptitude in these abilities should lead to advantages in rate and ultimate attainment of L2 learning.

Originally based on research into L1 abilities, Sparks and

Ganschow proposed the Linguistic Coding Differences Hypothesis (Ganschow, Sparks,

& Javorsky, 1998; Sparks & Ganschow, 1991, 2001). The hypothesis was formed from the observation that L1 reading and writing abilities seemed to predict L2 proficiency, but no effect was found in the processing of meaning. Thus, L1 orthographic and phonological abilities were viewed as the primary building blocks for subsequent L2 learning, but semantic processing was largely unaffected by these abilities. Numerous studies have reported on the role of L1 abilities in L2 learning (e.g., Bialystok & Luk,

2007; Dufva & Voeten, 1999; Koda, 2000; Sparks, Patton, Ganschow, & Humbach,

2009).

Robinson (2001a, 2005a, 2007) proposed a conceptualization of language learning aptitude that consists of a network of aptitude complexes. The theory draws on research that claims aptitude complexes and aptitude-treatment interactions need to be integrated into a theory of aptitude (e.g., Snow, 1992). The first component specifies abilities that are considered relevant for input processing. These include, for example, processing speed, pattern recognition, and phonological memory (Williams & Lovatt, 2005), which

44

combine to influence learners’ ability to perform cognitive functions on the linguistic input such as those implicated in interactional approaches to L2 acquisition (Gass, 1997;

Long, 1996; Pica, 1994). In addition to gap-noticing abilities, the second-tier aptitude complexes consist of memory for contingent speech, deep semantic processing, memory for contingent text, and metalinguistic rule rehearsal. In the third tier of the model, the aptitude complexes are connected to task features such as planning time, background knowledge, and here-and-now. Learners exhibiting certain aptitude profiles are hypothesized to benefit from L2 tasks that match their strengths. Finally, these components interact with the fourth tier of pragmatic and interactional abilities, which include mind reading, social insight, and emotional intelligence.

Relevant to the current study are the memory and analytical components of the model. In Japanese junior and senior high schools, L2 English instruction primarily focuses on rote memorization of grammatical rules and explicit instruction and practice.

Memory for language, grammatical sensitivity, and pattern recognition should contribute to learning in such a linguistic environment compared to aptitude complexes related to implicit learning. Furthermore, phonological memory-related constructs have been shown to account for L2 development at beginning and later stages of learning (O'Brien,

Segalowitz, Collentine, & Freed, 2006; O'Brien, Segalowitz, Freed, & Collentine, 2007).

In a review of recent research into language aptitude, rn ei (2005) proposed that language aptitude could be defined as any combination of a number of cognitive processes linked to L2 learning. This definition takes a pragmatic and conservative view

45

of language aptitude, but it leaves opportunity to take a broad interpretation of language learning ability.

Operationalizations of Language Aptitude

Researchers have operationalized language aptitude through the development of language aptitude tests. These instruments ascertain the extent to which learners exhibit abilities in various processes related to foreign language learning. To date, five language aptitude tests have been widely reported on in the L2 learning literature: the Modern

Language Aptitude Test, the Pimsleur Language Aptitude Battery, the Language Aptitude

Battery for the Japanese, the Lunic Language Marathon, and the Cognitive Ability for

Novelty in Acquisition of Language as applied to foreign language test. These tests are reviewed chronologically.

Modern Language Aptitude Test. The MLAT (Carroll & Sapon, 1959) is comprised of five subtests. Part 1, Number Learning, which targets timed associative learning, asks participants to learn the names of 1-, 2-, and 3-digit numbers in a new language and transcribe the numbers that they hear. Part 2, Phonetic Script, tests participants’ phonemic coding ability. Test takers study a phonetic script and choose the word that they hear from choices written in phonetic script. Part 3, Spelling Cues, taps participants L1 vocabulary knowledge and their ability to handle novel spellings of known words. This section targets L1 knowledge and phonetic coding skills. Part 4,

Words in Sentences, targets language analytic ability and grammatical sensitivity. Test

46

takers are required to choose a word that serves the same grammatical function as a specified word. All words are embedded in sentences. Part 5, Paired Associates, targets rote memory learning. Test takers study a list of Kurdish vocabulary items and their

English translations, and then they complete a multiple-choice test of the word pairs without referring to the original lists.

The MLAT has been used in numerous studies over the last five decades, and there is a growing consensus of its validity when used as a test of language aptitude.

Indeed, DeKeyser (2000) stated that the MLAT “is usually considered the best verbal aptitude test in terms of its predictive validity for L2 learning” (p. 509).

Pimsleur Language Aptitude Battery. The Pimsleur Language Aptitude Battery

(Pimsleur, 1966; PLAB) is a test of language aptitude that is comprised of six components. Section 1 asks test takers to report the grades they received in a variety of school subjects. The rationale for this is that success in some school subjects is predictive of success is other subjects. Section 2 consists of items related to students’ interest in learning a foreign language. Test takers respond on a five-point scale. The third section requires test takers to choose the correct synonym from four choices for a given adjective.

Section 4 is a test of language analysis. Test takers are given words and phrases in an unknown language and English translations for the linguistic data. After studying the lists and example phrases, participants must choose from four choices the correct translation of English sentences or target language sentences. This subtest targets language analytic ability and inductive language learning ability. Section 5 asks test takers to identify words

47

in a foreign language. Three similar sounding foreign language words are presented, and participants must indicate which of the three words was included in each of 30 sentences.

This subtest targets the ability to discriminate among sounds in a foreign language. In the final section, Sound-Symbol Association, test takers listen to nonsense words that differ in the number of syllables, and they must choose the string they heard from a list of four choices. This subtest targets the ability to make sound-symbol associations.

Language Aptitude Battery for the Japanese. The Language Aptitude Battery for the Japanese (LABJ) was created to fill the need of testing L1 Japanese speakers’ language aptitude. To create this language aptitude test, Sasaki (1991) translated Part 5 of the MLAT so that the word pair translations were in Japanese. Part 2 of the LABJ,

Language Analysis, is a Japanese version of the PLAB Language Analysis section. The third section of the LABJ is an original sound-symbol association test that is similar in form to the sound-symbol association subtest of the PLAB.

This 61-item aptitude test was designed to measure three aspects of language aptitude: rote vocabulary learning ability, inductive language learning ability, and sound- symbol association ability. Sasaki developed this test to fill the need of having a language aptitude test given in the participants’ L1. Testing language aptitude in an L2 can confound proficiency with aptitude, and testing aptitude in a language that is grammatically or orthographically similar to the participants’ L1 might bias the results in favor of specific groups of learners. Sasaki modeled the test after the Modern Language

Aptitude Test and the Pimsleur Language Aptitude Battery.

48

The first subtest of the LABJ is rote vocabulary learning of an assumed previously unstudied language. Participants are given approximately five minutes to memorize 24 Kurdish words. The words are written in katakana with the Japanese translations provided in list format. They then answer 24 multiple-choice questions that ask the test-takers to choose the correct Japanese translation for a given Kurdish word.

Participants are not permitted to refer back to the original vocabulary list when answering the multiple-choice items.

The second subscale targets grammatical sensitivity through the analysis of vocabulary items and example sentences. Learners must apply their abilities to induce syntactic rules and morphological properties of a newly presented language. This subtest consists of 15 multiple-choice items.

The third subtest is designed to measure learners’ phonemic sensitivity by asking them to make sound-symbol associations. The 22 multiple-choice items present participants with four nonsense words written in either katakana or romaji. After listening to a stimulus word, learners are asked to circle the word that they heard. The stimuli are read once at a fast pace for each item.

A number of previous researchers have used the LABJ to test language aptitude

(e.g., Robinson, 2002; Sasaki, 1996). Robinson reported a reliability coefficient of .85

(KR-21) for the entire test. Sasaki reported reliability coefficients of .84, .80, and .63 for the paired associates, language analysis, and sound-symbol association scales, respectively.

49

Lunic Language Marathon. The Lunic Language Marathon (LLM) is a test of language learning aptitude that was developed by Sick and Irie (1998). The test instructions are given in Japanese, so the test is suitable for use by researchers working with Japanese speaking participants. The LLM is designed around the learning of an artificial language. It is based on linguistic universals but not on Japanese or English word order. Test administration is conducted via recorded instructions, which provide explanations of the tasks and control the testing time for each section. The LLM is comprised of the following four subtests: number learning, sound-symbol association, vocabulary learning, and language analytical ability. As this test was used in the present study to assess language learning aptitude, a description of the subtests is included in

Chapter 3: Methods.

Cognitive Ability for Novelty in Acquisition of Language as Applied to

Foreign Language Test. Grigorenko, Sternberg, and Ehrman (2000) reconceptualized language aptitude within a psychological theory of knowledge acquisition. They proposed that the ability to cope with novelty and ambiguity was central to language learning. This ability involves five processes of knowledge acquisition: (a) selective encoding, (b) accidental encoding, (c) selective comparison, (d) selective transfer, and (e) selective combination. These processes operate on the lexical, morphological, semantic, and syntactic levels. Learners receive input on these levels in either a visual or oral mode.

Learners’ knowledge retention and management of novelty can be tested through immediate and delayed recall tasks. This construct definition and method of testing of

50

language aptitude underpin the design of the Cognitive Ability for Novelty in Acquisition of Language as Applied to Foreign Language Test (CANAL-FT).

The CANAL-FT is composed of five sections that involve immediate recall of stimuli, and four sections from the original five that require delayed recall related to the learning of an artificial language, Ursulu. The five sections were described by Grigorenko,

Sternberg, and Ehrman (2000, p. 394) as follows:

1. Learning meanings of neologisms from context

2. Understanding the meaning of passages

3. Continuous paired-associate learning

4. Sentential inference

5. Learning language rules (immediate recall only)

Section 1 presents paragraphs that are comprised of a varying number of unknown words

(5%, 10%, or 20%). Test-takers have to show understanding of the neologisms via multiple-choice questions and successful storage in working memory by answering two multiple-choice questions after every passage. To assess long-term memory storage, one item is presented at least 30 minutes after the introduction of the relevant passage.

Section 2 is identical to Section 1 in that unknown words are presented within paragraphs, but learners are tested on their oral and reading comprehension. The questions target learners’ comprehension of main ideas, details, and their ability to infer and apply knowledge of the Ursulu language.

Section 3 is similar in form to a traditional paired associates test. Learners are asked to memorize and recall 60 word pairs. One unique aspect of this subtest is that

51

some system items were included—that is, some items to be learned have systematic characteristics that could facilitate learning.

In Section 4, Ursulu sentences are presented with their corresponding English translations. Test-takers then have to choose the correct translation of an English or

Ursulu sentence from five choices. Half of the 20 stimuli are presented visually, and half are presented aurally.

Section 5 tests learners’ acquisition of the rules of the Ursulu language. The basis of the expected learning comes from the input provided throughout the test. This section is comprised of 12 items.

While the CANAL-F theory advances the study of language aptitude, the resulting test is fairly unique in the application of the theory. Most subtests assess learners’ acquisition of novel words in difficult English passages. The test was designed for native speakers of English and was targeted at advanced university students. As there have been no other studies of the CANAL-FT, the validity of score use and interpretation in different populations remains an open question.

Effects of Language Aptitude on L2 Learning

Language learning aptitude is theorized to affect the rate and ultimate attainment of L2 learning. Studies have shed light on these issues through the investigation of the effects of components of language aptitude on L2 learning outcomes. Research results have been reported for naturalistic and classroom language learning involving a diverse pool of participants.

52

In an oft-cited study, Harley and Hart (1997) posited that there would be a positive relationship between young learners’ memory-related abilities and L2 proficiency and a positive relationship between older learners’ analytical abilities and L2 proficiency. Young learners were defined as L2 learners who began an L2 immersion program from the first grade, and late immersion learners were L2 learners who began an

L2 immersion program from the seventh grade. The rationale for these hypotheses was based on the idea that young learners rely more on memory-based implicit learning processes in L2 acquisition, while older learners rely more on analytical abilities during instructed L2 learning. The early immersion group included 36 participants, and the late immersion group was comprised of 29 participants. However, a number of participants in each group were excluded from many of the analyses due to the differing number of participants who completed the various tests. The language aptitude measures were comprised of the MLAT Word Pairs subtest, a memory for text task, which required participants to write down the idea units contained in two listening passages, and the

PLAB language analysis subtest. Six L2 proficiency measures were used: (a) vocabulary recognition, (b) multiple-choice listening comprehension, (c) cloze test, (d) timed writing task, (e) elicited imitation task, and (f) cartoon description task. The results indicated that on average, the late immersion group (M = 9.1, SD = 3.7) had a significantly higher language analysis score than the early immersion group (M = 7.4, SD = 3.0) (p = .045).

For average proficiency scores, the early immersion group outperformed the late immersion group on the vocabulary recognition task and the sentence repetition task

(equivalent scoring method), while the late immersion group scored significantly higher

53

on the writing task. For the early immersion learners, vocabulary recognition was statistically correlated with two aptitude measures: word pairs (r = .37) and memory for text (r = .53); listening comprehension correlated with all three aptitude measures (word pairs, r = .49; memory for text, r = .50; language analysis, r = .48); and cloze scores were associated with memory for text (r = .40). The pattern of proficiency-aptitude correlations for the late immersion learners was somewhat different. Language analysis scores were statistically correlated with vocabulary recognition (r = .41), cloze (r = .41), task fulfillment in writing (r = .43), and written accuracy (r = .44), while word pairs was correlated with written accuracy (r = .44). However, in addition to these correlations, there were many negative or near-zero correlation coefficients. The authors then conducted regression analyses to determine which aspects of language aptitude predicted

L2 proficiency. For the early immersion learners, memory for text predicted vocabulary recognition, memory for text and language analysis predicted listening comprehension, and memory for text predicted cloze test performance. For the late immersion learners, language analysis predicted (a) vocabulary recognition; (b) cloze; (c) written task fulfillment; and (d) written accuracy. The results were interpreted as evidence for the implication of different facets of language aptitude at different stages of learning. Early immersion learners’ L2 proficiency was predicted by language aptitude-related memory measures, while late immersion learners’ L2 performance was predicted by language analysis measures.

In a related study, Harley and Hart (2002) examined whether naturalistic learning outside of the classroom was more closely associated with memory-based learning

54

approaches than with analytical-based learning approaches. A sample of 31 English- speaking eleventh and twelfth graders completed two language aptitude subtests and a number of L2 proficiency measures; however, the sample size varied as not all participants completed all tests. The language aptitude tests consisted of a memory for text task, which was a measure of memory for language learning, and the language analytic section of the PLAB, which represented a measure of language analytic ability.

The L2 proficiency measures were the same as the ones described in Harley and Hart

(1997). The researchers also administered a post-immersion questionnaire that asked participants to report on a 5-point scale how much on average they interacted daily in the

L2, how they self-evaluated their language learning progress, and how they approached language learning in an immersion environment (i.e., the extent to which they used memory-based approaches or analytical approaches). Reliability estimates were reported for the measures, but the estimates for the French listening comprehension pre- and post- tests were low (.45 and .56, respectively). Among the correlations between language aptitude and L2 proficiency, a number of statistically significant relationships were observed. Cloze scores correlated with language analysis (r = .39, p < .05, n = 28); error count in writing was related to memory for text (r = .41, p < .05, n = 28); and in the sentence repetition task, language analysis was related to exact scoring (r = .61, p < .01, n

= 27), equivalent scoring (r = .47, p < .05, n = 27), and sentence features (r = .44, p < .05, n = 27). Linear regression was used to test whether aspects of language aptitude predicted aspects of L2 proficiency. Language analysis predicted cloze, sentence repetition (exact scoring, equivalent scoring, and sentence features), while memory for text predicted

55

errors. The researchers interpreted the findings as support for the hypothesis that late- immersion learners tend to use language analytic abilities, and that language analytical abilities influence language learning outcomes.

In a study of language aptitude, age of acquisition, and ultimate attainment,

DeKeyser (2000) predicted negative correlations between age of beginning English acquisition and ultimate attainment, and considering possible critical period effects, only adults with high verbal aptitude were predicted to score in the range of participants who began acquiring English from an early age. The participants were 57 native speakers of

Hungarian—42 of them were over 16 years of age at the time of arrival in the U.S., and

15 were under 16. Participants completed a 200-item grammaticality judgment test, which included numerous subscales divided by grammatical structure, and they also sat for a 20-item multiple-choice aptitude test given in Hungarian that was adapted from the words-in-sentences subtest of the MLAT (M = 4.7, SD = 2.79). The results implied that ultimate attainment was negatively correlated with age of acquisition (r = -.61, p < .001,

N = 57). For the participants who were over 16 upon arrival, grammatical judgment scores and language aptitude scores were statistically correlated (r = .33, p < .05, n = 42), while for the under 16 group, those scores were not correlated (r = .07, ns, n = 15).

Language aptitude was not related to age of arrival (r = .09, ns, N = 57). These findings lend support to the findings of Harley and Hart (1997) that language aptitude tends to be stable over time, but that older learners rely on analytical approaches to language learning, while younger learners rely on memory-based, implicit learning mechanisms. Findings such as these lend support to the hypothesis that of the language learning aptitude

56

components, language analytical ability might be the strongest or exclusive predictor of

L2 declarative and procedural knowledge.

Ross, Yoshinaga, and Sasaki (2002) explored the effects of early and late immersion, foreign language learning, and language aptitude on the acquisition of

English. The participants were Japanese university students who were divided into three groups based on the age at which they were first exposed to English and the number of years they lived overseas: Child SLA Group (n = 34); Teen SLA Group (n = 38); and

Teen Foreign Language Learning Group (n = 57). A 24-item grammaticality judgment test rated on a 7-point scale and two language aptitude tests (MLAT, Words in Sentences, and LABJ, Part 2, Language Analysis) were administered. For some aspects of grammar, no exposure or aptitude effects were found, but for adjunct island extractions, high aptitude learners in the Teen Foreign Language Learning Group rejected sentences at the same rate as Child SLA learners. The findings provide additional support for the hypothesis that language aptitude influences classroom language learning, most likely due to the analytical approach to teaching and learning commonly used in instructional settings.

Kiss and Nikolov (2005) developed a language aptitude test for Hungarian- speaking young language learners. The test was modeled after the MLAT and PLAB.

After piloting the test with 36 12-year-old children, a revised version of the test was used with a sample of 398 Hungarian sixth graders. In addition to the revised aptitude test, an

English language proficiency test and a motivation questionnaire were administered.

Based on the descriptions of the test tasks, the test seems to have tapped primarily L2

57

explicit knowledge. The sound-symbol association scale produced the lowest reliability estimate out of the aptitude subscales (α = .65); the other aptitude subscales exhibited reliability coefficients ranging from .72 to .76. The aptitude test scores and proficiency test scores exhibited a strong correlation (r = .63, p < .01), which was stronger than those between aptitude and motivation (r = .37, p < .01) and motivation and proficiency (r

= .48, p < .01).

Erlam (2005) examined possible interaction effects among instructional treatments and language learning aptitude. Ninety-two New Zealand high school students in their 2nd year of L2 French study were assigned to one of four instructional groups: deductive instruction (n = 21), inductive instruction (n = 22), structured input (n = 23), or control (n = 26). The researcher taught three 45-minute sessions, targeting French direct object pronouns. The participants completed pre-, post-, and delayed posttests that focused on receptive and productive skills. Six months after the end of the study, 60 of the participants sat for a language learning aptitude test. The language aptitude test consisted of the sound discrimination subtest from the PLAB as a test of phonemic coding (n = 60, M = 19.97, SD = 4.77, α = .73), the words-in-sentences subtest of the

MLAT as a test of language analytic ability (n = 43, M = 15.26, SD = 4.40, α = .60), and a test of working memory, which required participants to write from memory word lists that they had been shown (n = 60, M = 19.97, SD = 4.52, α = .73). Correlational analyses revealed that for the deductive group, sound discrimination scores were related to immediate posttest listening comprehension scores (r = .52, p = .05, n = 19) and were negatively related to immediate posttest oral production gain scores (r = -.46, p = .05, n =

58

19). For the inductive instruction group, scores on the words-in-sentences subtest were related to immediate posttest listening comprehension gain scores (r = .51, p = .05, n =

20), delayed posttest written production gain scores (r = .59, p = .01, n = 20), and were negatively related to delayed oral production gain scores (r = -.61, p = .01, n = 19). For the structured input group, scores on the words-in-sentences subtest were associated with gain scores on the immediate written production test (r = .49, p < .05, n = 21), and the delayed written production test (r = .51, p = .05, n = 21). Scores on the working memory test were correlated with immediate written production scores (r = .49, p = .05, n = 21) and with delayed written production scores (r = .57, p = .01, n = 21). The results of regression analyses revealed that MLAT words-in-sentences scores predicted written production immediate posttest scores, and working memory scores and words-in- sentences scores predicted written production delayed posttest scores. The results should be interpreted cautiously due to the low reliability of the listening comprehension test (α

= .43 for aggregated test scores from three versions of the test).

In a study of the benefits of starting foreign language learning at an early age,

Larson-Hall (2008) compared two groups of Japanese university students who differed in the age at which they started studying English (early starters, n = 61; and late starters, n =

139). The early starters began studying English before entering junior high school, while the late starters began their English studies upon entering junior high school. Participants completed a grammaticality judgment test adapted from DeKeyser (2000), a phonemic discrimination task, and Parts 3 and 4 of the LABJ (Language Analysis, K-R 20 = .80;

Sound-Symbol Association, K-R 20 = .63). For the early starters, statistically significant

59

correlations were obtained for language aptitude and phonemic discrimination (r = .37, p

< .05, n = 58) and language aptitude and grammaticality judgment scores (r = .32, p < .05, n = 59). Regarding group differences, no statistically significant difference in language aptitude was found between the early starters and the late starters. This result implies that language aptitude does not develop as a skill, regardless if more time is spent on the domain specific task.

In a study that examined the relationships among language aptitude, phonological memory, and L2 proficiency, Hummel (2009) had 77 French-speaking university students sit for the short-form of the MLAT (spelling cues, words in sentences, and paired associates; α = .55), which was given in the participants’ L1 (French), a phonological memory task, which required participants to repeat unknown Arabic words (α = .86)—the stimuli varied from 3 to 9 syllables with 4 items at each syllable length for a total of 28 items—and the Michigan Test of English Language Proficiency (20 vocabulary items, 40 grammar items, and 20 reading comprehension items; α = .87). Not all participants completed all tests, which resulted in varying n-sizes for the analyses. A number of statistically significant correlations were observed: L2 proficiency and phonological memory (r = .35, p < .01), L2 proficiency and aptitude (r = .25, p < .05); phonological memory with the vocabulary subtest (r = .36, p < .01) and the grammar subtest (r = .33, p

< .01); aptitude and reading (r = .29, p < .05); and aptitude and grammar (r = .25, p < .05).

However, a number of nonsignificant relationships were also observed: paired associates and L2 proficiency (r = .03, ns); paired associates and phonological memory (r = .06, ns); and aptitude and phonological memory (r = .08, ns). An exploratory factor analysis

60

resulted in three factors (i.e., proficiency, aptitude, and phonological memory). It is notable, however, that the words-in-sentences test cross-loaded on the proficiency factor, and only one variable defined the phonological memory factor. A multiple regression analysis revealed that language aptitude and phonological memory statistically predicted

L2 proficiency, accounting for 29% of the variance in the dependent variable.

Individually, words-in-sentences predicted 14.5% of the variance, phonological memory accounted for 10%, spelling cues explained 3.1% of the variance, and paired associates accounted for an additional 1.5% of the variance. The results were interpreted as confirming the importance of aptitude and phonological memory in L2 learning, and that the language aptitude subtests might be related to L1 reading skills, as predicted by the

Linguistic Coding Differences Hypothesis (Sparks & Ganschow, 1991). Hummel also posited that phonological memory was distinct from language aptitude as operationalized in the study due to the test scores loading on a separate factor, and that phonological memory might play an important role in the beginning stages of acquisition when initial input processing and phonological mapping occur.

Studies of Metalinguistic Knowledge, Proficiency, and Aptitude

For classroom learning, language aptitude is thought to influence learning processes. That is, in instructed SLA, high levels of metalinguistic terminology are commonly used and explicit rules are often taught in classes. Language aptitude might mediate the learning of these explicit terms and rules. However, controversy exists over

61

the factorial structure of L2 proficiency, metalinguistic knowledge, and language aptitude.

Conflicting results have been reported in empirical studies of these constructs.

In a large-scale study of metalinguistic knowledge, Alderson et al. (1997) examined the relationships among metalinguistic knowledge, language aptitude, and language proficiency. Five-hundred-nine university students who were native speakers of

English completed a metalinguistic assessment test consisting of three main sections:

Section 1 asked learners to identify parts of speech in example sentences given in English and French; section 2 was modeled after Bialystok (1979) and asked learners to judge the grammaticality of French sentences and to state the rules which were broken; section 3 was modeled after Sorace (1985) and asked learners to make corrections to erroneous

English sentences and to state the grammatical rules that had been violated. Section 1 was scored dichotomously. On section 2, learners were given one point for identifying and correcting any errors, one point for correct usage of metalanguage, and one point for stating the rules that had been broken. Section 3 was scored by giving one point for correcting an error, one point for stating the rule which had been broken, and one point for giving the reason which explained the rule.

To test the learners’ L2 French proficiency, numerous tests were administered, including a cloze test, a c-test, a reading comprehension test, a listening test, and a writing test. Different n-sizes were reported for each test. The main motivation for choosing the various proficiency tests seemed to stem from the use of the tests in previous studies and the desire to create the circumstances in which comparisons could be made with those studies.

62

Two aptitude tests were employed in the study: the words-in-sentences test from the MLAT (n = 121) and an inductive language learning test (n = 150). In the latter test, learners were given a text written in Swahili, and English equivalents were given for some of the sentences. The participants then had to translate the remaining sentences from Swahili to English. Each of the 30 items was scored dichotomously.

The results relevant to the present study are summarized. First, the mean scores on the metalinguistic tests were between 33% and 65%. This result was interpreted as showing that most learners had difficulty with providing or identifying metalinguistic terms or with providing explanations of rules. Furthermore, the standard deviations were high in relation to the means, implying that a wide range of scores was observed.

Second, learners’ ability to correct errors in French was greater than their ability to state rules that had been broken, which was greater than their ability to use metalinguistic terminology to explain the violated rules. The interpretation of this finding focused on the efficacy of metalanguage in that if learners can state rules without using metalanguage, then the value of metalanguage becomes questionable. This finding is also interesting when compared with the results of a questionnaire administered to 80 participants in the study: The learners reported that they occasionally or often needed to use metalanguage, but that the terminology was not fully explained in their language classes. Thus, learners were aware of many terms and had to use the terms for their studies even though they often did not fully understand the terms.

Third, the correlation coefficients among the metalinguistic knowledge tests and the proficiency tests were low. The highest observed correlation was between the total

63

metalinguistic test score (without the error identification score) and the C-test (r = .47).

Alderson et al. (1997) found no theoretical rationale for the variation in the observed correlations, which led them to the conclusion that listening ability is not related to metalinguistic knowledge, and that metalinguistic knowledge and language proficiency are, for the most part, unrelated.

Fourth, three factor analyses were carried out with select subtests of the metalinguistic knowledge test, certain proficiency tests, the words-in-sentences subtest of the MLAT, and the inductive language learning test. In the first analysis of metalinguistic knowledge tests and proficiency tests, after applying a Varimax rotation, two factors emerged: metalinguistic knowledge and language proficiency. Test scores from one metalinguistic knowledge test, which required learners to identify errors in French sentences, cross loaded on both factors. This complex loading was attributed to the role of both metalinguistic knowledge and proficiency in the identification of L2 errors. These results, however, should be interpreted cautiously as there were only 27 cases in the factorial analysis of nine variables.

The second factor analysis was also focused on metalinguistic knowledge and proficiency (n = 111), but the scores from the listening test were removed from the analysis because they were not statistically correlated with any of the metalinguistic knowledge tests or language aptitude tests. The removal resulted in the French errors test scores loading strongly on the proficiency factor (.72) and moderately on the metalinguistic factor (.48).

64

The third factor analysis included the metalinguistic knowledge test scores, one language proficiency test score (c-test) and the two language aptitude test scores (the number of cases was not reported). A two-factor solution was reported, but the factorial structure was not interpreted. Interestingly, the words-in-sentences test scores loaded weakly on the first factor (.14) and moderately on the second factor (.53). In the first analysis, Alderson et al. (1997) interpreted a factor loading of .52 as high, but in the third factor analysis, a loading of .53 was interpreted as being not high. The inductive language learning test scores loaded moderately on the first factor (-.40) and strongly on the second factor (.64). The language aptitude tests, then, did not load strongly on either resultant factor and did not have high enough loadings to stand as a third factor.

Alderson et al. (1997) interpreted and summarized the results as follows:

Metalinguistic knowledge and language proficiency are relatively uncorrelated—where there is a relationship, it is weak. Metalinguistic knowledge does not appear to play a role in language proficiency; that is, the two constructs were not highly correlated in their study. Finally, the relationship among metalinguistic knowledge, L2 proficiency, and language aptitude is unclear. Language aptitude test scores did not correlate strongly with any of the other tests, and they failed to distinguish themselves as a unique factor. Thus, the study sheds some light on the factorial structure of metalinguistic knowledge, L2 proficiency, and language aptitude, but questions remain as to the relationships within each factor, the operationalization of the constructs, and overall methodological rigor.

The results need to be interpreted cautiously because metalinguistic knowledge was tested in the participants’ L1 and L2. Learners’ metalinguistic knowledge could be

65

qualitatively different in their L1 and L2. When testing metalinguistic knowledge, the language used to present the items, the language in which the participants respond, and the linguistic background of the participants require close scrutiny. If a metalinguistic knowledge test is given in the participants L2, a theoretical rationale for why the sample would have significant L2 mental representations of metalinguistic knowledge needs to be espoused. If classroom L2 grammatical instruction and explanation had been carried out in the L1, metalinguistic knowledge related to the L2 would be represented in L1 memory stores. Testing learners’ knowledge of metalinguistic terminology in the L2 confounds metalinguistic knowledge and L2 proficiency. Furthermore, considering the number of variables used in the study, larger samples might be needed to observe clear factorial structures. Indeed, as L2 metalinguistic knowledge has been shown to increase with proficiency (Elder et al., 1999), low-level learners could have rich, elaborated L1 representations of metalinguistic knowledge that are related to the L2.

Alternative operationalizations of metalinguistic knowledge have been employed in empirical studies. Commonly, metalinguistic knowledge has been operationalized as learners’ ability to provide, understand, or explain grammatical structures and themes in correct and erroneous sentences. The ability to correct the erroneous sentences is often included in the construct definition of metalinguistic knowledge. Roehr (2008b) extended the operational definition of metalinguistic knowledge to include a measure of L2 analytic ability. The purpose of her study was two-fold: (a) to examine the relationship between proficiency and metalinguistic knowledge of university-level native English speaking learners of L2 German, and (b) to examine the relationship between the ability

66

to correct, describe, and explain L2 errors and the ability to identify parts of speech in L2 sentences.

Three instruments were created to test learners’ L2 proficiency, L2 representations of metalinguistic knowledge, and L2 analytic ability. First, a 42-item gap- fill and multiple-choice L2 German proficiency test was administered. Next, a 15-item L2 metalinguistic knowledge test was given, which required learners to correct, describe, and explain highlighted errors. Descriptions and explanations were scored separately, resulting in a maximum possible score of 30. The third test was considered to be part of the metalinguistic knowledge test. It consisted of 15 items that tested L2 analytic ability.

Learners were asked to identify the grammatical functions of highlighted parts of sentences, similar to the words-in-sentences subtest of the MLAT. It should be noted that this test was given in the learners’ L2 (i.e., German). No metalinguistic terminology was necessary to complete the test. Fifty-two participants completed the L2 proficiency test, and 54 participants completed the metalinguistic and language analytic tests.

The results of correlational analyses revealed that the proficiency and metalinguistic knowledge tests were statistically correlated. A principal components analysis of the metalinguistic-related items and the L2 analytic items resulted in a one- factor solution, which explained 82% of the variance.

Roehr (2008b) discussed the implications of the results in terms of L2 metalinguistic knowledge and L2 analytic ability being parts of a complex construct. She proposed that a reevaluation of the components of L2 metalinguistic knowledge was

67

needed; that is, L2 metalinguistic knowledge is a multifaceted construct that includes a dimension related to L2 analytic ability.

A careful comparison and critique of Alderson et al. (1997) and Roehr (2008b) reveal other factors that might account for the findings of the two studies. First, low correlations between L2 proficiency and metalinguistic knowledge (Alderson et al.) could be explained by the operationalizations of L2 proficiency. If the proficiency measures focused on listening and speaking ability or L2 implicit knowledge, low correlations with tests of metalinguistic knowledge would be expected if the learners had domain-relevant implicit linguistic competence that could be called upon during the testing tasks. Thus, to further clarify the relationship between proficiency and metalinguistic knowledge, Roehr utilized a narrow measure of L2 proficiency—the construct was operationalized as grammatical and lexical knowledge. Constraining the definition thus turns the focus to the specific relationship between L2 grammatical and lexical knowledge and metalinguistic knowledge. Roehr hypothesized that the use of a wider definition of proficiency (i.e., one that includes the four skills or other competencies) explained the low correlations previously found between proficiency and metalinguistic knowledge.

Furthermore, Roehr included a measure of language analytic ability in the measure of metalinguistic knowledge. The rationale for the inclusion was that researchers (e.g.,

Ranta, 2002) have found evidence for the claim that the effects of instruction are not powerful enough to compensate for individual differences in language analytic ability.

Therefore, Ranta proposed that metalinguistic ability and language analytic ability are a unidimensional construct. The rationale for this claim, however, is unclear in respect to

68

the definition of metalinguistic ability. There arises a need to define and differentiate metalinguistic knowledge and ability. Proposing that these concepts are the same would require theorizing and describing declarative metalinguistic knowledge and proceduralized metalinguistic knowledge (i.e., metalinguistic ability). The differences between these concepts, if any, warrant explanation as these ideas form the foundation of

Roehr’s arguments. An alternate proposal can be espoused positioning metalinguistic ability as a component of language aptitude. Indeed, theories and operationalizations of language aptitude often include measures of metalinguistic ability. The MLAT, PLAB,

LLM, and LABJ include language analysis, analytical ability, or inductive language learning subtests. Tests of inductive language learning measure metalinguistic ability in that they require test-takers to process linguistic stimuli and models for syntactic, morphologic, and lexical aspects of language. Rules must be noticed, controlled, and applied to successfully complete the tests. The role of metalinguistic ability, therefore, can be situated as a component of language aptitude. Learners’ individual differences in metalinguistic ability can then be viewed as mediating their development of explicit knowledge, interlanguage restructuring through the provision and processing of feedback, and development of procedural knowledge.

The operationalizations of metalinguistic knowledge and language analytical ability, however, could account for the one-factor solution that Roehr (2008b) found.

Both constructs were tested in the L2, which could confound the measurement of the two variables with that of L2 proficiency. Thus, a one-factor solution might be an artifact of an encompassing L2 proficiency construct. Furthermore, conflicting findings have been

69

reported in the literature. While some studies reported one-factor solutions for metalinguistic knowledge and language learning aptitude (e.g., Alderson et al., 1997;

Roehr), others have reported that metalinguistic knowledge is a unique construct and is divergent from language learning aptitude (e.g., Roehr & Gánem-Gutiérrez, 2009).

Knowledge and Theory-Based Gaps

There exist two primary gaps in the research literature: (a) the effects of metalinguistic knowledge on L2 procedural knowledge; and (b) the effects of language learning aptitude on L2 procedural knowledge.

Declarative knowledge is theorized to play a significant role in L2 learning, with explicit instruction and knowledge facilitating L2 learning (Norris & Ortega, 2000).

However, the effects of metalinguistic knowledge have received relatively less focus.

Although metalinguistic knowledge is theorized to play a significant role in L2 learning

(R. Ellis, 2012, p. 174), empirical evidence supporting this claim is sparse, and the findings have been contradictory. Few studies have directly investigated the impact of metalinguistic knowledge on L2 outcomes. Furthermore, most studies of L2 explicit knowledge have not situated L2 outcomes within a specific theory of L2 acquisition.

There exists a need to examine the effects of metalinguistic knowledge on L2 learning outcomes situated within a declarative-procedural model of L2 acquisition.

As reviewed above, language learning aptitude has been the focus of numerous studies that examined the relationships between aptitude and variables relevant to L2 learning, but few studies have directly examined the effects of language learning aptitude

70

on L2 declarative and procedural outcomes. A theoretical gap exists in the design and analysis of the effects of language learning aptitude on L2 learning in that previous studies have not examined the role of language learning aptitude within a specific theory of L2 knowledge representation. Specifically, recent conceptualizations of declarative and procedural determinates of L2 learning have not been applied and tested using models that specify a priori the type of L2 knowledge that is being examined and the predicted effects of different aspects of language learning aptitude on L2 procedural skills.

Research focused on modeling the effects of language learning aptitude on L2 procedural knowledge would fill this substantive void.

Analytical Gaps

Two significant analytical gaps exist in the L2 procedural, declarative, and language aptitude literature: The first relates to measurement models and the second to empirical evidence. With the exception of Sick (2007), Rasch modeling has not been applied in the validation of score use of tests of declarative knowledge and language learning aptitude. It follows that previous studies of implicit and explicit knowledge and language aptitude were based on analyses using classical test theory (i.e., true score theory and raw scores). The current study implemented an analytic method that employs interval-level ability measures for each variable under investigation. The Rasch model

(Rasch, 1960) suits this purpose in that the model takes a strict approach to measurement:

The Rasch approach to measurement is unique in that it tests how well data fit an idealistic measurement model, as opposed to the more common model-to-data fitting

71

approach to measurement. Through the inspection of person ability reliability and separation indices, item and person fit statistics, and tests of unidimensionality, the degree to which useful measures have been obtained can be examined.

Similarly, reports of analyses using confirmatory factor analysis or structural equation modeling are scarce in the literature on metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge. Confirmatory factor analysis has been used only once (R. Ellis & Loewen, 2007), and the study was reported only as a response to a critique (Isemonger, 2007) of an exploratory study (R. Ellis, 2005). There is a theoretically motivated rationale for disaggregating the measured variables that were analyzed in R. Ellis and R. Ellis and Loewen and testing predictions drawn from theories of L2 skill-theory. Furthermore, path models and structural models could be used to test the effects of metalinguistic knowledge and language learning aptitude, accounting for possible relationships among latent factors. Structural models representing the theoretical relationships posited among implicated variables related to declarative and procedural knowledge and language learning aptitude have yet to be tested. Using these analytical techniques to test hypotheses can produce empirical evidence, which in turn serves as the basis for L2 knowledge representation, processing, and acquisition theory development.

Purposes of the Study

There are three primary purposes of the current study: (a) to examine the effects of metalinguistic knowledge on L2 procedural knowledge; (b) to investigate the influence of language learning aptitude on L2 procedural knowledge; and (c) to apply the Rasch

72

model and structural equation modeling in the investigation of the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge.

The first purpose is to test empirically the role of metalinguistic knowledge in L2 learning. This type of declarative knowledge is commonly taught in L2 classrooms, but its contribution to the L2 learning process is unclear. Although explicit L2 instruction facilities L2 learning (Norris & Ortega, 2000), the degree to which metalinguistic knowledge contributes to this process is uncertain. Alderson et al. (1997) found that metalinguistic knowledge was unrelated to L2 proficiency—this implies that metalinguistic knowledge serves little purpose in foreign . However, metalinguistic knowledge has also been shown to increase with L2 proficiency (Elder et al., 1999). Empirical evidence is needed to create conceptual clarity regarding the role of metalinguistic knowledge in L2 learning. Evidence related to the impact of metalinguistic knowledge on L2 outcomes could inform pedagogical approaches to L2 instruction that incorporate metalinguistic explanations and explicit instruction of grammatical rules. For these reasons, this line of inquiry could contribute to L2 theory development and L2 pedagogy. Specifically, testing models related to declarative and procedural knowledge can contribute to skill theory explanations of L2 learning, and clarifying the role of metalinguistic knowledge can inform approaches to L2 teaching and curriculum design.

Examining the influence of metalinguistic knowledge on L2 procedural knowledge can lead to informed approaches to L2 teaching and learning.

The second purpose of this study is to investigate the effects of language learning aptitude on L2 procedural knowledge. As reviewed above, various theoretical positions

73

exist in the L2 research literature as to the role of language learning aptitude in L2 learning. Language learning aptitude is believed to be relatively ineffectual in L1 acquisition. However, preliminary evidence points to a prominent role of language learning aptitude in adult L2 acquisition (e.g., Abrahamsson & Hyltenstam, 2008;

DeKeyser, 2000; DeKeyser, Alfi-Shabtay, & Ravid, 2010; Harley & Hart, 1997). More specifically, grammatical sensitivity and analytical ability have been associated with highly proficient adult language learners (e.g., DeKeyser). Harley and Hart found differential effects of aspects of language learning aptitude on the L2 learning outcomes of early and late immersion learners. Other studies have reported weak correlations between L2 proficiency and language learning aptitude (e.g., Hummel, 2009), and weak to moderate correlations between language aptitude and grammaticality judgments (e.g.,

Larson-Hall, 2008). Unlike the relatively invariant acquisition of an L1, adult L2 acquisition is highly variable in its outcomes, and different cognitive functions subserve the qualitatively different acquisitional pathways. Clarifying the role of language learning aptitude in L2 acquisition can contribute to the explanation of variable L2 learning outcomes. This is significant in that language learning aptitude is one of the most researched individual differences, yet few theories of L2 learning have incorporated it as an explanatory variable. Situating language learning aptitude in a model of declarative and procedural knowledge could shed new light on the role of this construct in L2 learning.

The third purpose of the present study is to investigate metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge using the analytic methods of

74

Rasch modeling and structural equation modeling, which addresses the two analytical gaps. Applying these analytic methods is significant in that Rasch modeling provides evidence of the extent to which data fit the Rasch model, which takes the opposite approach from most statistical modeling. Assessing the degree that data fit the expectations of the Rasch model provides evidence of construct validity, test-sample targeting, and unidimensionality. These measurement-relevant qualities are seldom examined in studies of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge; however, assessing these aspects of testing instruments is a necessary step in measuring L2 knowledge and ability. In an oft-cited study, R. Ellis

(2005) examined numerous tests of implicit and explicit knowledge; however, the data were not subjected to item level analyses, and a true-score theory of measurement was implicit in the statistical analyses. It should be noted that such an approach is fairly standard in the field of L2 learning and teaching. Thus, the application of the Rasch model could provide unique insights into L2 learners’ performances on tests of declarative and procedural knowledge, and into the construct validity of those tests.

Similar to the Rasch approach to measurement (i.e., applying a priori expectations to the data), the use of structural equation modeling enables the testing of a priori hypotheses.

This is important in that is forces the researcher to clearly specify the relationships in the data before subjecting the data to statistical analyses. Furthermore, the regression weight of each operationalization of the latent variables can be examined, which provides key information relating to the measurement of the constructs under investigation. In structural equation modeling, it is common for each latent variable to be operationalized

75

in multiple ways, using various tests or tasks designed to tap a given construct. This provides a thorough assessment of each construct. Furthermore, error terms are incorporated in the models, which account for the unsystematic variance in the manifest variables. Thus, the use of Rasch modeling and structural equation modeling provides a rigorous statistical investigation of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge.

Research Questions

Three research questions guided this study.

1. What is the relationship between metalinguistic knowledge and language learning

aptitude? And to what extent do these variables converge onto a single factor?

2. To what extent do metalinguistic knowledge and language learning aptitude influence

the development of L2 procedural knowledge?

3. What are the relative contributions of components of metalinguistic knowledge and

language learning aptitude to the development of L2 procedural knowledge?

Regarding the first research question, it is hypothesized that metalinguistic knowledge and language learning aptitude are separate constructs. Figure 1 represents this hypothesis.

In this model, language learning aptitude is posited to covary with metalinguistic knowledge. The rationale for this model comes from the theoretical view that language aptitude is developed or held in relation to metalinguistic knowledge, and that aptitude does not mediate or directly influence the acquisition of metalinguistic knowledge. The three manifestations of metalinguistic knowledge are hypothesized to load only on the

76

latent metalinguistic knowledge variable, and the four observed language aptitude variables on the latent aptitude factor. The paths from the latent variables to the manifest variables indicate that the observed variables serve as indicators of their latent variables, and the lack of paths to the other latent variable indicate hypothesized zero-loadings.

These loadings will be checked using confirmatory factor analysis, which also serves as a confirmation of the measurement models for metalinguistic knowledge and language learning aptitude.

Figure 1. Two-factor model of metalinguistic knowledge and language learning aptitude. Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

With regards to the second research question, Figure 2 represents the hypothesis that metalinguistic knowledge and language learning aptitude affect the development of

L2 procedural knowledge. Based on the theorized impact of language aptitude on L2

77

learning and the afforded facilitative effects of metalinguistic knowledge on the proceduralization of L2 knowledge posited in skill-based approaches to L2 acquisition

(e.g., DeKeyser, 1997, 1998), these two latent variables account for portions of the observed variance in L2 procedural knowledge. The paths that point from language learning aptitude and metalinguistic knowledge to L2 procedural knowledge represent influential effects. The lack of an indicator between language aptitude and metalinguistic knowledge represents the assumption that these two constructs are independent factors in the model. The two constructs are hypothesized to affect the development of L2 procedural knowledge because language aptitude is considered to exert significant influence on L2 development, and explicit instruction and learning has been shown to be beneficial in adult L2 learning (e.g., Norris & Ortega, 2000). This model is extended to test the effects of metalinguistic knowledge and language learning aptitude on L2 developmental measures. It is also hypothesized that these latent variables predict L2 complexity, accuracy, and fluency. To test these effects, these three developmental measures are inserted in the position of L2 procedural knowledge in the model shown in

Figure 2.

Regarding the third research question, Figure 3 represents the hypothesis that components of language learning aptitude independently influence the development of L2 procedural knowledge. It is predicted that language analytical ability exerts a stronger influence than number learning, sound-symbol association, or vocabulary learning on the procedural language skills of L2 learners who have primarily experienced classroom-

78

Figure 2. Structural model of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge. L2PrcdrlKnw = L2 procedural knowledge; Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

classroom-based learning (Abrahamsson, & Hyltenstam, 2008; DeKeyser, 2000;

DeKeyser, Alfi-Shabtay, & Ravid, 2010; Harley & Hart, 1997).

In this path model, receptive and productive metalinguistic knowledge are hypothesized to predict L2 procedural knowledge. Out of receptive metalinguistic knowledge and productive rule explanation, productive knowledge is hypothesized to exert a larger effect on L2 procedural knowledge. The rationale for this hypothesis is that declarative knowledge that is available for production might have a stronger association with L2 procedural knowledge. Language analytical ability and vocabulary learning are hypothesized to influence the development of receptive and productive metalinguistic

79

Figure 3. Path model of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge. L2PrcdrlKnw = L2 procedural knowledge; Rule = rule explanation; RcpMtLng = receptive metalinguistic knowledge; Vocab = vocabulary learning; LngAn = language analytical ability; NumLn = number learning; SndSym = sound-symbol association.

knowledge and L2 procedural knowledge due to the memory- and analytical-related aspects of those aptitude subtests. In the model, language analytical ability and sound- symbol association ability influence the development of L2 procedural knowledge, but language analytical ability is hypothesized to exhibit the most influence on L2 procedural knowledge. No influential relationships are hypothesized among the metalinguistic knowledge variables, number learning, and sound-symbol association. Receptive and productive metalinguistic knowledge were tested in Japanese, but number learning and

80

sound-symbol associations were tested in a language unknown to the participants. Thus, higher levels of metalinguistic knowledge are not thought to influence performance on the language aptitude subtests. However, language analytical ability can predict receptive metalinguistic terminology due to the test method and the cognitive abilities associated with language analysis and the development of receptive metalinguistic knowledge. On the receptive metalinguistic knowledge test, participants, for example, analyze English sentences and select appropriate Japanese metalinguistic terms from a list of options.

Participants’ analytical ability might facilitate performance on this test.

81

CHAPTER 3

METHODS

Participants

All of the participants were recruited from two Japanese universities. Two- hundred-forty-nine participants were assessed for eligibility. As the focus of this study was on the effects of metalinguistic knowledge and language learning aptitude, only L2

English learners who had studied English in Japan were eligible. Therefore, nine of the participants were excluded because they had studied abroad for more than 12 months. Of the remaining 240 participants, there were 94 men (39.2%) and 123 women (51.2%). The other 23 participants (9.6%) did not specify their sex on the background questionnaire or did not complete all of the tests; that is, they took, for example, only the language learning aptitude test and were absent on the other test days—this applies to all of the following background data. There were 26 freshmen (10.8%), 47 sophomores (19.6%),

43 juniors (17.9%), and 35 seniors (14.6%)—the other 89 participants (37.1%) were unspecified. The majority of the participants ranged in age from 18 to 23 (98%), with two

24-year-olds (0.8%), one 27-year-old (0.4%), and one 33-year-old participant (0.4%).

The average age was 20.17 years (SD = 1.62). The participants grew up, lived, and studied in Japan—the average length of study abroad was 0.67 months. One-hundred- sixty-nine participants (70.4%) had never studied abroad, and only 10 participants (4.2%) had studied abroad for six to eleven months.

82

Due to the teaching approaches commonly implemented in Japanese junior and senior high schools, it can be assumed that the participants had experienced approximately six years of form-focused instruction. Most English classes in Japanese junior and senior high schools involve teachers using Japanese to explain English grammar. Previous studies have described English instruction at these educational levels as focused on reading skills and grammar-translation (e.g., Gorsuch, 2000).

The participants were asked to report their scores on any standardized proficiency tests they had taken. Forty participants reported TOEFL ITP scores (M = 465, SD = 42).

Six participants had taken the TOEFL iBT (M = 73, SD = 12). Fifty participants reported

TOEIC scores (M = 659, SD = 119). On average, the sample was intermediate to high- intermediate; however, it is probable that the lower-level learners had not taken a proficiency test, which would result in only the higher proficiency learners reporting standardized test scores.

The participants were attending one of two universities: a large private or large national university in the Kanto region of Japan. Hensachi rankings for the 2011 and

2012 university entrance exam results for the relevant faculties were averaged as indicators of the levels of the universities. University A, a large, private university in

Tokyo, had an average hensachi of 56.26. University B, a national university in the Kanto region, had an average hensachi of 54.70. These rankings are only general estimates of the selectively of the universities for a given time period. Some universities might try to artificially inflate their rankings by excluding students who were admitted through

83

alternative methods such as recommendation agreements with high schools and interviews. Therefore, there is some inherent unreliability in these university rankings.

One-hundred-fifty-three participants (63.7%) were attending University A, and 87 participants (36.3%) were enrolled in University B. One-hundred-forty-eight of the participants from University A were studying in the Department of English. The other five participants belonged to the Faculty of Intercultural Communication. All of the participants from University B were enrolled in the Faculty of Education.

Before collecting the data, a sample size of 150 to 200 participants was decided to be sufficiently large. This size was determined by considering the complexity of the statistical models to be tested and by consulting the literature regarding appropriate sample sizes for the specific statistical analyses employed in the current study. For structural equation modeling (SEM) approaches, Kline (2005) described sample sizes of less than 100 participants as “small,” and N-sizes of 100 to 200 participants as “medium.”

He considers sample sizes over 200 as “large” (p. 15). Breckler (1990, as cited in Kline) reported a median sample size of 198 based on the meta-analysis of 72 published SEM studies.

Participation in this study was voluntary, and informed consent was obtained from each participant. The data used in this study was collected during the participants’ university classes. Informed consent of the participants was obtained through oral instructions given before data collection. The following was read to the participants:

“This is a research project about second language learning. Participation is voluntary and unrelated to your course grades. You do not have to participate if you do not want to.

84

Also, you can withdraw from the research project at any time. Your answers will be kept anonymous.” Participants wrote “oka ,” “ es,” or “ ou can use these data” at the top of a test to express confirmation of their consent.

Instrumentation

Receptive Metalinguistic Knowledge Test

In previous studies, metalinguistic knowledge has been operationalized through the use of tests given in the participants’ L1 and L2. Some oft-cited studies (e.g., Ellis,

2005; Roehr, 2008b) have tested metalinguistic knowledge only in the participants’ L2— this type of operationalization casts doubt on the interpretations drawn from those studies regarding the unidimensionality of proficiency and metalinguistic knowledge/ability. To tease apart the latent structure of these complex and related constructs, a test of metalinguistic knowledge given in the participants’ L1 is needed. To this end, a 22-item

Japanese test of metalinguistic knowledge was created to test Japanese university students’ levels of L1 representations of metalinguistic knowledge related to English

(Appendix A; see Appendix B for an English translation).

The development of the test progressed through three stages: (a) mapping the construct, (b) designing the items, and (c) applying a measurement model. First, in mapping a construct, a test designer must consider the indicators of the construct. As

Wilson (2005) explained, the initial step “is for the measurer to focus on the essential feature of what is to be measured—in what way does an individual show more of it and less of it” (p. 38). For this test, participants’ knowledge runs along a continuum of a

85

higher number of known English rules and Japanese metalinguistic terminology to a lower number of known rules and terms. Thus, items that span a range of easy to difficult were needed to assess learners’ levels of declarative metalinguistic knowledge.

The next step was to identify Japanese metalinguistic terms that varied in difficulty. Some studies (e.g., Berry, 2009; Elder et al., 1999) have used metalinguistic terms such as noun and adjective. While these terms would certainly fall into the easy band of the difficulty continuum, I deemed terms such as these to be too easy for

Japanese learners of English. If the participants were to encounter these terms in Japanese, little to no variation would be found in the test scores. Thus, valuable time would be spent on items that would not provide measurement-relevant information about the participants’ metalinguistic knowledge. To avoid such a scenario, I consulted English grammar books written in Japanese to identify Japanese metalinguistic terminology that went beyond common terms such as noun, verb, and adjective. Over 500 candidate terms were identified from the index of topics and terms included in the grammar books. To assess the difficulty of the various terms, two experienced Japanese teachers of English were recruited and asked to judge each term as to its relative difficulty. I asked the teachers to judge the perceived difficulty of 513 Japanese metalinguistic terms using a 3- point scale (1 = Easy; 2 = Moderately difficult; 3 = Difficult). The teachers were asked to base their ratings on whether high school students would be familiar with the Japanese terms. As the teachers had attended and taught in public schools, they possessed a sense of the familiarity of the various terms.

86

Based on these ratings, technical terms that varied in perceived difficulty were chosen as candidates for inclusion on the test. To avoid creating a test consisting of items that were too easy, I started by examining the terms that were rated difficult by both raters and some that were rated moderately difficult. The terms were evaluated for their adaptability to a receptive metalinguistic knowledge test. It was necessary to identify terms that were objectively definable and not clearly self-evident; that is, if the term itself consisted of a phrase or included obvious hints as to its meaning, it would be difficult to paraphrase or to create distractors that were not noticeably incorrect.

In creating the items, a multiple-choice format was chosen because this test was designed to tap receptive metalinguistic knowledge. Participants should not be required to produce any metalinguistic terminology—I was only interested in receptive identification of L1 metalinguistic terms. Initially, 18 multiple-choice test items were created and piloted with 49 participants. The Rasch person reliability (separation) for the 18-item test was .49 (.98), and the item reliability (separation) was .92 (3.41). The results of the pilot test revealed that the initial 18 items were functioning appropriately; therefore, an additional four items were created, resulting in a 22-item test. Two of the items asked test-takers to choose the correct Japanese metalinguistic term for a linguistic rule given in

Japanese. Nine of the items required participants to identify the correct metalinguistic term that described or explained the structures or words included in English sentences.

An English translation of an item targeting identification of interrogative adverbs is as follows:

87

1. When does the show begin? What grammatical terminology can be used to explain this when? a. Interrogative adverb b. Interrogative pronoun c. Interrogative adjective d. Relative adverb

Eleven of the items required participants to choose the English word, phrase, or sentence that represented an example of the metalinguistic term provided in the item stem. An

English translation of this type of item is as follows:

2. Which of the following sentences contains a nominal clause? a. If we hadn’t spent all of our money, we could go to the concert. b. She will finish her homework if she studies all day. c. Cooking dinner together has been important to us. d. I don’t know if the train has arrived.

The Rasch person reliability and separation coefficients for the 22-item test were .67 and

1.43, respectivel . Cronbach’s alpha was .67. These reliability estimates were slightly lower than the common criteria of .70 or above for dichotomous test reliability coefficients. One reason for the slightly low reliability could be the number of test items—the test consisted of 22 dichotomous items. Increasing the number of test items could result in increased reliability estimates for future test administrations.

Productive Metalinguistic Knowledge Test

The productive metalinguistic knowledge test is a test of metalinguistic technical terms and grammatical rule knowledge (Appendix C). It requires participants to identify ungrammatical portions of sentences and to explain why the identified parts of the sentences are ungrammatical. The instructions of the test ask participants to perform three

88

tasks: (a) identify the mistake in each sentence by underlining the ungrammatical portion of each sentence; (b) rewrite each sentence in English so that the mistake is corrected; (c) use Japanese technical terminology (i.e., metalinguistic terminology that describes characteristics of language) to explain why the original sentence is ungrammatical.

Requiring metalinguistic descriptions reveals participants’ productive knowledge of metalinguistic terminology and declarative rule awareness of English grammatical features.

This test consists of 17 items targeting various English rules. The target structures were modeled on those from R. Ellis (2005). R. Ellis chose to focus on grammatical structures that varied in difficulty, timing of instruction in schools, and linguistic features

(i.e., morphological or syntactical rule violations). For example, declarative rules are relatively easy for forms such as possessive –s and regular past tense, although these forms can be difficult in production. Other constructions, however, are difficult from a declarative knowledge perspective; the dative alteration and unreal conditions are relatively harder to explain. The timing of instruction is also different. Plural –s, possessive –s, and the past tense are usually taught within the first or second year of junior high school in Japan. Embedded questions, ergative verbs, and unreal conditions are introduced relatively later. The target structures also varied by linguistic category— they represented morphological or syntactical rule violations, thus covering a variety of grammatical constructions. Sample structures, test sentences, and their characteristics are shown in Table 1.

89

This test produces two scores: a score of technical metalinguistic language and a score of rule explanation. Higher scores represent higher levels of metalinguistic technical language and declarative knowledge of linguistic rules. Responses to each sentence are scored on a three-point rating scale for use of technical language (0 = The participant does not use technical terms; 1 = The participant uses one technical term; 2 = The participant uses two or more technical terms) and on a 5-point scale for rule correctness

(0 = The participant does not demonstrate awareness of the target rule; 1 = The participant states an incorrect rule; 2 = The participant states a partially correct rule; 3

= The participant states a correct rule, but the explanation lacks some detail or is not exhaustively complete or fully elaborated; 4 = The participant states a completely correct rule; the stated rule is correct and exhaustively complete). The scoring rubrics are included in Appendix E, and an English translation is provided in Appendix F.

Table 1. Example Target Structures of the Productive Metalinguistic Knowledge Test Rule Structure Stimuli violation Possessive -s Hiroyuki bicycle was expensive. M Regular past tense Hiroshi play baseball at school yesterday. M Modals Shohei wants buying a new car. M Dative Junko described Ayumi her vacation. S Tag questions You have enough money, won’t you? S For / Since She has been learning French since nine S years. Unreal conditionals If he had been faster, he will win the race. S Yes/No questions Did Shoko ate lunch? M Note. M = morphological; S = syntactical.

Two Japanese-English bilingual professors rated the productive metalinguistic knowledge tests. The raters taught undergraduate- and graduate-level applied linguistics

90

and English courses. The first rater scored all 215 tests, and the second rater scored a set of five randomly selected items (29%) on a randomly selected sample of 34 tests (16%).

For technical language, the inter-rater agreement coefficient was high at .96

(163/170). In the seven scoring disagreements, the attributed scores differed by one point on the rating scale. These results indicate that the scores given by the first rater were reliable.

For rule explanation, the inter-rater agreement coefficient was high at .92

(157/170). Out of the disagreements, eight of them differed by one point on the rating scale. These results indicate that the scores given by the first rater were reliable.

As the inter-rater agreement coefficients were high for the two productive metalinguistic knowledge scales, the scores from rater 1 were considered to be accurate estimates of the participants’ productive metalinguistic knowledge. These raw scores were then subjected to a Rasch analysis to produce interval-level measures. Cronbach’s alpha for the technical terminology scale was sufficient at .82, and the Rasch reliability and separation estimates for the person measures were .78 and 1.91, respectively.

Cronbach’s alpha for the rule explanation scale was .86, and the Rasch reliability and separation estimates for the person measures were .81 and 2.05, respectively.

Language Learning Aptitude Test

The Lunic Language Marathon (LLM) was used as a measure of language learning aptitude. It was developed by Sick and Irie (1998) to serve as a test of language learning aptitude for researchers working with Japanese speaking participants. In this

91

aptitude test, test-takers learn an artificial language, Lunic, which is thought to be unique to the participants in that it is based on neither Japanese nor English word order—the artificial language was constructed with consideration of linguistic universals. Using an artificial language provides a number of benefits. Most notably, testing language aptitude in the target language can confound proficiency with aptitude, and testing aptitude in a language that is grammatically or orthographically similar to the participants’ L1 might bias the results in favor of specific groups of learners.

The test is administered using an audio file that presents the instructions in

Japanese and serves as a time constraint for each section. The four subtests that comprise the LLM are as follows:

1. Lunic Numbers: Test-takers have 4 minutes to listen to and repeat numbers in the

Lunic language. All instruction and practice is conducted aurally/orally—note

taking is not permitted. Test-takers then complete number dictation questions.

This subtest targets phonological memory. As the Lunic language has some

phonological rules, test-takers that can notice, encode, and apply phonological

patterns are theorized to be at an advantage.

2. Lunic Writing: This subtest targets sound-symbol associations, similar to the

corresponding subtests of the MLAT and LABJ. However, test-takers must make

sound-symbol associations between sounds in a previously unknown language

and a new orthographic system. After an aural practice session with no note

taking, test-takers select the character from four choices that corresponds to the

sound that they heard.

92

3. Lunic Vocabulary: This subtest targets rote memory learning of 20 Lunic

vocabulary items and their Japanese translations. Test-takers are given 4 minutes

to memorize the target words. Twenty multiple-choice questions are then

presented. Test-takers cannot refer to the vocabulary items or the translations

when answering these questions.

4. Language Analytical Ability/Inductive Language Learning: The final section of

the test targets learners’ ability to use and infer grammatical rules in a test of

inductive language learning. Samples of the Lunic language are provided using

vocabulary items encountered in the previous subtests. Grammatical rules are

given and need to be applied to select the correct translations of target sentences.

Some items require test-takers to build upon rules presented in previous items.

This subtest is comprised of 15 items.

Previous studies that have used the LLM as a test of language learning aptitude include

Sick (2007), Sick and Irie (2001), and Takada (2003).

The reliability of the 100-item test as measured by Cronbach’s alpha was .92. The

Rasch person measure reliability (separation) was .91 (3.24). Cronbach’s alpha for the number learning scale was .89, and the Rasch person measure reliability (separation) was .87 (2.59). Cronbach’s alpha for the sound-symbol association scale was .83, and the

Rasch person measure reliability (separation) was .65 (1.36). The relatively lower Rasch reliability estimate could have been caused by a ceiling effect. Many participants had high scores on this scale. Cronbach’s alpha for the vocabulary learning scale was .84, and the Rasch person measure reliability (separation) was .79 (1.97). Finally, Cronbach’s

93

alpha for the language analytical ability scale was .79, and the Rasch person measure reliability (separation) was .68 (1.45).

Procedural Knowledge Test

A timed writing task was used to measure L2 procedural knowledge (Appendix

D). This test was similar to the TOEFL Independent Writing Task. Participants were given a topic which required them to choose between two options and to support their choice using supporting details and explanations. The writing prompt was as follows:

“Do you agree or disagree with the following statement? Children should begin learning a foreign language as soon as they start elementary school. Use specific reasons and examples to support your position.” A 25-minute time limit was imposed for the writing.

The rationale for implementing a time limit was that participants would be more likely to rely on L2 procedural knowledge when performing under timed conditions. Previous studies have found evidence that scores from timed performances are relatively more representative of implicit linguistic competence than those obtained under untimed conditions (e.g., Bialystok, 1979, 1982; R. Ellis, 2005).

The essays were first scored by two raters using a holistic rating rubric. The first rater was a native speaker of English with advanced degrees who had been teaching

English as a second language for over 20 years. The second rater was a native speaker of

English with advanced degrees who had been teaching English as a second language for over 12 years. The TOEFL writing rating rubric was adopted for this study (see

Educational Testing Service, 2008, for a complete description of the original rating

94

rubric). This rubric has scores ranging from 0 (an essay that is not related to the writing prompt or written in a foreign language) to 5 (an effective, well organized and developed essay). However, the scale did not provide enough rating points to divide a relatively homogenous group of L1 Japanese learners in terms of their English ability. To overcome this problematic aspect, the rating scale was divided into a 12-point scale, and an iterative rating procedure was adopted.

The raters rated the essays at least two times—the first time was to categorize the essay into one of the original TOEFL rating scale categories, and the second was to assign a rating within each TOEFL rating band. Thus, instead of using only the ratings of

1, 2, 3, 4, and 5 (a rating of zero was not used because a writing performance at that level would have been removed from the data set), additional categories were added above and/or below categories 2, 3, 4, and 5. This resulted in the following rating scale: 1, 2-, 2,

2+, 3-, 3, 3+, 4-, 4, 4+, 5-, 5. These ratings were then converted into a numerical scale

(i.e., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, & 12). The iterative rating process resulted in each essay being read multiple times to place it in an initial category, and then to assign it a score based on the quality of the essay within a certain rating band. The first rater scored all 214 essays, and the second rater scored a random sample of 52 essays. The two sets of ratings were strongly correlated (r = .94, p < .001).

In addition to the holistic ratings, the essays were also scored for characteristics of complexity, accuracy, and fluency (CAF), and these scores were subjected to a principal component analysis (PCA) to produce CAF component scores. As described in Chapter 2, these three aspects of L2 language production are considered to represent distinct aspects

95

of L2 performance (Norris & Ortega, 2009; Skehan, 1998), and they have been used as dependent variables in studies of L2 acquisition (Pallotti, 2009).

The essays were scored for characteristics of CAF by two raters. The first rater scored all 214 essays, and the second rater scored a random sample of 30 essays. The raters scored the essays for the frequencies of words, T-units, dependent clauses, clauses, error-free T-units, error-free dependent clauses, and error-free clauses. These characteristics were chosen because they represent unique aspects of language production and can be used independently as measures of CAF or in the computation of ratio measures. The raters discussed the definitions of the CAF characteristics, which are provided in this section, so that they had a shared understanding of what constituted valid production of each characteristic. The raters scored five essays together, which were not used in the final data set, for training and rater alignment.

Regarding the word counts, each word was counted. Contractions were counted as one word (e.g., don’t), and no problematic issues were encountered. In one essay the sentence I learned A, B, C, D was used. This sentence was counted as three words because if each letter, which represented a hypothetical example, had been counted as a word, the fluency score would have been artificially inflated. The word everyday was counted as two words (i.e., every day) if it was not used to modify another word (e.g., I studied everyday [sic]). And high school, written as one word, was counted as two words.

It was not common to find words written in Japanese, but in the few cases in which participants wrote words in Japanese, those words were not included in the word counts.

96

A T-unit was defined as an independent clause and all attached dependent clauses.

Adverbial, adjective/subordinate, and nominal clauses were counted as dependent clauses.

Adverbial clauses that were reduced to participle phrases were also counted as dependent clauses (e.g., After entering junior high school,…). Nominal clauses reduced to infinitives were not counted as dependent clauses. Clause frequency was calculated as the total number of independent and dependent clauses. Any T-unit or clause with a grammatical error in it was not counted as an error-free T-unit/clause. These criteria resulted in very strict scoring of accuracy. Some participants wrote T-units that contained multiple dependent clauses, which were fairly complex. However, for example, within a three clause T-unit, one article error would have resulted in the whole T-unit being scored as inaccurate. In addition to grammaticality, semantic appropriateness was also considered in the accuracy judgments. Clauses that were grammatically accurate but were not comprehensible in context were scored as inaccurate. The raters discussed how T-units and dependent clauses would be classified and scored so that they would be working off of the same criteria. Only a few discrepancies were observed in the raters’ accuracy scores, and those occurred primarily in the participants’ use of articles in some phrases such as should study foreign language compared to should study a foreign language. The first rater accepted only the latter phrase because that is how it was phrased in the writing prompt, while the second rater accepted both phrases.

The inter-rater reliability coefficients were high for all of the ratings. The Pearson correlations for the scores from the two raters were as follows: (a) words, r = 1.00, p

< .001; (b) T-units, r = 1.00, p < .001; (c) dependent clauses, r = 1.00, p < .001; (d)

97

clauses, r = 1.00, p < .001; (e) error-free T-units, r = .99, p < .001; error-free dependent clauses, r = 1.00, p < .001; and error-free clauses, r = .99, p < .001. Considering the high correlation coefficients, which indicate a high degree of agreement and similarity in the ratings, the scores from rater 1 were used as measures of CAF.

The CAF scores were used as frequencies or to compute ratio measures. Initially,

16 CAF measures were computed: (a) T-units, (b) dependent clauses, (c) clauses, (d) error-free T-units, (e) error-free dependent clauses, (f) error-free clauses, (g) words per minute, (h) words per T-unit, (i) words per dependent clause, (j) words per clause, (k) words per error-free T-unit, (l) words per error-free dependent clause, (m) clauses per T- unit, (n) dependent clauses per clause, (o) error-free T-units per T-unit, and (p) error-free clauses per clause. As the essay writing was timed and all participants had only 25 minutes to write, words per minute was calculated by dividing the total number of words by 25.

Controversy exists as to which aspects of L2 performance some of these measures capture. Norris and Ortega (2009) recommended the use of mean length of T-unit, clauses per T-unit (i.e., T-unit complexity ratio), and mean length of clause, which were included in the complexity measures. Error-free T-units has been suggested to be a productive measure of L2 accuracy (Polio, 1997). Under timed conditions, the number of words produced serves as a measure of fluency (Wolfe-Quintero, Inagaki, & Kim, 1998), and this measure has been shown to capture fluency in L2 written production (Norris &

Ortega). However, Wolfe-Quintero, Inagaki, and Kim claimed some measures (e.g., T- unit length) were indices of fluency, while Norris and Ortega argued that they measured

98

complexity. Norris and Ortega introduced unpublished sources showing a factor analysis of a limited number of CAF measures, pointing to the need for more evidence of CAF validity.

In the present study, to shed empirical light on this matter, the CAF measures were subjected to a PCA (a) to identify the developmental aspect of each measure and (b) to compute CAF component scores for use in subsequent analyses. The variables that loaded on the complexity component were (a) words per T-unit, (b) words per dependent clause, (c) clauses per T-unit, and (d) dependent clauses per clause. The variables that loaded on the fluency component were (a) T-units, (b) clauses, and (c) words per minute.

The variables that loaded on the accuracy component were (a) words per error-free T-unit,

(b) error-free T-units per T-unit, and (c) error-free clauses per clause.

Based on the final component structure obtained in the fifth PCA (see Chapter 4:

Preliminary Analyses), component scores were computed using the regression method.

This resulted in three component scores (i.e., complexity, accuracy, & fluency) for each participant who completed the procedural knowledge test. These component scores were used as measures of CAF in subsequent analyses.

Procedures

The tests were administered to intact university classes over portions of two or three 90-minute class periods. First, the language learning aptitude test was administered during the first 50 minutes of a given class period. Second, the L2 procedural knowledge task was carried out a week later. A period of approximately 35 minutes was needed to

99

complete the writing task—this estimate includes five minutes to explain the task, 25 minutes for writing, and five minutes to collect the completed writing samples. Third, the productive metalinguistic knowledge test, the receptive metalinguistic knowledge test, and the background questionnaire (Appendix G; see Appendix H for an English translation) were administered one week after the writing task. The productive metalinguistic knowledge test was administered before the receptive metalinguistic knowledge test because the receptive metalinguistic knowledge test contained many technical terms that could have facilitated performance on the productive metalinguistic knowledge test. The metalinguistic knowledge tests and the background questionnaire were not timed, and it took approximately 60 minutes to explain, complete, and collect the tests.

Analysis

Two primary analytical techniques were used in the present study: (a) Rasch modeling and (b) structural equation modeling. First, each technique is briefly described.

Second, the analytical procedures and evaluative criteria are explained.

Rasch Analysis

The Rasch model (Rasch, 1960) is a measurement model that analyzes the probabilit of success or failure of a given person on a given test item. That is, a person’s success or failure on each test item provides information as to the location of that individual on the latent variable. Scores of pass/fail on a set of items are the necessary

100

and sufficient conditions for transforming the scores into a log-odds scale. This results in a stochastic measurement model that produces interval-level measures from raw scores.

The formula for the Rasch model is as follows:

Pni(x = 1) = f(Bn - Di)

This equation can be interpreted as “the probabilit (Pn) of Person n getting a score (x) of

1 on a given item (i) is a function (f) of the difference between a person’s abilit (Bn) and an item’s difficultl ( i)” (Bond & Fox, 2007, p. 278).

The Rasch model provides quality control statistics for judging the functioning of test items and the influence of responses on item calibration. These take the form of fit statistics of which the model provides two types: infit and outfit. These statistics provided information related to the construct validity of the test items. That is, items that fit the expectations of the Rasch model tend to be interpreted as tapping the target construct.

Infit mean square statistics represent the difference in the predicted and observed values in the item responses. This statistic is weighted to give more value to responses from persons who are estimated to have an ability that is commensurate with the difficulty of a given item (i.e., Wni). Outfit statistics provide the same information as infit statistics; however, they are computed based on responses from the whole sample and are not weighted based on a restricted range of responses (Bond & Fox, 2007). Fit statistics are calculated as follows (Bond & Fox, pp. 285-286):

infit =

outfit =

101

The model also reports standardized infit and outfit statistics. These values are based on a distribution that has a mean of 0 and a standard deviation of 1. Statistical significance of fit statistics can be computed at p < .05. However, the meaningfulness of the misfit should be considered before judging the statistical significance of the misfit due to the nature of the distribution and the effects of sample size. Additionally, the model produces standard error estimates at the item and person level, and point-measure biserial correlation coefficients.

After inspecting the item functioning of a given instrument, Rasch person ability measures can be obtained. These interval-level person ability measures produced by the model can be used in statistical techniques to investigate research questions and test hypotheses. Rasch person and item reliability and separation estimates are also computed in the analysis. Rasch reliability indicates the reproducibility of the measures, defined as the variance explained by the Rasch model divided by the total variance in the performances. The reliability formula is

Rasch separation is calculated by placing the adjusted standard deviation of the measures in the numerator and the average standard error of measurement in the denominator

(Bond & Fox, 2007, p. 284). The formula for separation is

102

Structural Equation Modeling

Structural equation modeling (SEM) is a statistical technique that involves the a priori specification of a hypothesized model. This model commonly consists of two main parts: a measurement model and a structural model. The measurement model involves the specification of observable variables that serve as indicators of a latent, unobserved variable. The observable variables commonly take the form of test scores or responses to questionnaire items. Best practices suggest that three or more observed variables should be used as indicators of a single latent variable. Each indicator should be measured on a continuous scale. An example taken from the present study would be the subscales of language learning aptitude that serve as manifest variables of a latent language learning aptitude construct. Changes in levels of the latent variable are expected to be observable in the manifest variables.

The structural component of the model is constructed by hypothesizing relationships among latent variables. A latent variable could be modeled as influencing another latent variable, covarying with a latent variable, or existing independently of other latent variables. Each dependent variable in the model has an error term associated with it to account for the residual after predicting the variable. Error in manifest variables is represented by error terms, and error in latent variables is depicted as disturbance terms, which represent the influence of variables not accounted for in the model.

Measurement models and structural models are evaluated using goodness-of-fit statistics. These indices provide estimates of fit of the observed covariance matrix and the population covariance matrix. The indices that are favored by researches are often in flux

103

(Kline, 2005), and the breadth of choice of fit indices makes it difficult to compare reported research or to provide clear guidelines as to which fit indices are most preferable.

Recently, researchers have tended to prefer the chi-square statistic as an indicator of overall model fit. As this statistic is sensitive to sample size, its interpretation should be supplemented with other fit indices. The root mean square of error of approximation provides additional evidence of fit due to it being a parsimony-adjusted index that is based on a noncentral chi-square distribution (Kline). Commonly, values around .05 are considered indicative of good model fit. Finally, a hypothesized model can be compared to a baseline model using the comparative fit index. This index expresses the degree to which a hypothesized model fits the data compared with an independence model, which assumes zero correlations among the variables when analyzed using the population parameters.

There are six basic categories of fit indices: absolute fit measures, relative fit measures, parsimony adjusted measures, fit measures derived from the noncentral chi- square distribution, information-based measures, and adequacy of sample-size measures.

Absolute fit measures assess the fit of the observed covariance matrix to the estimated population covariance matrix. These measures include chi-square, root mean- square residual (RMR), the goodness-of-fit index (GFI; the amount of variance explained by the model), adjusted goodness-of-fit index (AGFI; GFI adjusted for the number of estimated parameters), and the parsimony ratio of GFI (PGFI; GFI controlled for degrees of freedom). Even though the chi-square test is overly sensitive to sample size, it is the standard absolute fit index. Nonsignificant values indicate good fit.

104

Relative fit measures are based on the comparison of the observed model with the independence model and the saturated model—the independence model represents a model of unrelated variables, while the saturated model reflects a model with zero degrees of freedom. The observed covariance matrix is situated along the continuum of fit between these models (Tabachnick & Fidell, 2007). These indices include the normed fit index (NFI), the relative fit index (RFI), the incremental fit index (IFI), the Tucker-

Lewis Index (TLI), and the comparative fit index (CFI). CFI values of .95 and above indicate good model fit (Hu & Bentler, 1999).

Parsimony adjusted fit indices take into consideration the number of model parameters. Adding too many parameters to a model can cause it to fit better than a model with fewer parameters. These measures include the parsimony ratio (PRATIO), which can be applied to other measures to control for the degrees of freedom such as in the parsimony-ratio normed fit index (PNFI), and the parsimony-ratio comparative fit index (PCFI). Values greater than .60 usually indicate good-fitting models (Blunch,

2008).

The next set of fit indices is based on the noncentral chi-square distribution. These measures include the noncentrality parameter (NCP), the minimum discrepancy function

(FMIN), the population discrepancy (FO), and the root mean-square error of approximation (RMSEA). For this set of indices, smaller values indicate better model fit.

For RMSEA, values < .06 indicate good-fitting models (Hu & Bentler, 1999).

Information-based measures of model fit represent estimates of how well a model will be validated in samples from the same population. These indices are usually most

105

relevant for comparing two or more models. These measures include Akaike’s

Information Criterion (AIC), the Brown-Cudeck Criterion (BCC), the Bayes Information

Criterion (BIC), the consistent version of AIC (CAIC), expected cross-validation index

(ECVI), and the maximum likelihood/modified ECVI (MECVI). When comparing models, smaller values tend to indicate better-fitting models.

The last set of indices represents the adequacy of the sample size used to test a model. Critical N, which is labeled as HOELTER in AMOS, indicates an estimate of the necessary sample size for a model that will not generate a statistically significant chi- square value.

In addition to these fit indices, structural equation modeling provides researchers with regression and correlation coefficients for the relationships tested in the hypothesized model. These coefficients represent the effects exerted by the variables in the model.

Analytical Procedures and Evaluative Criteria

The scores from the instruments were used as measured variables in structural models designed to test the hypotheses under investigation. First, scores for each measured variable were subjected to a Rasch analysis to investigate the reliability of the measures and the validity of measure use and interpretation. All missing responses to items were marked as incorrect. The fit of each item to the Rasch model on each test was examined. This part of the analysis is based on Rasch fit statistics, item calibrations, standard errors, Rasch PCA, reliability, and separation statistics. Absolute infit or outfit

106

values of < 1.50 were used as an initial criterion of fit to the Rasch model, and > .50 was used as a lower limit (Linacre, 2011a; Wright & Linacre, 1994). The reason for choosing these criteria was based on recommendations in the literature (e.g., Linacre). A fit statistic of .50 or 1.50 represents 50% less or more variation, respectively, in the responses to a given item than the Rasch model expects. If responses are too deterministic, the relevant item might not be well targeted to the sample. If there is too much unsystematic variance in the responses, the relevant item might not be targeting the latent variable, or there might be problematic aspects of the wording of the item. When unsystematic variance exceeds 50%, there is more noise in the responses than measurement-relevant information. Thus, items with fit values greater than 1.50 might be unproductive to measurement. However, their effects might be so minimal as to result in no substantive change to the measurement of the participants. Therefore, the presence of fit values greater than 1.50 indicates the need to investigate the effects of those items on the measurement of the sample. The value of 1.50, however, can fall outside of ±2 standard deviations. Thus, in addition to the first criterion, as a stricter test of data-to- model fit, a range of ±2 standard deviations from the mean of the fit statistics was used to judge fit to the Rasch model (McNamara, 1996). In this case, it is assumed that the fit statistics were sampled from a normally distributed set of values. Correlations between person measures computed with and without the suspect items reveal the measurement- relevant effects.

Regarding dimensionality, a loading of three or more items on the first contrast in a Rasch PCA was considered a possible indication of a second dimension in the data

107

(Linacre, 2011a). Finally, separation statistics > 2 are considered desirable when dividing a sample or setting cut-scores (Linacre). Separation statistics are highly influenced by the targeting of item difficulty estimates to person ability estimates, so participant ability diversity and item difficulty diversity are influential to Rasch separation statistics. An ungrouped design was used in the present study, but high separation statistics indicate diversity in the measures.

Descriptive statistics for each scaled variable were computed from the Rasch and

CAF measures. The data were screened for the presence of outliers (i.e., standardized scores ≥ ±3.29). The normality of the distributions was checked by examining skewness and kurtosis statistics and through the inspection of histograms and P-P plots.

Furthermore, the linearity and homoscedasticity of the variables was assessed through the inspection of bivariate scatter plots. Mahalanobis distance was used to detect multivariate outliers at the p < .001 level.

Path and structural models were used to answer the research questions that guided this study. The goodness-of-fit of each model was judged using three fit statistics: chi- square, comparative fit index, and root mean square error of approximation. However, all fit indices provided by AMOS were reported to allow researchers to reinterpret the results based on changes in standards of goodness-of-fit interpretations. If sufficient model fit was not obtained, modifications indices were examined, and theoretically defensible model modifications were carried out. The resulting model fit was then reported. If models are nested, the fit of each model can be compared by calculating the chi-square difference between two models. For nonnested models, cross-validation indices can be

108

compared (Browne & Cudeck, 1989). Furthermore, coefficients can be compared, and the statistical significance of the coefficients can be assessed. In SEM, multiple r-squared coefficients serve as effect sizes—they represent the amount of variance in the dependent variable accounted for by the independent variable(s). Furthermore, regression coefficients can be interpreted as effect sizes in their standardized or unstandardized forms (Grissom & Kim, 2012). The alpha level for statistical tests (e.g., correlation coefficients, regression coefficients, chi-square values) was set at .05. The comparative fit index and root mean square error of approximation was judged using Hu and Bentler’s

(1999) criteria of .95 and .06, respectively.

Modeling procedures. The analytical methods of confirmatory factor analysis

(CFA), path analysis, and SEM were used to assess the models designed to answer the research questions posited in this study.

Numerous recommendations for useful fit indices and fit criteria exist in the literature, but the use of chi-square, CFI, and RMSEA is commonly recommended (e.g.,

Blunch, 2008). The fit indices reported by AMOS were reported in this study, but model comparison was not the primary objective. The research questions and hypotheses were concerned with the relative strength of the regression coefficients and the variance accounted for by the independent variables in the models. Considering this focus, model fit was viewed as a necessary condition for valid interpretation of the estimated regression weights in the models. Thus, while model fit was reported and interpreted, estimates of covariance, regression weights, and effect sizes provided answers to the

109

research questions. Exploratory factor analysis was conducted in SPSS 19, as was the computation of descriptive statistics, and path and structural models were tested in

AMOS 19.

Issues of model complexity and sample size. Complex models require larger sample sizes to obtain stable parameter estimates and to achieve enough statistical power to detect significant relationships. Model complexity is, however, somewhat subjective.

The number of parameters to be estimated, the number of measured variables, and ways in which variables are operationalized are three characteristics to consider when evaluating model complexity. The models that were tested in the present study were simple to moderate in complexity. All of the latent variables were measured using at least three operationalizations of the constructs, and each operationalization consisted of a multiple-item test. Therefore, considering the range of possible structural models, the specified models are judged as relatively noncomplex and the variables as measured with a high level of methodological rigor.

110

CHAPTER 4

PRELIMINARY ANALYSES

In this chapter, the results of analyses of the instruments are reported to examine the test items and to offer validity evidence for the use of test scores in the examination of L2 declarative and procedural knowledge and language learning aptitude. The fundamental research question underlying the analyses of the tests is as follows: To what extent do the test items conform to the expectations of the Rasch model? More specifically, these analyses sought to determine (a) the degree to which the test items exhibited construct coverage along the Rasch dimension; (b) the extent to which the test items fit the Rasch model; and (c) the unidimensionality of the instruments.

Analysis

Rasch scaling was employed to examine the test performances. Additionally, principal component analysis (PCA) was used to analyze L2 complexity, accuracy, and fluency (CAF) measures. Rasch analyses were conducted with Winsteps 3.72.3 (Linacre,

2011b), and PCA was carried out in SPSS 19. Current versions of Winsteps use an updated algorithm when computing Rasch PCA of item residuals, which results in less variance being explained by the Rasch model than in the same analysis carried out in previous versions of the software (Linacre, 2011a).

111

Receptive Metalinguistic Knowledge Test

Two-hundred-five participants completed the test of receptive metalinguistic knowledge. The average of the person measures (M = -.45; SD = 1.04) was slightly lower than the arbitrarily set mean of the item estimates (M = .00; SD = 1.25). This figure implies that this set of items was slightly difficult on average for this sample of learners; however, .45 is less than half of a logit from the mean of the item difficulty estimates, which could be interpreted as fairly well targeted. The Rasch person measure reliability was .67 (person measure separation = 1.43). Cronbach’s alpha was also .67. The interpretation of Rasch person reliability estimates is the same as true-score approaches to reliability estimation. An estimate of .67 does not reach the widely accepted value of .70 for score reproducibility. However, only 156 cases out of 205 could be used in the reliability analysis due to the administration of the test. After an initial pilot test, four additional items were added to the test, which resulted in an initial sample of 49 participants taking an 18-item test, and 156 participants taking a 22-item test. Responses to the additional four items were marked as missing data for the initial 49 participants. As these data were treated as missing, the number of valid cases for the reliability analysis was lowered. The Rasch model can compute person measures regardless of missing data on a small subset of items. The Rasch separation statistic indicates that due to the targeting of the sample and the items or the number of items, two statistically distinct strata were not observable in the distribution of person ability estimates. Values of 2.0 or higher are desirable. The sample performed relatively similarly on the test. The Rasch item difficulty reliability was .98 (item measure separation = 6.61). These figures indicate

112

that the item difficulty estimates were stable—that is, enough measurement relevant information was obtained from the sample to accurately and precisely estimate the difficulty of the items. The item difficulty estimates spanned a wide range: The range of item difficulty estimates was 4.67 logits (min = -2.27, max = 2.40). A visual representation of the distribution of person abilities estimates and item difficulties estimates is shown in Figure 4. In this Rasch item-person map, persons are placed on the left side of the map and are represented by number signs (#). They are placed according to their Rasch ability estimates, with more able persons being placed at the upper end of the vertical continuum. Likewise, the test items are shown on the right side of the vertical line. They are arranged according to their Rasch difficulty estimates. A person with an ability estimate of x has a 50% chance of succeeding on an item with a difficulty estimate of x. As the gap widens between a person ability estimate and an item difficulty estimate, the probability of success on that item increases (i.e., ability estimate > difficulty estimate) or decreases (i.e., ability estimate < difficulty estimate).

The most difficult item was Item 13 (question emphasis) at 2.40 logits, while the easiest item was Item 8 (participle construction) at -2.27 logits. The empirical item difficulty hierarchy was aligned with the expected values for the items. Items that dealt with metalinguistic knowledge commonly taught in Japanese junior high school English classes were at the lower end of the construct (e.g., items testing participle constructions, interrogative pronouns, and interrogative adverbs), while items tapping metalinguistic knowledge of grammatical points and structures less commonly taught or taught in the

113

------More Able Persons | More Difficult Items 4 + | | | | . | | 3 + | | | |T Item13: Question emphasis | | 2 . + . | Item3: Relative adjective clause | . T| # | # |S Item11: Objective complement | Item14: Descriptive adjective Item19: Rhetorical question 1 ## + Item7: Incomplete intransitive verb | ## | Item16: Predicative adjective Item21: Adverbial subordinate conjunction .### S| Item20: Indefinite article | Item10: Past perfect-unrealized desire .##### | Item6: Nominal clause | 0 .######### +M Item4: Declarative interrogative . | Item18: Subject+Verb+Object+Verb base form ####### | Item9: Absolute participial construction .###### M| Item12: Gerund | Item22: Collective noun .####### | | -1 ####### + Item15: Negation emphasis | Item5: Adjective order .### |S .# S| .### | . | Item1: Interrogative adverb | Item2: Interrogative pronoun -2 .# + Item17: Adjective-No person subject | | Item8: Participle construction .# T|T | | # | -3 + | | | | . | | -4 . + Less Able Persons | Less Difficult Items ------Figure 4. Item-person map of the receptive metalinguistic construct. Each # represents three persons. Each . represents one to two persons. M = mean; S = 1 SD; T = 2 SD.

114

later grades of high school were more difficult (e.g., objective complements, relative adjective clauses, and question emphasis). The content of the easier items dealt mostly with single words that performed common functions in sentences or question formation, such as interrogative adverbs or pronouns. The items targeting the middle of the construct in terms of difficulty were varied among function words and longer constructions. For example, collective nouns were below the mean of item difficulty estimates and indefinite articles were above the mean. Similarly, participial constructions were below the mean, while nominal clauses were above it. The higher end of the construct was defined by less commonly taught structures, including incomplete intransitive verbs, rhetorical questions, and objective complements. Thus, the items defined a metalinguistic knowledge construct that ranged from easy to difficult.

Most notable when examining Figure 4 are the gaps between many of the item difficulty estimates. Small gaps in construct coverage were observed between Items 13

(question emphasis) and 3 (relative adjective clause), and between Item 3 and Item 11

(objective complement). While few participants were estimated to be in those ranges, ideally one item would fit in each of those gaps. Participants were observed in two more gaps between Item 22 (collective noun) and Item 15 (negation emphasis), and Item 5

(adjective order) and Item 1 (interrogative adverb), which implies that one or two additional items at those difficulty levels would have contributed to the overall coverage of the construct. Ideally, all of the items would have been staggered along the continuum of item difficulty estimates, increasing from the lower ranges by .20 logits so that person abilities could be estimated with greater precision.

115

The exact item difficulty estimates are shown in Table 2. The precision of the item difficulty estimates varied depending on the location of the item. Items estimated to be at the higher and lower ends of the Rasch variable had higher error estimates due to the lack of information received about those items. Few participants were estimated to fall in the same ranges as those items, resulting in less measurement precision and measurement-relevant information. Twenty-one of the 22 items had error estimates from

.15 to .22 logits (M = .18, SD = .03). Wright (1977) suggested that for tests of 20 or 30

Table 2. Rasch Item Statistics for the Receptive Metalinguistic Knowledge Test (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC 13 2.40 .27 1.04 0.2 1.20 0.7 .18 3 1.86 .22 0.91 -0.6 0.91 -0.3 .35 11 1.29 .22 1.01 0.1 0.92 -0.3 .28 14 1.13 .18 1.03 0.4 0.97 -0.1 .30 19 1.11 .21 1.08 0.7 1.08 0.4 .22 7 0.96 .18 0.91 -1.0 0.78 -1.4 .43 21 0.68 .19 0.89 -1.3 0.81 -1.2 .43 16 0.67 .17 1.07 1.0 1.13 0.9 .28 20 0.64 .19 1.06 0.7 1.09 0.6 .26 10 0.38 .16 1.11 1.6 1.06 0.6 .28 6 0.30 .16 0.85 -2.5 0.77 -2.2 .51 4 -0.06 .15 1.03 0.6 1.20 2.1 .34 18 -0.20 .15 0.99 -0.2 0.98 -0.1 .39 9 -0.25 .15 1.01 0.2 0.99 0.0 .38 12 -0.37 .15 0.94 -1.1 0.93 -0.8 .44 22 -0.56 .17 0.98 -0.4 0.93 -0.7 .42 15 -1.01 .16 1.05 0.8 1.06 0.6 .35 5 -1.13 .16 1.09 1.4 1.52 4.3 .29 1 -1.71 .18 0.99 -0.1 0.93 -0.4 .40 2 -1.84 .18 1.10 1.0 1.11 0.7 .30 17 -2.01 .19 0.97 -0.2 0.99 0.0 .40 8 -2.27 .20 0.84 -1.3 0.65 -1.8 .52 M 0.00 .18 1.00 0.0 1.00 0.1 SD 1.25 .03 0.08 1.0 0.18 1.3 Note. PMC = point-measure correlation.

116

items calibrated on a sample of 100 participants, error estimates around the mean of .25 or lower contribute positively to the measurement of the sample. The precision of the current item measures were acceptable, and those items were providing useful information for measurement.

Table 2 also shows the fit statistics for the items. Infit and outfit mean-square statistics should be centered on 1.00 for items that fit the Rasch model. Fit statistics over

1.00 (i.e., underfit) represent more variation in the responses to that item than predicted by the Rasch model, while figures under 1.00 represent overfit, which is less detrimental to measurement than underfit. Items showing a high degree of overfit do not adversely affect the measurement of the sample—too much overfit indicates an intensification of the Rasch variable (i.e., the responses are overly deterministic). The criterion of 1.50 was used to judge goodness-of-fit to the Rasch model, and an additional criterion of two standard deviations above or below the mean of mean-square statistics was also employed. Based on these criteria, only one item exhibited misfit. Item 5 (adjective order) had an outfit mean-square of 1.52, thus underfitting the Rasch model. However, the infit mean-square for this item was within the acceptable range at 1.09. To examine the effects of this item on the measurement of the sample, one set of person measures based on all the items and one set calibrated without Item 5 were computed. The Rasch person reliability (.67) and separation (1.43) coefficients and the item measure reliability

(.98) and separation (6.61) coefficients were the same for both sets of measures.

Furthermore, the two sets of measures were strongly correlated (r = .99, p < .001, two- tailed). Considering these results, Item 5 was retained for the analyses.

117

The point-measure correlation coefficients provided additional evidence of the construct validity of the items. All of the coefficients were positive, and seven of the coefficients were over .40, indicating strong relationships between answers to each of those items and overall test scores. Thus, the items could be considered accurate, and there was empirical evidence that they were targeting the latent variable.

The dimensionality of the items was investigated through the use of Rasch PCA of item residuals (Table 3). The results indicated that 32% of the variance (eigenvalue =

10.3) was explained by the measures. Of the unexplained variance, 4.9% was explained

Table 3. Rasch Principal Components Analysis for the Receptive Metalinguistic Knowledge Test Item Loading Measure Infit MNSQ Outfit MNSQ 8 .47 -2.27 0.84 0.65 20 .34 0.64 1.06 1.09 1 .34 -1.71 0.99 0.93 22 .29 -0.56 0.98 0.93 9 .29 -0.25 1.01 0.99 6 .28 0.30 0.85 0.77 7 .26 0.96 0.91 0.78 21 .24 0.68 0.89 0.81 13 .09 2.40 1.04 1.20 18 .08 -0.20 0.99 0.98 3 .05 1.86 0.91 0.91 4 .02 -0.06 1.03 1.20 5 -.41 -1.13 1.09 1.52 14 -.37 1.13 1.03 0.97 2 -.35 -1.84 1.10 1.11 15 -.29 -1.01 1.05 1.06 12 -.27 -0.37 0.94 0.93 17 -.27 -2.01 0.97 0.99 11 -.18 1.29 1.01 0.92 19 -.15 1.11 1.08 1.08 16 -.14 0.67 1.07 1.13 10 -.07 0.38 1.11 1.06 Note. Measures are Rasch logits.

118

by the first contrast (eigenvalue = 1.6). When judging the strength of the loading on the first contrast, Linacre (2011a) suggested the criterion of three items or more for initial diagnosis of a possible second dimension and recommended comparing the loading to the amount of variance explained by the item measures. As the strength of the loading was less than two items and more variance was explained by the Rasch item difficulty estimates (20.2%, eigenvalue = 6.5), no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

Productive Metalinguistic Knowledge Test

Two-hundred-fifteen participants completed the test of productive metalinguistic knowledge. All of the participants’ responses to the 17 test items were scored by one rater for technical language use and rule explanation. A second rater scored a portion of the data. Out of the 17 items, five items (29%; Items 3, 4, 10, 13, & 16) and 34 tests (16%) were randomly selected and scored by the second rater for technical language use and rule explanation.

For technical language use, the inter-rater agreement coefficient was high at .96

(163/170). In the seven scoring disagreements, the attributed scores differed by one point on the rating scale. For correctness of rule explanation, the inter-rater agreement coefficient was high at .92 (157/170). Out of the disagreements, eight differed by one point on the rating scale.

As the inter-rater agreement coefficients were high for the two productive metalinguistic knowledge tests, the scores from rater 1 were considered to be accurate

119

estimates of participants' productive metalinguistic knowledge. These sets of scores were then subjected to independent Rasch analyses using the rating-scale model. Scores given by the two raters were also analyzed using Facets to examine the effects of rater severity on the estimation of person ability measures. The item difficulty estimates and fit statistics were nearly identical to those of the rating-scale analysis. Furthermore, the person measures from the rating scale analysis and the Facets analysis exhibited a perfect correlation (r = 1.00, p < .001). Considering these findings, it was concluded that the analysis method would not impact the item and person measures, and the Rasch rating- scale model was used for the analyses.

Technical Terminology Scale

Rasch item statistics for the technical terminology scale were computed and examined (Table 4). Rasch person measure reliability (separation) was sufficient at .78

(1.91). The separation statistic was slightly below the recommended value of 2.0, indicating homogeneity of the person measures relative to their standard errors.

Cronbach’s alpha for the person measures was .82. The point-measure correlations were all positive and acceptable (.34-.56). The Rasch item reliability (separation) was .98

(7.17). The most difficult item was Item 13, which had a difficulty estimate of 2.00. This item tested technical terminology related to the dative alternation. The easiest item was

Item 11, with a difficulty measure of -1.42. This item tested technical terminology for the regular past tense. All of the items showed acceptable infit and outfit MNSQ, except for

Item 15 (since/for distinction), which slightly underfit the Rasch model. The outfit

120

MNSQ statistic for this item was 1.61—this value exceeded the 1.50 criterion and was just over two standard deviations from the mean of the outfit statistics. To assess the effect of this item, a separate Rasch analysis was conducted without Item 15. The Rasch person reliability and separation statistics were identical to the initial analysis, and the two sets of person measures were strongly correlated (r = .99, p < .001). Thus, Item 15 was retained for further analyses.

The item-person map of the technical language construct (Figure 5) shows the empirical item difficulty hierarchy. The item difficulty measures were interpretable and were aligned as expected. Items testing relatively difficult structures (e.g., dative alternation, ergative verbs, embedded questions) were higher in the hierarchy than items focused on relatively easier structures (e.g., regular past tense, possessive –s, yes/no questions). These latter structures are also commonly introduced earlier in L2 instructed learning environments. The lower end of the construct consisted of regular past tense and possessive –s. These two structures possess a high degree of frequency, which aligns with their placement in the lower range of the construct. The middle of the construct was defined by items that focused on modal verbs, verb complements, adverb placement, unreal conditionals, and tag questions. These structures are covered in Japanese junior and senior high school English courses and are sufficiently comprehensible or commonly encountered, which supports the validity of the interpretation of this range of the construct. The upper range of the construct was defined by items that targeted more complex grammatical structures. Learners often have difficulty with the since/for

121

distinction, creating embedded questions, and explaining ergative verbs. Thus, the items defined a technical terminology construct that ranged from easy to difficult.

Table 4. Rasch Item Statistics for the Productive Metalinguistic Technical Terminology Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC 13 2.00 .17 0.86 -0.6 0.59 -1.4 .34 3 0.73 .10 0.67 -4.0 0.70 -2.2 .45 15 0.66 .10 1.45 4.4 1.61 3.6 .36 2 0.48 .10 0.75 -3.2 0.72 -2.3 .49 8 0.47 .10 1.22 2.5 1.08 0.7 .45 5 0.44 .10 1.17 2.0 1.00 0.1 .52 14 0.36 .10 0.73 -3.8 0.68 -2.9 .56 16 0.31 .10 0.64 -5.3 0.72 -2.5 .52 7 0.15 .09 0.77 -3.3 0.74 -2.5 .55 12 -0.10 .09 1.17 2.2 1.09 0.9 .53 9 -0.42 .09 1.05 0.7 1.05 0.5 .53 1 -0.61 .10 1.41 4.6 1.43 3.5 .43 4 -0.68 .10 1.20 2.3 1.27 2.3 .43 6 -0.68 .10 0.70 -4.1 0.80 -1.9 .56 17 -0.73 .10 0.87 -1.7 1.05 0.5 .54 10 -0.96 .10 1.12 1.4 1.05 0.4 .56 11 -1.42 .11 1.23 2.0 1.39 2.2 .46 M 0.00 .10 1.00 -0.2 1.00 -0.1 SD 0.80 .02 0.26 3.1 0.29 2.1 Note. PMC = point-measure correlation.

The mean of Rasch person measures (-0.32) was close to the mean of the Rasch item measures (.00). This relationship indicates that the items were slightly difficult for this sample of learners. A few gaps were noticeable in the item difficulty hierarchy. The largest gap was between Item 13 (dative alternation) at 2.00 logits and Item 3 (ergative verbs) at 0.73 logits. These gaps are less worrisome when using a rating scale as the response points of each item act as response locations on the Rasch developmental pathway, resulting in adequate construct coverage.

122

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 5). The results indicated that 39% of the variance (eigenvalue = 10.9) was explained by the measures. Of the unexplained variance, 6.0% was explained by the first contrast (eigenvalue = 1.7). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (26.1%, eigenvalue = 7.3). As the strength of the loading was less than two items, no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

Table 5. Rasch Principal Components Analysis for the Productive Metalinguistic Technical Terminology Scale Item Loading Measure Infit MNSQ Outfit MNSQ 5 .52 0.44 1.17 1.00 3 .44 0.73 0.67 0.70 12 .42 -0.10 1.17 1.09 9 .33 -0.42 1.05 1.05 14 .22 0.36 0.73 0.68 2 .14 0.48 0.75 0.72 10 .10 -0.96 1.12 1.05 15 .08 0.66 1.45 1.61 13 .03 2.00 0.86 0.59 4 -.52 -0.68 1.20 1.27 11 -.45 -1.42 1.23 1.39 1 -.42 -0.61 1.41 1.43 8 -.42 0.47 1.22 1.08 16 -.09 0.31 0.64 0.72 7 -.08 0.15 0.77 0.74 17 -.06 -0.73 0.87 1.05 6 -.04 -0.68 0.70 0.80 Note. Measures are Rasch logits.

123

------More Able Persons | More Difficult Items 2 + Item 13: Dative alternation | . | |T T| .## | . | | 1 .# + .# | .# |S Item 3: Ergative verbs .## S| Item 15: Since/for .### | Item 2: Embedded questions Item 5: Relative clauses Item 8: Determiners/Indefinite article .## | Item 14: Tag questions ######### | Item 16: Unreal conditionals .## | Item 7: Adverb placement 0 .#### +M ### | Item 12: Verb complements .## M| #### | Item 9: Modal verbs .###### | .#### | Item 1: Comparatives Item 4: Plural -s Item 6: Third person -s ### |S Item 17: Yes/no questions .### | -1 + Item 10: Possessive -s .# S| .# | .# | Item 11: Regular past tense ## | |T .# | | -2 ## T+ | | . | | | | | -3 . + | | | | | | | -4 .# + Less Able Persons | Less Difficult Items ------Figure 5. Item-person map of the technical terminology construct. Each # represents three persons. Each . represents one to two persons. M = mean; S = 1 SD; T = 2 SD.

124

Rule Explanation Scale

Rasch item statistics for the rule explanation scale were computed and examined.

Rasch person measure reliability (separation) was sufficient at .81 (2.05). The separation statistic was slightly above the recommended value of 2.0, indicating diversity of the person measures relative to their standard errors. Cronbach’s alpha for the person measures was .86. The point-measure correlations were all positive and acceptable (.39-

.63). The Rasch item reliability (separation) was .98 (6.95). The most difficult item was

Item 13, which had a difficulty estimate of 0.95. This item tested rule knowledge related to the dative alternation. The easiest item was Item 10, with a difficulty measure of -0.79.

This item tested rule knowledge of possessive -s. All of the items showed acceptable infit and outfit MNSQ—no items misfit the Rasch model. Item measures and fit statistics are shown in Table 6.

The item-person map of the rule explanation construct shows the empirical item difficulty hierarchy (Figure 6). The item difficulty measures were interpretable and were aligned as expected. Items testing relatively difficult structures (e.g., dative alternation, indefinite article, unreal conditionals) were higher in the hierarchy than items focused on relatively easier structures (e.g., possessive –s, regular past tense, modal verbs). These latter structures are also commonly introduced earlier in instructed L2 learning environments. The lower end of the construct consisted of possessive –s and regular past tense. These two structures are frequency encountered, which aligns with their placement in the lower range of the construct. The middle of the construct was defined by items that focused on verb complements, comparatives, relative clauses, and adverb placement.

125

These structures are covered in Japanese junior and senior high school English courses and are sufficiently comprehensible or commonly encountered, which supports the validity of the interpretation of this range of the construct. The upper range of the construct was defined by items that targeted more complex grammatical structures.

Learners often have difficulty with the dative alternation, determiners, unreal conditionals, and ergative verbs. Thus, the items defined a rule explanation construct that ranged from easy to difficult.

The mean of Rasch person measures (0.14) was close to the mean of the Rasch item measures (.00). This relationship indicates that the items were slightly easy for this sample of learners. There was a gap of .32 logits between Item 13 (dative alternation) at

0.95 logits and Item 8 (indefinite article) at 0.63 logits. This gap is less worrisome when using a rating scale as the response points of each item act as response locations on the

Rasch developmental pathway, resulting in adequate construct coverage. Overall, the item difficulty estimates and the person ability estimates were adequately aligned.

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 7). The results indicated that 42.7% of the variance (eigenvalue = 12.7) was explained by the measures. Of the unexplained variance, 5.2% was explained by the first contrast (eigenvalue = 1.5). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (28.1%, eigenvalue = 8.3). As the strength of the loading was less than two items, no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

126

Table 6. Rasch Item Statistics for the Productive Metalinguistic Rule Explanation Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC 13 0.95 .06 0.72 -3.2 0.61 -2.9 .51 8 0.63 .05 1.09 1.1 1.01 0.1 .50 16 0.49 .05 0.63 -5.5 0.74 -2.6 .52 3 0.46 .05 0.98 -0.2 0.92 -0.7 .48 7 0.20 .05 0.74 -3.5 0.78 -2.3 .53 14 0.18 .05 0.60 -5.9 0.58 -4.7 .63 2 0.18 .05 0.93 -0.8 0.88 -1.2 .56 15 0.16 .05 1.37 4.1 1.44 3.7 .46 5 -0.05 .06 1.10 1.1 1.07 0.6 .59 1 -0.20 .06 1.40 3.6 1.46 3.3 .39 12 -0.23 .06 1.19 1.8 1.06 0.5 .60 6 -0.24 .06 0.76 -2.6 0.72 -2.4 .56 4 -0.24 .06 1.23 2.1 1.31 2.2 .50 17 -0.33 .06 1.13 1.2 1.19 1.4 .51 9 -0.54 .07 0.98 -0.1 0.94 -0.3 .55 11 -0.64 .07 1.36 2.6 1.48 2.8 .46 10 -0.79 .08 1.44 2.9 1.26 1.5 .56 M 0.00 .06 1.04 -0.1 1.03 -0.1 SD 0.46 .01 0.27 3.0 0.29 2.3 Note. PMC = point-measure correlation.

127

------More Able Persons | More Difficult Items 2 + . | | . | | | . | T| | ## | 1 .# + . |T Item 13: Dative alternation .### S| .##### | .## | Item 8: Determiners/Indefinite article .#### |S Item 16: Unreal conditionals Item 3: Ergative verbs .##### | ######### | .##### M| Item 14: Tag questions Item 15: Since/for Item 2: Embedded questions Item 7: Adverb placement ##### | 0 #### +M Item 5: Relative clauses .#### | .#### | Item 1: Comparatives Item 12: Verb complements Item 4: Plural -s Item 6: Third person -s .# | Item 17: Yes/no questions .# S| ### |S Item 9: Modal verbs .# | Item 11: Regular past tense . | . | Item 10: Possessive -s .# |T -1 . T+ . | . | | . | | | | | | -2 + | | | | | | . | | | -3 . + Less Able Persons | Less Difficult Items ------Figure 6. Item-person map of the rule explanation construct. Each # represents two persons. Each . represents one person. M = mean; S = 1 SD; T = 2 SD.

128

Table 7. Rasch Principal Components Analysis for the Productive Metalinguistic Rule Explanation Scale Item Loading Measure Infit MNSQ Outfit MNSQ 12 .57 -0.23 1.19 1.06 14 .33 0.18 0.60 0.58 9 .28 -0.54 0.98 0.94 16 .26 0.49 0.63 0.74 3 .25 0.46 0.98 0.92 7 .20 0.20 0.74 0.78 17 .16 -0.33 1.13 1.19 5 .12 -0.05 1.10 1.07 8 .09 0.63 1.09 1.01 2 .05 0.18 0.93 0.88 11 -.54 -0.64 1.36 1.48 1 -.54 -0.20 1.40 1.46 4 -.26 -0.24 1.23 1.31 6 -.25 -0.24 0.76 0.72 15 -.23 0.16 1.37 1.44 10 -.13 -0.79 1.44 1.26 13 -.13 0.95 0.72 0.61 Note. Measures are Rasch logits.

Language Learning Aptitude Test

The reliability of the 100-item test as measured b Cronbach’s alpha was .92. The test performances were subjected to a Rasch analysis. The Rasch reliability of the person measures was .91 with a separation index of 3.24. The Rasch reliability of the item measures was .98 with a separation index of 6.31. The mean of the person measure estimates was .42. This statistic implies that the test was relatively easy for the sample.

Only two items slightly underfit the Rasch model. Figure 7 shows the item-person map for all of the test items. The items were sufficiently spread along the construct. Only seven of the 224 participants who completed the test were estimated to have ability measures above the most difficult item. The upper range of the construct was defined primarily by number learning items. The amount of information that had to be processed and the strict time limit imposed for each number learning item resulted in a challenging

129

set of items. The middle of the construct was defined by a mix of items from each of the four language aptitude scales. The lower portion of the construct was defined by primarily the easier language analytical ability items and the sound-symbol association items. The vocabulary learning items were distributed along the Rasch developmental pathway, but they were relatively scarce in the upper range.

The dimensionality of the items was investigated through a Rasch PCA of item residuals. The results indicated that 24.2% of the variance (eigenvalue = 31.9) was explained by the measures. Of the unexplained variance, 4.0% was explained by the first contrast (eigenvalue = 5.3). This amount of variance is less than the amount explained by the Rasch item difficulty measures (16.9%, eigenvalue = 22.3). Although the loading on the first contrast had the strength of slightly more than three items, the loading accounted for less than 5% of the unexplained variance. Moreover, this analysis considered the dimensionality of all 100 items of the language aptitude test—these items can be divided into four scales and can be assumed to be somewhat multidimensional. Thus, no systematic effects were found in the unexplained variance. After this initial examination of the test items, each scale was analyzed separately.

130

------More Able Persons | More Difficult Items 4 + | | . | | | | | 3 + | | . | | . | . | .# |T 2 T+ LNum8b .# | LNum7b .## | LNum15b LNum9a .# | LNum12b LNum9b ### | LNum12c LNum3b #### | LNum13b LNum4b LNum9c .### S| LNum13a LngAn13 .## |S LNum11c LNum2b Voc18 1 .####### + LNum11b LNum15c LNum4c Voc20 Voc3 .##### | LNum5a LNum5b Voc17 .#### | Voc12 Voc16 Voc5 ######## | LNum14c LNum8a LNum8c Voc6 ###### M| LNum10b LNum11a LNum3a LngAn11 Voc19 ############ | LNum10a LNum10c LNum14b LNum4a LNum5c LNum7c LngAn15 Voc2 .##### | LNum14a LNum7a LngAn12 Voc13 Voc4 Voc9 ########### | LNum13c LNum15a LNum2c LNum3c Voc1 0 .##### +M LNum1c LNum2a LNum6a LNum6b Voc10 ##### | LNum6c LngAn10 Voc7 ###### S| LNum1b SndSym18 ##### | LNum12a SndSym2 | LngAn14 LngAn9 SndSym15 SndSym3 .# | LNum1a SndSym16 . | SndSym14 SndSym5 SndSym7 Voc11 .### | SndSym11 SndSym4 -1 T+ LngAn7 . |S SndSym19 SndSym20 SndSym6 . | LngAn8 SndSym9 | SndSym1 SndSym12 SndSym13 | SndSym10 | SndSym17 SndSym8 Voc15 | Voc14 . | Voc8 -2 + LngAn1 LngAn2 |T LngAn3 LngAn5 | | LngAn4 | LngAn6 | | | -3 .# + Less Able Persons | Less Difficult Items ------Figure 7. Item-person map of the language learning aptitude construct. Each # represents two persons. Each . represents one person. M = mean; S = 1 SD; T = 2 SD.

131

Number Learning Scale

Rasch item statistics for the 45-item number learning scale were computed and examined (Table 8). Rasch person measure reliability (separation) was sufficient at .87

(2.59). The separation statistic was above the recommended value of 2.0, indicating heterogeneity of the person measures relative to their standard errors. Cronbach’s alpha for the person measures was .89. The point-measure correlations were all positive and acceptable (.23-.50). The Rasch item reliability (separation) was .95 (4.36). The most difficult item was LNum 8b, which had a difficulty estimate of 1.35. The easiest item was

LNum 1a, with a difficulty measure of -1.41. All of the items showed acceptable infit and outfit MNSQ, except for four items. LNum 15b had an outfit MNSQ statistic (1.38) that was more than two standard deviations from the mean. Similarly, LNum 15c (outfit

MNSQ = 1.26), LNum 11b (outfit MNSQ = 1.25), and LNum 9b (infit MNSQ = 1.17) slightly underfit the model. To assess the effects of these items, a separate Rasch analysis was conducted without those four items. The Rasch person reliability and separation statistics were identical to the initial analysis, and the two sets of person measures were strongly correlated (r = .99, p < .001). Thus, those items were retained for subsequent analyses.

The item-person map of the number learning construct shows the empirical item difficulty hierarchy (Figure 8). The mean of Rasch person measures (-0.28) was close to the mean of the Rasch item measures (.00). This relationship indicates that the items were slightly difficult for this sample of learners. No gaps were noticeable in the item difficulty hierarchy. However, some participants had ability estimates that exceed either

132

the most difficult item or the easiest item. The items were well-aligned with the majority of the sample. The upper range of the construct was defined by primarily items that tapped the second digit of a three digit number. This result is interpretable in that the first and last pieces of aurally presented information are usually easier to remember than the information in the middle. The middle and lower end of the construct were defined by primarily the first and third digits presented in the items. These digits could have been more salient to the participants, which would aid in their recognition. Thus, the items defined a number learning construct that ranged from easy to difficult.

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 9). The results indicated that 22.5% of the variance (eigenvalue = 13.1) was explained by the measures. Of the unexplained variance, 6.1% was explained by the first contrast (eigenvalue = 3.5). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (14%, eigenvalue = 8.1). The three digits that comprised LNum 4a, 4b, and 4c, along with 8a and 11b positively loaded on the first contrast, while 5a, 5b, and 5c negatively loaded on the first contrast. An analysis of these items revealed no similarities that would suggest a secondary factor in the data. As the strength of the loading was just on the borderline of the initial criteria, the Rasch PCA results supported the interpretation that the test items were unidimensional.

133

Table 8. Rasch Item Statistics for the Number Learning Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC LNum8b 1.35 .18 0.99 -0.1 0.97 -0.1 .35 LNum7b 1.26 .17 1.03 0.3 1.07 0.5 .32 LNum15b 1.14 .17 1.11 1.2 1.38 2.4 .23 LNum9a 1.08 .17 1.13 1.5 1.06 0.4 .28 LNum12b 1.03 .17 1.00 0.0 1.09 0.7 .34 LNum9b 0.97 .16 1.17 2.0 1.13 1.0 .25 LNum3b 0.89 .16 0.97 -0.3 0.95 -0.4 .39 LNum12c 0.82 .16 1.02 0.3 1.02 0.2 .35 LNum13b 0.79 .16 0.98 -0.3 0.99 0.0 .38 LNum4b 0.69 .16 1.00 0.0 1.01 0.2 .37 LNum9c 0.69 .16 0.90 -0.4 0.94 -0.5 .40 LNum13a 0.54 .15 0.89 0.4 1.02 0.3 .36 LNum11c 0.52 .15 0.99 -1.7 1.03 0.3 .45 LNum2b 0.50 .15 1.15 -1.7 0.82 -1.9 .47 LNum4c 0.33 .15 1.16 -0.2 0.97 -0.3 .40 LNum15c 0.29 .15 0.94 2.5 1.26 2.7 .27 LNum11b 0.27 .15 1.03 2.7 1.25 2.6 .27 LNum5a 0.24 .15 0.92 -1.0 0.95 -0.6 .43 LNum5b 0.15 .15 0.95 0.6 1.04 0.5 .37 LNum14c -0.02 .15 1.14 -1.5 0.88 -1.5 .46 LNum8c -0.04 .15 1.13 -1.1 0.91 -1.0 .44 LNum8a -0.11 .15 0.99 2.6 1.20 2.3 .29 LNum3a -0.24 .15 0.95 2.4 1.10 1.3 .31 LNum10b -0.24 .15 1.12 -0.2 0.97 -0.3 .42 LNum11a -0.24 .15 1.03 -1.0 0.91 -1.1 .44 LNum4a -0.28 .15 0.97 2.4 1.21 2.4 .30 LNum14b -0.28 .15 1.02 0.7 1.05 0.6 .38 LNum5c -0.30 .15 0.97 -0.6 0.94 -0.7 .43 LNum10c -0.33 .15 1.02 0.4 1.01 0.2 .39 LNum10a -0.35 .15 0.97 -0.6 0.96 -0.5 .43 LNum7c -0.39 .15 0.93 -1.4 1.05 0.6 .45 LNum14a -0.48 .15 0.95 -0.9 0.90 -1.1 .45 LNum7a -0.50 .15 0.99 -0.3 1.00 0.0 .41 LNum2c -0.55 .15 0.90 -1.8 0.85 -1.8 .49 LNum3c -0.63 .15 0.93 -1.4 0.91 -1.0 .46 LNum13c -0.63 .15 0.91 -1.7 0.86 -1.6 .48 LNum15a -0.63 .15 1.02 0.4 1.00 0.1 .40 LNum2a -0.70 .15 0.89 -2.1 0.82 -2.0 .50 LNum6b -0.70 .15 1.02 0.4 1.02 0.2 .39 LNum1c -0.72 .15 0.96 -0.7 0.94 -0.6 .44 LNum6a -0.75 .15 0.99 -0.2 0.93 -0.7 .43 LNum6c -0.84 .15 0.92 -1.3 0.84 -1.6 .48 LNum1b -1.05 .16 1.05 0.7 1.03 0.3 .38 LNum12a -1.13 .16 1.01 0.1 0.99 0.0 .41 LNum1a -1.41 .17 0.92 -1.0 0.85 -1.0 .48 M 0.00 .15 1.00 0.0 1.00 0.0 SD 0.70 .01 0.08 1.3 0.12 1.2 Note. PMC = point-measure correlation. Each item is labeled with a number, which represents the item number, and a letter, which represents the first, second, or third digit (a, b, & c, respectively) of a three-digit number.

134

Sound-Symbol Association Scale

Rasch item statistics for the sound-symbol association scale were computed and examined (Table 10). Rasch person measure reliability (separation) was slightly low at .65 (1.36). The separation statistic was below the recommended value of 2.0, indicating homogeneity of the person measures relative to their standard errors. However,

Cronbach’s alpha for the person measures was adequate at .83. The number of items (k =

20) and the targeting of the items to the sample likely resulted in a Rasch person reliability estimate below .70. The point-measure correlations were all positive and acceptable (.34-.56). The Rasch item reliability (separation) was .83 (2.18). The most difficult item was SndSym 18, which had a difficulty estimate of 0.88. The easiest item was SndSym 17, with a difficulty measure of -0.67. All of the items showed acceptable infit and outfit MNSQ, except for SndSym 18, which slightly underfit the Rasch model.

The outfit MNSQ statistic for this item was 1.41—this value was just over two standard deviations from the mean of the outfit statistics. To assess the effect of this item, a separate Rasch analysis was conducted without SndSym 18. The Rasch person reliability and separation statistics were lower with SndSym 18 removed (.62 and 1.27, respectively), as were Rasch item reliability and separation estimates (.80 and 1.98,

135

------More Able Persons | More Difficult Items 4 + | | | | . | | 3 + . | | .# | | | | 2 + .# T| .# | # | . |T # | LNum7b LNum8b ### | LNum15b LNum9a 1 .# + LNum12b LNum9b ## S| LNum12c LNum13b LNum3b ###### |S LNum4b LNum9c .### | LNum11c LNum13a .## | LNum2b ########## | LNum11b LNum15c LNum4c LNum5a .### | LNum5b 0 ######### +M LNum14c LNum8c .### M| LNum8a .########### | LNum10a LNum10b LNum10c LNum11a LNum14b | LNum3a LNum4a LNum5c ###### | LNum14a LNum7c .###### | LNum13c LNum15a LNum2c LNum3c LNum7a .######## |S LNum1c LNum2a LNum6a LNum6b ## | LNum6c -1 .### + LNum1b .#### S| LNum12a ### | .# |T LNum1a ### | # | ## | -2 ## + T| . | | . | | # | -3 + | . | | | | | -4 ## + Less Able Persons | Less Difficult Items ------Figure 8. Item-person map of the number learning construct. Each # represents two persons. Each . represents one person. M = mean; S = 1 SD; T = 2 SD.

136

Table 9. Rasch Principal Components Analysis for the Number Learning Scale Item Loading Measure Infit MNSQ Outfit MNSQ LNum4b .65 0.69 1.00 1.01 LNum4a .58 -0.28 0.97 1.21 LNum4c .58 0.33 1.16 0.97 LNum8a .54 -0.11 0.99 1.20 LNum11b .48 0.27 1.03 1.25 LNum3a .31 -0.24 0.95 1.10 LNum8b .31 1.35 0.99 0.97 LNum15b .24 1.14 1.11 1.38 LNum13b .19 0.79 0.98 0.99 LNum12b .17 1.03 1.00 1.09 LNum7b .15 1.26 1.03 1.07 LNum11a .14 -0.24 1.03 0.91 LNum10b .08 -0.24 1.12 0.97 LNum6a .07 -0.75 0.99 0.93 LNum3b .07 0.89 0.97 0.95 LNum11c .07 0.52 0.99 1.03 LNum13a .06 0.54 0.89 1.02 LNum10c .05 -0.33 1.02 1.01 LNum15c .04 0.29 0.94 1.26 LNum12a .00 -1.13 1.01 0.99 LNum5a -.58 0.24 0.92 0.95 LNum5b -.58 0.15 0.95 1.04 LNum5c -.40 -0.30 0.97 0.94 LNum8c -.34 -0.04 1.13 0.91 LNum9a -.33 1.08 1.13 1.06 LNum9b -.27 0.97 1.17 1.13 LNum1b -.26 -1.05 1.05 1.03 LNum2b -.25 0.50 1.15 0.82 LNum14b -.21 -0.28 1.02 1.05 LNum12c -.21 0.82 1.02 1.02 LNum7a -.20 -0.50 0.99 1.00 LNum15a -.18 -0.63 1.02 1.00 LNum1a -.17 -1.41 0.92 0.85 LNum9c -.16 0.69 0.90 0.94 LNum7c -.11 -0.39 0.93 1.05 LNum1c -.09 -0.72 0.96 0.94 LNum3c -.09 -0.63 0.93 0.91 LNum14a -.09 -0.48 0.95 0.90 LNum6b -.08 -0.70 1.02 1.02 LNum2a -.08 -0.70 0.89 0.82 LNum13c -.07 -0.63 0.91 0.86 LNum6c -.07 -0.84 0.92 0.84 LNum10a -.05 -0.35 0.97 0.96 LNum14c -.05 -0.02 1.14 0.88 LNum2c -.03 -0.55 0.90 0.85 Note. Measures are Rasch logits.

137

respectively). The two sets of person measures were strongly correlated (r = .98, p

< .001). Considering the lower reliability and separation estimates and the strong correlation between the sets of measures, SndSym 18 was retained for further analyses.

The item-person map of the sound-symbol association construct shows the empirical item difficulty hierarchy (Figure 9). The item difficulty hierarchy was difficult to interpret in that no one item should have been inherently more difficult for all learners.

The number of phonemes in each item could be used to predict item difficulty. However, as the number of phonemes increases, clusters of phonemes can become more recognizable. And if the clusters become too long, words and sentences would be tested instead of the smaller units of sounds and symbols. Furthermore, in a sound-symbol association test, the distinctiveness of the symbols could also affect performance. Even in items that target one or two phonemes, learners could have difficulty in making sound- symbol associations if the symbols associated with the given phonemes, or those provided as distractors, have similar orthographic properties. These constraints are likely to influence the reliability of sound-symbol association tests—note the relatively low reliability estimates in the literature (e.g., Robinson, 2002; Sasaki, 1996). Thus, it is important to include a variety of phonemes and symbols in the items, while controlling for the length of the items and orthographic familiarity and distinctiveness.

The mean of Rasch person measures (1.67) was much higher than the mean of the

Rasch item measures (.00). This relationship indicates that the items were easy for this sample of learners. Few gaps were noticeable in the item difficulty hierarchy. However, the items were not targeted well to the sample. Numerous participants were estimated to

138

have ability measures higher than the most difficult item. More difficult sound-symbol association items were needed to fully measure the construct for this sample.

Table 10. Rasch Item Statistics for the Sound-Symbol Association Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC SndSym18 0.88 .16 1.28 3.7 1.41 3.5 .34 SndSym2 0.66 .17 1.13 1.7 1.07 0.6 .43 SndSym3 0.58 .17 0.88 -1.6 0.82 -1.5 .56 SndSym15 0.58 .17 1.06 0.8 1.11 0.9 .45 SndSym16 0.37 .17 1.00 0.0 1.02 0.2 .48 SndSym14 0.31 .18 1.02 0.3 1.08 0.6 .46 SndSym7 0.28 .18 0.98 -0.1 0.97 -0.2 .49 SndSym5 0.22 .18 1.04 0.4 1.04 0.3 .46 SndSym4 0.09 .18 0.96 -0.4 0.89 -0.6 .50 SndSym11 0.09 .18 1.03 0.4 1.06 0.4 .45 SndSym20 -0.09 .19 1.06 0.6 1.25 1.3 .42 SndSym19 -0.16 .19 0.98 -0.1 1.09 0.6 .46 SndSym6 -0.20 .19 0.89 -1.0 0.80 -1.0 .53 SndSym9 -0.36 .20 1.01 0.1 0.82 -0.8 .47 SndSym1 -0.40 .20 1.08 0.7 0.86 -0.6 .44 SndSym12 -0.48 .21 0.92 -0.6 0.71 -1.3 .51 SndSym13 -0.48 .21 0.96 -0.3 0.95 -0.2 .47 SndSym10 -0.57 .21 0.87 -1.0 0.75 -1.1 .53 SndSym8 -0.67 .22 0.86 -1.0 0.70 -1.2 .53 SndSym17 -0.67 .22 0.90 -0.7 0.84 -0.6 .50 M .00 .19 1.00 0.1 0.96 0.0 SD .46 .02 0.10 1.1 0.18 1.1 Note. PMC = point-measure correlation.

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 11). The results indicated that 20.1% of the variance (eigenvalue = 5.0) was explained by the measures. Of the unexplained variance, 6.5% was explained by the first contrast (eigenvalue = 1.6). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (8.6%, eigenvalue = 2.1). As the strength

139

------More Able Persons | More Difficult Items 4 .###### + | T| | | | | 3 ############ + | | S| | ############ | | 2 + ######### | | M| .##### | | .###### | 1 + ###### |T SndSym18 .### | SndSym2 | SndSym15 SndSym3 ### S|S SndSym16 | SndSym14 SndSym5 SndSym7 ### | SndSym11 SndSym4 0 .# +M .# | SndSym19 SndSym20 SndSym6 | SndSym9 . |S SndSym1 SndSym12 SndSym13 | SndSym10 .# T| SndSym17 SndSym8 |T -1 + | | # | | | | -2 + | | | | | | -3 . + | | | | | | -4 # + Less Able Persons | Less Difficult Items ------Figure 9. Item-person map of the sound-symbol association construct. Each # represents three persons. Each . represents one or two persons. M = mean; S = 1 SD; T = 2 SD.

140

Table 11. Rasch Principal Components Analysis for the Sound-Symbol Association Scale Item Loading Measure Infit MNSQ Outfit MNSQ SndSym6 .58 -0.20 0.89 0.80 SndSym10 .53 -0.57 0.87 0.75 SndSym8 .49 -0.67 0.86 0.70 SndSym3 .39 0.58 0.88 0.82 SndSym4 .17 0.09 0.96 0.89 SndSym7 .15 0.28 0.98 0.97 SndSym19 .07 -0.16 0.98 1.09 SndSym2 -.52 0.66 1.13 1.07 SndSym1 -.32 -0.40 1.08 0.86 SndSym5 -.27 0.22 1.04 1.04 SndSym20 -.20 -0.09 1.06 1.25 SndSym14 -.14 0.31 1.02 1.08 SndSym15 -.14 0.58 1.06 1.11 SndSym9 -.10 -0.36 1.01 0.82 SndSym12 -.09 -0.48 0.92 0.71 SndSym11 -.07 0.09 1.03 1.06 SndSym16 -.07 0.37 1.00 1.02 SndSym17 -.04 -0.67 0.90 0.84 SndSym13 -.02 -0.48 0.96 0.95 SndSym18 -.01 0.88 1.28 1.41 Note. Measures are Rasch logits.

of the loading was less than two items, no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

Vocabulary Learning Scale

Rasch item statistics for the vocabulary learning scale were computed and examined (Table 12). Rasch person measure reliability (separation) was adequate at .79

(1.97). The separation statistic was slightly below the recommended value of 2.0, indicating homogeneity of the person measures relative to their standard errors.

Cronbach’s alpha for the person measures was .84. The point-measure correlations were

141

all positive and acceptable (.37-.56). The Rasch item reliability (separation) was .97

(5.91). The most difficult item was Voc 18, which had a difficulty estimate of 1.21. The easiest item was Voc 8, with a difficulty measure of -2.31. All of the items showed acceptable infit and outfit MNSQ, except for Voc 1 and 3, which slightly underfit the

Rasch model. The infit MNSQ statistic for Voc 1 was 1.20. This value was just over two standard deviations from the mean of the outfit statistics. The infit MNSQ for Voc 3 was

1.22. To assess the effects of these items, a separate Rasch analysis was conducted without Voc 1 and 3. The Rasch person reliability estimate was the same for both sets of measures, but person separation (1.92) was lower with Items 1 and 3 removed. The two sets of person measures were strongly correlated (r = .99, p < .001). Considering the lower separation estimate and the strong correlation between the sets of measures, Voc 1 and 3 were retained for subsequent analyses.

The item-person map of the vocabulary learning construct shows the empirical item difficulty hierarchy (Figure 10). The mean of Rasch person measures (0.41) was slightly higher than the mean of the Rasch item measures (.00). This relationship indicates that the items were relatively easy for this sample of learners. A few gaps were noticeable in the item difficulty hierarchy. Numerous participants were estimated to have ability measures that were higher than the most difficult items. Similarly, there were gaps between Voc 11 and 7, and 15 and 11—many participants were estimated to be in those ranges. More difficult vocabulary items and a few items in the range of -1.90 to -0.45 would have improved the measurement of the construct for this sample. The participants had been exposed to the characters in the sound-symbol association portion of the test,

142

which should have facilitated their learning of the symbols used in the Luna writing system. The item hierarchy was interpretable based primarily on the length of the vocabulary items. The lower range of the construct was defined by words that consisted of a single character. These words were easier to encode and recall. The middle portion of the construct was comprised of words consisting of two, three, or four characters. The upper range of the variable was defined exclusively by vocabulary items consisting of three and four characters. This result provides evidence for the interpretation that the items were tapping memory-related aspects of aptitude. Thus, the items defined a vocabulary learning construct that ranged from easy to difficult.

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 13). The results indicated that 29.7% of the variance (eigenvalue = 8.4) was explained by the measures. Of the unexplained variance, 6.3% was explained by the first contrast (eigenvalue = 1.8). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (15.7%, eigenvalue = 4.5). As the strength of the loading was less than two items, no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

143

Table 12. Rasch Item Statistics for the Vocabulary Learning Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC Voc18 1.21 .16 0.95 -0.6 0.96 -0.3 .51 Voc3 1.11 .16 1.22 2.9 1.35 2.7 .37 Voc20 0.98 .16 0.94 -0.8 0.91 -0.8 .52 Voc17 0.90 .16 0.89 -1.7 0.80 -2.0 .56 Voc5 0.72 .16 1.13 2.0 1.29 2.6 .41 Voc12 0.72 .16 1.05 0.8 1.12 1.2 .46 Voc16 0.68 .16 1.06 1.0 1.34 3.1 .45 Voc6 0.60 .16 1.05 0.7 1.04 0.5 .47 Voc19 0.46 .16 1.02 0.4 0.95 -0.5 .48 Voc2 0.26 .16 0.96 -0.6 0.95 -0.5 .50 Voc4 0.16 .16 0.91 -1.4 0.90 -1.0 .53 Voc13 0.14 .16 1.05 0.8 1.07 0.7 .46 Voc9 0.12 .16 0.96 -0.6 1.03 0.3 .50 Voc1 0.02 .16 1.20 3.1 1.26 2.3 .37 Voc10 -0.08 .16 0.90 -1.7 0.83 -1.6 .54 Voc7 -0.36 .16 0.95 -0.7 1.01 0.2 .50 Voc11 -1.09 .16 0.98 -0.2 0.89 -0.5 .47 Voc15 -1.98 .18 0.82 -1.2 0.53 -1.7 .54 Voc14 -2.25 .24 0.89 -0.6 0.57 -1.3 .51 Voc8 -2.31 .25 0.88 -0.6 0.63 -1.0 .51 M 0.00 .17 0.99 0.0 0.97 0.1 SD 1.06 .03 0.10 1.4 0.23 1.5 Note. PMC = point-measure correlation.

144

------More Able Persons | More Difficult Items 4 #### + | | | | ### | | 3 + | T| .#### | | | |T 2 ##### + | ##### | S| | .####### | .###### | Voc18 Voc3 1 +S Voc20 ######### | Voc17 | Voc12 Voc16 Voc5 .######## | Voc6 M| Voc19 .######## | Voc2 ######## | Voc13 Voc4 Voc9 0 +M Voc1 .####### | Voc10 | .######## | Voc7 | ######### S| | -1 .####### +S | Voc11 .### | | | .## | | -2 T+ Voc15 . |T | Voc14 Voc8 | | | | -3 + | | | . | | | -4 ### + Less Able Persons | Less Difficult Items ------Figure 10. Item-person map of the vocabulary learning construct. Each # represents two persons. Each . represents one person. M = mean; S = 1 SD; T = 2 SD.

145

Table 13. Rasch Principal Components Analysis for the Vocabulary Learning Scale Item Loading Measure Infit MNSQ Outfit MNSQ Voc14 .64 -2.25 0.89 0.57 Voc8 .57 -2.31 0.88 0.63 Voc9 .33 0.12 0.96 1.03 Voc20 .25 0.98 0.94 0.91 Voc10 .23 -0.08 0.90 0.83 Voc17 .21 0.90 0.89 0.80 Voc3 .18 1.11 1.22 1.35 Voc18 .16 1.21 0.95 0.96 Voc2 .06 0.26 0.96 0.95 Voc4 -.53 0.16 0.91 0.90 Voc5 -.52 0.72 1.13 1.29 Voc13 -.22 0.14 1.05 1.07 Voc11 -.21 -1.09 0.98 0.89 Voc16 -.20 0.68 1.06 1.34 Voc6 -.13 0.60 1.05 1.04 Voc19 -.12 0.46 1.02 0.95 Voc15 -.11 -1.98 0.82 0.53 Voc1 -.03 0.02 1.20 1.26 Voc7 -.02 -0.36 0.95 1.01 Voc12 -.01 0.72 1.05 1.12 Note. Measures are Rasch logits.

Language Analytical Ability Scale

Rasch item statistics for the language analytical ability scale were computed and examined (Table 14). Rasch person measure reliability (separation) was slightly low at .68 (1.45). The separation statistic was below the recommended value of 2.0, indicating homogeneity of the person measures relative to their standard errors. However,

Cronbach’s alpha for the person measures was adequate at .79. The number of items (k =

15) and the targeting of the items to the sample likely resulted in a Rasch person reliability estimate below .70. The point-measure correlations were all positive and acceptable (.37-.59). The Rasch item reliability (separation) was .97 (6.18). The most difficult item was LngAn 13, which had a difficulty estimate of 2.76. The easiest item was LngAn 6, with a difficulty measure of -1.97. All of the items showed acceptable infit

146

and outfit MNSQ statistics except for LngAn 13, which had an outfit mean-square of 1.66.

To assess the effect of this item, a separate Rasch analysis was conducted without LngAn

13. The Rasch person reliability and separation statistics were lower with LngAn 13 removed (.63 and 1.32, respectively), as was the Rasch item separation estimate (5.51).

The two sets of person measures were strongly correlated (r = .98, p < .001). Considering the lower reliability and separation estimates and the strong correlation between the sets of measures, LngAn 13 was retained for further analyses.

The item-person map of the language analytical ability construct shows the empirical item difficulty hierarchy (Figure 11). The item difficulty hierarchy was interpretable in that items that tested a combination of rules were generally more difficult than items that required the analysis or application of a single rule. For example, the most difficult item, LngAn 13, required analysis of a subject, object, adverb, and relative clause in a new language, and the content of the item differed from the examples provided on the test. In contrast, the easiest item, LngAn 6, required only the analysis of a subject, object, and present-tense verb, and the content of the item matched the content of the examples. The upper range of the construct was defined by items that required analysis of linguistic categories (i.e., similar to LngAn 13) and the application of grammatical rules. The amount of information and the complexity of the rules resulted in items that tapped high levels of analytical ability. The lower portion of the variable was defined by items that were similar to LngAn 6; that is, the items focused on the analysis and application of relatively simple rules. Thus, the items defined an analytical ability construct that ranged from simple to complex linguistic rule analysis.

147

The mean of the Rasch person measures (1.65) was much higher than the mean of the Rasch item measures (.00). This relationship indicates that the items were easy for this sample of learners. Many gaps were noticeable in the item difficulty hierarchy.

However, the items were spread out in varying difficulties. Some participants were estimated to have ability measures higher than the most difficult item. There were only

15 items, so including more items might have improved the measurement of the construct for this sample.

Table 14. Rasch Item Statistics for the Language Analytical Ability Scale (Measure Order) Infit Infit Outfit Outfit Item Measure SE MNSQ ZSTD MNSQ ZSTD PMC LngAn13 2.76 .17 1.20 2.4 1.66 3.2 .37 LngAn11 1.78 .16 0.93 -1.1 1.20 1.6 .52 LngAn15 1.56 .16 1.05 0.8 1.09 0.8 .48 LngAn12 1.53 .16 0.92 -1.3 0.86 -1.3 .54 LngAn10 0.94 .16 1.05 0.8 1.18 1.5 .47 LngAn14 0.58 .17 1.11 1.3 1.22 1.5 .44 LngAn9 0.55 .17 0.92 -0.9 0.81 -1.4 .55 LngAn7 -0.04 .19 0.94 -0.6 0.85 -0.8 .53 LngAn8 -0.35 .21 1.12 1.0 1.17 0.8 .45 LngAn1 -1.34 .27 1.15 0.8 1.45 1.1 .44 LngAn2 -1.34 .27 0.79 -1.1 0.49 -1.5 .59 LngAn5 -1.41 .28 0.80 -1.0 0.57 -1.1 .58 LngAn3 -1.49 .29 0.84 -0.7 0.65 -0.8 .57 LngAn4 -1.76 .31 0.88 -0.4 0.55 -0.9 .56 LngAn6 -1.97 .34 0.86 -0.5 1.56 1.1 .54 M 0.00 .22 0.97 0.0 1.02 0.2 SD 1.46 .06 0.13 1.1 0.36 1.4 Note. PMC = point-measure correlation.

148

------More Able Persons | More Difficult Items 4 .### T+ | | | ###### | | | 3 + |T S| LngAn13 .######### | | | | 2 .####### + | | LngAn11 M| LngAn12 LngAn15 ########### |S | | 1 ####### + LngAn10 | | .### | LngAn14 LngAn9 | S| .# | 0 +M LngAn7 | .## | LngAn8 | . | | T| -1 # + | | LngAn1 LngAn2 . |S LngAn3 LngAn5 | | LngAn4 | -2 . + LngAn6 | | | | | |T -3 + | | . | | | | -4 .# + Less Able Persons | Less Difficult Items ------Figure 11. Item-person map of the language analytical ability construct. Each # represents four persons. Each . represents one to three persons. M = mean; S = 1 SD; T = 2 SD.

149

The dimensionality of the items was investigated through a Rasch PCA of item residuals (Table 15). The results indicated that 36.9% of the variance (eigenvalue = 8.8) was explained by the measures. Of the unexplained variance, 7.0% was explained by the first contrast (eigenvalue = 1.7). This amount of variance is less than the amount explained by the Rasch item difficulty estimates (19.0%, eigenvalue = 4.5). As the strength of the loading was less than two items, no systematic effects were found in the unexplained variance. These results support the interpretation that the test items are unidimensional.

Table 15. Rasch Principal Components Analysis for the Language Analytical Ability Scale Item Loading Measure Infit MNSQ Outfit MNSQ LngAn2 .63 -1.34 0.79 0.49 LngAn3 .44 -1.49 0.84 0.65 LngAn1 .41 -1.34 1.15 1.45 LngAn9 .31 0.55 0.92 0.81 LngAn6 .18 -1.97 0.86 1.56 LngAn4 .17 -1.76 0.88 0.55 LngAn8 .14 -0.35 1.12 1.17 LngAn10 .11 0.94 1.05 1.18 LngAn5 .11 -1.41 0.80 0.57 LngAn7 .10 -0.04 0.94 0.85 LngAn13 -.50 2.76 1.20 1.66 LngAn12 -.39 1.53 0.92 0.86 LngAn11 -.38 1.78 0.93 1.20 LngAn15 -.37 1.56 1.05 1.09 LngAn14 -.05 0.58 1.11 1.22 Note. Measures are Rasch logits.

L2 Procedural Knowledge Test

L2 procedural knowledge was assessed by means of a timed essay. The essays were judged by two raters using a 12-point rating scale; the first rater scored all 214 essays, and the second rater scored a random sample of 52 essays. The descriptive

150

statistics for the ratings are shown in Table 16. The average of the ratings given by rater 1 was 5.58, while the average for rater 2 was slightly higher at 6.08. A visual inspection of histograms for the two sets of ratings revealed no clear deviations from normality. A correlation analysis showed a strong relationship between the two sets of ratings (r = .94, p < .001). This indicates that the ratings given by the two raters were aligned and consistent for each essay.

Table 16. Descriptive Statistics for the Essay Ratings Rater 1 Rater 2 M 5.58 6.08 SE 0.14 0.29 95% CI [5.30,5.86] [5.50,6.66] SD 2.06 2.09 Skewness 0.34 0.42 SES 0.17 0.33 Kurtosis 0.27 0.99 SEK 0.33 0.65 Note. CI = confidence interval.

Facets Analysis

The essay ratings were subjected to a Facets (Linacre, 2012b) analysis to examine rater severity and to produce interval-level procedural knowledge measures. The person ability measures were initially estimated based on a rater severity mean of 50 with 10 units between logits. The average of the Rasch person measures was slightly lower than this mean (see Table 17). A visual inspection of a histogram of the Rasch person measures revealed that they were distributed sufficiently across the construct. The skewness and kurtosis statistics indicated that the distribution did not deviate from normality (z-skewness = -.53; z-kurtosis = -1.18).

151

Table 17. Descriptive Statistics for the Rasch Procedural Knowledge Measures Procedural Knowledge M 45.48 SE 1.05 95% CI [43.41,47.56] SD 15.39 Skewness -0.09 SES 0.17 Kurtosis -0.39 SEK 0.33 Note. CI = confidence interval. n = 214.

Figure 12 shows the distribution of Rasch measures for the participants, raters, and the rating scale. The Rasch measure for rater 1 was 50.54 (.05 logits), while the measure for rater 2 was 49.46 (-.05 logits). This result indicates that rater 1 was a slightly more severe judge. That is, participants were more likely to receive a lower score from rater 1 than they were from rater 2. If the raw scores from the raters were to be used as indicators of L2 procedural knowledge, a more comprehensive scoring system would be needed to account for the difference in the leniency of the raters. However, the multifaceted Rasch model accounts for rater severity and adjusts the Rasch person measures to reflect the severity of the raters, and the raters showed a similar pattern in their ratings. A chi-square test indicated that the raters were not statisticall different, χ2

(1) = 0.7, p > .05.

152

+------+ |Measr|+Participants|-Raters | RS | |-----+------+------+-----| | 88 + + +(12) | | 86 + . + + | | 84 + + + | | 82 + . + + --- | | 80 + + + | | 78 + . + + | | 76 + . + + 11 | | 74 + + + --- | | 72 + *. + + 10 | | 70 + + + --- | | 68 + *. + + 9 | | 66 + * + + --- | | 64 + ***. + + 8 | | 62 + + + --- | | 60 + + + | | 58 + ******. + + 7 | | 56 + + + | | 54 + * + + --- | | 52 + *********. + + | | 50 + *. + Rater 1 Rater 2 + 6 | | 48 + + + | | 46 + *. + + --- | | 44 + + + | | 42 + + + | | 40 + **********. + + 5 | | 38 + + + | | 36 + . + + --- | | 34 + + + | | 32 + *******. + + 4 | | 30 + + + | | 28 + . + + --- | | 26 + **. + + | | 24 + + + 3 | | 22 + . + + --- | | 20 + + + | | 18 + ***. + + 2 | | 16 + + + | | 14 + + + | | 12 + + + --- | | 10 + + + | | 8 + . + + | | 6 + + + | | 4 + + + | | 2 + + + | * 0 * . * * (1) * |-----+------+------+-----| |Measr|+Participants|-Raters | RS | +------+ Figure 12. Facets vertical ruler of Rasch L2 procedural knowledge measures. Measr = measure; RS = rating scale. Each “*” is approximately four persons, and each “.” is one person.

153

PCA of Complexity, Accuracy, and Fluency Measures

Additionally, each essay was scored for characteristics of complexity, accuracy, and fluency (CAF). The inter-rater reliability coefficients were high for all of the ratings.

The Pearson correlations for the scores from the two raters were as follows: (a) words, r

= 1.00, p < .001; (b) T-units, r = 1.00, p < .001; (c) dependent clauses, r = 1.00, p < .001;

(d) clauses, r = 1.00, p < .001; (e) error-free T-units, r = .99, p < .001; error-free dependent clauses, r = 1.00, p < .001; and error-free clauses, r = .99, p < .001. Initially,

16 CAF measures were computed: (a) T-units, (b) dependent clauses, (c) clauses, (d) error-free T-units, (e) error-free dependent clauses, (f) error-free clauses, (g) words per minute, (h) words per T-unit, (i) words per dependent clause, (j) words per clause, (k) words per error-free T-unit, (l) words per error-free dependent clause, (m) clauses per T- unit, (n) dependent clauses per clause, (o) error-free T-units per T-unit, and (p) error-free clauses per clause. The descriptive statistics for these measures are shown in Table 18.

These measures were subjected to a principal component analysis (PCA) to identify the component of each CAF measure, and to reduce the data into CAF component scores.

Before conducting the analysis, the data were screened for outliers. Participants with z-scores ≥ 3.29 were judged to be outliers. Fourteen unique outliers across 16 variables were identified. The extreme scores were removed from the data set before conducting the PCA. After examining the component structure of the variables, the outliers were returned to the data set, and the analyses were conducted again. All of the relevant values (e.g., component loadings) were nearly identical for both analyses. As one of the purposes of the PCA was to compute component scores for the participants, the

154

outliers were included in the data set to compute these scores. The PCA that included the outliers are reported here.

In factor analysis, multicollinearity is a concern because only the shared variance is analyzed. If the variables are linear combinations of each other, the correlation matrix could be nonpositive definite, which would result in the inability to invert the matrix.

However, in PCA, all the variance in the data is analyzed; thus, “multicollinearity is not a problem because there is no need to invert a matrix” (Tabachnick & Fidell, 2007, p. 614).

The CAF data were analyzed to reveal the underlying component structure of the

16 measures, with all items included. An orthogonal rotation (Varimax) was selected to obtain interpretable components because it was not clear as to whether the target variables should be highly correlated or not. Some learners could have high fluency, accuracy, and complexity, while others could have high ability in one or two of these developmental measures. Indeed, these measures are hypothesized to be in competition when learners focus on meaning. Irrespective of the theoretical standing of these measures, the analyses were repeated using oblique rotations, and the results were identical to those of orthogonal rotations.

An initial examination of a scree plot revealed four possible components for extraction. The eigenvalues supported this interpretation as the first four components had eigenvalues over Kaiser’s criterion of 1.0 (6.20, 3.28, 2.69, & 1.75, respectively). The four components accounted for 87% of the variance. A criterion of .40 was used to determine significant component loadings (Stevens, 2002).

155

Table 18. Descriptive Statistics for the CAF Measures M SE 95% CI SD Skewness Kurtosis T 10.53 .30 [9.93,11.13] 4.44 .53 .54 DC 4.77 .21 [4.5,5.19] 3.12 1.27 2.79 C 15.29 .46 [14.3,16.19] 6.68 .83 2.01 EFT 4.61 .21 [4.20,5.02] 3.03 1.17 2.88 EFDC 2.81 .17 [2.48,3.15] 2.50 1.34 2.61 EFC 8.17 .35 [7.48,8.86] 5.11 1.26 4.03 WPM 4.95 .15 [4.66,5.25] 2.19 .68 1.60 W/T 11.97 .21 [11.57,12.38] 3.01 1.02 3.05 W/DC 31.75 1.25 [29.28,34.22] 18.30 2.23 8.45 W/C 8.14 .10 [7.95,8.34] 1.45 .47 .96 W/EFT 32.12 1.70 [28.78,35.47] 24.84 2.57 9.20 W/EFDC 47.16 2.86 [41.52,52.81] 41.87 1.59 3.02 C/T 1.47 .02 [1.44,1.51] .27 1.01 1.47 DC/C .30 .01 [.28,.32] .12 -.01 -.20 EFT/T .43 .01 [.41,.46] .22 .12 -.35 EFC/C .52 .01 [.49,.55] .21 -.09 -.35 Note. SES = .17; SEK = .33. n = 214. CI = confidence interval; T = T-units; DC = dependent clauses; C = clauses; EFT = error-free T-units; EFDC = error-free dependent clauses; EFC = error-free clauses; WPM = words per minute; W/T = words per T-unit; W/DC = words per dependent clause; W/C = words per clause; W/EFT = words per error-free T-unit; W/EFDC = words per error-free dependent clause; C/T = clauses per T-unit; DC/C = dependent clauses per clause; EFT/T = error-free T-units per T-unit; EFC/C = error-free clauses per clause.

Based on this criterion, multiple complex loadings were observed in the rotated component matrix (Table 19). A preliminary examination of the loadings indicated that component 1 seemed to be related to fluency, component 2 to complexity, and component 3 to accuracy—component 4 was not clearly defined with only two variables loading at the criterion level. As error-free dependent clauses loaded on two components, which were difficult to interpret, and words per error-free dependent clause did not load on any components, these two variables were removed and a second PCA was conducted.

In a second PCA, four components were identified, and more variance was accounted for (91.61%). There were still multiple complex loadings. Among those variables, dependent clauses loaded strongly on two components (.74 on component 1 and .64 on component 2). 156

Table 19. Rotated Component Loadings for the First PCA Rotated Factor Loadings Variable 1 2 3 4 T .96 DC .73 .65 C .98 EFT .69 .68 EFDC .59 .60 EFC .76 .56 WPM .94 W/T .57 .80 W/DC -.79 W/C .96 W/EFT -.69 W/EFDC C/T .95 DC/C .95 EFT/T .94 EFC/C .90 Note. Only component loadings > .40 are shown. n = 214. T = T-units; DC = dependent clauses; C = clauses; EFT = error-free T-units; EFDC = error-free dependent clauses; EFC = error-free clauses; WPM = words per minute; W/T = words per T-unit; W/DC = words per dependent clause; W/C = words per clause; W/EFT = words per error-free T-unit; W/EFDC = words per error-free dependent clause; C/T = clauses per T-unit; DC/C = dependent clauses per clause; EFT/T = error-free T-units per T-unit; EFC/C = error-free clauses per clause.

This variable could be considered to be a fluency measure in that more words might be needed to produce an essay that contains multiple dependent clauses compared to an essay with fewer dependent clauses. However, it is most likely an indicator of complexity.

Due to the cross-loading, this variable was removed from the analysis.

In a third PCA, four components were still identified in the data, accounting for

91.43% of the variance. Only two variables, words per T-unit and words per clause, had loadings over .40 on the fourth component. Words per T-unit also loaded on component 2, which was most likely related to complexity. Words per clause loaded only on the fourth

157

component—a single component loading is insufficient for defining a component. Thus, words per clause was removed from the analysis.

A fourth PCA was conducted, which resulted in a clear three-component solution.

Three components had eigenvalues well over 1.0, and the point of inflexion in the scree plot also indicated a three-component solution. This component structure was interpretable and aligned with the expected structure drawn from L2 acquisition theory.

The three extracted components accounted for 84.77% of the variance. However, complex loadings were observed. Most notably, error-free T-units and error-free clauses cross loaded on two components. Thus, these two variables were removed from the analysis.

A fifth PCA was conducted, and a clear three-component solution was observed.

The three components explained 83.07% of the variance. The three components were clearly defined—no cross-loadings were found, and each variable loaded strongly on a single component (Table 20). The component structure was interpretable: Component 1 was related to complexity, component 2 to fluency, and component 3 to accuracy. The fluency component included frequency and ratio measures, which suggests that the component structure was not determined by the type of measure. The variables that loaded on the complexity component were (a) words per T-unit, (b) words per dependent clause, (c) clauses per T-unit, and (d) dependent clauses per clause. The variables related to fluency were (a) T-units, (b) clauses, and (c) words per minute. The variables that loaded on the accuracy component were (a) words per error-free T-unit, (b) error-free T- units per T-unit, and (c) error-free clauses per clause.

158

Based on the final component structure obtained in the fifth PCA, component scores were computed using the regression method. This resulted in three component scores (i.e., complexity, accuracy, & fluency) for each participant who completed the procedural knowledge test. The negative loadings of words per error-free T-unit and dependent clause were an artifact of the method of calculation used. These measures were computed as the ratio of occurrence within the number of words produced, and as the denominator increases, the resultant ratio decreases. Computing them in this way controls for essay length. Therefore, lower scores indicate a higher rate of error-free T-unit and dependent clause production.

Table 20. Rotated Component Loadings for the Fifth PCA Rotated Factor Loadings Variable 1 2 3 T .97 C .98 WPM .96 W/T .76 W/DC -.71 W/EFT -.72 C/T .96 DC/C .97 EFT/T .94 EFC/C .90 Note. Only component loadings > .40 are shown. n = 214. T = T-units; C = clauses; WPM = words per minute; W/T = words per T-unit; W/DC = words per dependent clause; W/EFT = words per error-free T-unit; C/T = clauses per T-unit; DC/C = dependent clauses per clause; EFT/T = error-free T-units per T-unit; EFC/C = error-free clauses per clause.

An initial inspection of histograms for the three CAF variables revealed no problematic deviations from normality. However, possible outliers were observable in the

159

tails of the distributions. Analyses of univariate and multivariate outliers are reported in the Results chapter.

Summary of the Preliminary Analyses

In this chapter, three tests of L2 declarative knowledge, one test of language learning aptitude comprised of four scales, and one test of L2 procedural knowledge were analyzed. Rasch modeling, correlation, and PCA were conducted to examine the instruments. The results of the preliminary analyses provided evidence for the validity of score use for these instruments.

First, regarding declarative knowledge, the Rasch analysis of the receptive metalinguistic knowledge test revealed that the items varied in difficulty and were sufficiently spread along the variable. The items showed good fit to the Rasch model and were unidimensional. The reliability estimate was slightly low at .67, but this figure approached the widely accepted value of .70. As for the two productive metalinguistic knowledge tests, the inter-rater agreement coefficients were high for the technical terminology scale (.96) and the rule explanation scale (.92). Rasch rating-scale analyses indicated that the test items fit the Rasch model and that the tests were unidimensional.

Second, the language learning aptitude test as a whole was reliable (.92), and the person measures were sufficiently separated (3.24). Overall, the test was relatively easy for the participants. The person measures for each scale were reliable. A few slightly misfitting items were identified across three of the scales, but cross-plots of person measures computed with and without the misfitting items suggested that their inclusion

160

did not affect the measurement of the participants. The test as a whole and each of the four scales were sufficiently unidimensional.

Third, L2 procedural knowledge was assessed using a timed essay judged by two raters. A strong relationship was found between the two sets of ratings, and the distribution of Rasch measures did not deviate from normality. The essays were also scored for complexity, accuracy, and fluency. These measures were subjected to a PCA to identify component loadings for each developmental measure. The analysis resulted in a three-component structure with each measure loading on a single component.

Component scores were computed for use in subsequent analyses.

161

CHAPTER 5

RESULTS

First, the results of data screening are reported. Second, each research question is restated and is followed by the results of the analyses conducted to answer it.

Data Screening

The data from the 249 participants on the background questionnaire, declarative knowledge tests, L2 procedural knowledge test, and language learning aptitude test were screened. Following the data screening procedures in Tabachnick and Fidell (2007) for ungrouped data, the following checks were conducted: (a) the accuracy of the data entry,

(b) the suitability of the participants based on background variables, (c) the treatment of missing values, (d) the presence of univariate outliers, (e) the normality of the distributions, (f) the linearity and homoscedasticity of the variables, and (g) the presence of multivariate outliers.

First, the accuracy of the data entry was checked. Searches in Excel were conducted to check for missing scores and scores outside of the possible ranges—no unexpected scores were found. The data were also matched to the original test scores. No mistakes in the data entry were found.

Second, based on the background data, nine participants who had studied abroad for more than 12 months were identified. As the target population was Japanese L2 learners of English who had studied in Japanese junior and senior high schools, these

162

eight participants were removed from the data set. This resulted in an N-size of 240 participants.

Third, the data set was checked for missing values. As the tests were given over three weeks, some participants were absent on some of the test days, resulting in missing data for some of the participants. Efforts were made for these participants to take the tests on different days, but due to scheduling constraints and a lack of contact information, it was not possible to schedule appointments. The data were missing completely at random.

These data were test scores from multiple-item tests; therefore, it would be difficult to predict participants’ knowledge and abilit . Likert-type questionnaire data are more suitable for data imputation than data derived from multiple-item tests. The missing data were 14% of the data set, well over the 5% figure suggested by Tabachnick and Fidell

(2007). Therefore, list-wise deletion was applied to the data set. This process resulted in an N-size of 167.

Fourth, z-scores were computed for each participant on each variable. Participants with z-scores ≥ ±3.29 were considered to be univariate outliers. There were no outliers on

L2 procedural knowledge, complexity, accuracy, fluency, sound-symbol association, number learning, or language analytical ability.

One outlier was identified on the receptive metalinguistic knowledge variable.

Case 34 had a z-score of 3.63. This participant had the highest measure on the variable.

Deleting or adjusting outlying scores are common methods for dealing with extreme scores. When adjusting scores, a preferred method is to add one unit to the second highest score or to subtract one from the second lowest score (Tabachnick & Fidell, 2007). The

163

identified extreme measure was adjusted by adding one to the second highest measure to bring the measure within 3.29 standard deviations. Case 42 had a z-score of -3.21. This case was not technically an outlier, but the measure was adjusted by subtracting one from the second lowest measure to bring it nearer to the distribution. After these adjustments, there were no univariate outliers on the variable.

One outlier was identified on the technical terminology variable. Case 128 had a z-score of -4.02. This test asked participants to use technical terminology to describe grammatical rule violations. As this participant did not attempt many of the test items, the case was removed from the data set. After removal of this participant’s data, there were no univariate outliers on the variable.

Case 128 also had an extreme score on the rule explanation variable. However, this case was deleted from the data set after examination of the technical terminology variable. Thus, there were no univariate outliers on the rule explanation variable.

Two outliers were found on the number learning variable. Case 157 had a z-score of 3.30, and case 23 had a z-score of -4.77. The former measure was adjusted by adding one to the second highest measure. The latter measure was adjusted by subtracting one from the second lowest measure. After these adjustments, there were no univariate outliers on the variable.

Fourth, the normality of the distributions was examined by calculating standardized skewness and kurtosis statistics for each variable, and by visually inspecting histograms and P-P plots. When the N-size reaches approximately 200, many tests of normality are too sensitive to slight deviations from normality: They report statistically

164

significant differences between the observed distribution and a normal distribution when the difference is small enough that it does not affect the outcome of the statistical tests— this includes statistical skewness and kurtosis. To overcome these limitations, visual inspections of histograms and P-P plots are recommended (Field, 2009; Tabachnick &

Fidell, 2007). Furthermore, as skewness and kurtosis statistics can be difficult to interpret and statistically significant in large samples, Meyers, Gamst, and Guarino (2006) recommended using the criterion of ±1.00 to judge the magnitude of the statistics. Table

21 shows the skewness and kurtosis statistics for the declarative knowledge, language learning aptitude, and L2 procedural knowledge variables.

Table 21. Skewness and Kurtosis Statistics for the Metalinguistic Knowledge, Language Aptitude, and L2 Procedural Knowledge Variables Skew SES Kurtosis SEK ZSkew ZKurt RcptMeta -0.05 .19 0.22 .38 -0.26 0.58 TechTerm -0.34 .19 0.05 .38 -1.79 0.13 RuleExpl -0.28 .19 1.03 .38 -1.47 2.71 LngApt 0.47 .19 0.12 .38 2.47 0.32 Num 0.13 .19 1.25 .38 0.68 3.29 SndSym 0.29 .19 -0.44 .38 1.52 -1.16 Vocab 0.93 .19 0.90 .38 4.89 2.37 LngAn 0.19 .19 0.28 .38 1.00 0.74 L2Prcd -0.17 .19 -0.79 .38 -0.89 -2.08 Complex 0.53 .19 0.34 .38 2.79 0.89 Accuracy -0.38 .19 0.29 .38 -2.00 0.76 Fluency 0.20 .19 -0.08 .38 1.05 -0.21 Note. N = 166. RcptMeta = receptive metalinguistic knowledge; TechTerm = technical terminology; RuleExpl = rule explanation; LngApt = language learning aptitude; Num = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability; L2Prcd = L2 procedural knowledge; Complex = complexity.

All of the absolute values of the skewness and kurtosis statistics were within

±1.00 except for rule explanation kurtosis and number learning kurtosis—these two values were either just around 1.00 or slightly above. Thus, regarding the magnitudes of

165

the statistics, no problematic distributions were observed. Four variables, language learning aptitude (i.e., the test as a whole), vocabulary learning, complexity, and accuracy, exhibited statistically significant skewness—their z-scores exceeded ±1.96. Four variables, rule explanation, number learning, vocabulary learning, and L2 procedural knowledge, exhibited statistically significant kurtosis, with z-scores beyond ±1.96.

Considering the N-size of the study, visual inspections of histograms and P-P plots were undertaken to make a final judgment of the normality of the distributions, and careful consideration was paid to the distributions that had statistically significant skewness or kurtosis. These inspections confirmed that the variables did not show any clear deviations from normality.

Mardia’s normalized estimate of multivariate kurtosis was used to assess multivariate kurtosis in the variables. Multivariate kurtosis adversely affects the modeling of covariance to a higher degree than skewness does, and standardized multivariate kurtosis values > 5 might negatively impact parameter estimation (Bentler, 2006).

Normalized estimates of multivariate kurtosis for the combinations of variables used in the models ranged from 2.26 to 3.99, indicating the assumption of multivariate kurtosis was met.

Fifth, the linearity and homoscedasticity of the variables was investigated. As variables are correlated to different degrees, the establishment of a perfect linear relationship is not required. Some variables will be highly correlated, while others will exhibit almost no relationship. What is most important is to establish the absence of curvilinear relationships and the lack of clear bulging in the variances of the variables.

166

Tabachnick and Fidell (2007) recommended that scatter plots of the variables be examined to assess the linearity and homoscedasticity of the relationships. Scatter plots of all combinations of variables were examined, which entailed checking 55 scatter plots for the bivariate combinations of three metalinguistic variables (receptive metalinguistic knowledge, technical terminology, and rule explanation), four language aptitude variables

(number learning, sound-symbol association, vocabulary learning, and language analytical ability), and four L2 procedural knowledge variables (L2 procedural knowledge, complexity, accuracy, and fluency). No curvilinear relationships or differences in homoscedasticity were found, confirming that the assumptions of linearity and homoscedasticity were met.

Finally, the presence of multivariate outliers was investigated. Mahalanobis distance was used to identify multivariate outliers. Critical values at the p < .001 level were examined for the various combinations of variables that were tested in the structural models. No multivariate outliers were found in the data.

In summary, the data were screened for univariate outliers, normality, linearity, homoscedasticity, and multivariate outliers. Univariate outliers were removed from the data set, or their measures were adjusted in relation to the next highest or lowest measure in the distribution. The normality of the variables was checked through skewness and kurtosis statistics, the statistical significance of those values, and visual inspections of histograms and P-P plots. Two of the variables exceeded the criteria of ±1.00 for kurtosis, but the deviations were ever so slight. Based on the criteria of ±1.96 standard deviations, p < .05, four variables showed statistically significant skewness, and four variables

167

exhibited statistically significant kurtosis. However, visual inspections of histograms and

P-P plots revealed that none of the variables clearly deviated from normality—no clear evidence of nonnormality was observed in the distributions. Furthermore, the assumption of multivariate kurtosis was met. Thus, no transformations were applied to the variables.

The assumptions of linearity and homoscedasticity were confirmed through the inspection of scatter plots. Finally, no multivariate outliers were found in the data set.

These data were then analyzed to answer the research questions that guided this study.

Descriptive Statistics

All variables except for the component scores were Rasch calibrated using user- friendly scales. All person measures were scaled off of an item difficulty mean of 50 with

10 units between logits. Linacre (2012a) recommended this as a user-friendly scaling system with person measures that generally range from 0 to 100. The scale is also convenient in that measures are easily converted back into logits (e.g., 50, 60, and 70 correspond to 0, 1, and 2 logits, respectively). The descriptive statistics for the sample are shown in Tables 22, 23, and 24. All of the means were close to the Rasch item mean difficulty of 50 except for sound-symbol association and language analytical ability— these two tests were relatively easy for the participants. The participants exhibited high levels of analytical ability, which might influence their metalinguistic knowledge. All of the variables were distributed well in relation to their means. The 95% confidence intervals were tightly grouped around the means.

168

Table 22. Descriptive Statistics for Declarative Knowledge Receptive metalinguistic Technical knowledge terminology Rule explanation M 46.89 48.53 52.60 SE 0.70 0.63 0.40 95% CI [45.52, 48.27] [47.29, 49.76] [51.81, 53.40] SD 8.96 8.07 5.17 Skewness -0.05 -0.34 -0.28 SES 0.19 0.19 0.19 Kurtosis 0.22 0.05 1.03 SEK 0.38 0.38 0.38 Note. CI = confidence interval.

Table 23. Descriptive Statistics for Language Aptitude Sound- Language Number symbol Vocabulary analytical learning association learning ability M 48.61 67.56 56.73 69.19 SE 0.79 1.01 1.11 1.09 95% CI [47.05, 50.17] [65.56, 69.56] [54.55, 58.92] [67.05, 71.34] SD 10.16 13.05 14.24 14.00 Skewness 0.13 0.29 0.93 0.19 SES 0.19 0.19 0.19 0.19 Kurtosis 1.25 -0.44 0.90 0.28 SEK 0.38 0.38 0.38 0.38 Note. CI = confidence interval.

Table 24. Descriptive Statistics for L2 Procedural Knowledge L2 procedural knowledge Complexity Accuracy Fluency M 46.16 0.09 -0.02 0.01 SE 1.11 0.08 0.08 0.07 95% CI [43.98, 48.35] [-0.06, 0.25] [-0.18, 0.13] [-0.13, 0.15] SD 14.25 1.02 1.00 0.92 Skewness -0.17 0.53 -0.38 0.20 SES 0.19 0.19 0.19 0.19 Kurtosis -0.79 0.34 0.29 -0.08 SEK 0.38 0.38 0.38 0.38 Note. CI = confidence interval. Complexity, accuracy, and fluency are component scores.

169

Research Question 1: Relationship Between Metalinguistic Knowledge and

Language Learning Aptitude

Before using confirmatory factor analysis (CFA) to test the model related to the first research question, exploratory factor analysis (EFA) was conducted to assess the relationships among the variables that served as indicators of metalinguistic knowledge and language learning aptitude. EFA has been used in previous studies of language learning aptitude and metalinguistic knowledge (e.g., Alderson, Clapham, & Steel, 1997;

Roehr, 2008b); therefore, the results of EFA are reported here to allow for comparisons among previous studies and the present study. CFA, however, was used to answer the first research question.

The three metalinguistic variables and the four language learning aptitude variables were subjected to a factor analysis. Principal axis factoring was used to extract factors from the shared variance in the data, and an oblique rotation (Direct Oblimin) was selected to create an interpretable factor structure.

Prior to conducting EFA, the correlations among the metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge variables were examined (see

Appendix I). No correlations > .90 were observed, indicating the absence of multicollinearity in the data. Receptive metalinguistic knowledge exhibited fairly strong, statistically significant correlations with productive metalinguistic knowledge (i.e., technical terminology and rule explanation). It was also related to language analytical ability, L2 procedural knowledge, accuracy, and fluency. It was not statistically related to complexity. Technical terminology was strongly related to rule explanation. The

170

correlation coefficient was high, but one pair of strongly correlated variables should not cause excessive multicollinearity. Indeed, the variables need to be related to serve as indicators of a shared latent factor. Technical terminology was also statistically significantly related to L2 procedural knowledge, complexity, accuracy, and fluency.

Rule explanation showed a similar pattern of relationships—statistically significant associations with L2 procedural knowledge, complexity, accuracy, and fluency were observed.

The language learning aptitude variables were statistically significantly associated with each other; however, the correlations were weak to moderate. None of the language aptitude variables were related to L2 procedural knowledge, complexity, accuracy, or fluency. Language learning aptitude did exhibit a weak statistically significant relationship with receptive metalinguistic knowledge.

L2 procedural knowledge showed moderate relationships with complexity and accuracy, and it was strongly related to fluency. Complexity, accuracy, and fluency were not statistically related.

A scree plot revealed a two-factor solution. The eigenvalues for these two factors exceeded Kaiser’s criteria of 1.0 (factor 1 = 2.44; factor 2 = 1.70). The two factors explained 45.61% of the variance in the data. The KMO measure of sampling adequacy was .66, which exceeded the minimum recommend value of .50 (Field, 2009). Bartlett’s test of sphericity was significant, p < .001, which indicated that the correlations among the variables were statistically different from 0. A criterion of .40 was used to determine significant factor loadings (Stevens, 2002). Table 25 shows the rotated factor loadings.

171

Each variable clearly loaded on a single factor, and no cross-loadings or unexpected loadings were observed. All of the metalinguistic knowledge variables loaded strongly on the first factor, which was labeled metalinguistic knowledge. The second factor, language learning aptitude, had two medium to fairly strong loadings (i.e., number learning and sound-symbol association). Vocabulary learning and language analytical ability just exceeded the criterion of .40.

These results support the use of these variables as indicators of the constructs of metalinguistic knowledge and language learning aptitude. As a stricter test of the dimensionality of the variables, a one-factor solution was specified, and a second EFA was conducted. This analysis provides additional evidence related to the factorial structure of metalinguistic knowledge and language learning aptitude. If these constructs are part of a unidimensional factor, statistically significant loadings on the first factor should be observed, and more variance should be explained by the model.

Table 25. Pattern Matrix of the Rotated Factor Loadings for Metalinguistic Knowledge and Language Learning Aptitude Rotated Factor Loadings Variable 1 2 Receptive metaling .61 TechTerms .97 RuleExpl .90 Number learning .58 Sound-symbol .61 Vocabulary learning .42 Analytical ability .43 Note. Only factor loadings > .40 are shown. Receptive metaling = receptive metalinguistic knowledge; TechTerms = technical terminology; RuleExpl = rule explanation; Sound-symbol = sound-symbol association; Analytical ability = language analytical ability.

172

The parameters for the second EFA were the same as the first EFA, but a one- factor solution was specified in SPSS. The scree plot and eigenvalues were identical to the first analysis. The single extracted factor accounted for 30.56% of the variance, which was 15.05% less variance than that explained by a two-factor solution. Table 26 shows the unrotated factor loadings. As only one factor was specified, the solution could not be rotated. The factor loadings indicated that the metalinguistic knowledge variables accounted for the variance, and the language learning aptitude variables were not related to that factor—the factor loadings for the language aptitude variables ranged from .06 to .17.

Based on these results, it was determined that a two-factor solution best accounted for the data. To confirm and validate these results, the metalinguistic knowledge and language learning aptitude variables were entered into a CFA to test the factorial structure.

Table 26. Factor Matrix for a One-Factor Solution Rotated Factor Loadings Variable 1 Receptive metaling .62 TechTerms .94 RuleExpl .90 Number learning .08 Sound-symbol .16 Vocabulary learning .06 Analytical ability .17 Note. Factor loadings > .40 are shown in boldface. Receptive metaling = receptive metalinguistic knowledge; TechTerms = technical terminology; RuleExpl = rule explanation; Sound-symbol = sound-symbol association; Analytical ability = language analytical ability.

173

In the CFA model, a two-factor model was hypothesized. Receptive metalinguistic knowledge, technical terminology, and rule explanation were indicators of a latent metalinguistic knowledge variable. Number learning, sound-symbol association, vocabulary learning, and language analytical ability were indicators of an unobserved language learning aptitude factor. The rectangles represent observed variables, and the ovals or circles represent unobserved, latent variables. In CFA and SEM, the one-headed arrows represent the concept that changes in the latent variables are reflected in differences on the observed variables. The double-headed arrow connecting the two latent factors signifies a nonzero correlation between the two constructs. In this model, metalinguistic knowledge and language learning aptitude are hypothesized to covary. To scale the latent variables, the paths from metalinguistic knowledge to rule explanation and from language learning aptitude to number learning were fixed to 1. These paths were chosen because these two manifest variables exhibited the highest reliability estimates.

Maximum likelihood estimation method was used to estimate the parameters of the models. This model had 28 sample moments and 15 distinct parameters that had to be estimated. This resulted in 13 degrees of freedom, which indicated that the model was over identified. Figure 13 shows the tested model with standardized coefficients. All of the regression weights were statistically significant, p < .05, which indicated that the regression weights were statistically different from zero. However, the correlation between metalinguistic knowledge and language learning aptitude was weak and nonsignificant, r = .11, p = .32. This result indicates that the correlation between these

174

two constructs is not statistically different from zero, which implies there is little to no relationship between metalinguistic knowledge and language learning aptitude.

The squared multiple correlations indicated that the metalinguistic knowledge factor accounted for 81% of the variance in rule explanation, 93% of the variance in technical terminology, and 37% of the variance in receptive metalinguistic knowledge.

The language learning aptitude factor explained 29% of the variance in number learning,

36% of the variance in sound-symbol association, 21% of the variance in vocabulary learning, and 18% of the variance in language analytical ability.

All of the standardized residual covariances were below the recommended absolute value of 2.58 (Byrne, 2010). The standardized total effects indicated that when metalinguistic knowledge increases by one standard deviation, rule explanation, technical terminology, and receptive metalinguistic knowledge increase by .90, .97, and .61 standard deviations, respectively. A one standard deviation increase in language learning aptitude results in number learning, sound-symbol association, vocabulary learning, and language analytical ability increasing by .54, .60, .45, and .43, respectively. These figures represent the relative impact and contribution of the individual variables.

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(13, N = 166) = 8.98, p = .78.

Other absolute fit measures showed good model fit: RMR = 5.09, GFI = .99, AGFI = .97,

PGFI = .46. Relative fit measures were very good: NFI = .98, RFI = .96, IFI = 1.00, TLI

= 1.00, CFI = 1.00, Parsimony-adjusted measures of model fit were acceptable: PRATIO

= .62, PNFI = .60, PCFI = .62. Fit measures based on the noncentral chi-square

175

distribution showed excellent model fit: RMSEA = .00 (Low = .00, High = .05, PCLOSE

= .94), NCP = .00 (Low = .00, High = 6.05), FMIN = .05, F0 = .00 (Low = .00, High

= .04). Information theoretic fit measures were as follows: AIC = 38.98, BCC = 40.51,

BIC = 85.66, CAIC = 100.66, ECVI = .24 (Low = .26, High = .30), MECVI = .25.

Finally, fit measures based on sample size were as follows: HOELTER .05 = 412,

HOELTER .01 = 509.

As the model fit was sufficient, and no changes were suggested by the modification indices, the model was accepted. Based on these results, no statistically significant correlation was observed between metalinguistic knowledge and language learning aptitude, which indicates the two variables are independent constructs.

176

Figure 13. CFA of metalinguistic knowledge and language learning aptitude. Standardized coefficients are shown. All paths are statistically significant (p < .05) except for the correlation between metalinguistic knowledge and language aptitude. Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

Research Question 2: Effects of Metalinguistic Knowledge and Language Learning

Aptitude on L2 Procedural Knowledge

Before conducting SEM to test the models related to the second research question,

EFA was conducted to assess the relationships among the variables that served as indicators of metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge.

177

The three metalinguistic variables, the four language learning aptitude variables, and L2 complexity, accuracy, and fluency were subjected to a factor analysis. L2 procedural knowledge was not included in the EFA because the purpose of the analysis was to assess the validity of using complexity, accuracy, and fluency as indicators of a shared latent variable. For reference, a separate EFA was conducted, and L2 procedural knowledge loaded at .66 with fluency on the third factor. Principal axis factoring was used to extract factors from the shared variance in the data, and based on the previous results that showed a weak, nonsignificant correlation between metalinguistic knowledge and language aptitude, an orthogonal rotation (Varimax) was selected to create an interpretable factor structure.

A scree plot indicated a two- or three-factor solution. Four eigenvalues exceeded or met Kaiser’s criteria of 1.0 (factor 1 = 2.71; factor 2 = 1.76; factor 3 = 1.10; factor 4 =

1.00). These four factors explained 44.34% of the variance in the data. The KMO measure of sampling adequacy was .68, which exceeded the minimum recommend value of .50. Bartlett’s test of sphericit was significant, p < .001, which indicated that the correlations among the variables were statistically different from 0. A criterion of .40 was used to determine statistically significant factor loadings. Table 27 shows the rotated factor loadings. Except for L2 accuracy, each variable loaded clearly on a single factor, and no cross-loadings were observed. All of the metalinguistic knowledge variables loaded strongly on the first factor, which was labeled metalinguistic knowledge. The second factor, language learning aptitude, had two medium to fairly strong loadings (i.e.,

178

number learning and sound-symbol association). Vocabulary learning loaded at .48, and the loading for language analytical ability just exceeded the criterion of .40.

Accuracy did not load at the .40 level on any of the factors. Its highest loading was .37 on factor 1, with loadings of -.06, -.10, and -.04 on the other three factors. These results are interpretable in that metalinguistic knowledge is considered to be facilitative to controlled, accurate L2 production. At the same time, however, they are problematic in that L2 accuracy did not form a factor with the other L2 developmental measures. L2 complexity loaded at .69 on the fourth factor, and fluency loaded at .69 on the third factor.

These three CAF variables were component scores. To investigate any differential effects of using component scores compared to the original measures, the factor analysis was rerun using the original CAF measures—the results were the same. Furthermore, the analysis was conducted again, requesting a three-factor solution. In this case, L2 complexity and accuracy did not load statistically on any of the factors, and fluency was the only variable that loaded on the third factor. Based on these results, L2 complexity, accuracy, and fluency are unique aspects of L2 production.

As the purpose of this EFA was to assess the dimensionality of the putative factors before conducting SEM, variables were not removed from the model to achieve a clear, interpretable factorial structure. Clearly, the constructs of L2 complexity, accuracy, and fluency are substantively unique, and employing them as related indicators of L2 procedural knowledge is problematic. As an alternative to testing a latent L2 procedural knowledge variable with complexity, accuracy, and fluency as observed indicators, I decided to use single, unique variables as observed indicators of L2 procedural

179

knowledge in hybrid structural models. L2 procedural knowledge measures derived from the Rasch-analyzed holistic essay ratings and the complexity, accuracy, and fluency measures were employed independently to test the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge, complexity, accuracy, and fluency.

Table 27. Rotated Factor Loadings for Metalinguistic Knowledge, Language Learning Aptitude, and L2 Developmental Measures Rotated Factor Loadings Variable 1 2 3 4 Receptive metaling .62 TechTerms .88 RuleExpl .86 Number learning .57 Sound-symbol .58 Vocabulary learning .48 Analytical ability .41 Complexity .69 Accuracy (.37) Fluency .69 Note. Only factor loadings > .40 are shown. The factor loading for accuracy is shown for reference. n = 166. Receptive metaling = receptive metalinguistic knowledge; TechTerms = technical terminology; RuleExpl = rule explanation; Sound-symbol = sound-symbol association; Analytical ability = language analytical ability.

The first hybrid structural model tested the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge. Rule explanation, technical terminology, and receptive metalinguistic knowledge measures served as manifest variables for the latent metalinguistic knowledge factor. Number learning, sound-symbol association, vocabulary learning, and language analytical ability served as manifest variables for language learning aptitude. Rasch-scaled timed-writing measures served as

180

the manifest variable of L2 procedural knowledge. The two latent variables were independent variables that predicted L2 procedural knowledge.

Maximum likelihood estimation method was used to estimate the parameters of the model. This model had 36 sample moments and 17 distinct parameters that had to be estimated. This resulted in 19 degrees of freedom, which indicated that the model was over identified. Figure 14 shows the tested model with standardized coefficients. All of the regression weights were statistically significant, p < .05, which indicated that the regression weights were statistically different from zero, except for the path from language learning aptitude to L2 procedural knowledge, β = -.07, p = .41. Metalinguistic knowledge showed a strong relationship with L2 procedural knowledge (β = .57, p

< .001), but language learning aptitude did not statistically predict L2 procedural knowledge. The squared multiple correlations indicated that metalinguistic knowledge and language learning aptitude accounted for 33% of the variance in L2 procedural knowledge.

The regression weights for the metalinguistic knowledge variables were strong, which provides evidence that the manifest variables were valid operationalizations of the construct. The loadings for the language learning aptitude variables, however, were relatively low. This finding suggests that there is scope for improving the conceptualization and measurement of language learning aptitude.

All of the standardized residual covariances were below the recommended absolute value of 2.58. The standardized total effects indicated that when metalinguistic knowledge increases by one standard deviation, L2 procedural knowledge increased

181

by .57. A one standard deviation increase in language learning aptitude resulted in a .07 decrease in L2 procedural knowledge. These values represent the relative impact and contribution of the individual variables.

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(19, N = 166) = 19.89, p = .40.

Other absolute fit measures showed good model fit: RMR = 8.09, GFI = .97, AGFI = .95,

PGFI = .51. Relative fit measures were acceptable or very good: NFI = .95, RFI = .93, IFI

= .99, TLI =.99, CFI = .99. Parsimony-adjusted measures of model fit were acceptable:

PRATIO = .68, PNFI = .65, PCFI = .68. Fit measures based on the noncentral chi-square distribution showed excellent model fit: RMSEA = .02 (Low = .00, High = .07, PCLOSE

= .79), NCP = .89 (Low = .00, High = 15.82), FMIN = .12, F0 = .01 (Low = .00, High

= .10). Information theoretic fit measures were as follows: AIC = 53.89, BCC = 55.85,

BIC = 106.79, CAIC = 123.79, ECVI = .33 (Low = .32, High = .42), MECVI = .34.

Finally, fit measures based on sample size were as follows: HOELTER .05 = 251,

HOELTER .01 = 301.

As the model fit was sufficient, and no changes were suggested by the modification indices, the model was accepted. The results indicated that metalinguistic knowledge statistically predicted L2 procedural knowledge (β = .57, p < .001), but language learning aptitude did not statistically predict L2 procedural knowledge (β = -.07, p = .41).

182

Figure 14. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge. Standardized coefficients are shown. L2PrcdrlKnw = L2 procedural knowledge; Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

The second hybrid structural model tested the effects of metalinguistic knowledge and language learning aptitude on L2 complexity. Rule explanation, technical terminology, and receptive metalinguistic knowledge measures served as manifest variables for the latent metalinguistic knowledge factor. Number learning, sound-symbol association, vocabulary learning, and language analytical ability served as manifest variables for language learning aptitude. L2 complexity component scores served as the observed variable of complexity. The two latent variables were independent variables that predicted L2 complexity.

183

Maximum likelihood estimation method was used to estimate the parameters of the model. This model had 36 sample moments and 17 distinct parameters that had to be estimated. This resulted in 19 degrees of freedom, which indicated that the model was over identified. Figure 15 shows the tested model with standardized coefficients. All of the regression weights were statistically significant, p < .05, which indicated that the regression weights were statistically different from zero, except for the path from language learning aptitude to L2 complexity, β = -.08, p = .45. Metalinguistic knowledge showed a weak, statistically significant relationship with L2 complexity (β = .19, p < .05), but language learning aptitude did not statistically predict L2 complexity. The squared multiple correlations indicated that metalinguistic knowledge and language learning aptitude accounted for 4.1% of the variance in L2 complexity.

All of the standardized residual covariances were below the recommended absolute value of 2.58. The standardized total effects indicated that when L2 metalinguistic knowledge increases by one standard deviation, L2 complexity increases by .19, and when language learning aptitude increases by one standard deviation, L2 complexity decreases by .08. These figures represent the relative impact and contribution of the individual variables.

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(19, N = 166) = 13.73, p = .80.

Other absolute fit measures showed good model fit: RMR = 5.43, GFI = .98, AGFI = .96,

PGFI = .52. Relative fit measures were very good: NFI = .96, RFI = .95, IFI = 1.00, TLI

184

Figure 15. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 complexity. Standardized coefficients are shown. Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

=1.00, CFI = 1.00. Parsimony-adjusted measures of model fit were good: PRATIO = .68,

PNFI = .65, PCFI = .68. Fit measures based on the noncentral chi-square distribution showed excellent model fit: RMSEA = .00 (Low = .00, High = .05, PCLOSE = .97), NCP

= .00 (Low = .00, High = 6.30), FMIN = .08, F0 = .00 (Low = .00, High = .04).

Information theoretic fit measures were as follows: AIC = 47.73, BCC = 49.69, BIC =

100.63, CAIC = 117.63, ECVI = .29 (Low = .32, High = .36), MECVI = .30. Finally, fit measures based on sample size were as follows: HOELTER .05 = 365, HOELTER .01 =

436.

185

As the model fit was sufficient, and no changes were suggested by the modification indices, the model was accepted. The results indicated that metalinguistic knowledge statistically predicted L2 complexity (β = .19, p < .05), but language learning aptitude did not statistically predict L2 complexity (β = -.08, p = .45).

The third hybrid structural model tested the effects of metalinguistic knowledge and language learning aptitude on L2 accuracy. Rule explanation, technical terminology, and receptive metalinguistic knowledge measures served as manifest variables for the latent metalinguistic knowledge variable. Number learning, sound-symbol association, vocabulary learning, and language analytical ability served as manifest variables for language learning aptitude. L2 accuracy component scores served as the manifest variable of accuracy. The two latent variables were independent variables that predicted

L2 accuracy.

Maximum likelihood estimation method was used to estimate the parameters of the model. This model had 36 sample moments and 17 distinct parameters that had to be estimated. This resulted in 19 degrees of freedom, which indicated that the model was over identified. Figure 16 shows the tested model with standardized coefficients. All of the regression weights were statistically significant, p < .001, which indicated that the regression weights were statistically different from zero, except for the path from language learning aptitude to L2 accuracy, β = -.08, p = .39. Metalinguistic knowledge showed a moderate, statistically significant relationship with L2 accuracy (β = .30, p

186

Figure 16. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 accuracy. Standardized coefficients are shown. Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

< .001), but language learning aptitude did not statistically predict L2 accuracy. The squared multiple correlations indicated that metalinguistic knowledge and language learning aptitude accounted for 10% of the variance in L2 accuracy.

All of the standardized residual covariances were below the recommended absolute value of 2.58. The standardized total effects indicated that when metalinguistic knowledge increases by one standard deviation, L2 accuracy increases by .30, and when language learning aptitude increases by one standard deviation, L2 accuracy decreases by .08. These figures represent the relative impact and contribution of the individual variables.

187

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(19, N = 166) = 12.38, p = .87.

Other absolute fit measures showed good or adequate model fit: RMR = 5.43, GFI = .98,

AGFI = .97, PGFI = .52. Relative fit measures were very good: NFI = .97, RFI = .95, IFI

= 1.00, TLI =1.00, CFI = 1.00. Parsimony-adjusted measures of model fit were good:

PRATIO = .68, PNFI = .66, PCFI = .68. Fit measures based on the noncentral chi-square distribution showed excellent model fit: RMSEA = .00 (Low = .00, High = .04, PCLOSE

= .98), NCP = .00 (Low = .00, High = 4.04), FMIN = .08, F0 = .00 (Low = .00, High

= .02). Information theoretic fit measures were as follows: AIC = 46.38, BCC = 48.34,

BIC = 99.28, CAIC = 116.28, ECVI = .28 (Low = .32, High = .35), MECVI = .29.

Finally, fit measures based on sample size were as follows: HOELTER .05 = 402,

HOELTER .01 = 483.

As the model fit was sufficient, and no changes were suggested by the modification indices, the model was accepted. The results indicated that metalinguistic knowledge statistically predicted L2 accuracy (β = .30, p < .001), but language learning aptitude did not statistically predict L2 accuracy (β = -.08, p = .39).

The fourth hybrid structural model tested the effects of metalinguistic knowledge and language learning aptitude on L2 fluency. Rule explanation, technical terminology, and receptive metalinguistic knowledge measures served as manifest variables for the unobserved metalinguistic knowledge factor. Number learning, sound-symbol association, vocabulary learning, and language analytical ability served as manifest variables for

188

language learning aptitude. L2 fluency component scores served as the manifest variable of fluency. The two latent variables were independent variables that predicted L2 fluency.

Maximum likelihood estimation method was used to estimate the parameters of the model. This model had 36 sample moments and 17 distinct parameters that had to be estimated. This resulted in 19 degrees of freedom, which indicated that the model was over identified. Figure 17 shows the tested model with standardized coefficients. All of the regression weights were statistically significant, p < .05, which indicated that the regression weights were statistically different from zero, except for the path from language learning aptitude to L2 fluency, β = .02, p = .80. Metalinguistic knowledge showed a moderate, statistically significant relationship with L2 fluency (β = .35, p

< .001), but language learning aptitude did not statistically predict L2 fluency. The squared multiple correlations indicated that metalinguistic knowledge and language learning aptitude accounted for 12% of the variance in L2 fluency.

All of the standardized residual covariances were below the recommended absolute value of 2.58. The standardized total effects indicated that when metalinguistic knowledge increases by one standard deviation, L2 fluency increases by .35, and when language learning aptitude increases by one standard deviation, L2 fluency increases by .02. These figures represent the relative impact and contribution of the individual variables.

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(19, N = 166) = 19.24, p = .44.

189

Figure 17. Structural model of the effects of metalinguistic knowledge and language learning aptitude on L2 fluency. Standardized coefficients are shown. Metaling = metalinguistic knowledge; LangAptitude = language learning aptitude; Rule = rule explanation; Tech = technical terminology; RcpMtLng = receptive metalinguistic knowledge; NumLn = number learning; SndSym = sound-symbol association; Vocab = vocabulary learning; LngAn = language analytical ability.

Other absolute fit measures showed good model fit: RMR = 5.42, GFI = .97, AGFI = .95,

PGFI = .51. Relative fit measures were very good: NFI = .95, RFI = .93, IFI = .99, TLI

=.99, CFI = .99. Parsimony-adjusted measures of model fit were good: PRATIO = .68,

PNFI = .65, PCFI = .68. Fit measures based on the noncentral chi-square distribution showed excellent model fit: RMSEA = .01 (Low = .00, High = .07, PCLOSE = .82), NCP

= .24 (Low = .00, High = 14.85), FMIN = .12, F0 = .00 (Low = .00, High = .09).

Information theoretic fit measures were as follows: AIC = 53.24, BCC = 55.20, BIC =

106.14, CAIC = 123.14, ECVI = .32 (Low = .32, High = .41), MECVI = .34. Finally, fit

190

measures based on sample size were as follows: HOELTER .05 = 259, HOELTER .01 =

311.

As the model fit was sufficient, and no changes were suggested by the modification indices, the model was accepted. The results indicated that metalinguistic knowledge statistically predicted L2 fluency (β = .35, p < .001), but language learning aptitude did not statistically predict L2 fluency (β = .02, p = .80).

For the second research question, metalinguistic knowledge and language learning aptitude were hypothesized to account for variation in L2 procedural knowledge. SEM analyses revealed that metalinguistic knowledge explained significant portions of variance in the dependent variables. Due to the contribution of memory and analytical abilities to the L2 learning process, it was hypothesized that language learning aptitude would also account for some degree of L2 proceduralization, complexity, accuracy, and fluency. This hypothesis, however, was not supported.

Research Question 3: Effects of Components of Metalinguistic Knowledge and

Language Learning Aptitude on L2 Procedural Knowledge

A path model was used to test the relationships among components of language learning aptitude, receptive metalinguistic knowledge, and productive metalinguistic knowledge (Figure 18). The strength of the regression paths was of interest. All of the variables in the model were observed variables that were used in the analyses reported for the first two research questions. The model hypothesized that L2 procedural knowledge

191

Figure 18. Path model of the effects of components of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge. L2PrcdrlKnw = L2 procedural knowledge; Rule = rule explanation; RcpMtLng = receptive metalinguistic knowledge; Vocab = vocabulary learning; LngAn = language analytical ability; NumLn = number learning; SndSym = sound-symbol association.

would be predicted by each of the four components of language learning aptitude (i.e., number learning, sound-symbol association, vocabulary learning, and language analytical ability) and by receptive and productive metalinguistic knowledge. Furthermore, although the analyses reported for the first research question indicated that no correlation exists between metalinguistic knowledge and language learning aptitude, memory-related components of language learning aptitude could affect the learning and retention of declarative metalinguistic knowledge. Additionally, language analytical ability could facilitate the analysis and learning of metalinguistic knowledge. Thus, vocabulary

192

learning and language analytical ability were posited to predict receptive and productive metalinguistic knowledge.

Maximum likelihood estimation method was used to estimate the parameters of the model. This model had 28 sample moments and 17 distinct parameters that had to be estimated. This resulted in 11 degrees of freedom, which indicated that the model was over identified. Four of the regression weights were statistically significant: rule explanation to L2 procedural knowledge, β = .49, p < .001; receptive metalinguistic knowledge to L2 procedural knowledge, β = .14, p < .05; language analytical ability to receptive metalinguistic knowledge, β = .17, p < .05; and number learning to L2 procedural knowledge, β = -.16, p < .05. These results indicated that that these four regression weights were statistically different from zero. Receptive and productive metalinguistic knowledge showed weak and strong statistically significant relationships, respectively, with L2 procedural knowledge, but only two components of language learning aptitude independently predicted L2 procedural knowledge. Surprisingly, the regression weight from number learning to L2 procedural knowledge was negative, indicating that as one variable increases the other decreases. The squared multiple correlations indicated that the predictors of L2 procedural knowledge accounted for 30% of the variance in the variable, and vocabulary learning and language analytical ability explained 4% of the variance in receptive metalinguistic knowledge and 1% of the variance in rule explanation.

The standardized residual covariances among the components of language learning aptitude exceeded the recommended absolute value of 2.58, as did the

193

covariance between receptive metalinguistic knowledge and rule explanation, and between receptive metalinguistic knowledge and L2 procedural knowledge.

The model showed poor fit to the data. The sample covariance matrix differed statistically from the population covariance matrix, χ2(11, N = 166) = 113.88, p < .05.

Other absolute fit measures indicated poor model fit: RMR = 21.13, GFI = .83, AGFI

= .57, PGFI = .33. Relative fit measures were also poor: NFI = .40, RFI = -.15, IFI = .42,

TLI = -.17, CFI = .39. Parsimony-adjusted measures of model fit were low: PRATIO

= .52, PNFI = .21, PCFI = .20. Fit measures based on the noncentral chi-square distribution showed poor model fit: RMSEA = .24 (Low = .20, High = .28, PCLOSE

= .00), NCP = 102.88 (Low = 72.29, High = 140.94), FMIN = .69, F0 = .62 (Low = .44,

High = .85). Information theoretic fit measures were as follows: AIC = 147.88, BCC =

149.62, BIC = 200.79, CAIC = 217.79, ECVI = .90 (Low = .71, High = 1.13), MECVI

= .91. Finally, fit measures based on sample size were as follows: HEOLTER .05 = 29,

HOELTER .01 = 36.

As the model fit was poor, the modification indices were examined. AMOS provides information related to the misfit of the specified model in the form of modification indices. These indices represent possible misspecification of the tested model. Specifically, these indices offer evidence as to which parameters should be freely estimated to improve model fit. These statistics are analogous to a chi-square test with one degree of freedom (Byrne, 2010). The value of the modification index statistic represents the expected decrease in the model chi-square statistic if a given parameter were to be freely estimated. Expected parameter change statistics are also generated.

194

These values indicate the expected change for the fixed parameters in the model if the associated parameters were freely estimated. A prudent method for evaluating the modification indices is to accommodate the largest modification into the model, and then examine the resultant model fit. Only theoretically defensible model respecifications should be made. Post hoc model modification changes the status of the modeling from confirmatory to exploratory. The indices indicated that model fit would improve if covariances were added among the language learning aptitude variables and between the metalinguistic error terms. Covariances were added one-by-one, and changes in model fit were examined.

The resulting model fit after the addition of a covariance between sound-symbol association and number learning was poor. Second, a covariance was added between sound-symbol association and language analytical ability—the model fit was still poor.

Third, a covariance between number learning and language analytical ability was added—the model fit was still poor.

At this point, the modification indices indicated that adding a covariance between number learning and vocabulary learning would improve model fit. Thus, fourth, a covariance was added between number learning and vocabulary learning—the model fit was still poor. Fifth, the model modification indices indicated that adding a covariance between sound-symbol association and vocabulary learning would improve model fit, but the model fit was still poor. Sixth, a covariance was added between language analysis and vocabulary learning—the model fit was improving but still poor. Seventh, a covariance was added between the error terms d1 and d2—the model fit greatly improved.

195

After seven modifications, the model had 28 sample moments and 24 distinct parameters that had to be estimated. This resulted in 4 degrees of freedom, which indicated that the model was over identified. Figure 19 shows the modified model. Three of the regression weights were statistically significant: rule explanation to L2 procedural knowledge, β = .47, p < .001; language analytical ability to receptive metalinguistic knowledge, β = .17, p < .05; and number learning to L2 procedural knowledge, β = -.16, p < .05. These three regression weights were statistically different from zero. The path from receptive metalinguistic knowledge to L2 procedural knowledge was not statistically significant after model modification, β = .13, p = .09.

Productive metalinguistic knowledge showed a moderate to strong statistically significant relationship with L2 procedural knowledge, but only two components of language learning aptitude independently predicted L2 procedural knowledge. The regression weight from number learning to L2 procedural knowledge was negative, indicating that as one variable increases the other decreases. The squared multiple correlations indicated that the predictors of L2 procedural knowledge accounted for 34% of the variance in the variable, and vocabulary learning and language analytical ability explained just 3% of the variance in receptive metalinguistic knowledge and 1% of the variance in rule explanation. The regression weight for vocabulary learning to receptive metalinguistic knowledge was -.08, p > .05, and the regression weight for vocabulary learning to L2 procedural knowledge was .06, p > .05. The standardized residual covariances among the variables were all below the recommended absolute value of 2.58.

196

Figure 19. Modified path model of the effects of components of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge. Standardized coefficients are shown. L2PrcdrlKnw = L2 procedural knowledge; Rule = rule explanation; RcpMtLng = receptive metalinguistic knowledge; Vocab = vocabulary learning; LngAn = language analytical ability; NumLn = number learning; SndSym = sound-symbol association.

The model showed good fit to the data. The sample covariance matrix did not differ statistically from the population covariance matrix, χ2(4, N = 166) = 2.00, p = .74.

Other absolute fit measures indicated good model fit: RMR = 2.84, GFI = .99, AGFI

= .98, PGFI = .14. Relative fit measures were also good: NFI = .99, RFI = .95, IFI = 1.00,

TLI = 1.00, CFI = 1.00. Parsimony-adjusted measures of model fit were low: PRATIO

197

= .19, PNFI = .19, PCFI = .19. Fit measures based on the noncentral chi-square distribution showed good model fit: RMSEA = .00 (Low = .00, High = .08, PCLOSE

= .85), NCP = .00 (Low = .00, High = 4.61), FMIN = .01, F0 = .00 (Low = .00, High

= .03). Information theoretic fit measures were as follows: AIC = 49.99, BCC = 52.43,

BIC = 124.67, CAIC = 148.67, ECVI = .30 (Low = .32, High = .34), MECVI = .32.

Finally, fit measures based on sample size were as follows: HOELTER .05 = 789,

HOELTER .01 = 1104.

As adequate model fit was observed, this model was accepted. However, it should be noted that nonsignificant paths were retained in the model. The results indicated that rule explanation statistically significantly predicted L2 procedural knowledge (β = .47, p

< .001); and language analytical ability statistically predicted receptive metalinguistic knowledge (β = .17, p < .05). Finally, number learning statistically significantly predicted

L2 procedural knowledge, but the regression coefficient was negative (β = -.16, p

< .05)—this finding could indicate that aural learning abilities are inversely related to L2 proceduralization and written production. Indeed, cognitive processes such as phonological memory that are engaged during online aural learning might differ from the processes applied longitudinally in the proceduralization of L2 declarative knowledge.

However, as the loading was relatively weak, it should be interpreted cautiously.

Summary of the Results

CFA and SEM were used to answer the three research questions. A two-factor model was hypothesized for metalinguistic knowledge and language learning aptitude,

198

and it was tested using CFA. The model showed good fit to the data. The regression coefficients were statistically significant, and the correlation between the factors was weak and nonsignificant. These results support the interpretation that metalinguistic knowledge and language learning aptitude are two distinct factors. Tests of the hybrid structural models revealed that metalinguistic knowledge statistically predicted L2 procedural knowledge, complexity, accuracy, and fluency. Language learning aptitude, however, was not a statistically significant predictor of those variables. Finally, results of a path analysis revealed that (a) metalinguistic rule explanation statistically predicted L2 procedural knowledge, (b) language analytical ability statistically predicted receptive metalinguistic knowledge, and (c) number learning statistically predicted L2 procedural knowledge. However, number learning and L2 procedural knowledge were negatively associated.

In the next chapter, the results of the analyses for each research question are summarized. Following the summary, the results are discussed and the implications are considered.

199

CHAPTER 6

DISCUSSION

In this chapter, each research question is restated and is followed by a summary of the results. The results for each research question are discussed by situating them within those of previous studies. The findings and implications are interpreted within a skill- theory approach to L2 learning.

Research Question 1: Relationship Between Metalinguistic Knowledge and

Language Learning Aptitude

The first research question addressed the relationship between metalinguistic knowledge and language learning aptitude. Previous studies (e.g., Alderson et al., 1997;

Roehr, 2008b) reported evidence indicating that metalinguistic knowledge and language learning aptitude were variables belonging to a multifaceted, but unidimensional, factor.

That is, a single factor accounted for both metalinguistic knowledge and language learning aptitude. However, metalinguistic knowledge consists of declarative representations of explicit knowledge of an L2. On the other hand, language learning aptitude involves the application of cognitive abilities that differ from L2 declarative knowledge. Thus, in the present study it was hypothesized that a two-factor model would account for the relationships among the indicators of the two constructs. This research question was answered by subjecting the three metalinguistic knowledge variables (i.e., receptive metalinguistic knowledge, technical terminology, and rule explanation) and the four language learning aptitude variables (i.e., number learning, sound-symbol

200

association, vocabulary learning, and language analytical ability) to an EFA. This analysis was then followed by a CFA to test the status of the metalinguistic knowledge and language learning aptitude constructs.

Summary of the Results for Research Question 1

First, an EFA was conducted to assess the relationships among the metalinguistic knowledge and language learning aptitude variables. An examination of the scree plot suggested a two-factor solution, and this interpretation was supported by the first two eigenvalues, which were the only ones greater than 1.0. The two factors accounted for

45.61% of the variance in the data. Each variable loaded cleanly on one of the two factors, and no cross-loadings were observed. The first factor was related to metalinguistic knowledge, and the second factor represented language learning aptitude.

Second, to exhaustively examine other possible factorial structures, a second EFA was conducted, and a single-factor solution was specified. This model was not supported.

Less variance was accounted for by the single-factor solution compared to the two-factor solution. Furthermore, only the metalinguistic knowledge variables loaded on the factor—these three loadings were strong. The language learning aptitude variables loaded weakly on the factor. Thus, no support for a single factor solution was found.

Third, a CFA was conducted to confirm the model fit of a two-factor model. The three metalinguistic knowledge variables served as indicators of a metalinguistic knowledge factor, and the four language learning aptitude variables served as indicators of a language learning aptitude factor. The two factors were hypothesized to covary. The

201

results indicated that all of the regression weights from the latent variables to their respective manifest variables were statistically significant, p < .05. The relationship between metalinguistic knowledge and language learning aptitude was weak and nonsignificant (r = .11, p = .32). Good model fit was obtained with no modifications,

χ2(13, N = 166) = 8.98, p = .78, CFI = 1.00, RMSEA = .00.

In sum, the results supported the hypothesis that metalinguistic knowledge and language learning aptitude are distinct factors. A two-factor solution adequately accounted for their relationship, and the two latent variables were statistically related to their manifest variables. No evidence was found to support a single-factor model for metalinguistic knowledge and language learning aptitude.

Interpretation of the Results for Research Question 1

The results of the analyses related to the factorial structure of metalinguistic knowledge and language learning aptitude support the hypothesis that these two constructs are unique factors. This finding is in opposition to the findings and interpretations drawn by some researchers.

Alderson et al. (1997) concluded that metalinguistic knowledge and L2 proficiency are relatively unrelated—strong relationships between the constructs were not found in their data. Furthermore, they argued that language learning aptitude, in the form of language awareness, might account for learners’ metalinguistic knowledge—that is, metalinguistic knowledge is closely related to language awareness and not L2 proficiency.

These interpretations might have resulted from the lack of an a priori theoretical model of

202

linguistic knowledge, or, at least, from the use of a model that does not account for tenable relationships between metalinguistic knowledge, language learning aptitude, and

L2 proficiency. Implicit and explicit knowledge are mentioned in their review, but the concept of proceduralization is absent from the discussion. It might be that some aspects of metalinguistic knowledge are learned outside of language-specific aptitudes and are more related to general cognitive learning mechanisms. Indeed, language learning aptitude is related to but qualitatively different from L2 proficiency and general intelligence (Sasaki, 1993). The proficiency measures used in Alderson et al. were a cloze test, a c-test, a reading comprehension test, a listening test, and a writing test. Depending on the testing conditions, these tests could tap a combination of declarative and procedural linguistic knowledge, which would result in a unitary linguistic factor.

With regards to language learning aptitude, Alderson et al. (1997) employed the words-in-sentences subtest of the MLAT and an inductive language learning test based on Swahili. The words-in-sentences test is closer to a grammatical sensitivity test than a language analytical ability test. This type of test targets general awareness of linguistic functions of native language words and phrases, which is congruent with the concept of language awareness. However, language awareness movements and explicit or declarative knowledge of a language are different concepts (R. Ellis, 2004). Moreover, learners who have taken linguistics courses can develop L1 linguistic awareness, which in turn assists performance on grammatical sensitivity tests. In this sense, grammatical sensitivity in a given language develops as a skill in that it can be studied and the application of it practiced. The related, yet distinct, ability of language analysis

203

oftentimes requires the comprehension, induction, and application of declarative linguistic rules. The language analytical ability measure used in the present study is an example of the latter type of test.

As a second aspect of language learning aptitude, Alderson et al. (1997) used an inductive language learning test. Test-takers were given a Swahili text that had the first few sentences translated into English. The participants translated the next few sentences from Swahili into English. Performance on this test is determined b learners’ abilit to induce the rules governing an assumed-to-be unknown language. As reported in their study, scores from these two language learning aptitude tests did not load on a separate factor. Language learning aptitude was statistically indiscernible from metalinguistic knowledge and some proficiency measures.

One explanation for the contradictory findings is that one of the metalinguistic knowledge tests in Alderson et al. (1997) did not cleanly tap the intended construct. In one of the tests, participants were asked to name different parts of speech in L1 English or L2 French sentences. This task shares similarities with tests of grammatical sensitivity.

Learners are asked to identify parts of speech in their L1, and performance on this type of test can be attributed to previous experience or training in L1 grammatical functions or to grammatical sensitivity. The former would imply that the sample had taken courses that facilitated test completion, and the latter would imply that the test actually tapped language learning aptitude.

Other studies have also reported that metalinguistic knowledge and language learning aptitude are components of a unitary construct, which is not congruent with the

204

findings of the present study. Roehr (2008b), in her investigation of L2 proficiency, metalinguistic knowledge, and L2 analytical ability, included in the definition of metalinguistic knowledge the ability to correct ungrammatical sentences and the ability to analyze L2 grammatical functions. L2 German proficiency was assessed by a gap-fill and multiple-choice test. Metalinguistic knowledge was operationalized as the ability to correct, describe, and explain errors. The L2 analytic ability test asked learners to identify the grammatical functions of highlighted parts of sentences—this test was similar to the

MLAT words-in-sentences subtest but given in the learners’ L2. The results indicated that scores on the three tests (i.e., L2 proficiency, metalinguistic knowledge, and L2 analytic ability) were statistically correlated. Most notably, L2 proficiency scores and metalinguistic knowledge scores were strongly correlated, r = .81, p < .01, and metalinguistic knowledge scores were also strongly correlated with language analysis scores, r = .84, p < .01. For the various bivariate combinations of variables, correlation coefficients ranged from .62 to .97. A PCA revealed that all of the test scores loaded on a single factor, which accounted for 82% of the variance in the data. The conclusion reached in the study was that language analytical ability is a part of metalinguistic knowledge.

However, alternative explanations for the findings can be espoused. First, L2 proficiency tests that tap primarily declarative knowledge should correlate with other tests of declarative knowledge. The L2 proficiency test used in Roehr (2008b) required learners to fill gaps (i.e., cloze-type test) or answer multiple-choice questions. Cloze tests might tap implicit knowledge to some degree, but a multiple-choice test clearly

205

encourages controlled access to declarative knowledge. Employing this type of proficiency test might have resulted in a strong correlation with metalinguistic knowledge due to the nature of the tests. The proficiency test focused on grammatical and lexical knowledge of L2 German. In meaning-focused tasks, procedural memory subserves grammatical processing and encoding. However, this scenario is unlikely on multiple- choice tests. Meaning is rarely paramount, and a comparison of the choices is carried out in declarative and working memory. Similarly, tests of lexical knowledge (i.e., vocabulary) inherently tap declarative knowledge. The content and format of the tests, then, presupposed a reliance on declarative knowledge.

Second, the operationalization of language analytical ability could account for the findings. Language analytical ability was tested in the L2, as were L2 proficiency and metalinguistic knowledge. It follows that instead of assessing three constructs, the tests measured one unitary factor—L2 proficiency. Higher level learners would be advantaged on a language analysis test conducted in the L2, and proficiency level would, to a certain extent, determine scores on a metalinguistic knowledge test that involved the use of the

L2.

Contrary to these previous studies, the present study was situated within a theory of linguistic knowledge. Declarative and procedural determinates of L2 learning were considered a priori, and the design of the testing instruments and tasks was informed by a skill-theory approach to L2 development. This consideration resulted in tests that relatively tapped declarative or procedural knowledge. Declarative knowledge was divided into receptive and productive components, and tasks that required each of those

206

abilities were created. Likewise, L2 proficiency was assessed through the use of a task that predisposed learners to rely on L2 procedural knowledge. Assessing linguistic knowledge in these ways results in more valid interpretations derived from the measures.

A similar argument can be put forth for the language learning aptitude measures.

In the present study, language learning aptitude was tested in an artificial language that was designed based on linguistic universals—this avoids confounding the ability to learn a language with previous knowledge of a given language (i.e., proficiency). If L2 proficiency is included as a component of a language learning aptitude test, valid interpretation of the test scores is difficult. An examination of the construct validity of the test would be needed before any interpretations or decisions based on test scores could be made. The present study avoided this confound by testing declarative and procedural knowledge independently of language learning aptitude.

In sum, previous studies have examined the relationship between metalinguistic knowledge and language learning aptitude and concluded that the relationship is ambiguous or that they are components of the same factor. The results of the current study suggest that metalinguistic knowledge and language learning aptitude are two distinct factors—one is based on declarative knowledge of language and the other on cognitive abilities implicated in L2 learning. The construct definitions and operationalizations, along with methodological rigor, explain the conflicting results of previous studies and the present study.

207

Research Question 2: Effects of Metalinguistic Knowledge and Language Learning

Aptitude on L2 Procedural Knowledge

The second research question addressed the extent to which metalinguistic knowledge and language learning aptitude influence the development of L2 procedural knowledge. Metalinguistic knowledge is hypothesized to effect L2 procedural knowledge through the application and proceduralization of declarative knowledge. Language learning aptitude is theorized to impact L2 procedural knowledge by facilitating the rate and ultimate attainment of L2 learning. The three metalinguistic variables and the four language learning aptitude variables were entered into an EFA and subsequently served as manifest variables in hybrid structural models designed to test their effects on L2 procedural knowledge, complexity, accuracy, and fluency.

Summary of the Results for Research Question 2

First, an EFA was conducted. The eigenvalues indicated a four-factor solution, while the scree plot indicated a two- or three-factor solution. Based on eigenvalues greater than 1.0, the four factors accounted for 44.34% of the variance in the data. The first factor was related to metalinguistic knowledge, the second to language learning aptitude, and the remaining two factors were not clearly defined. L2 fluency loaded on the third factor, and L2 complexity loaded on the fourth factor. L2 complexity, accuracy, and fluency were to serve as indicators of L2 procedural knowledge. However, because they lacked any commonality, they were used independently as observed dependent

208

variables. Overall L2 procedural knowledge was also assessed; thus, four dependent measures of L2 procedural knowledge were obtained.

Second, the manifest metalinguistic knowledge variables and language learning aptitude variables served as indicators of a metalinguistic knowledge factor and a language learning aptitude factor, respectively. The effects of these factors were assessed by testing their influence on L2 procedural knowledge, complexity, accuracy, and fluency in hybrid structural models. All of the manifest variables statistically loaded on their representative factors. Of interest in these analyses were the regression coefficients from the two factors to the dependent L2 proficiency variables.

The results of the first structural model indicated that metalinguistic knowledge statistically predicted L2 procedural knowledge (β = .57, p < .001). However, language learning aptitude did not influence L2 procedural knowledge (β = -.07, p = .41). The model showed good fit to the data, χ2(19, N = 166) = 19.89, p = .40, CFI = .99, RMSEA

= .02. The factors accounted for 33% of the variance in L2 procedural knowledge.

The results of the second structural model indicated that metalinguistic knowledge statistically predicted L2 complexity (β = .19, p < .05). Language learning aptitude, however, did not influence L2 complexity (β = -.08, p = .45). The model showed good fit to the data, χ2(19, N = 166) = 13.73, p = .80, CFI = 1.00, RMSEA = .00. The factors explained 4.1% of the variance in L2 complexity.

The results of the third structural model indicated that metalinguistic knowledge statistically predicted L2 accuracy (β = .30, p < .001). Conversely, language learning aptitude did not influence L2 accuracy (β = -.08, p = .39). The model showed good fit to

209

the data, χ2(19, N = 166) = 12.38, p = .87, CFI = 1.00, RMSEA = .00. The factors accounted for 10% of the variance in L2 accuracy.

The results of the fourth structural model indicated that metalinguistic knowledge statistically predicted L2 fluency (β = .35, p < .001). On the other hand, language learning aptitude did not influence L2 fluency (β = .02, p = .80). The model showed good fit to the data, χ2(19, N = 166) = 19.24, p = .44, CFI = .99, RMSEA = .01. The factors explained 12% of the variance in L2 fluency.

In sum, the results indicated that metalinguistic knowledge is a strong predictor of

L2 procedural knowledge, a weak predictor of L2 complexity, and a moderate predictor of L2 accuracy, and fluency. Conversely, the results did not support the hypothesis that language learning aptitude statistically predicts L2 procedural knowledge, complexity, accuracy, or fluency.

Interpretation of the Results for Research Question 2

The results for the two factors, metalinguistic knowledge and language learning aptitude, were quite different. Metalinguistic knowledge manifested moderate to strong effects on L2 procedural knowledge, while language learning aptitude showed little to no effect. The influence of metalinguistic knowledge on L2 procedural knowledge is considered first, followed by a discussion of language learning aptitude.

Role of metalinguistic knowledge in instructed L2 learning. The results suggest that metalinguistic knowledge facilitates the development of L2 procedural

210

knowledge. This finding is in line with the predictions derived from skill-theory accounts of L2 learning.

Skill theory accounts of L2 acquisition posit that declarative knowledge becomes proceduralized. Declarative knowledge is proceduralized into chunks (i.e., declaratively held linguistic chunks in the present study), which are activated and manipulated in production and skill use (Anderson, 2002; Anderson, Bothell, Byrne, Douglass, Lebiere,

& Qin, 2004). Under these assumptions, without the provision of metalinguistic knowledge, there would be nothing to proceduralize. If L2 learning differed from other cognitive skills, smooth development and direct access to implicit linguistic competence would be expected. Learners would implicitly construct unobservable processing procedures for L2 comprehension and production. However, empirical evidence suggests that much L2 learning happens explicitly, resulting in the construction of declarative knowledge. Indeed, L2 learners who receive declarative rules tend to outperform those who do not (DeKeyser, 1995), and learners proceduralize declarative rules through comprehension and production practice, with the proceduralization following a power function learning curve (DeKeyser, 1997).

The current study supports these findings. The learners who participated in this study received explicit instruction that involved metalinguistic terminology, practice, and feedback. The initial metalinguistic instruction created an informational base of declarative knowledge representations of L2 form-meaning relationships. Learners applied, refined, restructured, and reinforced this knowledge as they encountered new grammatical forms and integrated that knowledge into existent schema. The results of

211

this process were evidenced through performance on the declarative and procedural tasks.

Technical terminology, rule explanation, and receptive metalinguistic knowledge clearly defined a metalinguistic facet of linguistic knowledge, and these variables predicted performance on a task that tapped L2 procedural knowledge.

As learners use this declarative knowledge, it gradually becomes proceduralized.

In this form, it can be accessed more rapidly, which facilitates language production. In the declarative stage, learners can make phrases and sentences based on controlled, explicit reference to metalinguistic knowledge. When sped up, this process facilitates communication, although it can be characterized as hesitant and pause-filled, as learners take time to search for rules and monitor their production, correcting their utterances as they communicate meaning.

The results of this study do not indicate that the learners were relying on implicit linguistic competence as described by Paradis (2009). Implicit linguistic competence involves processing procedures that are not open to explicit inspection. Speech production is characterized as the conceptualization of an idea, the formulation (e.g., grammatical encoding), and the articulation or production (Levelt, 1989). In the case of

L1 speech production, these processes are implicitly carried out. Tests of L1 production often result in relatively less variation in the data than those of L2 production. The results of the present study indicate that the learners relied on proceduralized declarative knowledge. No evidence of a restricted performance range was observed. The significant relationship between metalinguistic knowledge and L2 procedural knowledge indicates that learners accessed proceduralized forms and possibly linguistic chunks (N. C. Ellis,

212

1996), which are represented in declarative knowledge (Anderson, 2002; Paradis, 2009).

These memory stores are equivalent to those posited in rule- and exemplar-based models of L2 knowledge (e.g., Skehan, 1998).

The participants completed the writing task in their university courses, which could have encouraged the use of proceduralized knowledge. That is, they might have associated the writing task with classroom language learning, which in turn could have aided in the recall of explicit aspects of the L2. The L2 performances might have differed if they had been conducted outside of the classroom. For instance, if the learners had been communicating with a conversation partner or interacting with interlocutors outside of the university, they might have focused on other aspects of language because they would have been processing input and formulating their responses in a different environment.

The results of the present study also support those of De Jong et al. (2012). They found that speaking proficiency was determined by declarative knowledge, processing knowledge, and pronunciation skills. L2 speaking proficiency was operationalized as the functional adequacy of performance on speaking tasks, and declarative and processing components of proficiency were tested. De Jong et al. suggested that these two components were analogous to declarative and procedural knowledge. A SEM analysis revealed that L2 learners' processing speed statistically predicted speaking proficiency, and the learners relied on declarative knowledge in meaning-focused tasks. These results lend support to a declarative and procedural model of L2 knowledge representation. The present study also found a significant declarative component in L2 performance.

213

Metalinguistic knowledge significantly predicted L2 performance when judged by the overall adequacy of the output. This finding suggests that learners activate their declarative knowledge when producing language under timed conditions, providing strong evidence for the efficacy of proceduralized declarative knowledge in L2 learning.

Alderson et al. (1997) claimed that metalinguistic knowledge and L2 proficiency are unrelated. Metalinguistic knowledge was not correlated with many of the proficiency measures utilized, and no clear rationale for the lack of correlations was evident, leading the researchers to conclude there is no relationship. The findings of the current study contradict those of Alderson et al. Metalinguistic knowledge was a strong predictor of L2 procedural knowledge. The manifest metalinguistic knowledge variables loaded strongly and uniquely on the metalinguistic knowledge factor, which in turn statistically predicted

L2 proficiency. Furthermore, the present study employed an appropriate sample size for the types of analyses. The reported n-size in Alderson et al. leaves open the interpretation that the observed relationships and factors were spurious.

Metalinguistic approaches to L2 education would be limited in scope if learners were to focus only on form. Traditionally, metalinguistic explanations of form have been used to draw learners’ attention to linguistic constructions and grammatical aspects of the

L2. These explanations of form can also be used to facilitate the induction of form- meaning mappings. That is, metalinguistic instruction can direct learners’ attention to form during meaning-focused language use.

One facet of instruction through which learners’ attention can be directed to form and meaning is corrective feedback. Corrective feedback can take numerous forms, and it

214

is commonly divided into implicit and explicit, or indirect and direct forms. Direct approaches signal where the errors are, which facilitates learning when learner self-repair is not possible. Indirect approaches indicate problematic words, structures, or phrases, but they require learners to reflect on the specified errors and generate reformulated phrases, thus enhancing deep reflection on form. Metalinguistic feedback provides learners with explicit, metalinguistic descriptions of errors. This type of feedback falls under explicit

L2 instruction, which facilitates overall L2 acquisition (Norris & Ortega, 2000).

Studies have shown the efficacy of explicit forms of feedback in L2 learning.

Lyster and Ranta (1997) divided pedagogical options for feedback into six categories: explicit correction, recasts, clarification requests, metalinguistic feedback, elicitation, and repetition. Of these, many direct methods were effective, and elicitation and metalinguistic feedback facilitated learning.

Recent reports of investigations into explicit feedback have supported the findings of Lyster and Ranta (1997). Lyster (2004) found that corrective feedback in the form of prompts statistically promoted learning of French grammatical gender compared to recasts on immediate and delayed posttests. Prompting learners encouraged them to focus explicitly on errors and to reformulate their utterances, providing opportunities for proceduralization through practice. Indeed, metalinguistic knowledge in the form of metalinguistic feedback (e.g., use the past tense form of the verb) can immediately benefit learners. Sauro (2009) manipulated feedback conditions, comparing the effects of recasts and metalinguistic feedback on the acquisition of the English zero article. Although both types of feedback were not statistically different on immediate or delayed posttests, a

215

significant group-time interaction was found—the metalinguistic feedback group outperformed the control group on immediate and delayed posttests. Sheen (2007) investigated the effects of explicit written corrective feedback on the acquisition of

English article usage. A direct correction group, a direct metalinguistic group, and a control group were compared on their acquisition of English articles. Both experimental groups outperformed the control group on the immediate posttests, and the metalinguistic group outperformed the direct group on the delayed posttest. Thus, the results indicate that metalinguistic feedback is more effective in developing learners L2 accuracy, and the effects are more durable than those of direct corrective feedback. Similarly, Sheen (2010) examined the relative effects of oral and written corrective feedback on English article learning. Of the four experimental conditions, three were related to explicit or metalinguistic corrective feedback. The results showed that these three groups outperformed the oral recast group (i.e., relatively implicit corrective feedback) and the control group. Furthermore, the explicit and metalinguistic groups developed a statistically higher level of awareness of the target structures than did the oral recast group. Thus, metalinguistic knowledge aids in the noticing and learning of relatively nonsalient L2 forms.

The findings derived from these studies point to the facilitative effects of metalinguistic knowledge. Learners can benefit immediately upon the provision of metalinguistic feedback and instruction, and the effects persist over time. These findings explain the results of this study. In the current study, metalinguistic knowledge statistically predicted L2 procedural knowledge, complexity, accuracy, and fluency. The

216

participants received large amounts of explicit English instruction during junior and senior high school. The effects of this instruction manifested themselves through the observed relations between metalinguistic knowledge and L2 procedural knowledge, accuracy, and fluency, and to a lesser extent, complexity.

The effectiveness of corrective feedback is mediated by individual differences

(Lyster, Saito, & Sato, 2013), and so, by extension, metalinguistic knowledge development and use can be influenced b learners’ internal characteristics. The effects of metalinguistic knowledge observed in the present study were robust even after controlling for language learning aptitude. Observing these effects, while holding language aptitude constant, suggests that the results are substantively meaningful.

Regardless of whether some learners had unique aptitudes for certain learning or acquisitional processes such as rote learning or language analytical ability, metalinguistic knowledge explained significant portions of the variance in the dependent L2 procedural and developmental measures.

Metalinguistic feedback can facilitate L2 learning and subsequent acquisition by drawing learners’ attention to incorrect forms and supporting acquisitional processing.

The learners can then notice the target structures (Schmidt, 1990), and reflect and revise hypotheses. The revised hypotheses can then be tested through further output, including within a feedback routine of prompts (Lyster, 2004). This process enhances syntactic restructuring and, in the written modality, lowers the demands on working memory that can occur when monitoring and testing hypotheses in meaning-focused oral communication. Furthermore, metalinguistic knowledge enhances the salience of target

217

forms that elude learners’ attention and noticing. Grammatical forms var according to their saliency (Gass, 1997; Goldschneider & DeKeyser, 2001), and metalinguistic feedback and classroom-based language-related episodes bring less salient forms into learners’ focus. Overall, these processes enhance learners’ accurac and contribute to the formation of procedural knowledge, which enables fluent language production.

The development, use, and proceduralization of metalinguistic knowledge can account for a portion of the degree of accuracy and fluency. However, relatively less variation in complexity was explained by learners' metalinguistic knowledge. These developmental measures cannot be assumed to develop simultaneously, and attention is needed for their deployment during communication. The concepts of cognitive tradeoffs, avoidance, and proceduralization offer explanations for the relationships.

First, tradeoffs can occur in language production. When learners focus on the accuracy of their utterances, fluency and complexity can suffer. This position is supported by a model of limited cognitive processing. During online processing, certain language features are prioritized, leaving others absent or diminished. The learners in this study might have prioritized fluency by attempting to write as much as possible, or they might have spent cognitive resources on producing accurate phrases. Activating metalinguistic knowledge for accuracy and fluency would leave scarce resources for focusing on complexity.

Second, complexity might have been difficult to measure due to avoidance.

Learners were not required to use certain linguistic structures for task completion; therefore, they could have avoided using forms of which they were uncertain. Indeed, as

218

the learners knew that someone would read their essays, they might have attempted to use only structures with which they were relatively confident. Approaching the task in this manner would result in relatively more fluent or accurate production, but complexity measures would be less related to the factors utilized for producing fluent, accurate language.

Third, for intermediate to high-intermediate learners, complex linguistic structures are inherently difficult, and it can be assumed that the learners did not possess complex proceduralized linguistic knowledge. Numerous hours of practice and much corrective feedback would be needed to proceduralize declarative knowledge of complex forms into procedural knowledge that is accessible under timed conditions. Moreover, such a transformation would be moderated b learners’ developmental readiness (Pienemann,

1984, 1989), rendering a limited number of explicit attempts at integrating a new structure insufficient. Thus, even if the learners had some basic declarative knowledge of complex rules, this knowledge was not proceduralized to the extent necessary for effortless, online use. This interpretation explains the relatively lower influence of metalinguistic knowledge on L2 complexity.

In the present study, receptive and productive metalinguistic knowledge were measured. Both tests required the use of Japanese to identify or describe how linguistic rules work in English. Measures from these tests were significantly related to L2 procedural knowledge (i.e., L2 proficiency). At first glance, this result might seem to support a grammar-translation approach to instructed L2 learning. However, such a cursory reading of the results would be misguided. Increases in metalinguistic knowledge

219

were associated with increases in L2 procedural knowledge, complexity, accuracy, and fluency, but the method and amount of practice needed to achieve those levels of proficiency were not explicitly examined. Although the learners possessed declarative knowledge related to English rules, it is doubtful that they explicitly recalled each individual rule during online production. The type of knowledge implicated in the time- pressured writing task is procedural—that is, proceduralized rules or chunked phrases manifested through the practice and proceduralization of declarative knowledge. To proceduralize a rule, recall and application of a given rule under increasingly demanding task conditions is necessary (DeKeyser, 2007). Therefore, the observed performance of the learners in this study reflects the results of practice, not the efficacy of a purely metalinguistic approach to L2 learning and teaching.

In many English classes in Japanese schools, the Japanese language is used to explain English rules, word meanings, phrases, and sentences. This type of instruction can provide opportunities for learners to increase their metalinguistic knowledge, but it cannot supply implicit or procedural knowledge to the learners. The results of the current study should be interpreted cautiously in that they suggest that some amount of declarative knowledge facilitates the proceduralization of that knowledge, but that is only one part of the equation. Production practice that involves the use of declarative rules and chunked phrases in meaning-focused communication is a necessary component of successful L2 learning. This aspect of L2 practice should be incorporated and strengthened in Japanese English classrooms.

220

Language learning aptitude and L2 procedural knowledge. Language learning aptitude did not statistically predict any of the L2 procedural knowledge variables. The standardized regression weight of the path from language aptitude to L2 procedural knowledge was -.07. As this path was not statistically significant, it was not a significant part of the model. Similarly, the paths to complexity, accuracy, and fluency were -.08, -

.08, and .02, respectively, which were all nonsignificant.

The lack of statistically significant relationships implies that language learning aptitude does not impact L2 procedural knowledge or L2 developmental measures (i.e.,

CAF). This finding is at odds with the numerous studies that reported significant effects of language learning aptitude on L2 proficiency. Harley and Hart (1997) found that young learners relied more on memory-related abilities and late immersion learners used analytical abilities. They also included measures of writing ability and written accuracy, which were statistically correlated with components of language learning aptitude.

Regression analyses were also conducted, but there were only 36 participants in the early immersion group and 29 in the late immersion group. Similarly, Harley and Hart (2002) found that language analytical ability predicted cloze and sentence repetition; however, it is not clear if these proficiency measures are relatively more declarative or procedural.

Memory for text predicted aspects of accuracy. With a sample size of 31 participants, the appropriateness of subjecting the data to regression analyses is questionable. In the present study, proficiency measures were hypothesized a priori to tap declarative or procedural knowledge, and the linguistic knowledge tasks were qualitatively different from the language learning aptitude tasks. If the proficiency measures in the reviewed

221

studies were derived from tasks that shared characteristics with the aptitude test tasks, task-related communalities could explain the observed relationships. However, the analyses for the second research question examined the effects of language learning aptitude as a whole. Language aptitude was comprised of memory-related abilities and language analytical abilities. The effects of specific components of aptitude are addressed in the discussion of the third research question.

Negative relationships between language learning aptitude and L2 proficiency have been reported in the literature. In a study of instructional treatments and language learning aptitude, Erlam (2005) found that for the inductive instruction group, language learning aptitude was positively related to listening and written posttest gain scores.

Delayed posttest oral production gain scores, however, exhibited a different pattern.

These scores were negatively correlated with language learning aptitude. It is possible that the oral production test tapped procedural knowledge to a higher extent than did the written production test. This interpretation would explain why analytical components of language aptitude predict some test scores but not others. Hummel (2009) reported a nonsignificant correlation between the MLAT paired associates subtest and L2 proficiency but did find statistical relationships between overall aptitude and reading scores, and between aptitude and grammar scores.

The way in which L2 proficiency is measured can affect the relationships among variables. If L2 proficiency tests tap declarative knowledge, higher correlations with controlled, analytical components of language aptitude might be observed. One reason for this is the explicit manipulation of linguistic rules that is tapped by tests of language

222

analytical ability. The skills needed to score highly on such tests are reflective of abilities implicated in controlled processing of language. Until declarative knowledge is proceduralized, learners must manipulate explicitly learned rules and items to create phrases and sentences. These types of skills are represented on analytical ability tests, which could explain the connection between language learning aptitude and declarative knowledge.

Oppositely, if characteristics of proficiency tests are manipulated to tap procedural knowledge, lower or negative relationships with language learning aptitude might be observed. Carefully designed studies in which tests of L2 knowledge are designed a priori to tap declarative or procedural knowledge are needed to test these hypotheses. Furthermore, language learning aptitude needs to be tested in a language unknown to the participants to avoid linguistic confounds. If language learning aptitude is tested in learners’ L1, then their aptitude could develop as a skill. That is, if learners explicitly study properties of L1 grammar, such as those covered in undergraduate and graduate linguistics courses and in language arts courses in junior and senior high school, their sensitivity to grammatical functions and constructions in their L1 should increase.

Any observed increases in grammatical sensitivity, which is a core component of the

MLAT, could be attributed to explicit study of the L1. Conversely, when language learning aptitude is tested in an artificial language that is unknown to the participants, effects of L1 grammatical sensitivity are controlled for. Learners would be tasked with learning an unknown language, which, ideally, would use an orthographic system that is

223

dissimilar to that of their L1. This could result in purer measures of language learning aptitude, which would be relatively unaffected by L1 grammatical sensitivity.

One explanation for the differential findings is that language learning aptitude exerts significant effects in the early stages of L2 learning, but these effects diminish over time. Indeed, when relating these findings to Skehan’s (2002) comparison of L2 acquisition processing stages and language learning aptitude components, the aptitude test used in the present study focused on only the first four processing stages (i.e., noticing, pattern identification, extending, and complexifying). As each aptitude subtest targeted some aspects of these stages, the resultant language learning aptitude factor was comprised of varying degrees of these aptitude components. These components are theorized to be facilitative at the beginning stages of language learning or during the initial stages of learning unknown forms. While these aptitudes are important for rapid and successful L2 learning in the early stages of acquisition, they do not capture the processing procedures implicated in procedural or automatized L2 production. After learners integrate new forms and restructure their interlanguage, accuracy must be increased through proceduralization, which in turn enables rapid retrieval processes to operate on the newly integrated forms. Rule-based structures are then automatized, and lexicalized chunks are stored and retrieved. The L2 procedural knowledge test induced learners to engage and apply the latter processes for meaning-focused language production. The language learning aptitude measures employed in the present study might not have sufficiently captured these processes. Thus, the processing procedures implicated in L2 procedural knowledge use were not adequately represented on the

224

language learning aptitude test, which accounts for the relatively nonexistent influence of the aptitude construct.

A second explanation for the findings related to language learning aptitude is that

L2 procedural knowledge was measured b learners’ performance on a timed writing task.

Characteristics of the writing task could have attenuated the impact of language aptitude.

For instance, a 25-minute time limit was imposed for the completion of the writing task.

This time limit could have been too long or short for some participants. Giving learners too much time for task completion could result in an increased reliance on metalinguistic knowledge. Oppositely, providing too little time could tax learners beyond their abilities, which could result in unsystematic variance in the data—asking learners to complete tasks that are beyond their present abilities could produce unpredictable and unrepresentative performances. Moreover, if a timed speaking task had been used, different associations might have been observed in the data. The structure of the test, however, would most likely determine those associations. For example, if learners were asked to respond in length to a given prompt, they would still have time to reflect on their metalinguistic knowledge and apply linguistic rules when speaking. If a sentence-level task were used, such as an elicited imitation task, the learners would have less time to reference their declarative knowledge. However, using such a task would reduce the learners’ focus-on-meaning, creating a situation in which learners’ knowledge of certain grammatical structures would be the object of inquiry. Thus, a wider variety of tasks designed to tap L2 proceduralized knowledge could offer a different view of learners’ abilities, which in turn could be affected by their aptitudes for L2 learning.

225

A third explanation for the lack of statistically significant relationships among language learning aptitude and the L2 proficiency measures is the distribution of aptitude measures. Four aptitude scales served as manifest variables for the language aptitude factor. Ceiling effects in the distribution of measures on the manifest variables could have affected the observed relationships between language learning aptitude and the L2 proficiency measures. Sound-symbol association, vocabulary learning, and language analysis subscales were relatively easy for the participants. Many scores were tightly grouped in the right side of the distributions. A lack of variance could have attenuated the correlations among language aptitude and L2 proficiency measures. If this is the case, it indicates that current measures of language learning aptitude cannot sufficiently measure a wide range of aptitude. Subscales with more difficult items would be needed to separate further the participants in the upper ranges of the constructs. As a 100-item aptitude test was used in this study, simply adding additional items is not a practical or efficient method for achieving a useful language aptitude test. However, the aptitude test consisted of four subscales. This composition resulted in three of the four scales having 15 or 20 items each. If the components of aptitude that were least relevant to a certain study were identified, those scales could be removed and the remaining scales could be lengthened.

This adjustment would result in improved reliability and separation coefficients, while maintaining a reasonable test-length so that language aptitude could be measured in a judicious amount of time.

One candidate for removal is the sound-symbol association scale. Many studies have reported low reliability coefficients for this variable—rarely do they reach or

226

exceed .70. Furthermore, this ability is implicated in the beginning stages of L2 learning as learners must learn a new orthographic system and construct a network of L2 sound- symbol relationships. However, after learners create a functional base of associations, the effects of this aptitude component decrease over time. Declarative memory functions are implicated in L2 learning as the number of vocabulary, forms, and rules increases. This declarative knowledge must be manipulated in meaning-focused activities and tasks, and eventually proceduralized. Thus, aspects of working memory might serve as better indicators of aptitude and learning than the construct of sound-symbol association. Indeed, researchers have turned their attention to this possibility. Working memory has been shown to influence the acquisition of morphosyntax (Sagarra & Herschensohn, 2010), and phonological working memory has been shown to predict L2 proficiency (Hummel,

2009; O’Brien, Segalowitz, Collentine, & Freed, 2006; O’Brien, Segalowitz, Freed, and

Collentine, 2007; see Baddeley, 2012, for a review of current working memory theory).

Finally, it should be noted that the models used to test the relationships among the constructs were relatively simple—this was by design. Simple models tend to generalize better, with the core structure of the model capturing the overall trends in the data. In this study, however, metalinguistic knowledge and L2 procedural knowledge were normally distributed. This could imply that an unmeasured variable influences both constructs.

Indeed, numerous variables could be hypothesized as influencing either metalinguistic knowledge or L2 procedural knowledge. One such candidate is academic ability.

Learners’ abilit to focus on explicit instruction in class and to retain declarative knowledge related to L2 instruction could influence the amount and accuracy of their

227

metalinguistic knowledge. These individual differences could then affect L2 procedural knowledge, primarily through their influence on metalinguistic knowledge. In the tested models, metalinguistic knowledge is hypothesized to become proceduralized. If learners start with differing amounts and qualities of metalinguistic knowledge, the resulting procedural knowledge could be affected. However, the purpose of the current study was to investigate the relationships among metalinguistic knowledge, language learning aptitude, and L2 procedural knowledge. The inclusion of additional variables in the models is a prime avenue of inquiry for future research.

In sum, the findings drawn from the results of the second research question suggest that metalinguistic knowledge exerts a moderate to strong effect on L2 procedural knowledge. Metalinguistic knowledge strongly influences L2 procedural knowledge, and through the examination of developmental measures, metalinguistic knowledge weakly affects the development of complexity, and moderately influences accuracy and fluency. These findings offer new insights into the role of metalinguistic knowledge in L2 learning. As a pedagogical tool, metalinguistic knowledge tends to go in and out of favor. However, the results of the present study suggest that metalinguistic knowledge is proceduralized and applied in meaning-focused L2 tasks.

Research Question 3: Effects of Components of Metalinguistic Knowledge and

Language Learning Aptitude on L2 Procedural Knowledge

The third research question addressed the relative contributions of components of language learning aptitude and metalinguistic knowledge to the development of L2

228

procedural knowledge. Two metalinguistic variables, receptive metalinguistic knowledge and rule explanation, and the four language learning aptitude variables, number learning, sound-symbol association, vocabulary learning, and language analytical ability, were entered into a path model to assess their relative contributions to L2 procedural knowledge. Each of the four components of language learning aptitude was hypothesized to predict L2 procedural knowledge, as were receptive metalinguistic knowledge and rule explanation. Additionally, as memory-related components of language learning aptitude could influence the development and retention of declarative metalinguistic knowledge, language analytical ability and vocabulary learning were hypothesized to influence receptive metalinguistic knowledge and rule explanation (i.e., productive metalinguistic knowledge).

Summary of the Results for Research Question 3

The initial results of the path model analysis indicated poor model fit, χ2(11, N =

166) = 113.88, p < .05, CFI = .39, RMSEA = .24. Thus, the regression paths were not interpreted—instead, model modifications indices were consulted. It should be noted that at this point, the modeling was exploratory. Based on the model modification indices, covariances were added to the model. In total, seven modifications were made to the initial model. The results need to be interpreted cautiously because even though many of the initial regression paths were not statistically significant, they were retained in the model. In lieu of removing direct paths of influence, only covariances were added. The rationale for this modification method was that only regression paths were of interest in

229

the model—covariances among variables were not predicted nor were they highly relevant to the research question.

After the addition of seven covariances, acceptable model fit was achieved, χ2(4,

N = 166) = 2.00, p = .74, CFI = 1.00, RMSEA = .00. Three statistically significant paths remained in the model: rule explanation to L2 procedural knowledge, β = .47, p < .001; language analytical ability to receptive metalinguistic knowledge, β = .17, p < .05; and number learning to L2 procedural knowledge, β = -.16, p < .05. The independent variables accounted for 34% of the variance in L2 procedural knowledge. Vocabulary learning and language analytical ability accounted for 3% of the variance in receptive metalinguistic knowledge.

Interpretation of the Results for Research Question 3

The finding that rule explanation statistically predicted L2 procedural knowledge is significant in that a single declarative variable exhibited a moderate to strong effect (β

= .47, p < .001). This implies that learners’ declarative knowledge of rules facilitates L2 development. This finding supports skill-theory accounts of L2 learning. Learners received explicit instruction on rules governing the operation of English forms. Cognitive processes are applied to the declarative representations during instruction and practice, which promotes the proceduralization of declarative forms for use as procedures in meaning-focused tasks (Anderson, 2002; DeKeyser, 1995, 1997, 1998; Johnson, 1996).

In the present study, productive metalinguistic knowledge was a stronger predictor of L2 procedural knowledge than was receptive metalinguistic knowledge. As

230

learners were asked to identify errors and verbalize, in written form, rules governing the formation of grammatical L2 utterances, the ability to express this knowledge might reflect a higher degree of knowledge integration and restructuring. The receptive metalinguistic knowledge test only required learners to select a metalinguistic term or a sentence that serves as an example of a metalinguistic term. Even if this knowledge was inefficiently organized in various schemas, the untimed condition of the test enabled explicit reflection on declarative knowledge stores. Conversely, while still held in declarative memory, the productive metalinguistic knowledge test required reflection and activation of multiple declarative sources to reflect, judge, and match the errors to explicit grammatical rules. This declarative rule knowledge might have been proceduralized to some degree, which would explain the stronger influence of productive metalinguistic knowledge on L2 procedural knowledge.

The finding that language analytical ability predicted receptive metalinguistic knowledge but not L2 procedural knowledge was somewhat unexpected. It was hypothesized that analytical ability would be statistically related to L2 procedural knowledge. Analytical ability has been implicated in late immersion learners’ language learning (Harley & Hart, 1997), which characterizes adult L2 learning—early immersion learners relied on memory-related abilities. Analytical ability has been associated with performance on certain proficiency measures, namely cloze and sentence repetition scores (Harley & Hart, 2002). Language analytical ability has also been implicated in ultimate attainment for immigrants who achieved high grammatical ability (DeKeyser,

231

2000). In eKe ser’s stud , grammatical sensitivit might have been assessed because a

Hungarian translation of the MLAT words-in-sentences test was used.

An alternative explanation for the previous findings related to language analytical ability, however, is that the proficiency tests used in those studies assessed declarative aspects of L2 proficiency. In some studies, productive measures of L2 proficiency were not used, or if production was required, it took the form of repetition of stimuli. In these cases, the findings of the current study would support the interpretations drawn from previous studies. In this study, language analytical ability predicted receptive metalinguistic knowledge; therefore, analytical abilities and grammatical sensitivity are most facilitative to task performance when explicit reflection on declarative knowledge is allowed. Indeed, learners’ language analytical ability, as it is operationalized on many language aptitude tests, is measured by performance on tests of declarative rule induction, comprehension, and application, which are carried out in a controlled, monitored manner.

These skills might indirectly affect L2 procedural knowledge through the use of receptive metalinguistic knowledge. In the present study, the effects of receptive metalinguistic knowledge approached statistical significance—the p-value of the regression path from receptive metalinguistic knowledge to L2 procedural knowledge was slightly greater than .05.

Language aptitude number learning statistically predicted L2 procedural knowledge. The strength of the effect was weak (β = -.16, p < .05), but the direction of the relationship was unexpected. That is, number learning and L2 procedural knowledge

232

were negatively related, which means that higher number learning measures were associated with lower L2 procedural knowledge measures.

One interpretation of this finding is that the cognitive processes implicated in language aptitude number learning are different from or in competition with those associated with L2 procedural knowledge. The number-learning test used in this study tapped primarily phonological memory. Participants spent approximately five minutes learning the pronunciations of the numbers 1, 2, 3, 4, 10, 20, 30, 40, 100, 200, 300, and

400 in an artificial language. They were not permitted to take notes; thus, all learning was based on aural input processed in phonological memory. After the training period, the participants dictated two- and three-digit numbers. The training and dictation were conducted with recorded materials, which served as a timekeeper. In the dictation, the numbers were read once with only a few seconds between the 15 test items.

Phonological working memory has been associated with the learning of restricted sets of L2 grammatical rules in laboratory studies (Williams & Lovatt, 2005) and with gains in L2 oral production (O’Brien et al., 2006). However, the L2 procedural knowledge test used in the current study required learners to apply knowledge proceduralized over a number of years (i.e., L2 proficiency) in written production. It can be assumed, then, that phonological memory was not implicated in the completion of the timed writing task. The participants might have relied on proceduralized knowledge of rules, which would lower the demands on working memory. Higher degrees of proceduralization result in the creation of processing procedures; these procedures create less cognitive burden than the exclusive use of nonproceduralized declarative knowledge

233

or working memory. Processing procedures are developed through practice (Anderson,

2002; DeKeyser, 2007). These differences might explain the negative relationship between language aptitude number learning and L2 procedural knowledge.

In sum, the findings from the third research question suggest that rule explanation is a strong predictor of L2 procedural knowledge. This type of declarative knowledge is beneficial to learners and can be applied, in proceduralized form, to meaning-focused tasks. Second, language analytical ability predicts receptive metalinguistic knowledge.

This finding suggests that analytical ability facilitates the learning and use of declarative knowledge. Finally, language aptitude number learning and L2 procedural knowledge are negatively related, which can be explained by the types of memory implicated in those tasks—number learning relies on phonological working memory, while meaning-focused writing tasks are informed by procedural processes developed through practice.

In the next chapter, I summarize the findings and consider the limitations of the present study. Suggestions for future research are provided, followed by the final conclusions.

Theoretical Implications

The theoretical implications of the results of the present study are organized according to the three research questions that guided this study. The results of the first research question indicated that metalinguistic knowledge and language learning aptitude are distinct constructs. This finding sheds new light on the relationship between these constructs. Previous studies have viewed metalinguistic knowledge as part of language

234

learning aptitude (e.g., Alderson, et al, 1997; Ranta, 2002; Roehr, 2008b), which situates metalinguistic knowledge as a component of language aptitude. The findings of the current study indicate that metalinguistic knowledge can develop independently of language learning aptitude. There exists a need for researchers to distinguish between declarative knowledge held by L2 learners and their cognitive ability to encode, recall, and apply linguistic knowledge in different stages of L2 learning (e.g., Skehan, 2002).

Distinguishing between these two constructs could inform SLA theory development and descriptions L2 language representation.

The results of the second research question indicated that metalinguistic knowledge is implicated in the development of L2 procedural knowledge, complexity, accuracy, and fluency. This finding is significant for SLA theory development in that it suggests differential effects of metalinguistic knowledge on varying developmental L2 measures. First, it suggests that metalinguistic knowledge can be proceduralized, which provides support for skill-theory account of L2 learning. Numerous accounts of L2 learning have been proposed (e.g., VanPatten & Williams, 2007), and empirical evidence is needed to revise theoretical accounts of L2 development. The findings of the current study serve as evidential support for the proceduralization of L2 declarative knowledge.

This proceduralization manifests differential effects on complexity, accuracy, and fluency.

The results of the third research question indicated that declarative knowledge of

L2 rules accounted for a significant portion of the variance in L2 procedural knowledge.

This finding also supports skill acquisition accounts of L2 learning and offers specificity

235

as to the type of knowledge most associated with L2 procedural knowledge. Productive rule explanation exhibited stronger loadings on L2 proceduralized knowledge than did receptive metalinguistic knowledge and the components of language learning aptitude.

This result can serve as a starting point for future research when examining the impact of acquisition-related variables on the development of L2 procedural knowledge and implicit linguistic competence.

Pedagogical Implications

The findings that metalinguistic knowledge influences L2 procedural knowledge, complexity, accuracy, and fluency has implications for L2 teaching and learning. First,

L2 teachers could apply these findings in the instructional balance and materials that they use. Many English lessons in Japanese junior and senior high school are conducted in

Japanese and consist of translation activities (Gorsuch, 2000). The use of Japanese could be legitimatized if it were limited to the provision of L1-mediated explanations of L2 grammatical rules. These rules serve as the declarative content that is required for subsequent proceduralization.

Nation (2007) suggested that language-focused learning should comprise approximately 25% of total instructional time. This type of instruction provides explicit focus on declarative aspects of language, allowing learners to reflect deeply on form.

Through explicit instruction and reflection, learners can develop metalinguistic knowledge related to various aspects of the L2, including , morphology, pronunciation, vocabulary, pragmatics, and discourse.

236

After the provision of metalinguistic explanation, the key pedagogical intervention that is needed is proceduralization practice. Nation (2007) divides this practice into three components: meaning-focused input, meaning-focused output, and fluency development. Germane to the current study is fluency development. This concept is functionally equivalent to proceduralization in that through this process proceduralized forms become available for use in online communication.

To proceduralize linguistic rules, learners should practice the application of a given rule while holding it in working memory (DeKeyser, 2007). The beginning stages of practice should be defined by controlled use of the structure. After initial familiarity emerges, task characteristics should be manipulated to gradually increase task difficulty.

Most importantly, time pressure should be applied to the practice. Requiring learners to produce output while decreasing the amount of time allotted to communicate a message induces restructuring of linguistic knowledge. Declarative knowledge is realigned so that it can be accessed effortlessly, and knowledge is restructured to allow for rapid processing of stimuli and linguistic output. An example of this type of practice is the retelling of information in a 4/3/2 activity (Nation, 2007). In this activity, learners are given 4 minutes to explain a given scenario, and after changing partners, they must retell the information in three minutes, and then within two minutes. Time pressure is increased at each stage of the activity, pushing learners to proceduralize their declarative knowledge.

This type of practice seems to be absent from the curricula implemented in some

Japanese junior and senior high schools. Of course there are many schools that implement

237

curricula that focus on the development of communicative L2 skills. However, it is fair to say that in many cases, language-focused instruction commands much more than 25% of the instructional time, which leaves little time for proceduralization practice. Or, another way to state the problem is that at least 25% of instruction is most likely not focused on proceduralization practice. One reason that teachers might shy away from realizing this type of practice in classrooms is that responsibility can shift from the teacher to the learner when implementing fluency-building activities. Students take on more responsibility for their learning in that they must put forth effort in the application of their declarative knowledge. Students can be passive learners in teacher-led, language-focused lessons. In fluency-building tasks, students must take an active role. When these roles shift, teachers can facilitate learners’ L2 development through the provision of feedback, including the use of prompts to push learners to reformulate their output.

In the present study, metalinguistic knowledge positively impacted three developmental measures of L2 acquisition: complexity, accuracy, and fluency. The largest effects were on fluency and accuracy. As proceduralization begins with the easier, more frequent structures in the L2, teachers should ensure that students receive sufficient amounts of production practice at the beginning stages of L2 learning. This could lead to significant gains in fluency as the cognitive burden of meaning-focused language use is lessened through proceduralization, and to gains in accuracy as the metalinguistic knowledge guides learners’ monitoring and self-correction.

Effects of metalinguistic knowledge should be less observable on L2 complexity due to the cognitive burden of using complex language. Complex language that is outside

238

learners’ current level of competence is b definition unproceduralized. Onl after explicit learning and sufficient proceduralization practice would complex forms be expected to appear in time-pressured output. If there is time pressure to communicate meaning, employing complex or newly learned structures accurately and fluently would create a significant cognitive burden. Thus, L2 instructors should focus on the balance between language-focused instruction and proceduralization practice through all stages of learners’ development, which in turn should assist L2 learners in developing the necessary knowledge and skills to communicate effectively in the L2.

239

CHAPTER 7

CONCLUSION

In this chapter, the findings of this study are summarized. Then, the limitations of the study are considered. Following the limitations, suggestions for future research are offered. Finally, a few considerations on the topic of this study are offered as final conclusions.

Summary of the Findings

Three research questions pertaining to the relationship between metalinguistic knowledge and language learning aptitude and their relative effects on L2 procedural knowledge guided this study. The first research question inquired about the factorial structure of metalinguistic knowledge and language learning aptitude, with a focus of reconciling the findings of previous studies. The second research question was concerned with the relative effects of metalinguistic knowledge and language learning aptitude on

L2 procedural knowledge. Four operationalizations of procedural knowledge were employed: (a) overall L2 procedural knowledge, (b) L2 complexity, (c) L2 accuracy, and

(d) L2 fluency. These variables constitute measures of overall L2 procedural knowledge and L2 developmental measures. The third research question focused on the relative contributions of components of metalinguistic knowledge (i.e., receptive and productive) and language learning aptitude (i.e., number learning, sound-symbol association, vocabulary learning, and language analytical ability) to L2 procedural knowledge.

240

In this stud , metalinguistic knowledge was measured in the participants’ L1 (i.e.,

Japanese). As previous studies tended to use metalinguistic knowledge tests given in learners’ L2, new instruments were needed to assess L1 metalinguistic knowledge. To this end, two tests were created to measure L1 Japanese metalinguistic knowledge. As many learners of English as a foreign language receive L1 mediated instruction, mental representations of metalinguistic knowledge are primarily held in the L1 and are connected to L2 structures and forms. The first instrument was a test of receptive metalinguistic knowledge. This test consisted of 22 multiple-choice items. The second instrument was a test of productive metalinguistic knowledge. This test was composed of

17 items that asked learners to identify errors in English sentences and to explain them using metalinguistic terminology. Scores of technical terminology and rule explanation were derived from this test.

Language learning aptitude was tested using the Lunic Language Marathon, which is a language aptitude test designed for use with L1 Japanese participants. The

100-item test consisted of four subtests: number learning, sound-symbol association, vocabulary learning, and language analytical ability. This instrument assesses the learning of an artificial language, thus controlling for L1 and L2 influences in the measurement of language aptitude.

L2 procedural knowledge was assessed through the use of a timed writing task.

Participants wrote a meaning-focused, timed essay in response to a given prompt. The essays were scored for overall quality by two raters—these ratings served as indicators of

L2 procedural knowledge. The essays were also scored for indicators of L2 complexity,

241

accuracy, and fluency (CAF). These developmental measures represent unique aspects of

L2 learning that can be differentially proceduralized.

Scores from each of these instruments were subjected to Rasch analyses to examine the fit of each item to the Rasch model, the targeting of the items to the sample, the construct coverage of the items, the reliability and separation of the measures, and the dimensionality of the test items. For the receptive metalinguistic knowledge test, the

Rasch reliability (separation) was slightly low at .67 (1.43); however, the items showed good fit to the Rasch model and no indications of multidimensionality were observed.

The Rasch reliability (separation) estimates for the technical terminology and rule explanation scales were sufficient at .78 (1.91) and .81 (2.05), respectively. With regards to the language learning aptitude scales, the Rasch reliability (separation) for the number learning scale was high at .87 (2.59). Other aptitude scales exhibited lower Rasch reliability (separation) estimates: sound-symbol association = .65 (1.36); vocabulary learning = .79 (1.97), and language analytical ability = .68 (1.45). As measured by

Cronbach’s alpha, all of the reliability estimates for the aptitude scale scores exceeded

.79 (min = .79, max = .89). With regards to the procedural knowledge test, the ratings from the two judges were highly correlated (r = .94), and the multifaceted Rasch model was used to control for rater severity. The ratings for L2 complexity, accuracy, and fluency (CAF) were also strongly correlated (all rs = .99 or 1.00). A principal components analysis was conducted to examine the functioning of each CAF measure and to consolidate the measures into component scores. The Rasch analyses provided interval-level measures that were used in subsequent analyses.

242

Before conducting the analyses related to each research question, the data were screened to assess the accuracy of the data entry, the suitability of the participants, the presence of univariate outliers, the normality of the distributions, the linearity and homoscedasticity of the variables, and the presence of multivariate outliers. Nine participants were removed based on background variables, and a small number of univariate outliers were removed from the data set. No clear deviations from normality were detected in the distributions of the variables, and the linearity and homoscedasticity of the variables was confirmed. No multivariate outliers were detected in the data.

With regards to the first research question, exploratory factor analysis revealed that the metalinguistic knowledge and language learning aptitude loaded on two factors, which accounted for 45.61% of the variance. This result suggests that metalinguistic knowledge and language learning aptitude are distinct constructs. Confirmatory factor analysis was then used to validate this finding. The regression weights from the two latent variables to their respective indicators were all statistically significant, and metalinguistic knowledge and language learning aptitude exhibited a weak, nonsignificant relationship (r = .11, p = .32). This finding is significant in that it is the result of a more rigorous test of the relationship between these two constructs than has been reported in the literature, and offers a different perspective of the association between metalinguistic knowledge and the ability to learn a new language.

As for the second research question, the analysis of structural models indicated that metalinguistic knowledge significantly affects the development of L2 procedural knowledge. Metalinguistic knowledge statistically predicted L2 procedural knowledge (β

243

= .57, p < .001), complexity (β = .19, p < .05), accuracy (β = .30, p < .001), and fluency

(β = .35, p < .001). However, language learning aptitude did not statistically predict L2 procedural knowledge (β = -.07, p = .41), complexity (β = -.08, p = .45), accuracy (β = -

.08, p = .39), or fluency (β = .02, p = .80). This finding is significant in that it provides evidence for the facilitative effects of metalinguistic knowledge in L2 learning.

Metalinguistic knowledge plays an important role in the proceduralization of L2 forms, and it differentially affects L2 complexity, accuracy, and fluency.

Regarding the third research question, analyses of path models indicated that rule explanation significantly influences L2 procedural knowledge. The regression weight from productive metalinguistic knowledge (i.e., rule explanation) to L2 procedural knowledge demonstrated a moderate effect (β = .47, p < .001). Of the regression paths in the model, the magnitude of this coefficient clearly distinguished it from the other variables. Language analytical ability statistically predicted receptive metalinguistic knowledge (β = .17, p < .05), and number learning was negatively associated with L2 procedural knowledge (β = -.16, p < .05). The predictive power of productive metalinguistic knowledge is a significant finding. It implies that the cognitive processes operating on declarative knowledge facilitate the organization of linguistic knowledge so that it is available for production and proceduralization.

244

Limitations

At least six limitations exist in the present study. These are related to the research design, the characteristics of the participants, the sample size, and the testing instruments.

First, a cross-sectional research design was adopted for this study, which involved testing all of the participants at one point in time. Observations of proceduralized declarative knowledge are based on the assumption that this proceduralization was conducted over the previous years in which the participants had been studying L2

English. The results of these processes were observable through the examination of ability measures. However, as learner development is assumed and not measured over time, it moderates the internal validity of the study. A longitudinal study in which learners were assessed at different points in their L2 development would have produced telling data pertaining to the rate of L2 knowledge proceduralization.

Second, characteristics of the participants might have influenced the relationships manifested in the data. All of the participants were Japanese speakers who had studied

English in a somewhat unique environment. They all had attended Japanese junior and senior high schools and studied English through predominately explicit instruction. The participants were recruited from two Japanese universities, one private and one national, and both universities admit students based on entrance exam scores. This process could have resulted in groups of students with similar characteristics and abilities. Sampling from a wider range of educational institutions and grade levels would have resulted in a more diverse sample.

245

Third, the sample size employed in this study could have been larger. When consulting the literature for the recommended sample size needed to conduct structural equation modeling, hundreds of participants are needed for even moderate sample sizes.

The models tested in this study were not so complex; therefore, the utilized sample size most likely produced stable estimates of effects. However, increasing the sample size would have resulted in more confidence in the interpretations derived from the analyses.

Fourth, not all of the testing instruments performed as expected. The Rasch reliability of the receptive metalinguistic knowledge test was slightly below .70, as were the estimates for sound-symbol association and language analytical ability. The Rasch analyses also revealed possible ceiling effects on some of the aptitude scales. More items should be added to the receptive metalinguistic knowledge test to provide increased construct coverage and higher reliability estimates. Likewise, the language learning aptitude scales could be lengthened to further disperse the measures and to increase the reliability estimates.

Fifth, the instrument and conditions used to test L2 procedural knowledge could have allowed for learners to reflect on declarative knowledge. As DeKeyser (2003) pointed out, tests of implicit or explicit knowledge can only be relative measures in that it is difficult to prevent learners from applying one type of knowledge or the other.

However, if learners are able to access and apply declarative knowledge under timed conditions, such use can be an indicator of procedural knowledge. The precise time constraints to apply to a given condition are difficult to determine. Having too much time for task completion allows access to a range of declarative knowledge, while imposing

246

too strict of a time limit might adversely affect performance, resulting in unsystematic variance in the measures.

Sixth, in the present study, procedural knowledge was measured in only one linguistic skill area (i.e., writing). A timed writing task was used to assess L2 learners’ procedural knowledge. This type of task could have favored analytical learners over memory-based learners. Furthermore, learners of varying levels of aptitude could have practiced essay writing enough to overcome any differences in language learning aptitude, which would lessen the impact of language aptitude on the procedural knowledge variables. The addition of a timed, meaning-focused speaking task could have shed more light on the effects of metalinguistic knowledge and language learning aptitude on L2 procedural knowledge.

Suggestions for Future Research

The present study investigated the factorial structure of metalinguistic knowledge and language learning aptitude, and assessed the extent to which metalinguistic knowledge and language learning aptitude influence L2 procedural knowledge, complexity, accuracy, and fluency. Here I offer suggestions as to how these lines of inquiry can be continued, how other aspects of L2 proceduralization can be examined, and how avenues of research into the effects of language learning aptitude at later stages of L2 acquisition can inform theories of L2 development.

First, the construct of metalinguistic knowledge could be assessed and manipulated within an experimental study. Metalinguistic knowledge could be used as a

247

factorial variable in laboratory and classroom-based studies. The manipulation of this variable could shed light on the effects of differing degrees of metalinguistic knowledge.

How much metalinguistic knowledge do learners need of a given rule to begin proceduralizing and applying the knowledge? What is the optimal timing of the provision of metalinguistic knowledge? In what ways should it be sequenced in L2 syllabuses?

Setting experimental metalinguistic knowledge conditions could shed light on these aspects of L2 knowledge development and use.

Second, methods could be developed and strengthened to assess the proceduralization of L2 knowledge. Tests are needed to determine when L2 knowledge has become procedural or automatic. Jiang’s (2007) stud of implicit linguistic competence is a good example of this. A self-paced reading task was used to elicit reaction times to grammatical and ungrammatical syntactic and morphological forms.

Using statistical tests to judge whether learners demonstrate automaticity of grammatical forms creates a clear, interpretable distinction between different forms of L2 knowledge.

Third, the components of language learning aptitude that influence L2 learning outcomes in later stages of acquisition need to be identified. The language learning aptitude measures used in the present study might be associated with initial L2 learning; however, these measures were not related to L2 procedural knowledge, which is developed over time. Skehan (2002) has laid the theoretical groundwork for putative processing procedures implicated in later stages of L2 acquisition. Similarly, Robinson

(2001a, 2002, 2005a, 2007) has proposed a research agenda through which aptitude complexes that impact various stages of L2 acquisition might be identified. Research into

248

these cognitive abilities could enhance the application of language learning aptitude measures in L2 education. Much theoretical and empirical work is left to be done in these promising areas of inquiry.

Final Conclusions

Reflecting on the findings of this study, it seems that initial aptitudes for L2 learning do not persist or greatly affect later learning outcomes. In this study, language learning aptitude did not account for variance in L2 learners’ procedural knowledge.

Metalinguistic knowledge, however, appeared to significantly influence the development of L2 procedural knowledge. These findings point to the need of integrating language- focused activities and subsequent proceduralization practice into the L2 learning process.

I am not, however, suggesting that only metalinguistic knowledge be taught. In fact, I feel that language-focused learning should only be a part of a language program. For learners to develop well-rounded L2 skills, instructional opportunities for meaning-focused input, output, and fluency development need to be balanced with language-focused learning.

Nation (2007) argued for a 25% allotment of instructional time to each of these linguistic aspects, and based on my experience, balancing L2 instruction in this way would result in significant L2 gains.

In Japanese schools, language-focused instruction already commands much of the instructional time in L2 classes. Many English classes are conducted in Japanese, and learners’ focus is drawn almost exclusively to declarative aspects of the language.

Meaning-focused output needs to command more of the instructional time. Although

249

procedural gains might not be evident in the short term, developing procedural knowledge that can be used in meaning-focused communication should benefit learners in the long run. Therefore, teachers in Japanese schools at all levels need to reflect on the learners’ needs and balance instructional activities to facilitate the development of learners’ declarative and procedural linguistic knowledge.

250

REFERENCES

Abrahamsson, N., & Hyltenstam, K. (2008). The robustness of aptitude effects in near- native second language acquisition. Studies in Second Language Acquisition, 30, 481-509. doi: 10.1017/S027226310808073X

Alderson, J. C., Clapham, C., & Steel, D. (1997). Metalinguistic knowledge, language aptitude and language proficiency. Language Teaching Research, 1, 93-121.

Anderson, J. R. (2002). Spanning seven orders of magnitude: A challenge for cognitive modeling. Cognitive Science, 26, 85-112.

Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. L. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036-1060. doi: 10.1037/0033-295x.111.4.1036

Baddeley, A. (2012). Working memory: Theories, models, and controversies. Annual Review of Psychology, 63(1), 1-29. doi: 10.1146/annurev-psych-120710-100422

Bentler, P. M. (2006). EQS 6 structural equations program manual. Encino, CA: Multivariate Software, Inc.

Berry, R. (2009). EFL majors' knowledge of metalinguistic terminology: A comparative study. Language Awareness, 18(2), 113-128. doi: 10.1080/09658410802513751

Bialystok, E. (1979). Explicit and implicit judgements of L2 grammaticality. Language Learning, 29, 81-103.

Bialystok, E. (1982). On the relationship between knowing and using linguistic forms. Applied Linguistics, 3, 181-206.

Bialystok, E. (2001). Metalinguistic aspects of bilingual processing. Annual Review of Applied Linguistics, 21, 169-181.

Bialystok, E., Craik, F. I. M., & Luk, G. (2008). Lexical access in bilinguals: Effects of vocabulary size and executive control. Journal of Neurolinguistics, 21(6), 522- 538. doi: 10.1016/j.jneuroling.2007.07.001

Bialystok, E., & Luk, G. (2007). The universality of symbolic representation for reading in Asian and alphabetic languages. Bilingualism: Language and Cognition, 10(2), 121-129. doi: 10.1017/S136672890700288x

251

Bialystok, E., & Sharwood Smith, M. (1985). Interlanguage is not a state of mind: An evaluation of the construct for second-language acquisition. Applied Linguistics, 6, 101-117.

Blunch, N. J. (2008). Introduction to structural equation modeling using SPSS and AMOS. London: Sage.

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Erlbaum.

Browne, M. W., & Cudeck, R. (1989). Single sample cross-validation indices for covariance structures. Multivariate Behavioral Research, 24(4), 445-455. doi: 10.1207/s15327906mbr2404_4

Byrne, B. M. (2010). Structural equation modeling with AMOS: Basic concepts, applications, and programming (2nd ed.). New York, NY: Routledge.

Carroll, J. B. (1962). The prediction of success in intensive foreign language training. In R. Glaser (Ed.), Training research and education (pp 87-136). Pittsburgh, PA: University of Pittsburgh Press. Retrieved from ERIC database. (ED038051)

Carroll, J. B., & Sapon, S. M. (1959). Modern Language Aptitude Test Manual. New York, NY: The Psychological Corporation.

De Jong, N. (2005). Can second language grammar be learned through listening? An experimental study. Studies in Second Language Acquisition, 27, 205-234.

De Jong, N. H., Steinel, M. P., Florijn, A. F., Schoonen, R., & Hulstijn, J. H. (2012). Facets of speaking proficiency. Studies in Second Language Acquisition, 34, 5-34.

DeKeyser, R. M. (1995). Learning second language grammar rules: An experiment with a miniature linguistic system. Studies in Second Language Acquisition, 17, 379-410.

DeKeyser, R. M. (1997). Beyond explicit rule learning: Automatizing second language morphosyntax. Studies in Second Language Acquisition, 19, 195-221.

DeKeyser, R. M. (1998). Beyond : Cognitive perspectives on learning and practicing second language grammar. In C. Doughty & J. Williams (Eds.), Focus on form in classroom second language acquisition (pp. 42-63). Cambridge: Cambridge University Press.

DeKeyser, R. M. (2000). The robustness of critical period effects in second language acquisition. Studies in Second Language Acquisition, 22, 499-533.

252

DeKeyser, R. M. (2003). Implicit and explicit learning. In C. Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 313-348). Oxford, UK: Blackwell.

DeKeyser, R. M. (2007). Conclusion: The future of practice. In R. M. DeKeyser (Ed.), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 287-304). Cambridge: Cambridge University Press.

DeKeyser, R. M. (2009). Cognitive-psychological processes in second language learning. In M. H. Long & C. Doughty (Eds.), The handbook of language teaching (pp. 119-138). Oxford, UK: Blackwell.

DeKeyser, R., Alfi-Shabtay, I., & Ravid, D. (2010). Cross-linguistic evidence for the nature of age effects in second language acquisition. Applied , 31, 413-438. doi: 10.1017/s0142716410000056

rn ei, Z. (2005). The psychology of the language learner: Individual differences in second language acquisition. Mahwah, NJ: Erlbaum.

Dufva, M., & Voeten, M. J. M. (1999). Native language literacy and phonological memory as prerequisites for learning English as a foreign language. Applied Psycholinguistics, 20(3), 329-348.

Educational Testing Service. (2008). TOEFL iBT tips: How to prepare for the TOEFL iBT. Retrieved from http://www.ets.org/toefl

Elder, C., Warren, J., Hajek, J., Manwaring, D., & Davies, A. (1999). Metalinguistic knowledge: How important is it in studying a language at university? Australian Review of Applied Linguistics, 22(1), 81-95.

Ellis, N. C. (1996). Sequencing in SLA: Phonological memory, chunking, and points of order. Studies in Second Language Acquisition, 18, 91-126.

Ellis, N. C. (2005). At the interface: Dynamic interactions of explicit and implicit language knowledge. Studies in Second Language Acquisition, 27, 305-352.

Ellis, R. (1993). The structural syllabus and second language acquisition. TESOL Quarterly, 27, 93-113.

Ellis, R. (1994a). A theory of instructed second language acquisition. In N. Ellis (Ed.), Implicit and explicit learning of languages (pp. 79-114). London: Academic Press.

Ellis, R. (1994b). The study of second language acquisition. Oxford: Oxford University Press.

253

Ellis, R. (2002). Does form-focused instruction affect the acquisition of implicit knowledge? Studies in Second Language Acquisition, 24, 223-236.

Ellis, R. (2003). Task-based language learning and teaching. Oxford: Oxford University Press.

Ellis, R. (2004). The definition and measurement of L2 explicit knowledge. Language Learning, 54, 227-275.

Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language: A psychometric study. Studies in Second Language Acquisition, 27, 141-172.

Ellis, R. (2006). Modelling learning difficulty and second language proficiency: The differential contributions of implicit and explicit knowledge. Applied Linguistics, 27, 431-463.

Ellis, R. (2008a). Investigating grammatical difficulty in second language learning: Implications for second language acquisition research and . International Journal of Applied Linguistics, 18, 4-22.

Ellis, R. (2008b). The study of second language acquisition (2nd ed.). Oxford: Oxford University Press.

Ellis, R. (2012). Language teaching research & language pedagogy. Oxford: Wiley- Blackwell.

Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford: Oxford University Press.

Ellis, R., & Loewen, S. (2007). Confirming the operational definitions of explicit and implicit knowledge in Ellis (2005): Responding to Isemonger. Studies in Second Language Acquisition, 29, 119-126.

Ellis, R., Loewen, S., & Erlam, R. (2006). Implicit and explicit corrective feedback and the acquisition of L2 grammar. Studies in Second Language Acquisition, 28, 339- 368.

Erlam, R. (2005). Language aptitude and its relationship to instructional effectiveness in second language acquisition. Language Teaching Research, 9(2), 147-171.

Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.

254

Ganschow, L., Sparks, R. L., & Javorsky, J. (1998). Foreign language learning difficulties: An historical perspective. Journal of Learning Disabilities, 31(3), 248-258.

Gass, S. (1983). The development of L2 intuitions. TESOL Quarterly, 17, 273-291.

Gass, S. M. (1997). Input, interaction, and the second language learner. Mahwah, NJ: Erlbaum.

Gass, S. M., & Selinker, L. (2001). Second language acquisition: An introductory course (2nd ed.). New York, NY: Erlbaum.

Gass, S. M., & Selinker, L. (2008). Second language acquisition: An introductory course (3rd ed.). New York, NY: Routledge.

Goldschneider, J. M., & DeKeyser, R. M. (2001). Explaining the “natural order of L2 morpheme acquisition” in English: A meta-analysis of multiple determinants. Language Learning, 51, 1-50.

Golonka, E. M. (2006). Predictors revised: Linguistic knowledge and in second language gain in Russian. Modern Language Journal, 90(4), 496-505.

Gorsuch, G. J. (2000). EFL educational policies and educational cultures: Influences on teachers’ approval of communicative activities. TESOL Quarterly, 34, 675-710.

Green, P. S., & Hecht, K. (1992). Implicit and explicit grammar: An empirical study. Applied Linguistics, 13, 168-184.

Grigorenko, E. L., Sternberg, R. J., & Ehrman, M. E. (2000). A theory-based approach to the measurement of foreign language learning ability: The Canal-F theory and test. Modern Language Journal, 84(3), 390-405.

Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate applications (2nd ed.). New York, NY: Routledge.

Han, Y., & Ellis, R. (1998). Implicit knowledge, explicit knowledge and general language proficiency. Language Teaching Research, 2, 1-23.

Harley, B., & Hart, D. (1997). Language aptitude and second language proficiency in classroom learners of different starting ages. Studies in Second Language Acquisition, 19(3), 379-400.

255

Harley, B., & Hart, D. (2002). Age, aptitude, and second language learning on a bilingual exchange. In P. Robinson (Ed.), Individual differences and instructed language learning (pp. 301-330). Amsterdam: Benjamins.

Hu, G. (2002). Psychological constraints on the utility of metalinguistic knowledge in second language production. Studies in Second Language Acquisition, 24, 347- 386.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.

Hulstijn, J. H. (2005). Theoretical and empirical issues in the study of implicit and explicit second-language learning: Introduction. Studies in Second Language Acquisition, 27, 129-140.

Hulstijn, J. H., & Hulstijn, W. (1984). Grammatical errors as a function of processing constraints and explicit knowledge. Language Learning, 34, 23-43.

Hummel, K. M. (2009). Aptitude, phonological memory, and second language proficiency in nonnovice adult learners. Applied Psycholinguistics, 30(2), 225-249.

Isemonger, I. M. (2007). Operational definitions of explicit and implicit knowledge: Response to R. Ellis (2005) and some recommendations for future research in this area. Studies in Second Language Acquisition, 29, 101-118.

Jiang, N. (2007). Selective integration of linguistic knowledge in adult second language learning. Language Learning, 57, 1-33.

Johnson, K. (1996). Language teaching & skill learning. Oxford: Blackwell.

Kiss, C., & Nikolov, M. (2005). Developing, piloting, and validating an instrument to measure young learners’ aptitude. Language Learning, 55(1), 99-150. doi: 10.1111/j.0023-8333.2005.00291.x

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York, NY: Guilford Press.

Koda, K. (2000). Crosslinguistic interactions in the development of L2 intraword awareness: Effects of logographic processing experience. Psychologia, 43(1), 27- 46.

Kormos, J. (2006). Speech production and second language acquisition. Mahwah, NJ: Erlbaum.

256

Krashen, S. (1981). Second language acquisition and second language learning. Oxford: Pergamon.

Larsen-Freeman, D., & Long, M. H. (1991). An introduction to second language research. Harlow, England: Longman.

Larson-Hall, J. (2008). Weighing the benefits of studying a foreign language at a younger starting age in a minimal input situation. Second Language Research, 24(1), 35- 63.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. Cambridge, MA: MIT Press.

Linacre, J. M. (2011a). A user's guide to WINSTEPS MINISTEP: Rasch-model computer programs. Beaverton, OR: Winsteps.com.

Linacre, J. M. (2011b). Winsteps (Version 3.72.3) [Computer software]. Beaverton, OR: Winsteps.com.

Linacre, J. M. (2012a). A user’s guide to FACETS: Rasch-model computer programs. Beaverton, OR: Winsteps.com.

Linacre, J. M. (2012b). Facets (Version 3.70.0) [Computer software]. Beaverton, OR: Winsteps.com.

Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W. C. Ritchie & T. K. Bhatia (Eds.), Handbook of second language acquisition (pp. 413-468). San Diego, CA: Academic Press.

Lyster, R. (2004). Differential effects of prompts and recasts in form-focused instruction. Studies in Second Language Acquisition, 26, 399-432.

Lyster, R., & Ranta, L. (1997). Corrective feedback and learner uptake: Negotiation of form in communicative classrooms. Studies in Second Language Acquisition, 20, 37-66.

Lyster, R., Saito, K., & Sato, M. (2013). Oral corrective feedback in second language classrooms. Language Teaching, 46(1), 1-40. doi: 10.1017/S0261444812000365

Macaro, E., & Masterman, L. (2006). Does intensive explicit grammar instruction make all the difference? Language Teaching Research, 10, 297-327.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.

257

Meyers, L. S., Gamst, G. C., & Guarino, A. J. (2006). Applied multivariate research: Design and interpretation. London: Sage.

Nation, P. (2007). The four stands. Innovation in Language Learning and Teaching, 1, 2- 13. doi: 10.2167/illt039.0

Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50(3), 417-528.

Norris, J. M., & Ortega, L. (2009). (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555- 578. doi: 10.1093/applin/amp044

O'Brien, I., Segalowitz, N., Collentine, J., & Freed, B. (2006). Phonological memory and lexical, narrative, and grammatical skills in second language oral production by adult learners. Applied Psycholinguistics, 27(3), 377-402. doi: 10.1017/S0142716406060322

O'Brien, I., Segalowitz, N., Freed, B., & Collentine, J. (2007). Phonological memory predicts second language oral fluency gains in adults. Studies in Second Language Acquisition, 29(4), 557-581. doi: 10.1017/S027226310707043x

Ortega, L. (2009). Understanding second language acquisition. London: Hodder Education.

Pallotti, G. (2009). CAF: Defining, refining and differentiating constructs. Applied Linguistics, 30(4), 590-601. doi: 10.1093/applin/amp045

Paradis, M. (1994). Neurolinguistic aspects of implicit and explicit memory: Implications for bilingualism and SLA. In N. Ellis (Ed.), Implicit and explicit learning of languages (pp. 393-419). London: Academic Press.

Paradis, M. (2004). A neurolinguistic theory of bilingualism. Amsterdam: Benjamins.

Paradis, M. (2009). Declarative and procedural determinants of second languages. Amsterdam: Benjamins.

Pica, T. (1994). Research on negotiation: What does it reveal about second-language learning conditions, processes, and outcomes? Language Learning, 44, 493-527.

Pienemann, M. (1984). Psychological constraints on the teachability of languages. Studies in Second Language Acquisition, 6, 186-214.

258

Pienemann, M. (1989). Is language teachable? Psycholinguistic experiments and hypotheses. Applied Linguistics, 10, 52-79.

Pienemann, M. (1998). Language processing and second language development: Processability theory. Amsterdam: Benjamins.

Pimsleur, P. (1966). The Pimsleur language aptitude battery. New York, NY: Harcourt, Brace, Jovanovic.

Polio, C. G. (1997). Measures of linguistic accuracy in research. Language Learning, 47, 101-143.

Ranta, L. (2002). The role of learners’ language analytic ability in the communicative classroom. In P. Robinson (Ed.), Individual differences and instructed language learning (pp. 159-180). Amsterdam: Benjamins.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

Robinson, P. (2001a). Individual differences, cognitive abilities, aptitude complexes and learning conditions in second language acquisition. Second Language Research, 17(4), 368-392.

Robinson, P. (2001b). Task complexity, cognitive resources, and syllabus design: A triadic framework for examining task influences on SLA. In P. Robinson (Ed.), Cognition and second language instruction (pp. 287-318). Cambridge: Cambridge University Press.

Robinson, P. (2002). Learning conditions, aptitude complexes and SLA: A framework for research and pedagogy. In P. Robinson (Ed.), Individual differences and instructed language learning (pp. 113-133). Amsterdam: Benjamins.

Robinson, P. (2005a). Aptitude and second language acquisition. Annual Review of Applied Linguistics, 25, 46-73.

Robinson, P. (2005b). Cognitive abilities, chunk-strength, and frequency effects in implicit artificial grammar and incidental L2 learning: Replications of Reber, Walkenfeld, and Hernstadt (1991) and Knowlton and Squire (1996) and their relevance for SLA. Studies in Second Language Acquisition, 27, 235-268.

Robinson, P. (2007). Aptitudes, abilities, contexts, and practice. In R. M. DeKeyser (Ed.), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 256-286). Cambridge: Cambridge University Press.

259

Roehr, K. (2006). Metalinguistic knowledge in L2 task performance: A verbal protocol analysis. Language Awareness, 15, 180-198.

Roehr, K. (2008a). Linguistic and metalinguistic categories in second language learning. Cognitive Linguistics, 19, 67-106.

Roehr, K. (2008b). Metalinguistic knowledge and language ability in university-level L2 learners. Applied Linguistics, 29, 173-199.

Roehr, K., & Gánem-Gutiérrez, G. A. (2009). The status of metalinguistic knowledge in instructed adult L2 learning. Language Awareness, 18(2), 165-181. doi: 10.1080/09658410902855854

Ross, S., Yoshinaga, N., & Sasaki, M. (2002). Aptitude-exposure interaction effects on Wh-movement violation detection by pre-and-post critical period Japanese bilinguals. In P. Robinson (Ed.), Individual differences and instructed language learning (pp. 267-299). Amsterdam: Benjamins.

Rubin, J. (1981). Study of cognitive processes in second language learning. Applied Linguistics, 11, 117-131.

Sagarra, N., & Herschensohn, J. (2010). The role of proficiency and working memory in gender and number agreement processing in L1 and L2 Spanish. Lingua, 120, 2022-2039.

Sasaki, M. (1991). Relationships among second language proficiency, foreign language aptitude, and intelligence: A structural equation modeling approach (Doctoral dissertation). Retrieved from ProQuest Dissertations & Theses A&I database. (Accession No. 303922251)

Sasaki, M. (1993). Relationships among second language proficiency, foreign language aptitude, and intelligence: A structural equation modeling approach. Language Learning, 43, 313-344.

Sasaki, M. (1996). Second language proficiency, foreign language aptitude, and intelligence: Quantitative and qualitative analyses. New York, NY: Peter Lang.

Sauro, S. (2009). Computer-mediated corrective feedback and the development of L2 grammar. Language Learning & Technology, 13(1), 96-120.

Sawyer, M., & Ranta, L. (2001). Aptitude, individual differences, and instructional design. In P. Robinson (Ed.), Cognition and second language acquisition (pp. 319-353). Cambridge: Cambridge University Press.

260

Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11, 129-158.

Sheen, Y. (2007). The effect of focused written corrective feedback and language aptitude on ESL learners' acquisition of articles. TESOL Quarterly, 41(2), 255- 283.

Sheen, Y. (2010). Differential effects of oral and written corrective feedback in the ESL classroom. Studies in Second Language Acquisition, 32(2), 203-234. doi: 10.1017/S0272263109990507

Sick, J. (2007). The learner's contribution: Individual differences in language learning in a Japanese high school (Doctoral dissertation). Retrieved from ProQuest Dissertations & Theses A&I database. (Accession No. 304811262)

Sick, J., & Irie, K. (1998). The Lunic Language Marathon: Version 3. Retrieved from http://idisk.mac.com/jimsick

Sick, J., & Irie, K. (2001). Investigating the role of aptitude in an EFL course in Japan. In P. Robinson, M. Sawyer, & S. Ross (Eds.), Second language acquisition research in Japan (pp. 129-141). Tokyo: Japan Association for Language Teaching.

Skehan, P. (1989). Individual differences in second language learning. London: Edward Arnold.

Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press.

Skehan, P. (2002). Theorising and updating aptitude. In P. Robinson (Ed.), Individual differences and instructed language learning (pp. 69-93). Amsterdam: Benjamins.

Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy, fluency, and lexis. Applied Linguistics, 30(4), 510-532. doi: 10.1093/applin/amp047

Sharwood Smith, M. (1981). Consciousness-raising and the second language learner. Applied Linguistics, 11, 159-168.

Snow, R. E. (1992). Aptitude theory: Yesterday, today, and tomorrow. Educational Psychologist, 27(1), 5-32.

Sorace, A. (1985). Metalinguistic knowledge and language use in acquisition-poor environments. Applied Linguistics, 6(3), 239-254.

261

Sparks, R. L., & Ganschow, L. (1991). Foreign language learning differences: Affective or native language aptitude differences? Modern Language Journal, 75, 3-16.

Sparks, R., & Ganschow, L. (2001). Aptitude for learning a foreign language. Annual Review of Applied Linguistics, 21, 90-111.

Sparks, R. L., Patton, J., Ganschow, L., & Humbach, N. (2009). Long-term relationships among early first language skills, second language aptitude, second language affect, and later second language proficiency. Applied Psycholinguistics, 30(4), 725-755. doi: 10.1017/s0142716409990099

Squire, L. R., & Zola, S. M. (1996). Structure and function of declarative and nondeclarative memory systems. Proceedings of the National Academy of Sciences of the United States of America, 93(24), 13515-13522.

Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ: Erlbaum.

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (4th ed.). Boston, MA: Pearson.

Takada, T. (2003). Learner characteristics of early starters and late starters of English language learning: Anxiety, aptitude, and motivation. JALT Journal, 25(1), 5-30.

Tokowicz, N., & MacWhinney, B. (2005). Implicit and explicit measures of sensitivity to violations in second language grammar: An event-related potential investigation. Studies in Second Language Acquisition, 27, 173-204.

Ullman, M. T. (2005). A cognitive neuroscience perspective on second language acquisition: The declarative/procedural model. In C. Sanz (Ed.), Mind & context in adult second language acquisition (pp. 141-178). Washington, DC: Georgetown University Press.

VanPatten, B. (1990). Attending to form in the input: An experiment in consciousness. Studies in Second Language Acquisition, 12, 287-301.

VanPatten, B. (2007). Input processing in adult second language acquisition. In B. Van Patten, & J. Williams (Eds.), Theories in second language acquisition: An introduction (pp. 115-135). Mahwah, NJ: Erlbaum.

VanPatten, B., & Williams, J. (Eds.). Theories in second language acquisition: An introduction. Mahwah, NJ: Erlbaum.

262

White, J., Munoz, C., & Collins, L. (2007). The his/her challenge: Making progress in a ‘regular’ L2 programme. Language Awareness, 16, 278-299.

Williams, J. N., & Lovatt, P. (2005). Phonological memory and rule learning. Language Learning, 55, 177-233. doi: 10.1111/j.0023-8333.2005.00298.x

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Erlbaum.

Wolfe-Quintero, K., Inagaki, S., & Kim, H. (1998). Second language development in writing: Measures of fluency, accuracy & complexity. Honolulu: University of Hawaii Press.

Wright, B. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14, 97-116.

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.

Zobl, H. (1995). Converging evidence for the ‘acquisition-learning’ distinction. Applied Linguistics, 16, 35-56.

263

APPENDICES

264

APPENDIX A RECEPTIVE METALINGUISTIC KNOWLEDGE TEST (JAPANESE VERSION)

次の1~22の問いを読んで、一番適切な答えを選び、その答えにマルしてくだ さい。

1. When does the show begin? この When を文法的に説明すると何ですか。 1. 疑問副詞 (*) 2. 疑問代名詞 3. 疑問形容詞 4. 関係副詞

2. What is that thing over there next to the building? この What は文法的に説明すると何ですか。 1. 関係代名詞 2. 疑問代名詞 (*) 3. 疑問副詞 4. 疑問形容詞

3. I lent him what money I had. この what は文法的に説明すると何ですか。 1. 疑問代名詞 2. 関係代名詞 3. 関係形容詞 (*) 4. 関係副詞

4. You gave him 10,000 yen? この英文は上昇調で読まれます。この構造は次のうちどれにあたりますか。 1. SVOC の疑問文 2. 平叙文のままの疑問文 (*) 3. 疑問文の慣用表現 4. 聞き返し疑問文

5. 名詞の前に複数の形容詞を並べるとき、正しい順序はどれですか。 1. 数量、色、材料・出所 (*) 2. 色、年齢・新旧、大小 3. 大小、年齢・新旧、数量 4. 材料・出所、年齢・新旧、色

265

6. 「名詞節」が使われている英文はどれですか。 1. If we hadn’t spent all of our money, we could go to the concert. 2. She will finish her homework if she studies all day. 3. Cooking dinner together has been important to us. 4. I don’t know if the train has arrived. (*)

7. SVC という文型にしないといけない動詞のことを何といいますか。 1. 他動詞 2. 完全他動詞 3. 不完全自動詞 (*) 4. 完全自動詞

8. 次のうち、「分詞構文」の例文はどれですか。 1. Seen from the top of the building, the cars look small. (*) 2. Every time I buy something, I have to use money. 3. When I go to work, I always ride a train. 4. Even though I have read this book many times, I still enjoy it.

9. It being Friday, the restaurant was crowded with happy customers. この英文の文型は何といいますか。 1. 分詞形容詞構文 2. 現在進行形の分詞構文 3. 独立分詞構文 (*) 4. 慣用的な分詞構文

10. 「過去完了形で用いたとき、文の意味が実現されなかった期待」を表さない動詞はどれ ですか。 1. desire 2. examine (*) 3. expect 4. hope

11. They made the room warm. この英文は、次のうち、どの文法用語を説明するための例文ですか。 1. 使役動詞 2. 自動詞句補文 3. 主格補語 4. 目的格補語 (*)

12. 次のうち、「動名詞」の例文はどれですか。 1. I feel like having another drink. (*) 2. Ken came into the room smiling. 3. We saw a man building a house. 4. Megumi’s flight will be arriving on time.

266

13. id ou follow the teacher’s advice at all? この at all を説明するものとして適切なものを次から選びなさい。 1. 否定表現を強調する形容詞句 2. 動詞を強調する形容詞句 3. 疑問詞のない疑問文の強調 (*) 4. 付加疑問の副詞句

14. 次の単語の中で、「限定用法でしか使われない形容詞」はどれですか。 1. aware 2. beautiful 3. enormous 4. main (*)

15. I’m not interested in basketball in the least. この in the least を説明するものとして適切なものを次から選びなさい。 1. 否定表現の強調 (*) 2. 否定を表す副詞 3. 頻度を表す副詞 4. 否定語を含まない目的語

16. 次の単語の中で、「叙述用法でしか使われない形容詞」はどれですか。 1. elder 2. glad (*) 3. responsible 4. special

17. 次の単語の中で、「人を主語にできない形容詞」はどれですか。 1. bored 2. convenient (*) 3. delightful 4. shocked

18. Can you help me cook dinner? この英文の構造を説明するものとして、最も適切なものを選びなさい。 1. 主語+動詞+目的語+動詞の原形 (*) 2. 主語+知覚動詞+動詞の原形 3. 主語+助動詞+不完全自動詞+目的語 4. 主語+自動詞+間接目的語+直接目的語

19. 次のうち、「修辞疑問文」の例文はどれですか。 1. Why didn’t you come to the party? 2. What is the weather in Hawaii like? 3. Who doesn’t love chocolate cake? (*) 4. How about going to see a movie? 267

20. 次のうち、「不定冠詞」の例文はどれですか。 1. She turned off the TV. 2. He bought a car. (*) 3. I went to bed. 4. Akiko wrote this story.

21. 次のうち、「従属接続詞を用いた副詞節」の例文はどれですか。 1. He not only plays the piano, but also the guitar. 2. The question is whether the player will fully recover. 3. I’m sure that she will come on time tomorrow. 4. I learned English after I moved to Australia. (*)

22. 次のうち、「集合名詞」の例文はどれですか。 1. A child needs love. 2. My sister and I traveled to Canada. 3. He ate two eggs for breakfast. 4. A crowd gathered outside the school. (*)

268

APPENDIX B RECEPTIVE METALINGUISTIC KNOWLEDGE TEST (ENGLISH TRANSLATION)

Instructions: Circle the letter of the most suitable answer for each question.

1. When does the show begin? What grammatical terminology can be used to explain this when? a. Interrogative adverb (*) b. Interrogative pronoun c. Interrogative adjective d. Relative adverb

2. What is that thing over there next to the building? Which of the following describes this What? a. Relative pronoun b. Interrogative pronoun (*) c. Interrogative adverb d. Interrogative adjective

3. I lent him what money I had. Which of the following describes this what? a. Interrogative adverb b. Relative pronoun c. Relative adjective (*) d. Relative adverb

4. You gave him 10,000 yen? What type of sentence structure is this? a. SVOC interrogative b. Declarative interrogative (*) c. Idiomatic interrogative d. Responsive interrogative

269

5. Which of the following is the correct order of adjectives when modifying a noun? a. Number, color, material・place (*) b. Color, age old-new, size c. Size, age old-new, quantity d. Material・place, age・old-new, color

6. Which of the following sentences contains a nominal clause? a. If we hadn’t spent all of our money, we could go to the concert. b. She will finish her homework if she studies all day. c. Cooking dinner together has been important to us. d. I don’t know if the train has arrived. (*)

7. What is a verb called that requires an SVC sentence structure? a. Transitive verb b. Regular intransitive verb c. Incomplete intransitive verb (*) d. Complete intransitive verb

8. Which of the following is an example of a participle construction? a. Seen from the top of the building, the cars look small. (*) b. Every time I buy something, I have to use money. c. When I go to work, I always ride a train. d. Even though I have read this book many times, I still enjoy it.

9. It being Friday, the restaurant was crowded with happy customers. What is this sentence structure called? a. Participial adjective construction b. Present progressive participial construction c. Absolute participial construction (*) d. Idiomatic participial construction

10. When the past perfect form is used, which of the following verbs cannot be used to reflect an unrealized desire? a. desire b. examine (*) c. expect d. hope

270

11. They made the room warm. Which of the following would this sentence serve as an example of? a. Causative verb b. Intransitive verb phrase complement c. Subjective complement d. Objective complement (*)

12. Which of the following sentences contains a gerund? a. I feel like having another drink. (*) b. Ken came into the room smiling. c. We saw a man building a house. d. Megumi’s flight will be arriving on time.

13. Did you follow the teacher’s advice at all? Choose the most suitable description for this at all. a. An adjectival that emphasizes a negative expression b. An adjectival that emphasizes a verb c. Emphasis of a phrase that does not function as a question (*) d. A tag question adverbial

14. Which of the following adjectives can only be used attributively? a. aware b. beautiful c. enormous d. main (*)

15. I’m not interested in basketball in the least. Choose the most suitable description for this in the least. a. Emphasis of a negative expression (*) b. An adverb that expresses negation c. An adverb that expresses frequency d. An object that does not include a negative

16. Which of the following adjectives can only be used predicatively? a. elder b. glad (*) c. responsible d. special

271

17. Which of the following adjectives cannot be used to modify a person in the subject position? a. bored b. convenient (*) c. delightful d. shocked

18. Can you help me cook dinner? Choose the most suitable construction that explains this sentence. a. Subject + verb + object + base form of verb (*) b. Subject + perception verb + base form of verb c. Subject + auxiliary verb + incomplete intransitive verb d. Subject + transitive verb + indirect object + direct object

19. Which of the following is an example of a rhetorical question? a. Wh didn’t ou come to the part ? b. What is the weather in Hawaii like? c. Who doesn’t love chocolate cake? (*) d. How about going to see a movie?

20. Which of the following is an example of an indefinite article? a. She turned off the TV. b. He bought a car. (*) c. I went to bed. d. Akiko wrote this story.

21. Which of the following is an example of a dependent adverbial clause? a. He not only plays the piano, but also the guitar. b. The question is whether the player will fully recover. c. I’m sure that she will come on time tomorrow. d. I learned English after I moved to Australia. (*)

22. Which of the following contains a collective noun? a. A child needs love. b. My sister and I traveled to Canada. c. He ate two eggs for breakfast. d. A crowd gathered outside the school. (*)

272

APPENDIX C PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST

各英文に文法的な間違いや誤りが一箇所あります。訂正が必要な箇所に下線を 引き、英文の右側に正しい英語を書きなさい。また、その部分が「なぜ間違いな のか、なぜ訂正が必要なのか」を適切な文法規則等を用いて説明しなさい。

番号 英文 1 The building is more taller than the shop. 説明 2 Mika wanted to know what had I seen. 説明 3 Last month the ice was melted. 説明 4 Seiko bought a few notebook. 説明 5 The computer that my uncle received it is old. 説明 6 Kazuo study Chinese every day. 説明 7 He writes very well Japanese. 説明 8 Anna ate sandwich for lunch. 説明 9 Mika can to play the piano. 説明 10 Hiroyuki bicycle was expensive. 説明 11 Hiroshi play baseball at school yesterday. 説明 (Table continues)

273

(Table continued) 番号 英文 12 Shohei says he wants buying a new computer. 説明 13 Junko described Ayumi her vacation. 説明 14 You have enough money, won’t you? 説明 15 She has been learning French since nine years. 説明 16 If he had been faster, he will win the race. 説明 17 Did Shoko ate lunch? 説明

274

APPENDIX D PROCEDURAL KNOWLEDGE TEST

次の問題に、英語で答えなさい。できるだけたくさん書いてください。裏面を使 用しても構いません。文章の構成、段落の数、改行は自由です。 (回答時間:25分)

Do you agree or disagree with the following statement? “Children should begin learning a foreign language as soon as the start elementary school.” Use specific reasons and examples to support your position.

275

APPENDIX E PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST SCORING RUBRIC (JAPANESE VERSION)

Use of Technical Language (Judged on a 3-point scale)

Item Score Description 1 2 比較級 or 音節 / taller がすでに比較級だから、more はいらない 1 0

2 2 間接疑問文、S, V, 倒置/ 疑問文ではないので [間接疑問文なので] 1 主語動詞を倒置させる必要はない 0

3 2 ergative or 自動詞 or 受動態/ ergative verb は明示的な主語が必要、 1 melt の主語が必要 (by 以下) melt は自動詞として使われるから 0

4 2 複数形(複数)、名詞(可算名詞) / a few が複数+名詞が複数形 1 0

5 2 関係代名詞 (先行詞) 1 0

6 2 三人称、単数、現在形 (主語) / [主語が三人称、現在形なので]動 1 詞の形を三人称単数現在形にする。 0

7 2 動詞、目的語 1 0

8 2 複数形、あるいは、不定冠詞 (a) 1 0

9 2 助動詞、動詞の原形 (or to 不定詞) / can (or 助動詞) + 動詞の原 1 形、Can + 動詞 0 (Table continues)

276

(Table continued) Item Score Description 10 2 所有、 's 1 0

11 2 過去形、時制 1 0

12 2 to 不定詞 (to + 動詞の原形) 1 0

13 2 SVOO もしくは第四文型、二重目的語 1 0

14 2 付加疑問文、一般動詞 1 0

15 2 前置詞 [for/since] 起点、期間 1 0

16 2 仮定法過去完了、条件節・主節、時制・相、完了 1 0

17 2 疑問文、過去形、(動詞の原形) 1 0

277

Rule Explanation (Judged on a 5-point scale)

Score Criterion Explanation 4 The participant states a completely correct rule. The stated rule is correct and exhaustively complete.

3 The participant states a correct rule, but the The rule is basically correct, explanation might lack some detail or might but it might be stated too not be exhaustively complete or fully simply or might be lacking elaborated. some detail compared to a 4.

2 The participant states a partially correct rule. Part of the stated rule is correct, but other phrases in the explanation might be incorrect or off target.

1 The participant states an incorrect rule. The participant attempted to explain the rule, but the explanation is not correct.

0 The participant does not demonstrate The participant did not notice awareness of the target rule. the rule violation or did not attempt to explain a rule.

278

APPENDIX F PRODUCTIVE METALINGUISTIC KNOWLEDGE TEST SCORING RUBRIC (ENGLISH TRANSLATION)

Use of Technical Language (Judged on a 3-point scale)

Item Score Description 1 2 Comparative or syllable / “taller” is already in comparative form, “more” 1 is not needed 0

2 2 Indirect question, S, V, inversion / because it is not an interrogative 1 [because it is an indirect question], there is no need for subject-verb 0 inversion

3 2 Ergative or intransitive or passive (voice) / ergative verbs require 1 explicit subjects, “melt” needs a subject (after “by”), because “melt” is 0 used as an intransitive verb

4 2 Plural form (plural), noun (countable noun) / “a few” implies plural + 1 the noun is in plural form 0

5 2 Relative pronoun (antecedent) 1 0

6 2 Third person, singular, present tense (subject) / [The subject is third 1 person singular, because the verb is in the present tense], Make the 0 verb into the third person singular present tense form.

7 2 Verb, object 1 0

8 2 Plural form or indefinite article (a) 1 0

9 2 Auxiliary verb, base form of the verb, (or infinitive) / can (or auxiliary 1 verb) + base form of the verb / Can + verb 0

10 2 Possessive, apostrophe –s 1 0 (Table continues)

279

(Table continued) Item Score Description 11 2 Past tense, tense 1 0

12 2 Infinitive, bare infinitive (to + base form of verb) 1 0

13 2 SVOO or fourth sentence pattern or double object 1 0

14 2 Tag question, ordinary verb 1 0

15 2 Preposition [for/since] starting, (time) period 1 0

16 2 Conditional type 3, conditional clause, main clause, tense, aspect, 1 perfect 0

17 2 Interrogative, past tense, (base form of the verb) 1 0

280

Rule Explanation (Judged on a 5-point scale)

Score Criterion Explanation 4 The participant states a completely correct rule. The stated rule is correct and exhaustively complete.

3 The participant states a correct rule, but the The rule is basically correct, explanation might lack some detail or might but it might be stated too not be exhaustively complete or fully simply or might be lacking elaborated. some detail compared to a 4.

2 The participant states a partially correct rule. Part of the stated rule is correct, but other phrases in the explanation might be incorrect or off target.

1 The participant states an incorrect rule. The participant attempted to explain the rule, but the explanation is not correct.

0 The participant does not demonstrate The participant did not notice awareness of the target rule. the rule violation or did not attempt to explain a rule.

281

APPENDIX G BACKGROUND QUESTIONNAIRE (JAPANESE VERSION)

学籍番号 : 年齢 歳 男性 / 女性

1. 海外滞在経験について ◈ これまでに、外国に 1 ヶ月以上滞在していた経験がある方にお聞きします。 どこの国にどれくらいの期間滞在していましたか。また、可能ならば、滞在していた時の年齢 と、滞在の理由・目的(例:語学留学、交換留学、親の仕事の都合、旅行、インターンなど) も記入して下さい。 国名【 】 期間【 年 ヶ 月】 年 齢【 - 才】 理由・ 目的 【 】 国名【 】 期間【 年 ヶ 月】 年 齢【 - 才】 理由・ 目的 【 】 国名【 】 期間【 年 ヶ 月】 年 齢【 - 才】 理由・ 目的 【 】

2. 英語能力について ◈ 下記のテストのスコアがある方は、テストの種類を○で囲み、スコアやレベルを記入して下さ い。 TOEFL ITP スコア【 点】 取得時期【 年】 TOEFL iBT スコア【 点】 取得時期【 年】 TOEIC スコア【 点】 取得時期【 年】 英検 レベル【 級】 取得時期【 年】 その他:

アンケートは以上で終了です。ご協力ありがとうございました。

282

APPENDIX H BACKGROUND QUESTIONNAIRE (ENGLISH TRANSLATION)

Faculty____ Department______ Year in school___ Student Number:_____ Age____ Male / Female

1. Overseas Experience Please answer the following question if you have spent more than one month in a foreign country. In what country did you live, and for how long? If possible, please indicate your age at that time and the reason for living overseas (e.g., foreign language study, exchange program participation, travel, internship, parent’s work, etc.). Country ( ) Period of stay ( Year(s) Month(s)) Age ( ) Purpose ( ) Country ( ) Period of stay ( Year(s) Month(s)) Age ( ) Purpose ( ) Country ( ) Period of stay ( Year(s) Month(s)) Age ( ) Purpose ( )

2. English Proficiency If you have taken an English proficiency test, please write the highest score that you obtained and the year in which you received the score.

TOEFL ITP Score ( ) Year ( ) TOEFL iBT Score ( ) Year ( ) TOEIC Score ( ) Year ( ) STEP Test Level ( ) Year ( ) Other:

Thank ou for our cooperation.

283

APPENDIX I CORRELATION MATRIX OF DECLARATIVE KNOWLEDGE, LANGUAGE APTITUDE, AND L2 PROCEDURAL KNOWLEDGE

2 3 4 5 6 7 8 9 10 11

1. RM .59** .55** .01 .10 -.05 .16* .39** .11 .21** .19*

2. TT 1 .87** -.01 .10 -.00 .10 .51** .17* .27** .30**

3. RE 1 .02 .12 -.02 .10 .54** .16* .29** .37**

4. NM 1 .33** .29** .19* -.12 -.13 -.05 -.03

5. SS 1 .24** .28** .02 .04 -.02 .03

6. VL 1 .21** .01 .01 -.07 .10

7. LA 1 .12 -.02 .05 .05

8. L2P 1 .24** .21** .64**

9. CM 1 .02 -.09

10. AC 1 -.01

11. FL 1

Note. RM = receptive metalinguistic knowledge; TT = technical terminology; RE = rule explanation; NM = number learning; SS = sound-symbol association; VL = vocabulary learning; LA = language analytical ability; L2P = L2 procedural knowledge; CM = complexity; AC = accuracy; FL = fluency. *p < .05. **p < .01.

284