37 (2018) 39–56

Contents lists available at ScienceDirect

Assessing Writing

journal homepage: www.elsevier.com/locate/asw

Modeling quality: A structural equation investigation of lexical, syntactic, and cohesive features in T source-based and independent writing ⁎ Minkyung Kim , Scott A. Crossley

Department of Applied and ESL, Georgia State University, Atlanta, GA 30302, USA

ARTICLE INFO ABSTRACT

Keywords: This study develops a model of second language (L2) writing quality in the context of a stan- Second language writing dardized writing test (TOEFL iBT) using a structural equation modeling (SEM) approach. A Structural equation modeling corpus of 480 test-takers’ responses to source-based and independent writing tasks was the basis Lexical sophistication for the model. Four latent variables were constructed: an L2 writing quality variable informed by Syntactic complexity scores of source-based and independent writing tasks, and lexical sophistication, syntactic Cohesion complexity, and cohesion variables informed by lexical, syntactic, and cohesive features within Writing quality the essays. The SEM analysis showed that an L2 writing quality model had a good fit, and was generalizable across writing prompts (with the exception of lexical features), gender, and learning contexts. The structural regression analysis indicated that 81.7% of the variance in L2 writing quality was explained by lexical decision reaction time scores (β = 0.932), lexical overlap between paragraphs (β = 0.434), and mean length of clauses via lexical decision reaction time scores (β = 0.607). These findings indicate that higher-rated essays tend to contain more sophisticated words that elicited longer response times in lexical decision tasks, greater lexical overlap between paragraphs, and longer clauses accompanying more sophisticated words. Implications for evaluating lexical, syntactic, and cohesive features in L2 writing are discussed.

1. Introduction

Writing is a process to creating meaning (Murray, 1980). To participate in writing as a meaning-making process, writers need to develop ideas linguistically and express ideas coherently (Grabe & Kaplan, 1996; Halliday & Hasan, 1976; Hayes, 1996). In second language (L2) writing, the role of language knowledge and how to use this knowledge in a linguistically and discoursally appropriate manner has a long history of research (Leki, Cumming, & Silva, 2008; Schoonen, van Gelderen, Stoel, Hulstijn, & de Glopper, 2011; Silva, 1993). Importantly, L2 writers’ linguistic and discoursal production can impact judgments of writing quality (Crossley & McNamara, 2012; Grabe & Kaplan, 1996; Weigle, 2002), such that higher rated L2 essays generally contain more sophisticated lexical items (Kyle & Crossley, 2016), more complex grammar structure (Ortega, 2015), and greater text cohesion (Crossley, Kyle, & McNamara, 2016a). In addition to language, another factor that is important for understanding and assessing L2 writing quality is task type (i.e., genres or discourse modes such as narration, description, exposition and argumentation). Previous research indicates that task types impact L2 writers’ performance including composing processes (e.g., planning, drafting, and revising) and language commands (e.g., use of lexicon and syntax; Grabe & Kaplan, 1996; Leki et al., 2008; Plakans, 2008). For instance, Lu (2011) found that argumentative

⁎ Corresponding author. E-mail addresses: [email protected] (M. Kim), [email protected] (S.A. Crossley). https://doi.org/10.1016/j.asw.2018.03.002 Received 18 July 2017; Received in revised form 24 February 2018; Accepted 10 March 2018 Available online 30 March 2018 1075-2935/ © 2018 Elsevier Inc. All rights reserved. M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56 writing tasks elicit more complex syntactic structures than narrative writing tasks. Furthermore, in academic writing, L2 writers tend to put more effort into online planning during drafting of source-based writing (i.e., writing that requires the integration of source materials), while they tend to put more effort into initial planning (before drafting) but less online planning for independent writing (i.e., writing that requires writers’ personal experience and knowledge; Plakans, 2008). Thus, previous research indicates that both language production and task type are important in investigating L2 writing quality. However, to our knowledge, no study has developed an L2 writing model that simultaneously considers relationships between L2 writing quality, and lexical, syntactic and cohesive features, and writing task types. In addition, many studies often assume that testing measures equally assess the same construct (e.g., writing quality) across different groups (e.g., female vs. male), but without evidence. To address these gaps, the purpose of the current study is two-fold. First, the study will investigate how structural equation modeling (SEM) can be used to develop a latent model of L2 writing quality based on holistic scores from two different writing tasks (i.e., source-based and independent writing tasks) as well as the lexical, syntactic, and cohesive features produced in the writing tasks. Second, the study will examine whether assessment of writing quality and language features can be operationalized equally across gender, learning contexts, and writing prompts. Exploring these questions can shed new light on the relationships between language use, task type, and L2 writing quality, which may inform applications in L2 writing assessment in general and computa- tional L2 writing evaluation in particular.

2. Literature review

2.1. L2 writing quality

Writing quality has been defined as “the fit of a particular text to its context, which includes such factors as the writer’s purpose, the discourse medium, and the audience’s knowledge of an interest in the subject” (Witte & Faigley, 1981, p. 199). In assessment contexts, writing quality generally aligns with the fit of test-takers’ essays to their assessment context, which is usually reflected by scoring rubrics. Scores are provided by expert raters and they are often numeric and/or quantifiable. For example, scoring rubrics for iBT TOEFL independent writing tests indicate that a higher quality writing sample (scored on 5-point scale) is represented by strong text organization, the appropriate use of explanations, coherence, and proficient language use. Considerable attention has been drawn to assessing L2 writing quality (Crossley & McNamara, 2012; Crossley et al., 2016a; Guo, Crossley, & McNamara, 2013; Kyle, 2016; Yang, Lu, & Weigle, 2015). In general, higher proficiency L2 writers produce a higher quality writing sample. They are also likely to be higher proficiency L2 language users and have greater writing expertise in their first languages (L1s; Cumming, 1989; Weigle, 2002). L2 knowledge is particularly important in proficient L2 writing because it enables L2 writers to transform propositional ideas into verbal forms in the L2 (Grabe & Kaplan, 1996; Schoonen et al., 2011). According to Bachman and Palmer (1996), language knowledge (as found in both an L1 and an L2) consists of organizational knowledge (about how sentences and texts are organized) and pragmatic knowledge (about how sentences and texts are related to the communicative goals of the language user; pp. 67–70). Organizational knowledge, which is the focus of this study, includes two areas: grammatical knowledge (i.e., knowledge of vocabulary, syntax, and phonology) and textual knowledge (i.e., knowledge of cohesion and knowledge of rhetorical organization; Bachman & Palmer, 1996). Previous studies have reported that L2 writers with more gram- matical knowledge (e.g., large vocabulary and mastery of various grammar structures) express their ideas more clearly and accurately (Schoonen et al., 2011), while L2 writers with more textual knowledge are more likely to connect longer stretches of ideas in a coherent manner (Crossley & McNamara, 2012). To measure language knowledge in L2 writing, some scholars use independent measures of knowledge (e.g., vocabulary tests and phonological awareness tests; Schoonen et al., 2011), while others measure language features produced in written texts (e.g., fre- quency measures and syntactic complexity measures). While early studies measured language features manually (Wolfe-Quintero, Inagaki, & Kim, 1998), more recently, with the advent of natural language processing (NLP) tools, computational indices of language features can be computed quickly, flexibly, and reliably (Crossley, Kyle, & McNamara, 2016b; Kyle & Crossley, 2015; Lu, 2010). The availability of NLP tools has led to an increase in studies exploring links between L2 writing quality, language features, and measures of writing quality. In the next section, we provide an overview of previous research on the relations of L2 writing quality with the three types of language features commonly measured by NLP tools: lexical, syntactic, and cohesive features.

2.2. Lexical, syntactic, and cohesive features and L2 writing quality

2.2.1. Lexical sophistication and L2 writing quality Lexical sophistication refers to use of advanced, sophisticated, and difficult words in written or spoken language output (Laufer & Nation, 1995). A traditional measure of lexical sophistication is associated with word frequency based on a large-scale corpus (i.e., reference-corpus frequency of words in a text; Kyle & Crossley, 2015). Low-frequency words (e.g., consolidation) are generally considered more advanced and sophisticated than high-frequency words (e.g., together). Beyond word frequency, different proxies for measuring ‘sophisticated’ vocabulary have been proposed: word range, age-of-acquisition (AoA) ratings, and word response times. Word range (i.e., the number of texts in which a word appears across a reference corpus; Kyle & Crossley, 2015) has been proposed because word range can reflect a word’s distributional patterns across a reference corpus. AoA ratings (i.e., L1 speakers’ mean estimates of the age at which they had learned words; Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012) have been suggested because word frequency and range are mainly based on materials for adult readers, and thus may not fully reflect the cumulative frequency (i.e., the degree to which people have encountered words through exposure from childhood). Word response times (i.e., the

40 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56 mean response times for a given word when it is presented in lexical decision and word naming tasks) can also be proxy measures for sophisticated words because word response times may reflect the degree of how difficult a word is to L1 speakers based on on-line processing information for the given word (Balota et al., 2007). Sophisticated vocabulary includes words that occur in fewer contexts (i.e., low range values such as found in the word expatriate), words with higher AoA ratings (e.g., 14.53 years-old for the word expatriate), and words that elicit longer response times for L1 speakers (e.g., 1009.94 ms for the word expatriate in lexical decision tasks). In contrast, less sophisticated vocabulary includes words that occur in more contexts (i.e., high range values such as found in the word people), words with lower AoA ratings (e.g., 3.52 years-old for the word people), and words that elicit shorter response times (e.g., 540.55 ms for the word people in lexical decision tasks). Much research has investigated the relations between lexical sophistication and L2 writing quality. One of the robust findings is that more proficient L2 writers use low-frequency words more than less proficient L2 writers (Crossley & McNamara, 2012; Guo et al., 2013; Laufer & Nation, 1995). Word range values are predictive of L2 writing quality such that more proficient L2 learners tend to use words with low range values in independent writing (Kyle & Crossley, 2016). AoA ratings are also predictive of holistic scores of L2 writing quality such that higher rated essays tend to contain words with higher AoA ratings than lower rated essays (Jung, Crossley, & McNamara, 2015). While word response times are predictive of L2 lexical proficiency (Berger, Crossley, & Kyle, in press) such that more proficient L2 learners use words that elicit longer response times than less proficient L2 learners, the relationship between word response times and L2 writing quality has not been examined.

2.2.2. Syntactic complexity and L2 writing quality In L2 research, syntactic complexity is broadly defined as the variation and the sophistication of grammatical structures (Lu, 2011; Ortega, 2015; Wolfe-Quintero et al., 1998). A traditional measure of syntactic complexity is length of a T-unit (i.e., one main clause plus the embedded subordinate clauses; Hunt, 1965). In a review article, Norris and Ortega (2009) conceptualize syntactic com- plexity as composed of four sub-constructs: overall complexity (i.e., length-based measures, such as mean length of sentence and mean length of T-unit), subordination complexity (e.g., dependent clauses per clause), sub-clausal complexity via phrasal elaboration (e.g., mean length of clauses), and coordination. Recently, some researchers suggest more fine-grained syntactic complexity measures by taking into account syntactic complexity at the phrase level (e.g., noun phrases and verb phrases; Biber, Gray, & Poonpon, 2011; Crossley & McNamara, 2014) and verb-argument constructions (e.g., ditransitive constructions; Kyle, 2016). Considerable attention has been drawn to the relations between syntactic complexity and L2 writing quality. Research has found that more proficient writes generally use longer clauses, T-units, and sentences (Lu, 2011; Ortega, 2003). For instance, Yang et al. (2015) found that global syntactic complexity (i.e., mean length of sentences and mean length of T-units) is predictive of L2 writing quality across two different topics in independent writing, such that more proficient L2 writers produce longer sentences and T-units across the different topics. Crossley and McNamara (2014) indicated that higher rated L2 essays for independent expository writing are predicted by clausal complexity (e.g., greater incidence of clauses and “that” verb complements). With reference to phrase-level syntactic complexity, Kyle (2016) found that higher rated L2 essays in independent writing contain more complex phrases (e.g., greater incidence of propositional phrases), supporting the importance of phrasal elaboration in academic writing (Biber et al., 2011).

2.2.3. Cohesion and L2 writing quality Coherence and cohesion are related but different constructs which are important in writing: Coherence refers to a quality of the mental representation of a text in the mind of the reader, while cohesion refers to surface-level linguistic linkages in a text that holds the text together (Halliday & Hasan, 1976). Generally, cohesion tends to be closely related to coherence since greater cohesion can help readers create a coherent mental representation (McNamara, Crossley, & McCarthy, 2010). The most commonly cited types of cohesive devices include Halliday and Hasan’s (1976) five types of cohesive ties, i.e., reference, conjunction, substitution, ellipsis, and lexical cohesion. Recently, with use of various automatic indices, Crossley et al. (2016b) propose three different types of cohesion: local cohesion at the sentence level (e.g., connectives, and lexical overlap for sentences), global cohesion at the paragraph level (e.g., lexical overlap for paragraphs), and text cohesion at the entire text level (e.g., word repetition and incidence of function words in a text). Importantly, different types of discourse (e.g., narratives, expositions, and argumentations) may have different expectations for coherence, which in turn likely lead to different use of cohesive features (Richards, 1990). Research on the relations between cohesion and L2 writing quality has found mixed results: Earlier studies report positive cor- relations between incidence of cohesive features and L2 writing quality (Chiang, 2003), while more recent studies using computa- tional indices indicate that cohesive features of L2 writing quality may vary depending on discourse types (Crossley & McNamara, 2012; Crossley et al., 2016a; Guo et al., 2013). For instance, in the investigation of independent argumentative writing, research found that L2 essay quality is negatively correlated with local cohesion (e.g., connectives, word overlap between sentences, semantic similarity between sentences) and text cohesion (e.g., aspect repetition; Crossley & McNamara, 2012; Guo et al., 2013), but positively correlated with global cohesion (Crossley & McNamara, 2014; Jung et al., 2015). On the other hand, Guo et al. (2013) found that local cohesion (e.g., semantic similarity between sentences) is a positive predictor of L2 writing quality for source-based writing arguably because the integration of source materials may require greater local cohesion. Similarly, in the investigation of expository writing, Crossley et al. (2016a) reported positive correlations between local cohesion and L2 essay quality. In sum, higher quality writing samples tend to include more sophisticated words as captured by various lexical measures, such as word frequency, word range, and AoA ratings. Higher quality writing samples also tend to include longer clauses and sentences with phrasal elaboration. The relationships between text cohesion and L2 writing quality are nuanced and may differ depending on writing tasks. Beyond these language features, another important aspect of assessing L2 writing quality is whether and how different task types (e.g., source-based and independent writing) lead to different use of language features. This question is addressed in the next section.

41 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

2.3. Source-based and independent writing

The majority of the above studies focused on various task types, such as exposition, narration, and argumentation, while in- vestigating relations between language use and L2 writing quality. Among various task types, academic writing is extensively re- searched because it is widely studied by L2 learners (Elder & O’Loughlin, 2003) and it is an essential element that L2 learners need to successfully use in L2 academic contexts in higher education (Cumming, 2006). In exploring academic writing in L2s, previous research has focused on the effects of types of academic writing on L2 writers’ use of cognitive and linguistic resources (Guo et al., 2013; Kyle & Crossley, 2016; Cumming et al., 2005). Perhaps the two most widely used types of academic writing are source-based writing and independent writing (Weigle, 2002). Source-based and independent writing tasks may differ in at least three aspects. First, in terms of cognitive demands, source-based writing requires summarizing and reorganizing information based on source materials, while independent writing requires generating ideas based on writer background knowledge and experience (Purves Söter, Takala, & Vähäpassi, 1984). Second, in terms of writing processes, independent writing likely involves planning the text for preparing for a message (i.e., conceptual processes in which the writer self-selects the content), while source-based writing may not (Cumming et al., 2005; Plakans, 2008). Finally, with reference to assessment, source-based tasks concern both comprehension (i.e., reading or/ and listening) and production (i.e., writing), whereas independent tasks concern production only (Weigle, 2002). In light of these differences, it is not surprising that source-based and independent writing tasks elicit L2 writers’ different language use as well as raters’ different focuses on linguistic and cohesive elements in judging essay quality (Cumming et al., 2005; Guo et al., 2013; Kyle & Crossley, 2016). For instance, Biber, Gray, and Staples (2016) reported that source-based writing tasks elicit more grammatically complex phrases (e.g., more complex noun phrases) than independent writing tasks. Guo et al. (2013) found that lexical sophistication is predictive of essay quality for both tasks, whereas cohesive features are predictive of essay quality for source- based tasks only. Kyle and Crossley (2016) found that source-based writing elicits more sophisticated lexical items than independent writing, and word range and bigram frequency are predictive of essay quality for independent writing but not predictive of essay quality for sources-based writing. While previous studies have reported different use of language features across different writing tasks, it also seems reasonable to assume that L2 writers may produce similar language features when completing different writing tasks. This is because L2 writers’ grammatical and textual knowledge is unlikely to change when sequentially completing different writing tasks. Therefore, some language features are likely to be consistently employed by L2 writers across different tasks (e.g., Guo et al., 2013). However, to our knowledge, there is a dearth of research on the underlying lexical, syntactic, and cohesive features that may be consistently employed by L2 writers and predictive of human ratings of essay quality across different task types (i.e., source-based and independent writing tasks).

2.4. Generalizing across gender, writing prompts, and learning contexts in assessing writing

When investigating test-takers’ performances, researchers often assume that testing measures equally assess the same construct (e.g., writing quality) in all groups. However, this assumption has not been well tested. In testing contexts, examining equivalence of measures (i.e., measurement invariance) is crucial to ensure that memberships of different groups (e.g., male vs. female) or different test prompts do not confound the relationships among measured variables. In this respect, when investigating the relationships between writing quality and language features found in writing samples, it is important to examine whether these relationships generalize across different groups or different writing prompts. In standardized tests, invariance is sought among a number of factors including gender, learning contexts, and writing prompts. First, with respect to gender, while previous research reported different language use between female and male learners in L2 writing (Bermudez & Prater, 1994; Johnson, 1992), it is assumed that standardized assessments do not show differences in measurement properties based on gender. Second, in terms of different learning contexts—English as a second language (ESL) learners vs. English as a foreign language (EFL), previous L2 writing research has highlighted differences between second language contexts (i.e., where a target languages is the primary national language) and foreign language contexts (i.e., where a target languages is not the primary national language) based on differing contextual affordances (e.g., the degree of exposure to L2s) and L2 learners’ motivations in participating in L2 writing (Manchón, 2009). Despite these differences in learning contexts, in testing contexts, writing quality and language features should be operationalized similarly within scoring scales. For example, a high quality writing sample should be independent of the writer’s status as either an EFL or ESL learner. Finally, effects of writing prompts on assessing writing quality should be minimized within a test because multiple prompts for the same writing task should measure the same construct of L2 writing quality (Cho, Rijmen, & Novák, 2013). Thus, the relationships between language features and writing quality should not function differently across different prompts. In summary, measurement properties in assessing writing quality and language features should be invariant across gender, learning contexts, and wring prompts, so that measurement properties for a writing test contribute equally between groups.

2.5. The current study

While many studies have demonstrated differences in language features in written response to source-based and independent writing tasks, little attention has been drawn to the similarities in language features that are consistently used by L2 writers and predictive of human judgments of essay quality across different task types (e.g., source-based and independent writing tasks). In addition, many previous studies have relied on one writing task to examine relationships between writing quality and language

42 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56 features, but writing quality is difficult to measure and developing models using multiple writing tasks is important to obtain generalizable findings about the relationships between writing quality and language features (Schoonen, 2005). Lastly, when as- sessing links between writing scores and linguistic features, little research has addressed whether assessments offer invariance measurement across different group memberships. That is, few studies have examined whether the relationships among measured variables are the same for different classification of people (e.g., gender) and for different conditions (e.g., writing prompts), which are important components of assessing whether results hold for test-takers regardless of gender, writing prompts, and learning contexts. To fill this gap, the current study develops an L2 writing quality model based on two different writing tasks (i.e., source-based and independent writing tasks) and the lexical, syntactic, and cohesive features produced during those tasks, using an SEM approach. To do so, we use confirmatory factor analysis to create a latent variable model of L2 writing quality. For the L2 writing model, we construct latent variables using observable measures of writing scores as well as lexical, syntactic, and cohesive features as found in writing samples in both source-based and independent writing. We then add a structural model to examine how the latent linguistic variables can predict the latent variable of L2 writing quality. Finally, we conduct invariance measurement analysis to test whether the developed L2 writing model is generalizable across three different group criteria: writing prompts, gender (i.e., male and female), and learning contexts (i.e., ESL and EFL). This study is guided by two key research questions:

1. How can latent variables of L2 lexical, syntactic, and cohesive features predict a latent variable of L2 writing quality? 2. Is the developed L2 writing quality model generalizable across writing prompts, gender, and learning contexts?

To answer these research questions, we used SEM, which is a broad class of data-analytic techniques that examine the re- lationships among observable variables (i.e., measurable variables or indicators) and their underlying latent variables (i.e., factors; Kline, 2011). A full SEM includes both a latent variable model (i.e., analyzing the relationships between observed variables and latent variables without a structural model) and a structural model (i.e., investigating the predictive relationship among latent variables). Confirmatory factor analysis is used to test whether a latent variable model adequately represents the fit for the covariances between the indicator variables (Kline, 2011). While SEM can be considered as confirmatory there are three main applications for SEM (Jöreskog, 1993; Kline, 2011): a strictly confirmatory application (i.e., a single model is available and tested), testing alternative models (i.e., one model among other alternative models is selected based on correspondence to the data), and testing model gen- eration (i.e., when an initial model does not correspond to the data, it is then modified by the researcher). The current study adopted the third approach (i.e., the test of model generation) which goal is to discover and develop a model that makes sense theoretically, is parsimonious, and fits the data (Jöreskog, 1993; Kline, 2011). In line with the approach of model generation, the main purpose of this study was to develop a model of writing quality in relation to lexical, syntactic, and cohesive features. Each lexical, syntactic, and cohesive feature and each writing score were considered as indicator variables. Four latent variables that comprised the indicator variables were constructed: an L2 writing quality variable informed by the writing scores for both independent and integrated tasks, and lexical sophistication, syntactic complexity, and cohesion variables informed by lexical, syntactic, and cohesive features, respectively, as found in both independent and integrated writing samples. The hypothesized latent variable model is shown in Fig. 1 in which latent variables are represented by ovals and observable variables are represented by squares. In Fig. 1, using structural regression analysis, the latent variables of lexical so- phistication, syntactic complexity, and cohesion were used as predictors of the latent variable of L2 writing quality. Importantly, this study used SEM to address three research gaps. First, previous research assessed lexical, syntactic, and cohesive features within a single writing task (Crossley & McNamara, 2012, 2014; Crossley et al., 2016a; Lu, 2010) or examined different writing task types using separate statistical analyses (Guo et al., 2013; Kyle & Crossley, 2016). Unlike these previous studies, this

Fig. 1. A Hypothesized Latent Variable Model of L2 Writing Quality.

43 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56 study used a single statistical analysis (i.e., SEM) to investigate the relationship between L2 writing quality and lexical, syntactic, and cohesive features across two different writing tasks. In this respect, this study provides a more comprehensive picture of the re- lationships between language features and writing quality. Second, many previous studies used multiple regressions to predict writing scores using computational indices (e.g., Crossley & McNamara, 2012, 2014; Crossley et al., 2016a). However, regressions models analyze observed variables only. An advantage of using SEM is that SEM allows latent variables to have scores with mea- surement extracted from observed scores. That is, latent variables can reflect “true” scores in a psychometric sense—scores that are freer of measurement error as compared to observed variables which include both true scores and a degree of measurement error (Kieffer & Lesaux, 2012; Kline, 2011). Thus, in our model, each latent variable with measurement error corrected represented “true” language features and writing quality of both source-based and independent writing. Third, unlike previous studies that did not take into account measurement invariance across different groups (e.g., gender) and different conditions (e.g., prompts), this study ex- amined measurement invariance through multi-group confirmatory factor analysis to test whether a latent variable model works equivalently (i.e., invariantly) for different groups (Dimitrov, 2010). In creating a model, it should be mentioned that rationales for choosing language features were both confirmatory and ex- ploratory. Our approach was partly confirmatory because we were informed by previous studies that reported language features as predictive of writing scores. At the same time, our approach was also exploratory for three main reasons. First, there is no agreement or a single theoretical account about what specific language features should be included in generating a model of writing quality. Second, given that many previous studies indicate that lexical, syntactic, and cohesive features are multi-dimensional (Crossley et al., 2016b; Biber et al., 2016; Kim, Crossley, & Kyle, 2018), it would be almost impossible to consider all of the lexical, syntactic, and cohesive features in a single model. Third, many lexical, syntactic, and cohesive features show interdependence with each other (Norris & Ortega, 2009) and are strongly correlated with each other (i.e., showing multicollinearity), which means that some lexical features may measure the same feature (e.g., word frequency and word familiarity similarly measure the degree of exposure to a language). Thus, to create a parsimonious model (Kline, 2011) among various language features, we controlled for multicollinearity.

3. Method

3.1. Corpus

We used a corpus of publicly available Test of English as a Foreign Language (TOEFL iBT) writing samples collected from 480 test- takers (i.e., the TOEFL iBT Public Use Dataset). The TOEFL iBT writing tasks represent the types of written tasks and texts that L2 students are likely to encounter in postsecondary settings and are considered appropriate for assessing test-takers’ academic language proficiency (Enright & Tyson, 2008). In the TOEFL iBT corpus, each test-taker completed two writing tasks: source-based (i.e., integrated) and independent writing. The 480 test-takers included 185 ESL learners who took the TOEFL in the U.S.A. or Canada, and 295 EFL learners who took the TOEFL in countries other than the U.S.A. or Canada. The native languages of the test-takers included 51 different languages (see Table 1 for detailed information). Among the 480 test-takers, 212 were male and 204 were female. Gender was not reported for the remaining 64 test-takers. The source-based writing tasks asked test-takers to first read a passage, listen to a related lecture, and then synthesize the information provided in the passage and the lecture in a 20-min writing assessment. The reading passage was shown on the screen during the composing stage. The independent writing tasks asked test-takers to write an argumentative essay on a given topic by providing specific reasons and examples in a 30-min timeframe. The corpus included two test forms with diff erent prompts. Each form contained 240 test-takers. In the source-based tasks, Form 1 was about effects of fish farming, while Form 2 was about bird migration. In the independent tasks, the prompt of Form 1 was about choosing to study subjects of interest or choosing subjects to prepare for a job, while the prompt of Form 2 was about the importance of cooperation in today’s world as compared to cooperation

Table 1 Participants by Native Languages.

Native language Number Percentage

Chinese 83 17.292 Korean 56 11.667 Spanish 52 10.833 Japanese 50 10.417 Arabic 30 6.250 German 26 5.417 French 23 4.792 Hindi 13 2.708 Portuguese 10 2.083 Russian 10 2.083 Other languagesa 127 26.458 Total 480 100

a Other languages include 41 languages with fewer than 10 test-takers.

44 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Table 2 Mean (Standard Deviation) of Essay Scores and Word Counts.

Source-based tasks Independent tasks

Form 1 (n = 240) Form 2 (n = 240) Total (n = 480) Form 1 (n = 240) Form 2 (n = 240) Total (n = 480)

Score 3.254 (1.179) 3.148 (1.308) 3.151 (1.244) 3.383 (0.864) 3.471 (0.910) 3.427 (0.887) Number of words 204.840 (52.437) 196.040 (50.662) 200.44 (51.692) 321.830 (79.720) 309.38 (77.117) 315.600 (78.596) in the past. Essays were evaluated by two trained TOEFL iBT raters using a five-point holistic scale that ranged from 1.0 to 5.0.1 If two ratings differed by one point or less, they were averaged. If two scores for an essay differed by more than one point, a third rater scored the essay. The rubric for the source-based tasks focused on the coherent and accurate presentation of the important in- formation given in the lecture and the reading along with appropriate language use and grammar. The rubric for the independent tasks focused on the effective completion of the writing task, text organization, the appropriate use of explanations and ex- emplifications, coherence, and the demonstration of syntactic variety along with appropriate word and phrase choices. Table 2 shows descriptive statistics of essay scores and word counts for the essays used in this study.

3.2. Natural language processing tools

To measure the lexical, syntactic, cohesive features in the writing samples, we used three NLP tools: The Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015), the Syntactic Complexity Analyzer (SCA; Lu, 2010), and the Tool for the Automatic Analysis of Cohesion (TAACO; Crossley et al., 2016b). Below, we discuss the computational indices used in this study.

3.2.1. TAALES indices A total of five lexical indices that have been reported as predictive of L2 writing quality and L2 lexical proficiency were selected from Tool 1 (cf. Crossley & McNamara, 2012; Kyle & Crossley, 2016; see Kyle & Crossley, 2015 for the reliability of TAALES): word frequency, word range, AoA ratings, and two of word response norms. Word frequency refers to the incidence of words within a large corpus while range refers to how often a word occurs in each text in a corpus. The word range and frequency norms were derived from the Corpus of Contemporary American English (COCA; Davies, 2008) academic sub-corpus from TAALES. For this study, we used log-transformed COCA academic frequency scores and raw COCA academic range scores. AoA scores reflects words that are learned at a later age. Our AoA scores were derived from the 30,121 lemmas obtained from Kuperman et al. (2012). Our word response times included lexical decision and word naming behavioral norms taken from the English Lexicon Project (ELP; Balota et al., 2007). The ELP includes native English speakers’ mean response times for 40,481 words.

3.2.2. SCA indices Syntactic complexity was measured using the syntactic complexity analyzer (SCA) developed by Lu (2010). SCA contains 14 indices of syntactic complexity based on nine linguistic units: word, sentence, clause,2 dependent clause, T-unit, complex T-unit, coordinate phrase, complex nominal, and verb phrase (see Lu, 2010 for the reliability of SCA). Lu (2010) developed these 14 indices based on second language writing development studies (Wolfe-Quintero et al., 1998). Among the 14 indices, we chose seven mea- sures that significantly differentiated developmental levels in Lu’s (2011) study: three length-related measures (i.e., mean length of clause, mean length of sentence, and mean length of T-unit), two coordination-related measures (i.e., coordinate phrases per clause, and coordinate phrases per T-unit), and two nominal-related measures (i.e., complex nominals per clause, and complex nominals per T-unit).3 Table 3 shows how these seven measures are calculated.

3.2.3. TAACO indices A total of four TAACO indices were chosen (see Crossley et al., 2016b for the reliability of TAACO). These included two measures related to connectives and two measures related to lexical overlap. Connectives were selected because they can help readers explicitly establish cohesive linkages between sentences within a text (Chiang 2003; Crossley & McNamara, 2012; Halliday & Hasan, 1976). Lexical overlap indices were chosen because repeating the same lemmas helps readers find the continuity of lexical ties across the text (Crossley et al., 2016b; Halliday & Hasan, 1976). Our two connective indices were positive logical connectives (e.g., actually and furthermore) and negative logical connectives (e.g., but and although). Our lexical overlap indices examined word overlap between two adjacent sentences (i.e., local cohesion) and a between two adjacent paragraphs (i.e., global cohesion).

1 The scores were those provided by ETS along with the iBT Public Use Dataset. The independent and integrated scoring rubrics are available on the ETS website at http://www.ets.org/Media/Tests/TOEFL/pdf/Writing_Rubrics.pdf. 2 In SCA, a clause is defined as a structure with a subject and a finite verb. 3 While SCA has been found useful in measuring syntactic complexity in L2 writing in previous studies (e.g., Lu, 2011; Yang et al., 2015; Yoon & Polio, 2017), one caveat in using SCA is that length-based measures (e.g., length of T-units) do not provide detailed information about grammatical complexity because they collapse many grammatical distinctions (Biber et al., 2016). Despite this disadvantage, we chose to use SCA because we were interested more in test-takers’ use of syntactic structures in a broader manner than in their use of specific grammatical features (e.g., the incidence of relative clauses per nominal).

45 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Table 3 Seven Syntactic Complexity Measures (Lu, 2010).

Measure Definition

Mean length of clause (MLC) Number of words/Number of clauses Mean length of sentence (MLS) Number of words/Number of sentences Mean length of T-unit (MLT) Number of words/Number of T-units Coordinate phrases per clause (CP/C) Number of coordinate phrases/Number of clauses Coordinate phrases per T-unit (CP/T) Number of coordinate phrases/Number of T-units Complex nominals per clause (CN/C) Number of complex nominals/Number of clauses Complex nominals per T-unit (CN/T) Number of complex nominals/Number of T-units

3.3. Statistical analysis

3.3.1. Normality and multicollinearity Normal distribution for all of the selected indices was checked through skewness and kurtosis levels. The values for skewness and kurtosis between −2 and +2 were considered to indicate univariate normality (George & Mallery, 2010). Multicollinearity between indices (defined as r>0.70) was then controlled for so as not to include indices that measure similar features. If two or more indices showed multicollinearity with each other, the index that was most strongly correlated with writing scores was kept.

3.3.2. Confirmatory factor analysis To generate a latent variable model, we conducted confirmatory factor analysis using R (R Core Team, 2015) and lavaan packages (Rosseel, 2012). In creating an L2 writing model, we used lexical, syntactic, and cohesive indices that were significantly correlated with writing scores for both source-based and independent tasks. Multivariate normality was checked using Mardia’s normalized estimate, with values below five considered to indicate multivariate normality (Byrne, 2006). The covariance matrix among the indicators was computed as input for the SEM. When evaluating the model, the latent variables were fixed at 1.0 such that factor loadings (i.e., measures of the influence a latent variable has on indicator variables) for each indicator variable were comparable. To evaluate overall model fit, four goodness-of-fit measures were used: the χ2 (Chi-square), comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). The χ2 measures absolute fit of the model to the data, but it is sensitive to reject the null hypothesis with large sample sizes. Thus, the normed χ2, i.e., χ2/degree of freedom (df), was used, where values below three indicate good fit(Hair, Black, Babin, Anderson, & Tatham, 2006). Indicators of good model fit included CFI statistics greater than 0.95, RMSEA less than 0.06, and SRMR less than 0.08, while indicators of acceptable model fit included CFI statistics greater than 0.90 and RMSEA less than 0.08 (Hu & Bentler, 1999).

3.3.3. Measurement invariance Using the latent variable model that was constructed through the confirmatory factor analysis, the invariance of measurement across groups was tested. Measurement invariance tests the comparability of scores between group variables and investigate whether measurement properties contribute equally (mainly through constraining factor loadings and intercepts) across these group variables (Beaujean, 2014; Meredith, 1993). That is, measurement invariance tests whether structural equations used to create a latent factor scores are equal across groups. The establishment of measurement invariance indicates that the constructed model and latent factors function equivalently across different groups, or in other words, the model is fair to different groups. We included three different group variables: two writing prompts (i.e., one group who took Form 1, and the other group who took Form 2), gender (i.e., male and female), and two learning contexts (i.e., ESL and EFL). When evaluating models for measurement invariance, the latent variable was fixed at 1.0. The invariance measurement is assessed with four sequential stages (Beaujean, 2014; Dimitrov, 2010): Configural invariance and three stages of measurement invariance (metric, scalar, and strict). Configural invariance (Model 0) concerns whether a latent variable model is assessed for both groups simultaneously. The configural invariance indicates that the groups have the same number of latent variables that are formed by the same number of indicator variables. Metric/weak measurement invariance (Model 1) adds that factor loadings on indicator variables are equal across groups. Scalar/strong measurement invariance (Model 2) adds that for a given indicator, intercepts (i.e., means of indicator variables) are equal across groups. Scalar invariance signals that individuals at the same level of a given latent variable have the same value on the indicator variables regardless of group membership. Strict mea- surement invariance (Model 3) adds that for a given indicator, residual variances and covariances are equal across groups. Strict invariance signals that the indicator variables were measured with the same precision in each group. When configural invariance across groups was met, measurement invariance tests were conducted for two nested models (Dimitrov, 2010): Model 1 vs. Model 0, Model 2 vs. Model 1, and Model 3 vs. Model 2. Invariance was interpreted in terms of chi- 2 2 square difference (Δχ ), and the CFI difference (ΔCFI = CFIconstr. − CFIunconstr.). When Δχ is significant and a ΔCFI value is lower than −0.01, measurement invariance is not warranted (Dimitrov, 2010).

46 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

4. Results

4.1. Descriptive analyses and correlations

Descriptive statistics for all of the 16 indices used in this study along with correlations between the indices and writing scores for both tasks are shown in Table 4 (see Appendix A for correlations among all of the 16 indices for both source-based and independent writing). Among the five lexical sophistication indices, three measures (i.e., lexical decision mean reaction time, word naming mean reaction time, and AoA rating) were significantly correlated with writing scores for both tasks, but they showed multicollinearity in independent writing. Thus, the lexical decision mean response time norm, which most strongly correlated with writing scores for both tasks, was chosen for SEM analysis. Among the seven syntactic complexity measures, mean length of clause was the only index that showed significant correlations with writing scores for both tasks and was therefore selected for inclusion in the SEM. Among the four cohesion indices, lexical overlap between paragraphs was chosen for the SEM because it showed significant correlations with writing scores for both tasks. These selections resulted in two indicators for each latent variable. While it is generally recommended that latent variables have at least four indicators for model identification, latent variables can also have two indicators when (a) the two indicators’ errors are uncorrelated and (b) both of the two indicators correlate with a third indicator of another latent variable but neither of the two indicators’ errors correlates with the error of that third indicator (Beaujean, 2014; Kenny, Kashy, & Bolger, 1998), which is the case with our data.

4.2. Multivariate normality

The results of multivariate normality test indicated non-normality of multivariate distribution. To address non-normality, esti- mator MLM (i.e., using standard maximum likelihood to estimate the model parameters with robust standard errors and a Satorra- Bentler scaled test statistic) was used. The MLMχ2 (i.e., SBχ2) takes into account a scaling correction to estimate chi-square under nonnormality (Satorra & Bentler, 1994). For invariance model fit, Satorra–Bentler scaled chi-square difference (SBΔχ2) was also used (Satorra & Bentler, 2001).

4.3. Confirmatory factor analysis

Correlation matrices among indicator variables (i.e., indices chosen for SEM analysis) are shown in Table 5. The initial latent variable model as hypothesized in Fig. 1 fitted the data well: SBχ2(14) = 24.219, SBχ2/df = 1.730, CFI = 0.987, RMSEA = 0.039, and SRMR = 0.025. Fig. 2 represents the latent model for L2 writing quality. The latent variables were renamed after the indices selected, Lexical decision reaction times, Mean length of clauses, and Lexical overlap between paragraphs. The correlations among latent

Table 4 Mean Scores and Correlations Between Indices and Writing Scores (n = 480).

Index Category Source-based writing Independent writing

Mean (SD) Correlationa (r or rho) Mean (SD) Correlationa (r or rho)

(a) Indices significantly correlated with scores for both tasks Lexical decision mean reaction time Lexical sophistication 621.785 (6.223) r = 0.341b 619.967 (6.95) r = 0.514b Word naming mean reaction time Lexical sophistication 609.878 (5.047) r = 0.306b 604.812 (5.622) r = 0.390b Age-of-acquisition (AoA) Lexical sophistication 5.455 (0.239) r = 0.284b 5.268 (0.255) r = 0.400b Mean length of clause Syntactic complexity 9.818 (1.728) r = 0.138b 9.663 (1.949) r = 0.254b Lexical overlap between paragraphs Cohesion 10.091 (5.226) r = 0.239b 14.038 (6.016) r = 0.292b (b) Indices significantly correlated with source-based writing scores only Mean length of sentence Syntactic complexity 26.134 (13.349) rho = 0.136b 26.622 (16.4) rho = 0.050 Mean length of T-unit Syntactic complexity 22.897 (11.022) rho = 0.159b 22.903 (14.666) rho = 0.079 Lexical overlap between sentences Cohesion 4.769 (2.483) rho = 0.138b 4.456 (2.834) rho = 0.026 (b) Indices significantly correlated with independent writing scores only Word range (COCA academic) Lexical sophistication 0.594 (0.04) r = −0.076 0.656 (0.03) r = −0.256b Word frequency (log-transformed, COCA Lexical sophistication 2.914 (0.116) r = −0.094 3.058 (0.077) r = −0.213b academic) Complex nominals per clause Syntactic complexity 1.206 (0.335) r = 0.114 0.972 (0.317) rho = 0.219b Coordinate phrases per clause Syntactic complexity 0.149 (0.092) rho = 0.032 0.194 (0.115) r = 0.194b Positive logical connectives (normed) Cohesion 0.033 (0.016) r = −0.008 0.038 (0.013) r = −0.155b (d) Indices not correlated with writing scores for either independent or source-based writing Complex nominals per T-unit Syntactic complexity 2.837 (1.63) rho = 0.094 2.343 (1.796) rho = 0.097 Coordinate phrases per T-unit Syntactic complexity 0.347 (0.276) rho = −0.032 0.456 (0.414) rho = 0.117 Negative logical connectives (normed) Cohesion 0.010 (0.008) r = 0.111 0.008 (0.006) r = 0.050

a If indices were normally distributed, Pearson’s correlations (r) were reported; If indices were not normally distributed, Spearman’s rank order correlations (rho) were reported. b p < 0.00315 (Because 16 correlations were conducted between writing scores and 16 indices, the alpha value was adjusted to 0.00315, or 0.05/16, by a Bonferroni correction).

47 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Table 5 Correlations Among Indicator Variables (n = 480).

123 4567

1 Score for source-based writing 1 2 Lexical decision mean response time for source-based writing 0.341** 1 3 Mean length of clause for source-based writing 0.138** 0.229** 1 4 Lexical overlap between sentences for source-based writing 0.239** 0.094* −0.020 1 5 Score for independent writing 0.697** 0.397** 0.149** 0.214** 1 6 Lexical decision mean response time for independent writing 0.472** 0.407** 0.277** 0.116* 0.514** 1 7 Mean length of clause for independent writing 0.194** 0.251** 0.433** 0.01 0.254** 0.374** 1 8 Lexical overlap between sentences for independent writing 0.218** 0.074 0.040 0.302** 0.292** 0.002 −0.061

* Indicates p < 0.05. ** Indicates p < 0.01.

Fig. 2. The Latent Variable Model for L2 Writing Quality with Parameter Estimates. Note: Estimates are standardized; Except for estimates with p values marked, all estimates are significant (p < 0.001) except for a dashed line that indicates non-significance. variables are shown in Table 6. L2 writing quality showed a strong correlation with Lexical decision reaction times (r = 0.792), while showing moderate correlations with Mean length of clauses (r = 0.345) and Lexical overlap between paragraphs (r = 0.499). Lexical decision reaction times and Mean length of clauses showed a strong correlation (r = 0.650). Lexical overlap between paragraphs showed no correlation with either Lexical decision reaction times (r = 0.129, p = 0.101) or Mean length of clauses (r = −0.047, p = 0.537). The inspection of the regression weight showed that 81.1% of the variance in L2 writing quality was explained by Lexical decision reaction times (standardized coefficient (β) = 0.879, p<0.01) and Lexical overlap between paragraphs (β = 0.376, p<0.01). However, Mean length of clauses did not predict L2 writing quality despite a moderate correlation between the two (r = 0.345) arguably because of the stronger correlation between Mean length of clauses with Lexical decision reaction times (r = 0.650), which indicated shared variance. Because previous research supports the importance of syntactic complexity in L2 writing (Crossley & McNamara, 2014; Lu, 2011; Yang et al., 2015) and the latent variable of Mean length of clauses showed a strong correlation with the latent variable of Lexical decision reaction times, a post-hoc mediation analysis was conducted to examine whether Mean length of clauses had an indirect effect

Table 6 Correlations Among Latent Variables (n = 480).

123

1 L2 writing quality 1 2 Lexical decision reaction times 0.792*** 1 3 Mean length of clauses 0.345*** 0.650*** 1 4 Lexical overlap between paragraphs 0.499*** 0.129 −0.047

*** Indicates p < 0.001.

48 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Fig. 3. The Final Latent Model of L2 Writing Quality with a Mediation Analysis. Note: Estimates are standardized; Except for estimates with p values marked, all estimates are significant (p < 0.001) except for a dashed line that indicates non-significance.

on L2 writing quality via Lexical decision reaction times. Following Baron and Kenny (1986), prior to a mediation analysis, three mediational assumptions were met: (a) an independent variable (i.e., Mean length of clauses) alone predicted a dependent variable (i.e., L2 writing quality; R2 = 0.100, β = 0.317, p<0.001); (b) a mediator (i.e., Lexical decision reaction times) alone predicted the dependent variable (R2 = 0.642, β = 0.801, p<0.001); and (c) the independent variable alone predicted the mediator (R2 = 0.439, β = 0.655, p<0.001). The model with the mediation analysis (see Fig. 3) had a good fit: SBχ2(15) = 28.376, SBχ2/df = 1.891, CFI = 0.983, RMSEA = 0.043, and SRMR = 0.035. In this model, 81.7% of the variance in L2 writing quality was explained by Lexical decision reaction times (β = 0.932, p<0.01) and Lexical overlap between paragraphs (β = 0.434, p<0.01). The indirect effect (i.e., the effect of Mean length of clauses on L2 writing quality via Lexical decision reaction times) was significant (β = 0.607, p<0.05). The total effect was also significant (β = 0.363, p<0.01). However, after the mediator (i.e., Lexical decision reaction times) was considered, the direct effect of Mean length of clauses on L2 writing quality (which was significant when Mean length of clauses alone was included as a predictor of L2 writing quality as one of the mediational assumptions) was no longer significant (β = −0.244, p = 0.143). That is, results of the mediation analysis indicated that Mean length of clauses did not directly impact L2 writing quality, but had an indirect effect on L2 writing quality via Lexical decision reaction times. These findings suggest that complex syntactic structures (i.e., longer length of clauses) might have positively impacted human judgments of L2 writing quality when these structures also contained sophisticated lexical items (i.e., words elicited longer response times). In comparison between the L2 writing model without mediation analysis (Fig. 2) and the one with mediation analysis (Fig. 3), the goodness-of-fit statistics indicated that the two models equally had good fit. However, we choose the one with mediation analysis as the final model because this model corroborates theoretical accounts for the importance of syntactic complexity in assessing L2 writing (Ortega, 2015; Norris & Ortega, 2009) as well as previous empirical studies which found that syntactic complexity is pre- dictive of L2 writing quality (Crossley & McNamara, 2014 ; Kyle, 2016; Lu, 2011).

4.4. Measurement invariance

Using the final latent variable model with the mediation analysis shown in Fig. 3, measurement invariance tests were conducted for three different group criteria: two different writing prompts, gender, and learning contexts (ESL vs. EFL). First, measurement invariance was tested across different writing prompts (i.e., Form 1 and Form 2) with the final model shown in Fig. 3 as a baseline model. Table 7 shows the fit statistics for invariance assessment by writing prompts. The goodness-of-fit indices showed that

Table 7 Fit Statistics for Invariance Assessment Across Different Writing Prompts.

Model SBχ2 df ΔSBχ2 Δdf Δp CFI ΔCFI RMSEA SRMR

Model 0 40.788 30 ––– 0.986 – 0.039 0.035 Model 1 51.632 41 10.844 11 0.447 0.987 0.001 0.033 0.058 Model 2 70.030 45 18.398 4 < 0.001 0.968 −0.019 0.048 0.062

Model 2partial 57.076 43 5.444 2 0.092 0.982 −0.005 0.037 0.059 Model 3 67.634 51 10.558 8 0.227 0.979 −0.003 0.037 0.069

49 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Table 8 Fit Statistics for Invariance Assessment Across Gender.

Model SBχ2 df ΔSBχ2 Δdf Δp CFI ΔCFI RMSEA SRMR

Model 0 52.161 30 –––0.971 – 0.060 0.045 Model 1 58.397 41 6.236 11 0.525 0.975 0.004 0.045 0.061 Model 2 58.979 45 0.582 4 0.992 0.980 0.005 0.039 0.061 Model 3 67.599 53 8.620 8 0.342 0.979 −0.001 0.036 0.069

configural invariance across Form 1 and Form 2 was supported (see fit statistics for Model 0 in Table 7). Second, given the evidence of configural invariance, metric measurement invariance was supported through the investigation of goodness-of-fit statistics and change-in-model fit statistics (see fit statistics for Model 1 in Table 7). Third, scalar measurement invariance was not supported: Model 2 did not fit the data, and the change-in-model fit statistics did not meet the cutoff criteria, either (see fit statistics for Model 2 in Table 7). The examination of the modification indices revealed that intercepts of lexical features (i.e., lexical decision mean response times) were substantially different across two groups likely because the two different prompts elicited different word use. Thus, the constraints on the intercepts of lexical decision mean response times derived from both source-based and independent tasks were modified to be freed (i.e., non-invariant). The subsequent model supported scalar measurement invariance (see fit statistics for

Model 2partial in Table 7). Finally, given the evidence of scalar measurement invariance, strict measurement invariance was supported (see fit statistics for Model 3 in Table 7). These results suggest that except for intercepts of lexical decision mean response times for both tasks, the observable variables and the latent variables included in the L2 writing quality model were measured with the same level of precision across the two groups that took tests with different prompts. The same procedure for the measurement invariance tests was conducted across male and female test-takers (see Table 8 for the measurement invariance results for test-takers grouped by gender) and across ESL and EFL test-takers (see Table 9 for the mea- surement invariance results for test-takers grouped by learning contexts). Configural invariance (Model 0) across gender and across learning contexts was supported. The fit indices for subsequent models for measurement invariance (Models 1–3) across gender and learning contexts were also met. Also, for the pairwise comparisons of nested models (i.e., Model 1 vs. Model 0, Model 2 vs. Model 1, and Model 3 vs. Model 2) across gender and learning contexts, all of the Δχ2 values were non-significant, and all of the ΔCFI values were greater than −0.01. Thus, measurement invariance for subgroups by gender and by learning contexts was supported.

5. Discussion

The main purpose of this study was to establish an L2 writing model using holistic scores of source-based and independent writing tasks as well as lexical, syntactic, and cohesive features as found in writing samples, and then to examine whether the established L2 writing model was generalizable across writing prompts, gender, and learning contexts.

5.1. Research question 1: L2 writing quality and lexical, syntactic, and cohesive features

5.1.1. Predicting L2 writing quality The first research question asked how latent variables of L2 lexical, syntactic, and cohesive features predicted a latent variable of L2 writing quality. Using an SEM analysis, we reported an L2 writing quality model that consisted of a latent variable of L2 writing quality informed by source-based and independent writing scores; a latent variable of lexical sophistication features informed by lexical decision mean response times; a latent variable of syntactic complexity features informed by mean length of clauses; and a latent variable of cohesive features informed by lexical overlap between adjacent paragraphs. In the final model with the mediation analysis, the structural regression analysis indicated that 81.7% of the variance in L2 writing quality was explained by Lexical decision reaction times (β = 0.932, p<0.01), Lexical overlap between paragraphs (β = 0.434, p<0.01), and Mean length of clauses via Lexical decision reaction times (β = 0.607, p<0.05). These results contribute to expanding previous research on assessing L2 writing. Unlike previous research that assessed lexical,

Table 9 Fit Statistics for Invariance Assessment Across Learning Contexts.

Model SBχ2 df ΔSBχ2 Δdf Δp CFI ΔCFI RMSEA SRMR

M0 43.456 30 –––0.983 – 0.043 0.042 M1 57.453 41 13.997 11 0.224 0.979 −0.004 0.041 0.066 M2 64.475 45 7.022 4 0.127 0.975 −0.004 0.042 0.068 M3 64.264 53 −0.211 8 0.909 0.986 0.011 0.030 0.072

50 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56 syntactic, and cohesive features within a single writing task (Crossley & McNamara, 2012, 2014; Crossley et al., 2016a; Lu, 2010)or that examined different writing task types using separate statistical analyses (Guo et al., 2013; Kyle & Crossley, 2016), this study investigated the relationship between L2 writing quality and lexical, syntactic, and cohesive features across two different writing tasks using a single statistical analysis (i.e., SEM). The findings suggest that lexical sophistication (when captured as lexical decision mean response times) is more important than syntactic complexity (when captured as mean length of clauses) and cohesion (when captured as lexical overlap between paragraphs) in assessing L2 writing quality. The findings also indicate that higher rated essays in both tasks tend to contain more lexically sophisticated words (i.e., words that elicit longer response times), greater global cohesion (i.e., greater lexical overlap between paragraphs; Crossley & McNamara, 2014), and more syntactically complex structures in the form of longer (finite) clauses with more sophisticated words (Lu, 2010; Ortega, 2015). Importantly, these lexical, syntactic, and cohesive latent variables with measurement error corrected likely reflect lexical, syntactic, and cohesive features that might have been consistently employed by L2 writers across two different writing tasks, which were also consistently indicative of higher rated L2 writing across the two task types. In addition, the finding that the latent variable of cohesion (when it is defined as referential cohesion across paragraphs) was not correlated with the latent variables of lexical sophistication and syntactic complexity suggests that cohesion likely functions as a predictor of essay quality that is independent of lexical and syntactic use. That is, higher-rated essays are related to L2 writers’ advanced language command (i.e., sophisticated words and complex syntactic structures) while higher-rated essays are also related to L2 writers’ greater use of referential cohesion. From a broader, theoretical perspective of L2 language competence, this finding supports the notion that successful L2 writing across different writing tasks depends on two different competences: grammatical competence (e.g., vocabulary, morphology, and syntax) and textual competence (e.g., cohesion and rhetorical organization; Bachman & Palmer, 1996; Grabe & Kaplan, 1996).

5.1.2. Lexical sophistication and L2 writing quality In light of the relations between lexical, syntactic, and cohesive features and L2 writing quality as evidenced in this study, the findings hold some important implications for assessing lexical, syntactic, and cohesive features in L2 writing. First, with respect to lexical sophistication, a lexical decision mean response time norm, which is derived from native-speakers’ on-line processing of words collected from psycholinguistic tasks, was most strongly correlated with writing scores for both source-based and independent writing among other lexical sophistication measures. For instance, AoA ratings were less correlated with writing scores than lexical decision mean response times, while word frequency measures were correlated with independent writing scores, but not with source- based writing scores. This finding may suggest that L2 writing quality (as reflected in holistic scores in both source-based and independent writing tasks) is more aligned with L1-based psycholinguistic lexical norms than with word frequency and range (i.e., text-based measures of lexical sophistication) and AoA ratings (i.e., subjective measures of lexical sophistication). Perhaps, given that the TOEFL writing samples used in this study were rated by expert raters (i.e., native or near-native speakers of English; Enright & Tyson, 2008), this finding may indicate similarities between online processing of words in isolation (i.e., lexical decision tasks) and online processing of words at the textual level (i.e., evaluating lexical sophistication in writing samples). That is, words that have higher cognitive demands (i.e., eliciting longer response times) in online lexical processing tasks are also likely to be considered as sophisticated words in writing samples regardless of writing tasks, which in turn may contribute to higher writing scores.

5.1.3. Syntactic complexity and L2 writing quality Next, with reference to syntactic complexity, our finding that mean length of clauses was the most important measure of syntactic complexity in modeling L2 language quality supports previous studies (Lu, 2011; Yang et al., 2015). For instance, Yang et al. (2015) found that mean length of clauses was the most important predictor of L2 independent writing scores across two different topics. Expanding previous studies, we also found that longer length of clauses was predictive of higher-rated essays only when these clauses accompanied sophisticated lexical items. This finding together with a strong correlation between latent variables of lexical sophis- tication and syntactic complexity seems to blur a clear distinction between lexical and syntactic features, supporting an interactive relationship between syntax and lexis (Römer, 2009). To illustrate the relationship between lexical sophistication and syntactic complexity, two sentences found in source-based writing tasks are shown below. Sentence 1 is from a writing sample with the highest score, 5, while Sentence 2 is from a writing sample with a lower score, 2. These two sentences have similar mean length of clauses. However, the two are different in their lexical features, such that Sentence 1 included more sophisticated vocabulary as indicated by a higher score of lexical decision mean reaction times than the Sentence 2.

1 Lastly, the third theory demonstrates that birds uses an internal compass to align themselves with the magnetic field of the Earth.

Mean length of clauses: 10.5 Score of lexical decision mean reaction times: 630.891

2 However, it would be hard for them to see the sun and the stars, if the weather is very cloudy.

51 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Mean length of clauses: 10 Score of lexical decision mean reaction times: 597.927 Similar examples are found in independent writing tasks. Sentence 3 is from a writing sample with the highest score, 5, while Sentence 4 is from a writing sample with a lower score, 1.5. While a clause in Sentence 3 is similar in length Sentence 4, the former includes more sophisticated words.

3 Such affirmation is confirmed by the emerging importance of the human resources in enterprises.

Length of a clause: 14 Score of lexical decision mean reaction times: 655.676

4 It didn't take so long for her to drop off the school and start over.

Length of a clause: 16 Score of lexical decision mean reaction times: 589.305 Although Sentences 1 and 2, both from source-based writing samples, have similar length of clauses as Sentences 3 and 4, both from independent writing samples, the sentences contain substantially different words. Thus, these examples illustrate that length of clauses itself may not be a determining factor of higher quality of L2 writing. Rather, as reported in the SEM, longer length of clauses that are accompanied by sophisticated lexical items is more closely linked to higher-rated essays for both source-based and in- dependent writing tasks. Importantly, mean length of (finite) clauses is considered a measure of phrasal elaboration because finite clauses become longer through the use of phrases and non-finite clauses (Biber et al., 2016). Thus, our finding is in line with previous studies that emphasized phrasal elaboration in L2 writing (Biber et al., 2011, 2016; Kyle, 2016) and expands these studies in that longer clauses accompanying sophisticated words are more closely linked to higher quality of L2 essays than length of clauses itself. Furthermore, given that a clause is the smallest grammatical unit which can express a complete proposition (Kroeger, 2005), our finding indicate that the more L2 writers challenge themselves in producing longer clauses with more sophisticated words in an attempt to make elaborated meanings within a single proposition, the more highly their writings are likely to be rated across different tasks. In addition, we found that measures related to T-units (i.e., complex nominals per T-units, and coordinate phrases per T-units) were not significantly correlated with either source-based or independent writing scores. This finding supports the idea that T-unit measures (which do not consider specific grammatical features that have distinct functions) may not be important predictors of L2 writing (Biber et al., 2011, 2016). We also found that two measures of longer production units of syntactic complexity (i.e., mean length of sentences, and mean length of T-units) were significantly correlated with source-based writing scores, but not with independent writing scores. This indicates that producing longer sentences and T-units is linked to higher rated source-based writing samples, but not to higher rated independent writing samples. In contrast, two measures of phrasal elaboration (i.e., complex nominals per clause, and coordinate phrases per clause) were significantly correlated with independent writing scores, but not with source-based writing scores. This supports links between phrasal elaboration and higher-rated independent writing samples. The findings that phrasal elaboration is linked to higher-ratings of independent writing samples but not to source-based ratings appear to be related to Biber et al. (2016). Biber et al. used the same corpus found in this study, but used finer-grained grammatical complexity features (e.g., passive voice verbs and finite adverbial clauses) along with multi-dimensional analysis. Despite dif- ferences in linguistic features and statistical analyses between Biber et al. and our study, two main similarities were found. First, Biber et al. (2016) reported that source-based writing samples were more grammatically complex at the phrasal level than in- dependent writing samples, which can be linked to our findings that source-based writing elicited greater complex nominals per clause than independent writing (see Table 4). Second, in terms of the predictive relationship between syntactic features and writing quality, Biber et al. (2016) reported that grammatical complexity related to ‘Literate’ versus ‘Oral’ responses4 was more strongly correlated with independent writing scores (r = 0.39) than source-based writing scores (r = 0.13). Similarly, our findings indicated that phrasal elaboration measured by complex nominals per clause and coordinate phrases per clause was more strongly correlated with independent writing scores (r =0.22and r = 0.19, respectively) than source-based writing scores (r =0.11and r = 0.03, respectively). Thus, findings of both Biber et al. and this study indicate that while source-based writing tasks may lead test-takers to produce more grammatically complex phrases than independent writing tasks, phrasal elaboration is more closely linked to writing quality for independent essays than writing quality for source-based essays. Arguably, in source-based tasks, other elements such as integration of reading and listening passages may be also crucial in raters’ evaluation of source-based essays beyond language use itself (e.g., phrasal elaboration), which thus results in weaker links between phrasal elaboration and source- based writing scores.

4 This oral-literate dimension is one of the main findings from Biber’s multi-dimensional analyses, which distinguish the ‘oral’ features associated with clausal elaboration from the ‘literate’ features associated with phrasal elaboration (Biber et al., 2016). Overall, more complex grammatical features are found in the ‘literate’ dimension.

52 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

5.1.4. Cohesion and L2 writing quality Finally, with respect to the relations between cohesive features and L2 writing quality, our finding that lexical overlap between paragraphs was an important predictor for both source-based and independent writing tasks corroborates previous studies which found that greater global cohesion at the paragraph level was positively correlated with independent writing tasks (Crossley & McNamara, 2014). We expand upon these previous studies in that greater lexical cohesion at the paragraph level was important across different writing tasks. That is, our findings indicate that L2 writers who are able to connect ideas across paragraphs through greater referential cohesion are likely to receive higher scores for both writing scores. In addition to the importance of global cohesion across different writing tasks, we further found that the relationships between cohesion and L2 writing quality may be nuanced depending on different types of writing tasks (Richards, 1990). For instance, our finding showed that greater lexical cohesion at the sentence level (i.e., greater local cohesion) was positively correlated with source- based writing tasks, but not with independent writing tasks, supporting links between local cohesion and higher rated source-based writing samples (Guo et al., 2013). On the other hand, in terms of the use of connectives, while the use of negative connectives was not correlated with source-based or independent writing tasks, the use of positive connectives was negatively correlated with in- dependent writing tasks, supporting the notion that local cohesion is not linked to higher rated independent writing samples (Crossley & McNamara, 2012; Guo et al., 2013).

5.2. Research question 2: generalizability of the L2 writing quality model

The second research question asked whether the established L2 writing quality model was generalizable across writing prompts, gender, and learning contexts. With respect to writing prompts, the results of the measurement invariance tests sup- ported strict measurement invariance for the model, except for the intercepts of lexical features (i.e., lexical decision mean response time) for both writing tasks. The notion that the intercepts of lexical features were not equally measured across two different prompts indicates the effects of writing prompts on lexical choices. In contrast, the intercepts of syntactic and cohesive features equally functioned across the two prompts, suggesting the lack of the prompt effects on syntactic and cohesive features foundinL2writing.Ontheotherhand,withreferencetodifferent groups based on gender and learning contexts, full strict measurement invariance was established, such that the L2 writing quality model invariantly measured the indicator variables and the latent variables across male and female test-takers and across ESL and EFL test-takers. In sum, the L2 writing quality model established in this study was partially generalizable across writing prompts, while fully generalizable across gender and learning contexts.

6. Conclusion

We created an L2 writing quality model that encompasses source-based and independent writing quality in relation to lexical, syntactic, and cohesive features as found in both source-based and independent tasks. Within the L2 writing quality model, the latent variables of lexical, syntactic, and cohesive features together described 81.7% of the variance in the latent variable of L2 writing quality. The L2 writing quality model appeared to be generalizable across different writing prompts (with the exception of lexical features), gender, and learning contexts. Although we found prompt effects on lexical use in L2 writing for both source-based and independent tasks, how these prompt effects specifically worked was not investigated. For instance, the prompt effects for lexical use may be larger in independent writing tasks than source-based writing tasks. In addition, we included a limited set of test-takers’ backgrounds (i.e., gender and learning contexts). Future studies that include various test-takers’ backgrounds, such as L2 learning experience, age, and L1s, would merit consideration in generating a more robust L2 writing quality model that can be generalizable regardless various group memberships. Finally, although we found that in a latent structure, cohesive features were unrelated to lexical and syntactic features, it is not known if producing a cohesive text is influenced by either L1 writing proficiency or L2 language proficiency, or both. Future studies may also consider L1 writing proficiency and general L2 language proficiency in constructing an L2 writing quality model.

53 .Km ..Crossley S.A. Kim, M.

Appendix A. Correlations among language features for both source-based and independent writing

Score LD WN AoA Range Frequency MLC MLS MLT CP/T CP/C CN/T CN/C Overlap Overlap Conn Conn (P) (S) (P) (N)

Score 1 0.346 0.302 0.273 −0.102 −0.137 0.138 0.136 0.159 0.049 0.032 0.129 0.094 0.267 0.138 0.011 0.117 LD 0.504 1 0.671 0.653 −0.309 −0.355 0.208 0.036 0.082 0.031 0.056 0.086 0.170 0.087 0.012 −0.006 0.007 WN 0.380 0.767 1 0.680 −0.260 −0.350 0.196 0.036 0.081 0.045 0.075 0.072 0.127 0.072 −0.001 −0.119 −0.077 AoA 0.395 0.704 0.774 1 −0.376 −0.443 0.192 0.019 0.126 0.086 0.109 0.151 0.197 0.043 −0.038 −0.116 −0.039 Range −0.230 −0.384 −0.374 −0.369 1 0.837 −0.208 0.216 0.121 0.038 −0.084 −0.051 −0.321 0.142 0.138 −0.043 0.035 Frequency −0.196 −0.318 −0.271 −0.304 0.846 1 −0.070 0.206 0.128 −0.045 −0.132 0.055 −0.097 0.111 0.202 −0.036 0.028 MLC 0.260 0.369 0.437 0.509 −0.139 −0.073 1 0.008 0.158 0.184 0.419 0.196 0.711 −0.052 −0.030 0.067 −0.061 MLS 0.050 0.056 −0.015 −0.005 0.002 0.074 0.007 1 0.854 0.377 0.015 0.691 0.065 0.244 0.798 −0.089 −0.169 MLT 0.079 0.091 0.043 0.087 −0.041 0.049 0.180 0.888 1 0.472 0.097 0.845 0.214 0.187 0.669 −0.116 −0.236 CP/T 0.117 0.179 0.097 0.173 −0.127 −0.034 0.313 0.505 0.552 1 0.857 0.302 0.033 0.098 0.263 −0.155 −0.093 CP/C 0.195 0.294 0.246 0.313 −0.180 −0.100 0.573 0.130 0.194 0.845 1 −0.006 0.157 0.013 −0.030 −0.097 0.007

54 CN/T 0.097 0.181 0.203 0.220 −0.040 0.059 0.246 0.723 0.858 0.447 0.164 1 0.560 0.104 0.581 −0.109 −0.241 CN/C 0.219 0.378 0.490 0.528 −0.108 −0.022 0.765 0.133 0.309 0.255 0.383 0.614 1 −0.058 0.077 0.041 −0.106 Overlap 0.280 −0.017 −0.013 −0.027 0.042 0.029 −0.066 0.044 0.049 −0.017 −0.069 −0.021 −0.111 1 0.331 −0.007 −0.036 (P) Overlap 0.026 −0.058 −0.124 −0.118 0.120 0.204 −0.076 0.835 0.750 0.378 0.041 0.582 0.014 0.176 1 −0.102 −0.167 (S) Conn(P) −0.132 −0.204 −0.191 −0.203 0.033 0.025 −0.205 0.054 0.043 −0.100 −0.186 −0.063 −0.249 0.065 0.138 1 −0.014 Conn(N) 0.082 0.013 −0.046 −0.024 −0.090 −0.132 −0.132 −0.085 −0.114 −0.091 −0.090 −0.173 −0.179 0.124 −0.074 0.053 1

Notes: n = 480; Correlations for source-based writing are presented above the diagonal, while correlations for independent writing are presented below the diagonal; Because some indices were not normally distributed, Spearman’s rank order correlations (rho) were reported. For this reason, correlations slightly differ from those reported in Table 4. Indices that showed multicollinearity (r>0.7) are shown in italics. Abbreviations: LD = Lexical decision mean reaction time; WN = Word naming mean reaction time; AoA = Age-of-acquisition; MLC = mean length of clause; MLS = Mean length of

sentence; MLT = Mean length of T-unit; CP/T = Coordinate phrases per T-unit; CP/C = Coordinate phrases per clause; CN/T = Complex nominals per T-unit; CN/C = Complex Assessing Writing37(2018)39–56 nominals per clause; Overlap(P) = Lexical overlap between paragraphs; Overlap(S) = Lexical overlap between paragraphs; Conn(P) = Positive logical connectives; Conn (N) = Negative logical connectives. M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

References

Bachman, L. F., & Palmer, A. S. (1996). in practice: Designing and developing useful language tests, Vol. 1. Oxford University Press. Balota, D., Yap, M. J., Cortese, M. J., Hutchison, K., Kessler, B., Loftis, B., ... Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445–459. Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173–1182. Beaujean, A. A. (2014). Latent variable modeling using R: A step-by-step guide. NY: Routledge. Berger, C.M., Crossley, S., & Kyle, K. (in press). Using native-speaker psycholinguistic norms to predict lexical proficiency and development in second language production. . doi: 10.1093/applin/amx005. Bermudez, A. B., & Prater, D. L. (1994). Examining the effects of gender and second language proficiency on Hispanic writers' persuasive discourse. Bilingual Research Journal, 18,47–62. Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45,5–35. Biber, D., Gray, B., & Staples, S. (2016). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, 37(5), 639–668. Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. Chiang, S. (2003). The importance of cohesive conditions to perceptions of writing quality at the early stages of foreign language learning. System, 31, 471–484. Cho, Y., Rijmen, F., & Novák, J. (2013). Investigating the effects of prompt characteristics on the comparability of TOEFL iBT™ integrated writing tasks. Language Testing, 30(4), 513–534. Crossley, S. A., & McNamara, D. S. (2012). Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication. Journal of Research in Reading, 35(2), 115–135. Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26,66–79. Crossley, S. A., Kyle, K., & McNamara, D. S. (2016a). The development and use of cohesive devices in L2 writing and their relations to judgments of essay quality. Journal of Second Language Writing, 32,1–16. Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237. Cumming, A. (1989). Writing expertise and second language proficiency. Language Learning, 39,81–141. Cumming, A. H. (Ed.). (2006). Goals for academic writing: ESL students and their instructors. John Benjamins Publishing. Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL. Assessing Writing, 10(1), 5–43. Davies, M. (2008). The corpus of contemporary American English: 520 million words, resent. Available online at http://corpus.byu.edu/coca/. Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement and Evaluation in Counseling and Development, 43, 121–149. Elder, C., & O’Loughlin, K. (2003). Investigating the relationship between intensive English language study and band score gains on IELTS. IELTS research reports, Vol. 4, Canberra: IELTS Australia207–254. Enright, M., & Tyson, E. (2008). Validity evidence supporting the interpretation and use of TOEFL iBT scores. TOEFL iBT Research Insight. George, D., & Mallery, M. (2010). SPSS for windows step by step: A simple guide and reference, 17.0 update (10th ed.). Boston: Pearson. Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing: An applied linguistic perspective. London: Longman. Guo, L., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study. Assessing Writing, 18(3), 218–238. Hair, J. F., Black, W. C., Jr, Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate analysis. NJ: Pearson Prentice-Hall, Englewood Cliffs. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Hayes, J. R. (1996). A new framework for understanding cognition and affect in writing. In C. M. Levy, & S. Ransdell (Eds.). The science of writing (pp. 1–27). Mahwah, NJ: Lawrence Erlbaum Associates. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. Hunt, K. W. (1965). Grammatical structures written at three grade levels. Champaign, IL: National Council of Teachers of English. Johnson, D. M. (1992). Interpersonal involvement in discourse: Gender variation in L2 writers’ complimenting strategies. Journal of Second Language Writing, 1(3), 195–215. Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen, & J. S. Long (Eds.). Testing structural equation models (pp. 294–316). Newbury Park, CA: Sage. Jung, Y., Crossley, S. A., & McNamara, D. S. (2015). Linguistic features in MELAB writing performances (Working Paper No. 2015-05). Retrieved from Cambridge Michigan Language Assessments website: http://www.cambridgemichigan.org/wp-content/uploads/2015/04/CWP-2015- 05.pdf. Kenny, D. A., Kashy, D., & Bolger, N. (1998). Data analysis in social psychology. In D. Gilbert, S. Fiske, & G. Lindzey (Eds.). Handbook of social psychology (pp. 233– 265). New York: McGraw-Hill. Kieffer, M. J., & Lesaux, N. K. (2012). Development of morphological awareness and vocabulary knowledge in Spanish-speaking language minority learners: A parallel process latent growth curve model. Applied Psycholinguistics, 33,23–54. Kim, M., Crossley, S. A., & Kyle, K. (2018). Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. The Modern Language Journal, 102(1), 120–141. Kline, R. B. (2011). Principles and practice of structural equation modeling (3rd ed.). New York: Guilford. Kroeger, P. R. (2005). Analyzing grammar: An introduction. Cambridge: Cambridge University Press. Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786. Kyle, K., & Crossley, S. (2016). The relationship between lexical sophistication and independent and source-based writing. Journal of Second Language Writing, 34, 12–24. Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. Georgia State University. Dissertation http://scholarworks.gsu.edu/alesl_diss/35. Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16, 307–322. Leki, I., Cumming, A., & Silva, T. (2008). A synthesis of research on second language writing in English. London: Routledge. Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers’ language development. TESOL Quarterly, 45,36–62. Manchón, R. (Vol. Ed.), (2009). Writing in foreign language contexts: Learning, teaching, and research: Vol. 43Multilingual Matters. McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written Communication, 27(1), 57–86. Meredith, W. (1993). Measurement invariance: Factor analysis and factorial invariance. Psychometrika, 58, 525–543. Murray, D. M. (1980). Writing as process: How writing finds its own meaning. Eight Approaches to Teaching Composition, 3–20. Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30, 555–578. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24, 492–518. http://dx.doi.org/10.1093/applin/24.4.492.

55 M. Kim, S.A. Crossley Assessing Writing 37 (2018) 39–56

Ortega, L. (2015). Syntactic complexity in L2 writing: Progress and expansion. Journal of Second Language Writing, 29,82–94. Plakans, L. (2008). Comparing composing processes in writing-only and reading-to-write test tasks. Assessing Writing, 13(2), 111–129. Purves, A. C., Söter, A., Takala, S., & Vähäpassi, A. (1984). Towards a domain-referenced system for classifying composition assignments. Research in the Teaching of English, 18(4), 385–416. R Core Team (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. Richards, J. C. (1990). The language teaching matrix. New York: Cambridge University Press. Römer, U. (2009). The inseparability of lexis and grammar: Corpus linguistic perspectives. Annual Review of Cognitive Linguistics, 7, 141–163. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covariance structure analysis. In A. von Eye, & C. C. Clogg (Eds.). Latent variables analysis: Applications for developmental research (pp. 399–419). Thousands Oaks, CA: Sage. Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30. Schoonen, R., Van Gelderen, A., Stoel, R. D., Hulstijn, J., & De Glopper, K. (2011). Modeling the development of L1 and EFL writing proficiency of secondary school students. Language Learning, 61(1), 31–79. Silva, T. (1993). Toward an understanding of the distinct nature of L2 writing: The ESL research and its implications. TESOL Quarterly, 27(4), 657–677. Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press. Witte, S. P., & Faigley, L. (1981). Coherence cohesion, and writing quality. College Composition and Communication, 32(2), 189–204. Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in writing: Measures of fluency, accuracy, and complexity. Honolulu, HI: University of Hawaii Press. Yang, W., Lu, X., & Weigle, S. C. (2015). Different topics, different discourse: relationships among writing topic, measures of syntactic complexity, and judgments of writing quality. Journal of Second Language Writing, 28,53–67. Yoon, H.-J., & Polio, C. (2017). The linguistic development of students of English as a second language in two written genres. TESOL Quarterly, 51, 275–301.

Minkyung Kim is a PhD candidate in the Department of Applied Linguistics at Georgia State University. Her primary research interests are second language , , and lexical development.

Dr. Scott Crossley is an Associate Professor of Applied Linguistics at Georgia State University. Professor Crossley’s primary research focus is on natural language processing and the application of computational tools and machine learning algorithms in language learning, writing, and text comprehensibility.

56