Corpora and Language Pedagogy

How can corpora help in language pedagogy?

Richard Xiao

Abstract

Corpus linguistics as a methodology of linguistic research has gained such prominence over time that corpora have been used extensively in nearly all branches of linguistics. This chapter explores the potential uses of corpus data in one of these areas – language teaching and learning. We will first discuss a wide range of issues related to using corpora in language pedagogy, including referencing publishing, syllabus design and materials development, language testing, teacher development, data-driven learner (DLL), teaching language for specific purposes, as well as learner corpus and interlanguage analysis. We will then demonstrate, via a case study of passive constructions in Chinese learner English, how contrastive corpus linguistics can inform second language acquisition research. The chapter concludes by discussing the debate over the relevance of authenticity and frequency of corpora in language education as well as the future of corpus-based language pedagogy.

Key words: corpora, language pedagogy, data-driven learning, learner corpus, contrastive corpus linguistics, interlanguage, second language acquisition

1. Introduction

The corpus-based approach to linguistics and language education has gained prominence over the past four decades, particularly since the mid-1980s. This is because corpus analysis can be illuminating ‘in virtually all branches of linguistics or

1 language learning’ (Leech 1997: 9; cf. also Biber, Conrad and Reppen 1998: 11). One of the strengths of corpus data lies in its empirical nature, which pools together the intuitions of a great number of speakers and makes linguistic analysis more objective

(McEnery and Wilson 2001: 103). Unsurprisingly, corpora have been used extensively in nearly all branches of linguistics including, for example, lexicographic and lexical studies, grammatical studies, language variation studies, contrastive and translation studies, diachronic studies, semantics, pragmatics, stylistics, sociolinguistics, discourse analysis, forensic linguistics, and language pedagogy.

Corpora have won widespread popularity over time in spite of the fact that they still occasionally attract hostile criticism (e.g. Widdowson 1990, 2000).

In this chapter, we will not be concerned with the debate over the use of corpus data in linguistic analysis and language education. In our view, such a debate is over a non-issue. Readers interested in the pros and cons of using corpus data should refer to

Sinclair (1991), Widdowson (1991, 2000), de Beaugrande (2001) and Stubbs (2001).

Robert de Beaugrande’s unpublished paper, ‘Large corpora and applied linguistics: H.

G. Widdowson versus J. McH. Sinclair’ (available online at http://www.beaugrande.com/WiddowSincS.htm), provides an excellent summary of the debate between Sinclair and Widdowson, at the Georgetown University Round

Table on Languages and Linguistics in 1991, over the use of corpora in language teaching. While Widdowson, Sinclair and de Beaugrande characterize two extreme attitudes towards corpora, there are many milder (positive or negative) reactions to corpus data between the two extremes. Readers can refer to Nelson (2000: section

5.3.3.) for a good review. Nor will we discuss the use of corpora in a wide range of language studies. Readers can refer to Hunston (2002) and McEnery, Xiao and Tono

2 (2006) for a further discussion of using corpora in applied linguistics. Instead, this chapter focuses only on using corpora in language pedagogy.

The early 1990s saw an increasing interest in applying the findings of corpus-based research to language pedagogy. The upsurge of interest is evidenced by the eight well- received biennial international conferences on Teaching and Language Corpora

(TaLC) held in Lancaster (1996, 1994), Oxford (1998), Graz (2000), Bertinoro

(2002), Granada (2004), Paris (2006), and Lisbon (2008). This is also apparent when one looks at the published literature. In addition to a large number of journal articles, well over twenty authored or edited volumes have recently been produced on the topic of teaching and language corpora: Wichmann et al (1997), Partington (1998),

Bernardini (2000), Burnard and McEnery (2000), Kettemann and Marko (2002,

2006), Aston (2001), Ghadessy, Henry, and Roseberry (2001), Hunston (2002),

Granger et al (2002), Connor and Upton (2002), Tan (2002), Sinclair (2003, 2004),

Aston et al (2004), Mishan (2005), Nesselhauf (2005), Römer (2005), Braun, Kohn and Mukherjee (2006), Gavioli (2006), Scott and Tribble (2006), Hidalgo, Quereda and Santana (2007), O’Keeffe, McCarthy and Carter (2007), Aijmer (2009), and

Campoy, Gea-valor and Belles-Fortuno (2010). These works cover a wide range of issues related to using corpora in language pedagogy, e.g. corpus-based language description, corpus analysis in classroom, and learner corpora (cf. Keck 2004).

In the opening chapter of Teaching and Language Corpora (Wichmann et al 1997),

Geoffrey Leech observed that a convergence between teaching and language corpora was apparent. That convergence has three focuses, as noted by Leech (1997): the direct use of corpora in teaching (teaching about, teaching to exploit, and exploiting to

3 teach), the indirect use of corpora in teaching (reference publishing, materials development, and language testing), and further teaching-oriented corpus development (LSP corpora, L1 developmental corpora and L2 learner corpora).

In the remainder of this chapter, we will explore the potential uses of corpora in language pedagogy in line with Leech’s three focuses of convergence (sections 2-4), which is followed by a case study demonstrating how contrastive corpus linguistics can inform second language acquisition research (section 5). The chapter concludes by discussing the debate over the relevance of authenticity and frequency of corpora in language education as well as the future of corpus-based language pedagogy.

2. Indirect use of corpora

The use of corpora in language teaching and learning has been more indirect than direct. This is perhaps because direct use of corpora in language pedagogy is restricted by a number of factors including, for example, the level and experience of learners, time constraints, curricular requirements, knowledge and skills required of teachers for corpus analysis and result interpretation, and the access to resources such as computers, and appropriate software tools and corpora, or a combination of these

(see section 6 for further discussion). This section explores how corpora have impacted on language pedagogy indirectly.

2.1. Reference publishing

Corpora have revolutionized reference publishing (at least for English), be it a dictionary or reference grammar, in such a way that it is now nearly unheard of for new dictionaries and new editions of old dictionaries published from the 1990s

4 onwards not to be based on corpus data, and ‘even people who have never heard of a corpus are using the product of corpus-based investigation’ (Hunston 2002: 96).

Corpora are useful in several ways for lexicographers. The greatest advantage of using corpora in lexicography lies in their machine-readable nature, which allows dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds. The second advantage of the corpus- based approach, which is not available when using citation slips, is the frequency information and quantification of collocation which a corpus can readily provide.

Some dictionaries, e.g. COBUILD 1995 and Longman 1995, include such frequency information. Frequency data plays an even more important role in the so-called frequency dictionaries, which define core vocabulary to help learners of different modern languages, e.g. Davies (2005) for Spanish, Jones and Tschirner (2005) for

German, Davies and de Oliveira Preto-Bay (2007) for Portuguese, Lonsdale and Bras

(2009) for French, and Xiao, Rayson and McEnery (2009) for Chinese. Information of this sort is particularly useful for materials writers and language learners alike. A further benefit of using corpora is related to corpus markup and annotation. Many available corpora (e.g. the BNC) are encoded with textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) metadata which allows lexicographers to give a more accurate description of the usage of a lexical item.

Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs. Furthermore, a monitor corpus allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up- to-date. Last but not least, corpus evidence can complement or refute the intuitions of

5 individual lexicographers, which are not always reliable (cf. Sinclair 1991a: 112;

Atkins and Levin 1995; Meijs 1996; Murison-Bowie 1996: 184) so that dictionary entries are more accurate. The above observations above are line with Hunston (2002:

96), who summarizes the changes brought about by corpora to dictionaries and other reference books in terms of five ‘emphases’: an emphasis on frequency, an emphasis on collocation and phraseology, an emphasis on variation, an emphasis on lexis in grammar and an emphasis on authenticity.

It has been noted that non-corpus-based grammars can contain biases while corpora can help to improve grammatical descriptions (McEnery and Xiao (2005). The

Longman Grammar of Spoken and Written English (LGSWE, Biber et al 1999) can be considered as a milestone in reference publishing. Based entirely on the 40-million- word Longman Spoken and Written English Corpus, the grammar gives ‘a thorough description of English grammar, which is illustrated throughout with real corpus examples, and which gives equal attention to the ways speakers and writers actually use these linguistic resources’ (Biber et al 1999: 45). The new corpus-based grammar is unique in many different ways, for example, by taking register variations into account and exploring the differences between written and spoken grammars.

While lexical information forms, to some extent, an integral part of the grammatical description in Biber et al (1999), it is the Collins COBUILD series (Sinclair 1990,

1992; Francis et al 1996; 1997; 1998), that focus on lexis in grammatical descriptions

(the so-called ‘pattern grammar’, Hunston and Francis 2002). In fact, Sinclair et al

(1990) flatly reject the distinction between lexis and grammar. While pattern grammars focusing on the connection between pattern and meaning challenge the

6 traditional distinction between lexis and grammar, they are undoubtedly useful in language learning as they provide ‘a resource for vocabulary building in which the word is treated as part of a phrase rather than in isolation’ (Hunston 2002: 106).

In the dictionary family, perhaps the most important member as far as language pedagogy is concerned is a learner dictionary. Yet corpus-based learner dictionaries have a quite short history. It was only in 1987 that the Collins COBUILD English

Dictionary was published as the first ‘fully corpus-based’ dictionary. Yet the impact of this corpus-based dictionary was such that most other publishers in the ELT market followed Collins’ lead. By 1995, the new editions of major learner’s dictionaries such as the Longman Dictionary of Contemporary English (LDOCE, 3rd edition), the

Oxford Advanced Learner’s Dictionary (OALD, 5th edition), and a newcomer, the

Cambridge International Dictionary of English (CIDE, 1st edition) all claimed to be based on corpus evidence in one way or another.

One of the important features of corpus-based learner dictionaries is that their inclusion of quantitative data extracted from a corpus. Another important feature, which is also related to frequency information, is that such dictionaries typically select the vocabulary used from a controlled set when defining the entry for a word.

Producing definitions in an L2 that language learners can understand is a problem; language learners may not have a very well developed L2 vocabulary. This makes it necessary and desirable for dictionary makers to limit the vocabulary they use when defining words in a dictionary. Nowadays, most learner dictionary makers prepare a list of defining words, usually ranging from 2,000 to 2,500 words, based on the

7 frequency information extracted from corpora as well as on the lexicographers’ experience of defining words.

As noted earlier, an important use of corpus data for lexicography is in the area of example selection so that nowadays most dictionaries of English use corpora as the source of their examples. In the case of learner dictionaries, however, there was a tradition of using examples invented by lexicographers, rather than authentic materials, in dictionary production, because they believed that foreign language learners have difficulty understanding authentic materials and therefore have to be presented with simple, rewritten examples in which the use of a given word is highlighted to show its syntactic and semantic properties. It was corpus-based learner dictionary work which challenged this received wisdom. The COBUILD project broke with tradition and used authentic data extracted from corpora to produce illustrative examples for a learner dictionary. The use of authentic examples in learner dictionaries is an area where corpus-based learner dictionaries have innovated.

2.2. Syllabus design and materials development

While corpora have been used extensively to provide more accurate descriptions of language use, a number of scholars have also used corpus data directly to look critically at existing TEFL (Teaching English as a Foreign Language) syllabuses and teaching materials. Mindt (1996), for example, finds that the use of grammatical structures in textbooks for teaching English differs considerably from the use of these structures in L1 English. He observes that one common failure of English textbooks is that they teach ‘a kind of school English which does not seem to exist outside the foreign language classroom’ (Mindt 1996: 232). As such, learners often find it

8 difficult to communicate successfully with native speakers. A simple yet important role of corpora in language education is to provide more realistic examples of language usage. In addition, however, corpora may provide data, especially frequency data, which may further alter what is taught. For example, on the basis of a comparison of the frequencies of modal verbs, future time expressions and conditional clauses in corpora and their grading in textbooks used widely in Germany, Mindt

(ibid) concludes that one problem with non-corpus-based syllabuses is that the order in which those items are taught in syllabuses ‘very often does not correspond to what one might reasonably expect from corpus data of spoken and written English’, arguing that teaching syllabuses should be based on empirical evidence rather than tradition and intuition with frequency of usage as a guide to priority for teaching

(Mindt 1996: 245-246). While frequency is certainly not the only determinant of what to teach and in what order (see section 6 for further discussion), it can indeed help to make learning more effective. For example, McCarthy, McCarten and Sandiford’s

(2005-2006) innovative Touchstone book series, which is based on the Cambridge

International Corpus, aims to present the vocabulary, grammar, and functions students encounter most often in real life.

Hunston (2002: 189) echoes Mindt suggesting that ‘the experience of using corpora should lead to rather different views of syllabus design.’ The type of syllabus she discusses extensively is a ‘lexical syllabus’, originally proposed by Sinclair and

Renouf (1988) and outlined fully by Willis (1990) and embodied in Willis, Willis and

Davids’ (1988-1989) three-part Collins COBUILD English Course. According to

Sinclair and Renouf (1988: 148), a lexical syllabus would focus on ‘(a) the commonest word forms in a language; (b) the central patterns of usage; (c) the

9 combinations which they usually form.’ While the term may occasionally be misinterpreted to indicate a syllabus consisting solely of vocabulary items, a lexical syllabus actually covers ‘all aspects of language, differing from a conventional syllabus only in that the central concept of organization is lexis’ (Hunston 2002: 189).

Sinclair (2000: 191) would say that the grammar covered in a lexical syllabus is

‘lexical grammar’, not ‘lexico-grammar’, which attempts to ‘build a grammar and lexis on an equal basis.’ Indeed, as Murison-Bowie (1996: 185) observes, ‘in using corpora in a teaching context, it is frequently difficult to distinguish what is a lexical investigation and what is a syntactic one. One leads to the other, and this can be used to advantage in a teaching/learning context.’ Sinclair and his colleagues’ proposal for a lexical syllabus is echoed by Lewis (1993, 1997a, 1997b, 2000) who provides strong support for the lexical approach to language teaching.

A focus of the lexical approach to language pedagogy is teaching collocations and the related concept of prefabricated units. There is a consensus that collocational knowledge is important for developing L1/L2 language skills (e.g. Bahns 1993;

Zhang 1993; Cowie 1994; Herbst 1996: 389-391; Kita and Ogata 1997: 230-231;

Partington 1998: 23-25; Hoey 2000, 2004; Shei and Pain 2000: 167-170; Sripicharn

2000: 169-170; Altenberg and Granger 2001; McEnery and Wilson 2001; McAlpine and Myles 2003: 71-75; Nesselhauf 2003). Hoey (2004), for example, posits that

‘learning a lexical item entails learning what it occurs with and what grammar it tends to have.’ Cowie (1994: 3168) observes that ‘native-like proficiency of a language depends crucially on knowledge of a stock of prefabricated units.’ Aston (1995) also notes that the use of prefabs can speed language processing in both comprehension and production, thus creating native-like fluency. A powerful reason for the

10 employment of collocations, as Partington (1998: 20) suggests, ‘lies in the way it facilitates communication processing on the part of hearer’, because ‘language consisting of a relatively high number of fixed phrases is generally more predictable than that which is not’ while ‘in real time language decoding, hearers need all the help they can get.’ As such, competence in a language undoubtedly seems to involve collocational knowledge (cf. Herbst 1996: 389). Collocational knowledge indicates which lexical items co-occur frequently with others and how they combine within a sentence. Such knowledge is evidently more important than individual words themselves (cf. Kita and Ogata 1997: 230) and is needed for effective sentence generation (cf. Smadja and McKeown 1990). Zhang (1993), for example, finds that more proficient L2 writers use significantly more collocations, more accurately and in more variety than less proficient learners. Collocational error is a common type of error for learners (cf. McAlpine and Myles 2003: 75). Gui and Yang (2002: 48) observe, on the basis of the Chinese Learner English corpus, that collocation error is one of the major error types for Chinese learners of English. Altenberg and Granger

(2001) and Nesselhauf (2003) find that even advanced learners of English have considerable difficulties with collocation. One possible explanation is that learners are deficient in ‘automation of collocations’ (Kjellmer 1991). ‘As a result, learners need detailed information about common collocational patterns and idioms; fixed and semi- fixed lexical expressions and different degrees of variability; relative frequency and currency of particular patterns; and formality level’ (McAlpine and Myles 2003: 75).

Corpora are useful in this respect, not only because collocations can only reliably be measured quantitatively, but also because the KWIC (key word in centre) view of corpus data exposes learners to a great deal of authentic data in a structured way. Our view is line with Kennedy (2003), who discusses the relationship between corpus data

11 and the nature of language learning, focusing on the teaching of collocations. The author argues that second or foreign language learning is a process of learning

‘explicit knowledge’ with awareness, which requires a great deal of exposure to language data.

In addition to the lexical focus, corpus-based teaching materials try to demonstrate how the target language is actually used in different contexts, as exemplified in Biber et al’s (2002) Longman Student Grammar of Spoken and Written English, which pays special attention to how English is used differently in various spoken and written registers.

2.3. Language testing

Another emerging area of language pedagogy which has started to use the corpus- based approach is language testing. Alderson (1996) envisaged the possible uses of corpora in this area: test construction, compilation and selection, test presentation, response capture, test scoring, and calculation and delivery of results. He concludes that ‘[t]he potential advantages of basing our tests on real language data, of making data-based judgments about candidates’ abilities, knowledge and performance are clear enough. A crucial question is whether the possible advantages are born out in practice’ (Alderson 1996: 258-259). The concern raised in Alderson’s conclusion appears to have been addressed satisfactorily. Choi, Kim and Boo (2003), for example, find that computer-based tests are comparable to paper-based tests. A number of corpus-based studies of language testing have been reported. For example,

Coniam (1997) demonstrated how to use word frequency data extracted from corpora to generate cloze tests automatically. Kaszubski and Wojnowska (2003) presented a

12 corpus-driven program for building sentence-based ELT exercises – TestBuilder. The program can process raw and part-of-speech tagged corpora, tagged on the fly by a built-in part-of-speech tagger, and uses this as input for test material selection. Indeed, corpora have recently been used by major providers of test services for a number of purposes: 1) as an archive of examination scripts; 2) to develop test materials; 3) to optimize test procedures; 4) to improve the quality of test marking; 4) to validate tests; and 5) to standardize tests (cf. Ball 2001; Hunston 2002: 205). For example, the

University of Cambridge Local Examinations Syndicate (UCLES) is active in both corpus development (e.g. Cambridge Learner Corpus, Cambridge Corpus of Spoken

English, Business English Text Corpus and Corpus YLE Speaking Tests) and the analysis of native English corpora and learner corpora. At UCLES, native English corpora such as the British National Corpus (BNC) are used ‘to investigate collocations, authentic stems and appropriate distractors which enable item writers to base their examination tasks on real texts’ (Ball 2001: 7); the corpus-based approach is used to explore ‘the distinguishing features in the writing performance of EFL/ESL learners or users taking the Cambridge English examinations’ and how to incorporate these into ‘a single scale of bands, that is, a common scale, describing different levels of L2 writing proficiency’ (Hawkey 2001: 9); corpora are also used for the purpose of speaking assessment (Ball and Wilson 2002; Taylor 2003) and to develop domain- specific (e.g. business English) wordlists for use in test materials (Ball 2002; Horner and Strutt 2004).

2.4. Teacher development

For learners to benefit from the use of corpora, language teachers must first of all be equipped with a sound knowledge of the corpus-based approach. It is unsurprising to

13 discover then that corpora have been used in training language teachers (e.g. Allan

1999, 2002; Conrad 1999; Seidlhofer 2000, 2002; O’Keeffe and Farr 2003). Allan

(1999), for example, demonstrates how to use corpus data to raise the language awareness of English teachers in Hong Kong secondary schools. Conrad (1999) presents a corpus-based study of linking adverbials (e.g. therefore and in other words), on the basis of which she suggests that it is important that a language teacher do more than using classroom concordancing and lexical or lexico-grammatical analyses if language teaching is to take full advantage of the corpus-based approach.

Conrad’s concern with teacher education is echoed by O’Keeffe and Farr (2003), who argue that corpus linguistics should be included in initial language teacher education so as to enhance teachers’ research skills and language awareness.

3. Direct use of corpora

While indirect uses such as syllabus design and materials development are closely associated with what to teach, corpora have also provided valuable insights into how to teach. Of Leech’s (1997) three focuses, direct uses of corpora include ‘teaching about’, ‘teaching to exploit’, and ‘exploit to teach’, with the latter two relating to how to use. Given a number of restricting factors as noted in section 2, direct uses have so far confined largely to learning at more advanced levels, for example, in tertiary education, whereas in general English language teaching (let alone to mention other foreign languages), especially in secondary education (see Braun 2007 a rare example of an empirical study of using corpora in secondary education), the direct use of corpora is ‘still conspicuously absent’ (Kaltenböck and Mehlmauer-Larcher 2005).

14 ‘ Teaching about’ means teaching corpus linguistics as an academic subject like other sub-disciplines of linguistics such as syntax and pragmatics. Corpus linguistics has now found its way into the curricula for linguistic and language related degree programmes at both postgraduate and undergraduate levels. ‘Teaching to exploit’ means providing students with ‘hands-on’ know-how, as emphasized in McEnery,

Xiao and Tono (2006), so that they can exploit corpora for their own purposes. Once the student has acquired the necessary knowledge and techniques of corpus-based language study, learning activity may become student centred. ‘Exploiting to teach’ means using a corpus-based approach to teaching language and linguistics courses

(e.g. sociolinguistics and discourse analysis), which would otherwise be taught using non-corpus-based methods.

If the focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being associated typically with students of linguistics and language programmes, ‘teaching to exploit’ relates to students of all subjects which involve language study and learning, who are expected to benefit from the so-called data-driven learning (DDL) or ‘discovery learning’.

The issue of how to use corpora in the language classroom has been discussed extensively in the literature. With the corpus-based approach to language pedagogy, the traditional ‘three P’s’ (Presentation – Practice – Production) approach to teaching may not be entirely suitable. Instead, the more exploratory approach of ‘three I’s’

(Illustration – Interaction – Induction) may be more appropriate, where ‘illustration’ means looking at real data, ‘interaction’ means discussing and sharing opinions and observations, and ‘induction’ means making one’s own rule for a particular feature,

15 which ‘will be refined and honed as more and more data is encountered’ (see Carter and McCarthy 1995: 155). This progressive induction approach is what Murison-

Bowie (1996: 191) would call the interlanguage approach: namely, partial and incomplete generalizations are drawn from limited data as a stage on the way towards a fully satisfactory rule. While the ‘three I’s’ approach was originally proposed by

Carter and McCarthy (1995) to teach spoken grammar, it may also apply to language education as a whole, in our view.

It is clear that the teaching approach focusing on ‘three I’s’ is in line with Johns’

(1991) concept of ‘data-driven learning (DLL)’. Johns was perhaps among the first to realize the potential of corpora for language learners (e.g. Higgins and Johns 1984). In his opinion, ‘research is too serious to be left to the researchers’ (Johns 1991: 2). As such, he argues that the language learner should be encouraged to become ‘a research worker whose learning needs to be driven by access to linguistic data’ (ibid). John’s web-based Kibbitzer (www.eisu2.bham.ac.uk/johnstf/timeap3.htm) gives some very good examples of data-driven learning.

Data-driven learning can be either teacher-directed or learner-led (i.e. discovery learning) to suit the needs of learners at different levels, but it is basically learner- centred. This autonomous learning process ‘gives the student the realistic expectation of breaking new ground as a “researcher”, doing something which is a unique and individual contribution’ (Leech 1997: 10). It is important to note, however, that the key to successful data-driven learning, even if it is student-centred, is the appropriate level of teacher guidance or mediation depending on the learners’ age, experience, and proficiency level, because ‘a corpus is not a simple object, and it is just as easy to

16 derive nonsensical conclusions from the evidence as insightful ones’ (Sinclair 2004:

2). In this sense, it is even more important for language teachers to be equipped with the necessary training in corpus analysis (cf. section 6).

Johns (1991) identifies three stages of inductive reasoning with corpora in the DDL approach: observation (of concordanced evidence), classification (of salient features) and generalization (of rules). The three stages roughly correspond to Carter and

McCarthy’s (1995) ‘three I’s’. The DDL approach is fundamentally different from the

‘three P’s’ approach in that the former is bottom-up induction whereas the latter is top-down deduction. The direct use of corpora and concordancing in the language classroom has been discussed extensively in the literature (e.g. Tribble 1991, 1997a,

1997b, 2000, 2003; Tribble and Jones 1990, 1997; Flowerdew 1993; Karpati 1995;

Kettemann 1995, 1996; Wichmann 1995; Woolls 1998; Aston 2001; Osborne 2001,

Bruan 2007), covering a wide range of issues including, for example, underlying theories, methods and techniques, and problems and solutions.

4. Teaching oriented corpora

Teaching-oriented corpora are particularly useful in teaching languages for specific purposes (LSP corpora) and in research on L1 (developmental corpora) and L2

(learner corpora) language acquisition. Such corpora can be used directly or indirectly in language pedagogy as discussed in previous sections.

4.1. Languages for specific purposes and professional communication

In addition to teaching English as a second or foreign language in general, a great deal of attention has been paid to domain-specific language use and professional

17 communication (e.g. English for specific purposes and English for academic purpose).

For example, Thurstun and Candlin (1997, 1998) explore the use of concordancing in teaching writing and vocabulary in academic English. Hyland (1999) compares the features of the specific genres of metadiscourse in introductory course books and research articles on the basis of a corpus consisting of extracts from 21 university textbooks for different disciplines and a similar corpus of research articles. Upton and

Connor (2001) undertake a ‘moves analysis’ in the business English using a business learner corpus. The authors approach the cultural aspect of professional communication by comparing the ‘politeness strategies’ used by learners from different cultural backgrounds. Thompson and Tribble (2001) examine citation practices in academic text. Koester (2002) argues, on the basis of an analysis of the performance of speech acts in workshop conversations, for a discourse approach to teaching communicative functions in spoken English. Yang and Allison (2003) study the organizational structure in research articles in applied linguistics. Carter and

McCarthy (2004) explore, on the basis of the CANCODE corpus, a range of social contexts in which creative uses of language are manifested. Hinkel (2004) compares the use of tense, aspect and the passive in L1 and L2 academic texts. Xiao (2003) reviews a number of case studies using domain specialized multilingual corpora to teach domain specific translation. Studies such as these demonstrate that LSP corpora are particularly useful in teaching language for specific purposes and professional communication.

4.2. Learner corpora and interlanguage analysis

Two kinds of corpora that emerged in the 1990s have not only greatly contributed to the vitality of corpus linguistics but have also revived contrastive analysis and

18 interlanguage research. They are learner corpora and multilingual corpora. This section discusses learner corpora while the topic of multilingual corpora will be taken up for further discussion in section 5.1.

The creation and use of learner corpora in language pedagogy and interlanguage research has been welcomed as one of the most exciting recent developments in corpus-based language studies. If native speaker corpora of the target language provide a top-down approach to using corpora in language pedagogy, learner corpora provide a bottom-up approach to language teaching (Osborne 2002).

A learner corpus, as opposed to a “developmental corpus” composed of data produced by children acquiring their mother tongue (L1), comprises written or spoken data produced by language learners who are acquiring a second or foreign language. Data of this type has particularly been useful in language pedagogy and second language acquisition (SLA) research, as demonstrated by the fruitful learner corpus studies published over the past decade (see Pravec 2002; Keck 2004; and Myles 2005 for recent reviews). SLA research is primarily concerned with ‘the mental representations and developmental processes which shape and constrain second language (L2) productions’ (Myles 2005: 374). Language acquisition occurs in the mind of the learner, which cannot be observed directly and must be studied from a psychological perspective. Nevertheless, if learner performance data is shaped and constrained by such a mental process, it at least provides indirect, observable, and empirical evidence for the language acquisition process. Note that using product as evidence for process may not be less reliable; sometimes this is the only practical way of finding about process. Stubbs (2001) draws a parallel between corpora in corpus linguistics and

19 rocks in geology, ‘which both assume a relation between process and product. By and large, the processes are invisible, and must be inferred from the products.’ Like geologists who study rocks because they are interested in geological processes to which they do not have direct access, SLA researchers can analyze learner performance data to infer the inaccessible mental process of second language acquisition. Learner corpora can also be used as an empirical basis that tests hypotheses generated using the psycholinguistic approach, and to enable the findings previously made on the basis of limited data of a small number of informants to be generalised. Additionally, learner corpora have widened the scope of SLA research so that, for example, interlanguage research nowadays treats learner performance data in its own right rather than as decontextualised errors in traditional error analysis (cf.

Granger 1998: 6).

At the pre-conference workshop on learner corpora affiliated to the International

Symposium of Corpus Linguistics 2003 held at the University of Lancaster, the workshop organizers Yukio Tono and Fanny Meunier observed that learner corpora are no longer in their infancy but are going through their nominal teenage years – they are full of promise but not yet fully developed. In language pedagogy, the implications of learner corpora have been explored for curriculum design, materials development and teaching methodology (cf. Keck 2004: 99). The interface between

L1 and L2 materials has been explored. Meunier (2002), for example, argues that frequency information obtained from native speaker corpora alone is not sufficient to inform curriculum and materials design. Rather, ‘it is important to strike a balance between frequency, difficulty and pedagogical relevance. That is exactly where learner corpus research comes into play to help weigh the importance of each of

20 these’ (Meunier 2002: 123). Meunier also advocates the use of learner data in the classroom, suggesting that exercises such as comparing learner and native speaker data and analyzing errors in learner language will help students to notice gaps between their interlanguage and the language they are learning. Interlanguage studies based on learner corpora which have been undertaken so far focus on what Granger

(2002) calls ‘Contrastive Interlanguage Analysis (CIA)’, which compares learner data and native speaker data, or language produced learners from different L1 backgrounds. The first type of comparison typically aims to identify under or overuse of particular linguistic features in learner language while the second type aims to uncover L1 interference or transfer. In addition to CIA, learner corpora have also been used to investigate the order of acquisition of particular morphemes. Readers can refer to Granger et al (2002) for recent work in the use of learner corpora, and read Granger

(2003) for a more general discussion of the applications of learner corpora such as the

International Corpus of Learner English (ICLE).

In addition to SLA research, learner corpora can also be used directly in classroom teaching. For example, Seidlhofer (2002) and Mukherjee and Rohrbach (2006) demonstrate how a ‘local learner corpus’ containing students’ own writings can be used directly for learning by coping with students’ questions about their own or classmates’ writings, or analyzing and correcting errors in such familiar writings.

We have so far discussed how corpora, including those teaching oriented corpora like

LSP corpora and learner corpora, can be used directly or indirectly in language pedagogy. The section that follows seeks to demonstrate the predictive and diagnostic power of the integrated approach that combines contrastive corpus linguistics with

21 interlanguage analysis in second language acquisition research as advocated in Römer

(2008), via a case study of passive constructions in Chinese learner English.

5. Using contrastive corpus linguistics to inform LSA research

In this section, we will first clarify the type of corpora used in contrastive corpus linguistics, which will be followed by a summary of the findings from a published contrastive study of passive constructions in English and Chinese based on comparable corpora of the two languages (Xiao, McEnery and Qian 2006). These findings will in turn be used to predict and diagnose the performance of Chinese learners of English in their use of English passives as mirrored in a sizeable Chinese learner English corpus in comparison with a comparable native English corpus.

5.1. Contrastive corpus linguistics

As noted in section 4.2, multilingual corpora have been an important development in corpus research since the 1990s. A multilingual corpus involves two or more languages. Data contained in this kind of corpora can be either source texts in one language plus their translations in another language or other languages, or texts collected from different native languages using comparable sampling techniques to achieve similar coverage and balance. The two types of multilingual corpora are usually referred to as parallel corpora and comparable corpora respectively and used in translation and contrastive studies.

Contrastive studies can be theoretically oriented or geared towards applied research.

Theoretic contrastive studies are language independent and primarily concerned with how a universal category is realised in two or more different languages, whilst applied

22 contrastive studies are preoccupied with how a common category in one language is realised in another language. In its early stage, contrastive linguistics was predominantly theoretic, though the applied aspect was not totally neglected.

Theoretically oriented contrastive studies were continued from the late 1920s all the way into the 1960s by the Prague School. On the other hand, WWII aroused great interest in foreign language teaching in the United States, and contrastive studies were recognised as an important part of foreign language teaching methodology (cf. Fries

1945; Lado 1957). As a means of ‘predicting and/or explaining difficulties of second language learners with a particular mother tongue in learning a particular target language’ (Johansson 2003), applied contrastive studies were dominant throughout the 1960s. However, it was soon realised that language learning could not be accounted for by cross-linguistic contrast alone (see Sajavaara 1996 for a discussion of some problems with contrastive linguistics), and as a result contrastive studies lost ground to more learner-oriented approaches such as error analysis, performance analysis and interlanguage analysis (cf. Johansson 2003). The revival of contrastive studies in the 1990s has largely been attributed to the corpus methodology and the availability of multilingual corpora (cf. Granger 1996: 37; Salkie 1999; Johansson

2003).

What kind of corpora can be used in contrastive analysis? To answer this question, we will first need to have a general idea of purposes of multilingual corpora of various kinds.

While multilingual corpora, and especially comparable corpora, are designed and created with the explicit aim of cross-linguistic contrast, all corpora have ‘always

23 been pre-eminently suited for comparative studies’ (Aarts 1998: i). For example, the four English corpora of the Brown family (e.g. Brown, LOB, Frown, FLOB, see Xiao

2008: 395-297 for a comparison of these corpora) were created for synchronic and diachronic comparisons of English as used in Britain and the US in the early 1960s and the early 1990s, while the Lancaster Corpus of Mandarin Chinese (LCMC) was designed as a Chinese match for FLOB and Frown to facilitate cross-linguistic contrasts of English and Chinese (McEnery, Xiao and Mo 2003). The International

Corpus of English (ICE) project has used a common corpus design and the same sampling criteria for each of its components to ensure their comparability (Nelson

1996); similarly, the International Corpus of Learner English (ICLE) is designed in such a way that the subcorpora for learners of different L1 backgrounds are comparable (Granger 1998). Even a corpus like the British National Corpus (BNC), which was designed to be representative of modern British English (Aston and

Burnard 1998), also provides a useful basis for various intra-lingual comparisons (e.g. genre-based variations and variations caused by sociolinguistic variables). Clearly, corpora are intrinsically comparative, and so is the corpus linguistics methodology.

For example, collocations are extracted using statistic measures that compare the probabilities of co-occurring words within a specified window span of the node word; keywords are identified by comparing the target corpus with a reference corpus; what

Granger (1998: 12) referred to as Contrastive Interlanguage Analysis (CIA) is also mainly concerned with comparison, e.g. comparing interlanguage with target native language, and comparing different interlanguages (in terms of L1 background, age, proficiency level, task type, learning setting, and medium etc). In short, it can be said that the whole corpus research enterprise is based on comparison, for example, by comparing the same linguistic feature in different corpora, comparing different

24 linguistic features in the same corpus, and comparing what is observed and what is expected.

While corpus linguistics is clearly comparative in nature, the technical terms for corpora used in linguistic comparison are somewhat confusing, with the controversy revolving around the issue of whether a parallel corpus should be a corpus composed of source texts plus translations, or a corpus containing native language data collected using comparable sampling criteria. As we have argued elsewhere (McEnery et al

2006: 47), a parallel corpus is composed of source texts and their translations, whilst a comparable corpus contains L1 texts sampled from different languages which are comparable in sampling criteria. A translation corpus, instead of referring to what is actually a parallel corpus as suggested in the literature, comprises translated texts for us in studies of translational language (e.g. the Translational English Corpus).

Corpora which are designed primarily for intra-lingual comparison or for comparing different varieties of the same language (e.g. the ICE) are comparative corpora.

Having clarified the terminologies, it is appropriate to discuss what types of corpora are to be used in cross-linguistic contrasts. This is in fact an issue which is as debatable as the terminological issue. It has been argued that parallel corpora provide a sound basis for contrastive analysis, as demonstrated in the claims that ‘translation equivalence is the best available basis of comparison’ (James 1980: 178), and that

‘studies based on real translations are the only sound method for contrastive analysis’

(Santos 1996: i). However, as has been widely observed (Baker 1993: 243-5;

Hartmann 1995; Gellerstam 1996; Teubert 1996: 247; Laviosa 1997: 315; McEnery and Wilson 2001: 71-72; McEnery and Xiao 2002, Xiao and Yue 2009; Xiao, He and

25 Yue forthcoming), translational language is ‘an unrepresentative special variant of the target language’ which is perceptibly influenced by the source language (McEnery,

Xiao and Tono 2006: 93). The source texts and translations in a parallel corpus are certainly comparable in terms of sampling criteria such as genres – in fact sampling only applies in selecting source texts but does not apply twice to translations, but this comparability is immediately undermined by so-called ‘translationese’ in translated texts. For example, Laviosa (1998) finds that translational language has four core patterns of lexical use: a relatively lower proportion of lexical words over function words, a relatively higher proportion of high-frequency words over low-frequency words, a relatively greater repetition of the most frequent words, and less variety in the words that are most frequently used. Beyond the lexical level, translational language is characterised by normalization, simplification (Baker 1993), explicitation

(i.e. increased cohesion, Øverås 1998), and sanitization (i.e. reduced connotational meanings, Kenny 1998). In addition to these common features of translational language, Granger (1996) has noted some similarity between translationese and what she calls ‘learnerese’: ‘Both are situated somewhere between L1 and L2 and are likely to contain examples of transfer’, and both ‘give evidence of what Gellerstam (1986:

94) calls “syntactic fingerprints”’ (Granger 1996: 48).

As observations resulting from parallel corpus analysis usually invite ‘further research with monolingual corpora in both languages’ (Mauranen 2002: 182), parallel corpora can be a useful starting point of contrastive analysis. Nevertheless, it is also clear from the discussion above that while they are ideal resources for translation studies (see

McEnery and Xiao 2007 for further discussion), parallel corpora provide a poor basis for cross-linguistic contrasts if relied upon alone.

26 In the section that follows, we will present the findings of a contrastive study of passive constructions in English and Chinese on the basis of comparable written and spoken corpora of the two languages, which will be used to predict and diagnose what is observed in Chinese learner English.

5.2. Passive constructions in English and Chinese

This section summarises the results of a contrastive corpus analysis of passive constructions on the basis of comparable corpora of English and Chinese, which was published in Xiao, McEnery and Qian (2006). The primary corpus resources used in that study included FLOB for written English and LCMC for written Chinese, together with spoken corpora composed of transcripts for casual conversations in the two languages. In addition, two spoken corpora of sampling period similar to FLOB and LCMC were used to compare speech and writing. For English we used the demographically component of the British National Corpus, amounting to approximately four million words of conversational data sampled during 1985-1994.

For Chinese we used the Callhome Mandarin Chinese Transcript corpus, which contains 120 transcripts of telephone conversations amounting to roughly 300,000 words (see McEnery and Xiao 2008).

Our corpus-based contrastive study yields a number of interesting findings. Below we will only give a summary of the results that are most relevant to our discussion of the performance of Chinese learners of English in the following section.

27 Firstly, passive constructions are nearly ten times as frequent in English as in Chinese, with normalised frequencies of 1,026 and 110 instances per million words for the two languages respectively. There are a number of reasons for this contrast. First, be- passives can be used for both stative and dynamic situations whereas Chinese passives can only occur in dynamic events; second, Chinese passives usually have a negative pragmatic meaning while English passives (especially be-passives) do not; third,

English has a tendency to overuse passives, especially in formal writing whereas

Chinese tends to avoid syntactic passives wherever possible; Chinese has a number of linguistic devices other than the syntactically marked passive constructions to express a passive meaning, e.g. notional passives, lexical passives, topic sentences, subjectless sentences, sentences with vague subjects (e.g. youren ‘someone’, renmen ‘people’, dajia ‘all’), and special structures such as the disposal ba construction and the predicative shi…de structure. Finally, syntactically unmarked notional passives are more common in Chinese than in English because English is a subject-oriented language whereas Chinese is topic oriented. Given that Chinese passives are much more restricted in scope of use, their low frequency in relation to their English counterparts is unsurprising. It can be predicted from this sharp contrast in frequency of use that Chinese learners of English are very likely to underuse passives in their interlanguage.

Secondly, passives are formed by an auxiliary (be, get) followed by a past participial verb in English whilst in Chinese they can be marked syntactically by passive markers such as bei, indicated lexically by verbs with an inherent passive meaning (e.g. zao

‘suffer’), or simply expressed by unmarked notional passives or special sentence structures. Unlike English, which inflects the passivised verb morphologically,

28 Chinese is non-inflectional, which means that the same verb form is used for both active and passive voices in Chinese. Also because of the non-inflectional Chinese morphology, the concept of auxiliary is less salient or useful in Chinese. These cross- linguistic differences seem to suggest that the choice of correct auxiliaries as well as proper inflectional forms for passivised verbs can constitute a difficult area for

Chinese learners to acquire English passives.

Thirdly, short passives (i.e. passives without a by-phrase introducing an agent) are typical of English, accounting for over 90% of total occurrences in both speech and writing. Short passives are predominant in English simply because passives are often used in English as a strategy that allows one to avoid mentioning the agent when it cannot or must not be mentioned, while they are also used for stylistic and coherence purposes (see Granger 1976 and 1983 for further discussion of uses of passives). In contrast, three out of five syntactic passive markers in Chinese (wei…suo, jiao and rang) only occur in long passives (i.e. passives with an explicit agent). For the two remaining passive markers bei and gei, which allow both long and short passives, the proportions of short passives (60.7% and 57.5% respectively) are significantly lower than that for English passives. Early Chinese grammarians (e.g. Wang 1984; Lü and

Zhu 1979) noted that an agent must normally be spelt out in passive constructions, though this constraint has become more relaxed nowadays. When it is difficult to spell out the agent, passives are used in English, but an alternative device mentioned in the preceding paragraph is often used in Chinese instead of using passives. This finding can lead one to expect more long passives in the interlanguage of Chinese learners of

English.

29 100% 4.7% 10.7% 80% 37.8%

t Positive

n 60%

e 80.3%

c Neutral r

e 40% P Negative 51.5% 20% 15.0% 0% English be passives Chinese bei pas sives Language

Figure 1. Pragmatic meanings of be and bei passives

Finally, a major distinction between passives in English and Chinese is that Chinese passives are more frequently used with an inflictive meaning than their English counterparts. With the exception of the archaic passive form wei…suo, over half of syntactically marked passives in Chinese occur in adversative situations, a proportion considerably higher than that for English passives (see Figure 1). As the prototypical passive marker bei was derived from a verb with an inflictive meaning (i.e. ‘suffer’),

Chinese passives were used at early stages primarily for unpleasant or undesirable events. While this semantic constraint on the use of passives has become more relaxed, especially in written Chinese, under the influence of western languages, disyllabic words made up of bei and a single character verb as used in modern

Chinese typically refer to something undesirable, as in beibu ‘be arrested’, beifu ‘be captured’, beigao ‘the accused’, beihai ‘be a victim’ and beipo ‘be forced’. In contrast, marking negative pragmatic meanings is not a basic feature of English passives, though get-passives often refer to undesirable events. An essential difference between English and Chinese passives lies in how much negativity is coded in them,

30 which predicts that Chinese learners of English will use passives more frequently for

undesirable situations.

In the next section, we will analyze the use of passives in a Chinese learner English

corpus to ascertain how reliably the findings of our contrastive study as summarized

in this section can predict and diagnose learner behaviour in interlanguage.

5.3. Passive constructions in Chinese learner English

This section examines be passives in Chinese learner English. The corpus used is the

Chinese Learner English Corpus (CLEC), which contains one million words of essays

written by Chinese learners at five proficiency levels: high school students (ST2),

junior and senior non-English majors (ST3 and ST4), and junior and senior English

majors (ST5 and ST6). The five types of learners are equally represented in the

corpus. The corpus is fully annotated with learner errors using an error tagset that

consists of 61 error types clustered in 11 categories (see Gui and Yang 2002). In order

to compare Chinese learners’ interlanguage with native English, the Louvain Corpus

of Native English Essays (LOCNESS) is used as the control data, which is composed

of argumentative essays written by native British and American students on a great

variety of topics, totalling approximately 300,000 words (cf. Granger and Tyson

1996).

Table 1. Passives in CLEC and LOCNESS

Corpus Words Passives Per million LL score

words CLEC 1,070,602 9,711 907 1235.6 LOCNESS 324,304 5,465 1,685 (p<0.001)

31 A comparison of CLEC and LOCNESS shows that in relation to native English writing, Chinese learners of English significantly underuse passives in their interlanguage. Table 1 gives the raw frequencies of passive constructions in the two corpora as well as the frequencies normalised to a common base of one million words.

As can be seen, passives are nearly twice as frequent in native English as in Chinese learner English. The log-likelihood test (LL) indicates that this difference is statistically significant (LL=1235.6 for 1 degree of freedom, p<0.001). The significant underuse of passives in Chinese learner English is hardly surprising in light of the marked contrast in frequencies for passives in English and Chinese as noted in section

5.2. Granger (1996: 46) also expected French learners of English to underuse passives in their writing as it was noted that passives were twice as frequent in English as in

French (see Granger 1976), but she did not verify this prediction against French learner English data. While Chinese learners’ underuse of passives as mirrored in the

CLEC corpus is very likely to be caused by the influence of their native language, more cross-linguistic contrasts and interlanguage studies involving learners from other L1 backgrounds are required before we can be more confident that underuse of passives is the result of L1 transfer rather than a common feature of interlanguages, irrespective of the learner’s mother tongue, which would mean that learners underuse passives for developmental reasons. As Granger (2007) observes, while native

English speakers mainly use the verb discuss in the passive, ‘learners show a predilection for active structures with first person subjects.’

32 100%

80% t

n 60%

e Short pass ives c r

e 40% Long passives P

20%

0% CLEC LOCNESS Corpus

Figure 2. Long and short passive in CLEC and LOCNESS

The results of the contrastive analysis in section 5.2 predicted that Chinese learners would use long passives more frequently than native English speakers. Figure 2 shows the proportions of long and short passives in CLEC and LOCNESS. It can be seen that in comparison with native English writings, long passives are indeed slightly more frequent in Chinese learner English (9.14% and 8.44% for CLEC and

LOCNESS respectively), though this difference is marginal and not statistically significant (LL=2.18 for 1 degree of freedom, p=0.139).

100% 5.9% 4.4%

80%

t Pos itive

n 60% 68.4%

e 78.8%

c Neutral r

e 40% P Negative 20% 25.7% 16.8% 0% CLEC LOCNESS Corpus

Figure 3. Pragmatic meanings of passives in CLEC and LOCNESS

33 It was noted in earlier that over 50% of passives in Chinese express an inflictive meaning whereas the corresponding figure for be passives in English is merely 15%.

Such a contrast would reasonably lead one to expect more negative cases in Chinese learner English than in native English. This expectation is in fact supported by evidence from CLEC and LOCNESS. Figure 3 shows that 25.7% of passives in the

Chinese learner English data are negative whilst negative cases account for 16.8% in native English writings. The log-likelihood test indicates that the differences between

CLEC and LOCNESS in the three meaning categories are statistically significant

(LL=7.4 for 2 degrees of freedom, p=0.025). A comparison of Figures 1 and 3 suggests that the proportions for the three meaning categories for the two types of native English data (i.e. general English and students’ essays) are very close to each other. In contrast, the proportions in Chinese learner English shift away from those for

L1 Chinese and move closer to the proportions for L2 English. Given that interlanguage is ‘situated somewhere between L1 and L2’ (Granger 1996: 48), this movement is only reasonable and as expected.

An inspection of the specific errors related to the use of passive constructions in

CLEC also demonstrates the value of contrastive corpus linguistics in SLA research.

There are mainly four types of passive-related learner errors: underuse, misuse, misformation, and auxiliary errors. It can be considered as an advantage of the corpus-based approach to be able to view underuse or overuse of a linguistic feature in interlanguage as a type of learner error, as this was not possible in traditional error analysis without corpus data. Misuse of passives means that learners use passive constructions where they are not supposed to use them. Misformation errors are

34 associated with morphological inflections, while auxiliary errors relate to omission and misuse of auxiliaries in passive constructions. s

d 250 r o w

0 200 Aux. errors 0 0 ,

0 Misform ation

0 150 2

r Misus e e

p 100 Underuse y c

n 50 All error types e u q e

r 0 F ST2 ST3 ST4 ST5 ST6 Learner level

Figure 4. Passive-related errors in Chinese learner English

Figure 4 charts the distribution of four types of errors, as well as all error types as a whole, across learner proficiency levels. Unsurprisingly, when all error types are taken together, learners at higher levels generally make fewer errors related to passives. Of the four types of learner errors, underuse is the most important type, followed by misuse and misformation errors. Auxiliary errors are uncommon for learner groups other than the lowest level ST2 (i.e. high school students). It is also clear from the figure that learning curves are not straight lines. There can be relapses in the language acquisition process, especially for difficult items.

It is of interest to note that while error types are associated with learner levels when the dataset is taken as a whole (LL=51.77 for 12 degrees of freedom, p<0.001), similar leaner groups show similar error types. This means that the differences between the two non-English-major learner groups (i.e. ST3 and ST4), and between the two English-majors learner groups (i.e. ST5 and ST6) are not statistically

35 significant, as indicated in Table 2. The table gives the log-likelihood test scores and

probability values (3 degrees of freedom for all pairs of data), with significant

differences highlighted. Hence, Chinese learners can be divided into three broad

groups in terms of their acquisition of English passives: ST2 – ST3/ST4 – ST5/ST6.

Table 2. Association between error types and learner levels

From To LL score (3 d.f.) P value ST2 ST3 27.303 <0.001 ST3 ST4 6.955 0.073 ST4 ST5 18.563 <0.001 ST5 ST6 6.987 0.072

While we cannot be conclusive of whether the underuse of passives by Chinese

learners of English is a result of L1 transfer or a stage of the developmental path,

errors of this type in our learner data typically occur with verbs whose Chinese

equivalents are not normally used in passive constructions, as shown in (1).

(1) a. A birthday party will hold in Lily’s house. (ST2)

b. The woman in white called Anne Catherick. (ST5)

(2) a. The supper had done. (ST2)

b. wanfan zuo-hao le

supper cook-ready ASP

The supper is ready.

Underuse errors also occur under the influence of topic sentences in Chinese, as

exemplified in (2a), which is expressed in Chinese as (2b). The Chinese example in

(2b) is an instance of topic sentence, which is very common in this language. Here

wanfan ‘supper’ in the subject position is the topic and zuo-hao le ‘cook-ready ASP’

36 is the comment. Sentences like this cannot be used in the passive felicitously (e.g.

*wanfan bei zuo-hao le).

Misuse errors are mostly found in three contexts. Firstly, they occur when intransitive verbs are passivised (e.g. 3); secondly, errors of this type are related to the misuse of ergative verbs (e.g. 4); and finally, misuse errors can be a result of training transfer, i.e. excessive passive training in classroom instructions, as shown in (5). In sentences like these, the passivised verb is followed by an object, yet Chinese learners have been taught that passive transformation involves moving the object to the subject position. This can be taken as a symptom of the overdone passive training in English classrooms in China.

(3) a. A very unhappy thing was happened in this week. (ST2)

b. I was graduated from Zhongshan University. (ST5)

(4) a. the secince is developed quickly (ST4)

b. infant mortality was declined (ST4)

(5) a. Because they have been mastered everything of this job (ST4)

b. many machine and appliance are used electricity as power (ST5)

Misformation errors are a result of L1 interference. As noted in section 5.2, passivised verbs do not inflect in Chinese. Consequently, Chinese learners of English tend to use uninflected verbs or misspelled past participles in passive constructions, as exemplified as (6).

(6) a. His relatives can not stop him, because his choice is protect by the

laws. (ST6)

37 b. Since the People’s Republic of china was found on

October 1949, great changes <…> (ST2)

(7) a. In China, since the new China established, people’s life has goten

gotten>

better and better. (ST3)

b. I am not a smoker, but why do we forced to be a second-hand smoker?

(ST5)

Auxiliary errors, the final type of passive errors in our annotation scheme, are also the result of L1 interference. We noted earlier that while passives in Chinese can be marked syntactically, lexical passives, unmarked notional passives and topic sentences that express a passive meaning are abundant. As such, it is hardly surprising that Chinese learners of English tend to omit or misuse auxiliaries, as shown in (7).

The discussion in this section suggests that the performance of Chinese learners of

English in their use of English passives is closely linked to their native language; and most of the passive-related errors in their interlanguage can be accounted for from the perspective of contrastive corpus linguistics. In the following section, we will discuss the implications of this study in SLA research.

5.4. Modelling contrastive interlanguage analysis

We hope that the case study has demonstrated the predictive and explanatory power of contrastive corpus linguistics in SLA research. Combining contrastive analysis

(CA) and contrastive interlanguage analysis (CIA) is undoubtedly a fruitful direction to pursue in SLA research. This is not a new idea. As early as a decade ago, Granger

(1996: 46) proposed an ‘integrated contrastive model’:

38 The model involves constant to-ing and fro-ing between CA and CIA. CA

data helps analysts to formulate predictions about interlanguage which can

be checked against CA data. […] Conversely, CIA results can only be

reliably interpreted as being evidence of transfer if supported by clear CA

descriptions.

Just as CIA has contributed significantly to SLA research by enabling and foregrounding many areas of investigation which have traditionally been impossible or marginalized (e.g. quantitatively distinctive features of interlanguage such as overuse and underuse, the potential effects of learner parameters on interlanguage), the integrated approach that combines CA and CIA will be an indispensable tool in

SLA research, because ‘if we want to be able to make firm pronouncements about transfer-related phenomena, it is essential to combine CA and CIA approaches’

(Granger 1998: 14).

This emerging and promising area of research has recently become popular. For example, Gilquin (2001) demonstrates, on the basis of a case study of causative constructions in English and French, how the integrated contrastive model can help explain some of the characteristics of learners’ interlanguage and thus throw new light on the key notion of transfer, which turns out to be a more complex phenomenon than has traditionally been assumed. Similarly, Borin and Prütz (2004) use the integrated contrastive approach to explore L1 syntactic interference in advanced Swedish learner

English by investigating part-of-speech sequences. The increasing interest in the integrated approach is also demonstrated by the specialised workshop ‘Linking up

39 Contrastive and Learner Corpus Research’, which was affiliated to the 4th

International Contrastive Linguistics Conference.

We entirely agree with Granger (1996, 1998) that a combination of corpus-based contrastive study and interlanguage analysis can provide insights into language acquisition research, but we have different opinions of the role of parallel corpora (or

‘translation corpora’ in her words) in cross-linguistic contrasts, for the reasons outlined earlier in section 5.1. While Granger (1996: 38, 48) is fully aware of the drawback of using translated texts in contrastive analysis, her examples are largely based on data of this kind. In our revised CIA model, therefore, contrastive corpus linguistics interacts with interlanguage analysis on the basis of comparable native language corpora as illustrated in Figure 5.

Figure 5. A revised model of contrastive interlanguage analysis

It is true that using a bidirectional parallel corpus can average out, to some extent at least, the undesirable effects of translationese on contrastive analysis. To achieve this aim, however, the same sampling criteria must apply to the selection of source texts in both languages, because any mismatch of proportion, genre, or domain, for example,

40 may invalidate the findings derived from such a corpus (McEnery, Xiao and Tono

2006: 93). A well-matched bidirectional parallel corpus is in fact a mixture of parallel corpus and comparable corpus, which can become a bridge that brings translation and contrastive studies together. Yet the ideal bidirectional parallel-comparable corpus will often not be easy, or even possible, to build because of the heterogeneous pattern of translation between languages and genres. This is especially true if the corpus aims to achieve sufficient coverage and balance to produce convincing findings (McEnery and Xiao 2007). Hence, in our approach, comparable native language data is preferred in contrastive corpus linguistics. Other kinds of corpora for comparative studies such as parallel corpora, translational corpora, and comparative corpora are best suited for their own different purposes. Nevertheless, in spite of some difference in data type used, there has been increasing consensus that contrastive corpus linguistics has something to deliver in second language acquisition research.

6. Conclusions

Before we close the discussion of using corpora in language pedagogy, it is appropriate to address some objections to the use of corpora in language learning and teaching. While frequency and authenticity are often considered two of the most important advantages of using corpora, they are also the locus of criticism from language pedagogy researchers. For example, Cook (1998: 61) argues that corpus data impoverishes language learning by giving undue prominence to what is simply frequent at the expense of rarer but more effective or salient expressions. Widdowson

(1990, 2000) argues that corpus data is authentic only in a very limited sense in that it is de-contextualized (i.e. traces of texts rather than discourse) and must be re- contextualized in language teaching. It can also be argued that:

41 on the contrary, using corpus data not only increases the chances of learners

being confronted with relatively infrequent instances of language use, but

also of their being able to see in what way such uses are atypical, in what

contexts they do appear, and how they fit in with the pattern of more

prototypical uses. (Osborne 2001: 486)

This view is echoed by Goethals (2003: 424), who argues that ‘frequency ranking will be a parameter for sequencing and grading learning materials’ because ‘frequency is a measure of probability of usefulness’ and ‘high-frequency words constitute a core vocabulary that is useful above the incidental choice of text of one teacher or textbook author.’ Hunston (2002:194-195) observes that ‘items which are important though infrequent seem to be those that echo texts which have a high cultural value’, though in many cases ‘cultural salience is not clearly at odds with frequency.’ While frequency information is readily available from corpora, no corpus linguist has ever argued that the most frequent is most important. On the contrary, Kennedy (1998:

290) argues that frequency ‘should be only one of the criteria used to influence instruction’ and that ‘the facts about language and language use which emerge from corpus analyses should never be allowed to become a burden for pedagogy’. As such, raw frequency data is often adjusted for use in a syllabus, as reported in Renouf

(1987: 168). It would be inappropriate, therefore, for language teachers, syllabus designers, and materials writers to ignore ‘compelling frequency evidence already available’, as pointed out by Leech (1997: 16), who argues that:

Whatever the imperfections of the simple equation ‘most frequent’ = ‘most

important to learn’, it is difficult to deny that frequency information

becoming available from corpora has an important empirical input to

language learning materials.

42 Kaltenböck and Mehlmauer-Larcher (2005: 78) downplay the role of frequency in language learning, arguing that ‘what is frequent in language will be picked up by learners automatically, precisely because it is frequent, and therefore does not have to be consciously learned.’ This is not true, however. Determiners such as a and the are certainly very frequent in English, yet they are difficult for Chinese learners of

English because their mother tongue does not have such grammatical morphemes and does not maintain a count-mass noun distinction.

Clearly, frequency is not ‘automatically pedagogically useful’ (Kaltenböck and

Mehlmauer-Larcher 2005: 78); decisions relating to teaching must also take account of overall teaching objectives, learners’ concrete situations, cognitive salience, learnability, generative value and of course teachers’ intuitions (cf. Kaltenböck and

Mehlmauer-Larcher 2005: 78). However, frequency can at least help syllabus designers, materials writers and teachers alike to make better-informed and more carefully motivated decisions (cf. Gavioli and Aston 2001: 239).

If we leave objections to frequency data to one side, Widdowson (1990, 2000) also questions the use of authentic texts in language teaching. In his opinion, authenticity of language in the classroom is ‘an illusion’ (1990: 44) because even though corpus data may be authentic in one sense, its authenticity of purpose is destroyed by its use with an unintended audience of language learners (see Murison-Bowie 1996: 189).

Widdowson (2003: 93) makes a distinction between ‘genuineness’ and ‘authenticity’, which are claimed to be the features of text as a product and discourse as a process respectively: corpora are genuine in that they comprise attested language use, but they

43 are not authentic for language teaching because their contexts (as opposed to co-texts) have been deprived. We will not be engaged in the debate here, but would like to draw readers’ attention to Stubbs’ (2001) metaphor of product versus process as cited in section 4.2. The implication of Widdowson’s argument is that only language produced for imaginary situations in the classroom is ‘authentic’. Even if we do follow Widdowson’s genuineness-authenticity distinction, it is not clear why such imaginary situations are authentic because authenticity, as opposed to genuineness, would mean real communicative context. Situations conjured up for classroom teaching obviously do not take place for really communicative contexts, how can they be authentic, if we choose to keep this distinction? When students learn and practise a shopping ‘discourse’, they are actually by no means doing shopping! Furthermore, as argued by Fox (1987), invented examples often do not reflect nuances of usage. That is perhaps why, as Mindt (1996: 232) observes, students who have been taught

‘school English’ cannot readily cope with English used by native speakers in real life.

As such, Wichmann (1997: xvi) argues that in language teaching, ‘the preference for

“authentic” texts requires both learners and teachers to cope with language which the textbooks do not predict.’

The discussions in sections 2-4 suggest that corpora appear to have played a more important role in helping to decide what to teach (indirect uses) than how to teach

(direct uses). While indirect uses of corpora seem to be well established, direct uses of corpora in teaching are largely confined to advanced levels like higher education.

Corpus-based learning activities are nearly absent general TEFL classes at lower levels like secondary education. Of the various causes for this absence mentioned earlier, perhaps the most important are the access to appropriate corpus resources and

44 the necessary training of teachers, which we view as priorities for future tasks of corpus linguists if corpora are to be popularised to general language teaching context.

While there are a wide range of existing corpora that are publicly available (see Xiao

2008 for a recent survey), the majority of those resources have been developed ‘as tools for linguistic research and not with pedagogical goals in mind’ (Braun 2007). As

Cook (1998: 57) suggests, ‘the leap from linguistics to pedagogy is […] far from straightforward.’ To bridge the gap between corpora and language pedagogy, the first step would involve creating corpora that are pedagogically motivated, in both design and contents, to meet pedagogical needs and curricular requirements so that corpus- based learning activities become an integral part, rather than an additional option, of the overall language curriculum. Such pedagogically motivated corpora ‘should not only be more coherent than traditional corpora; they should, as far as possible, also be complementary to school curricula, to facilitate both the contextualisation process and the practical problems of integration’ (Braun 2007: 310). The design of such corpus- based learning activities must also take account of learners’ age, experience and level as well as their integration into the overall curriculum.

Given the situation of learners (e.g. their age, level of language competence, level of expert knowledge, and attitude towards learning autonomy) in general language education in relation to advanced learners in tertiary education, even such pedagogically motivated corpus study activities must be mediated by teachers. This in turn raises the issue of the current state of teachers’ knowledge and skills of corpus analysis and data interpretation, which is another practical problem that has prevented direct use of corpora in language pedagogy. As Kaltenböck and Mehlmauer-Larcher

45 (2005: 81) argue, ‘mediation by the teacher is a necessary prerequisite for successful application of computer corpora in language teaching and should therefore be given sufficient attention in teacher education courses’ (cf. also O’Keeffe and Farr 2003).

However, as the integration of corpus studies language teacher training is only a quite recent phenomenon (cf. Chambers 2007), ‘it will therefore at least take more time, and perhaps a new generation of teachers, for corpora to find their way into the language classroom’ (Braun 2007: 308).

In conclusion, it is our view that corpora will not only revolutionize the teaching of subjects such as grammar in the 21st century (see Conrad 2000), they will also fundamentally change the ways we approach language education, including both what is taught and how it is taught. As Gavioli and Aston (2001) argue, corpora should not only be viewed as resources which help teachers to decide what to teach, they should also be viewed as resources from which learners may learn directly.

References:

Aarts, J. (1998) ‘Introduction’. In S. Johansson and S. Oksefjell (eds.) Corpora and

Cross-linguistic Research. Amsterdam: Rodopi. ix-xiv.

Aijmer, K. (2009) Corpora and Language Teaching. Amsterdam: John Benjamins.

Alderson, C. (1996) ‘Do corpora have a role in language assessment?’ in J. Thomas

and M. Short (eds.) Using Corpora for Language Research, pp. 248-259. London:

Longman.

Allan, Q. (1999) ‘Enhancing the language awareness of Hong Kong teachers through

corpus data’. Journal of Technology and Teacher Education 7/1: 57-74.

46 Allan, Q. (2002) ‘The TELEC secondary learner corpus: a resource for teacher

development’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner

Corpora, Second Language Acquisition and Foreign Language Teaching, pp.

195–212. Philadelphia: John Benjamins.

Altenberg, B. and Granger, S. (2001) ‘The grammatical and lexical patterning of

MAKE in native and non-native student writing.’ Applied Linguistics 22/2: 173-

95.

Aston, G. (1995) ‘Corpora in language pedagogy: matching theory and practice’ in G.

Cook and B. Seidlhofer (eds.) Principle and Practice in Applied Linguistics:

Studies in Honour of H. G. Widdowson. Oxford: Oxford University Press.

Aston, G. (ed.) (2001) Learning with Corpora. Houston, TX: Athelstan.

Aston, G., Bernardini, S. and Stewart, D. (eds.) (2004) Corpora and Language

Learners. Amsterdam: John Benjamins.

Aston, G. and Burnard, L. (1998) The BNC Handbook: Exploring the British National

Corpus with SARA. Edinburgh: Edinburgh University Press.

Bahns, J. (1993) ‘Lexical collocations: a contrastive view’. ELT Journal 47/1: 56-63.

Baker, M. (1993) ‘Corpus linguistics and translation studies: implications and

applications’. In M. Baker, G. Francis and E. Tognini-Bonelli (eds.) Text and

Technology: in Honour of John Sinclair. Amsterdam: Benjamins. 233-352.

Ball, F. (2001) ‘Using corpora in language testing’. Research Notes 6: 6-8.

Ball, F. (2002) ‘Developing wordlists for BEC’. Research Notes 8: 10-13.

Ball, F. and Wilson, J. (2002) ‘Research projects relating to YLE Speaking Tests’.

Research Notes 7: 8-10.

Bernardini, S. (2000) Competence, Capacity, Corpora: A Study in Corpus-aided

Language Learning. Bologna: CLUEB.

47 Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics: Investigating

Language Structure and Use. Cambridge: Cambridge University Press.

Biber, D., Johansson S., Leech G., Conrad S. and Finegan, E. (1999) Longman

Grammar of Spoken and Written English. London: Longman.

Biber, D., Leech, G. and Conrad, S. (2002) Longman Student Grammar of Spoken

and Written English. London: Longman.

Borin, L. and Prütz, K. (2004) ‘New wine in old skins? A corpus investigation of L1

syntactic transfer in learner language’. In G. Aston, S. Bernardini and D. Stewart

(eds.) Corpora and Language Learners. Amsterdam: John Benjamins. 67–87.

Braun, S. (2007) ‘Integrating corpus work into secondary education: From data-driven

learning to needs-driven corpora’. ReCALL 19(3): 307-328.

Braun, S., Kohn, K. and Mukherjee, J. (eds.) (2006) Corpus Technology and

Language Pedagogy. Frankfurt: Peter Lang.

Burnard, L. and McEnery, A. (eds.) (2000) Rethinking Language Pedagogy from a

Corpus Perspective. New York: Peter Lang.

Campoy, M., Gea-valor, M. and Belles-Fortuno, B. (2010) Corpus-based

Approaches to English Language Teaching. London: Continuum.

Carter, R. and McCarthy, M. (1995) ‘Grammar and the spoken language’. Applied

Linguistics 16/2: 141-158.

Carter, R. and McCarthy, M. (2004) ‘Talking, creating: interactional language,

creativity, and context’. Applied Linguistics 25/1: 62-88.

Chambers, A. (2007) ‘Popularising corpus consultation by language learners and

teachers’. In E. Hidalgo, L. Quereda, and J. Santana (eds) Corpora in the Foreign

Language Classroom: Selected Papers from the Sixth International Conference

on Teaching and Language Corpora (TaLC 6), pp. 3–16. Amsterdam: Rodopi.

48 Choi, I., Kim, K. and Boo, J. (2003) ‘Comparability of a paper-based language test

and a computer-based language test’. Language Testing 20/3: 295–320.

Coniam, D. (1997) ‘A preliminary inquiry into using corpus word frequency data in

the automatic generation of English language cloze tests’. CALICO Journal 16/2-

4: 15-33.

Connor, U. and Upton, T. (eds) (2002) Applied Corpus Linguistics: A

Multidimensional Perspective. Amsterdam: Rodopi.

Conrad, S. (1999) ‘The importance of corpus-based research for language teachers’.

System 27: 1-18.

Conrad, S. (2000) ‘Will corpus linguistics revolutionize grammar teaching in the 21st

century?’. TESOL Quarterly 34: 548–60.

Cook, G. (1998) ‘The uses of reality: a reply to Ronald Cater.’ ELT Journal 52/1: 57-

64.

Cowie, A. (1994) ‘Phraseology’ in R. Asher (ed.) The Encyclopaedia of Language

and Linguistics Vol. 6, pp. 3168-3171. Oxford: Pergamon Press Ltd. de Beaugrande, R. (2001) ‘Interpreting the discourse of H. G. Widdowson: a corpus-

based critical discourse analysis’. Applied Linguistics 22/1: 104-121.

Flowerdew, J. (1993) ‘Concordancing as a tool in course design’. System 21/3: 231-

243.

Fox, G. (1987) ‘The case for examples’ in J. Sinclair (ed.) Looking Up: An Account of

the COBUILD Project, pp. 137-149. London: HarperCollins.

Francis, G., Hunston, S. and Manning, E. (1996) Collins COBUILD Grammar Patterns

1: Verbs. London: HarperCollins.

Francis, G., Hunston, S. and Manning, E. (1998) Collins COBUILD Grammar Patterns

2: Nouns and Adjectives. London: HarperCollins.

49 Fries, C. (1945) Teaching and Learning English as a Foreign Language. Ann Arbor:

University of Michigan Press.

Gavioli, L. (2006) Exploring Corpora for ESP Learning. Amsterdam: John

Benjamins.

Gavioli, L. and Aston, G. (2001) ‘Enriching reality: language corpora in language

pedagogy’. ELT Journal 55/3: 238-246.

Gellerstam, M. (1986) ‘Translationese in Swedish novels translated from English’. In

L. Wollin and H. Lindquist (eds.) Translation Studies in Scandinavia. Lund:

CWK Gleerup. 88-95.

Gellerstam, M. (1996) ‘Translations as a source fro cross-linguistic studies’. In K.

Aijmer, B. Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund

University Press. 53-62.

Ghadessy, M., Henry, A. and Roseberry, R. (eds.) (2001) Small Corpus Studies and

ELT: Theory and Practice. Amsterdam: John Benjamins.

Gilquin, G. (2001) ‘The integrated contrastive model. Spicing up your data’.

Languages in Contrast 3(1): 95–123.

Goethals, M. (2003) ‘E.E.T.: the European English Teaching vocabulary-list’ in B.

Lewandowska-Tomaszczyk (ed.) Practical Applications in Language and

Computers, pp. 417-427. Frankfurt: Peter Lang.

Granger S. (1976) ‘Why the passive?’. In J. Van Roey (ed.) English-French

Contrastive Analyses. Leuven: Acco. 23-57.

Granger, S. (1983) The Be + Past Participle Construction in Spoken English with

Special Emphasis on the Passive. Amsterdam: North-Holland.

50 Granger, S. (1996) ‘From CA to CIA and back: An integrated approach to

computerised bilingual and learner corpora’. In K. Aijmer, B. Altenberg and M.

Johansson (eds.) Language in Contrast. Lund: Lund University Press. 37-51.

Granger, S. (1998) ‘The computer learner corpus: a versatile new source of data for

SLA research’. In S. Granger (ed.) Learner English on Computer. London:

Longman. 3-18.

Granger, S. (2002) ‘A bird’s-eye view of learner corpus research’ in S. Granger, J.

Hung and S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language

Acquisition and Foreign Language Teaching, pp. 3–33. Philadelphia: John

Benjamins.

Granger, S. (2003) ‘Practical applications of learner corpora’ in B. Lewandowska-

Tomaszczyk (ed.) Practical Applications in Language and Computers, pp. 291-

302. Frankfurt: Peter Lang.

Granger, S. (2007) ‘Sylviane Granger: Interview’. Mindbite 1.

Granger, S., Hung, J. and Petch-Tyson, S. (eds.) (2002) Computer Learner Corpora,

Second Language Acquisition, and Foreign Language Teaching. Philadelphia:

John Benjamins.

Granger, S. and Tyson, S. (1996) ‘Connector usage in the English essay writing of

native and non-native speakers of English’. World Englishes 15: 19-29.

Gui, S. and Yang, H. (2002) Zhonguo Xuexizhe Yingyu Yuliaoku (Chinese Learner

English Corpus). Shanghai: Shanghai Foreign Language Education Press.

Hartmann, R. (1995) ‘Contrastive textology’. Language and Communication 5: 25-

37.

Herbst, T. (1996) ‘What are collocations: sandy beaches or false teeth?’. English

Studies 04/1996: 379-393.

51 Hidalgo, E., Quereda, L. and Santana, J. (2007) Corpora in the Foreign Language

Classroom: Selected Papers from the Sixth International Conference on Teaching

and Language Corpora (TaLC 6). Amsterdam: Rodopi.

Higgins, J. and Johns, T. (1984) Computers in Language Learning. Oxford: Oxford

University Press.

Hinkel, E. (2004) ‘Tense, aspect the passive voice in L1 and L2 academic texts’.

Language Teaching Research 8/1: 5-29.

Hoey, M. (2000) ‘A world beyond collocation: new perspectives on vocabulary

teaching’ in M. Lewis (ed.) Teaching Collocations, pp. 224-245. Hove: Language

Teaching Publications.

Hoey, M. (2004) ‘Lexical priming and the properties of text’. In A. Partington, J.

Morley and L. Haarman (eds.) Corpora and Discourse, pp. 385-412. Bern: Peter

Lang.

Horner, D. and Strutt, P. (2004) ‘Analyzing domain-specific lexical categories:

evidence from the BEC written corpus’. Research Notes 15: 6-8.

Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge

University Press.

Hyland, K. (1999) ‘Talking to students: metadiscourse in introductory coursebooks’.

English for Specific Purposes 18/1: 3-26.

James, C. (1980) Contrastive Analysis. London: Longman.

Johansson S. (2003) ‘Contrastive linguistics and corpora’. In S. Granger, J. Lerot and

S. Petch-Tyson (eds.) Corpus-Based Approaches to Contrastive Linguistics and

Translation Studies. Amsterdam: Rodopi. 31-44.

52 Johns, T. (1991) ‘“Should you be persuaded”: two samples of data-driven learning

materials’ in T. Johns and P. King (eds.) Classroom Concordancing ELR Journal

4. University of Birmingham.

Kaltenböck, G. and Mehlmauer-Larcher, B. (2005) ‘Computer corpora and the

language classroom: On the potential and limitations of computer corpora in

language teaching’. ReCALL 17:65-84.

Karpati, I. (1995) Concordance in Language Learning and Teaching. Pecs:

University of Pecs.

Kaszubski, P. and Wojnowska, A. (2003) ‘Corpus-informed exercises for learners of

English: the TestBuilder program’ in E. Oleksy and B. Lewandowska-

Tomaszczyk (eds.) Research and Scholarship in Integration Processes: Poland -

USA – EU, pp. 337-354. Łódź: Łódź University Press.

Keck, C. (2004) ‘Corpus linguistics and language teaching research: bridging the

gap’. Language Teaching Research 8(1): 83-109.

Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Longman.

Kennedy, G. (2003) ‘Amplifier collocations in the British National Corpus:

implications for English language teaching’. TESOL Quarterly 37/3: 467-487.

Kenny, D. (1998) ‘Creatures of habit? What translators usually do with words?’.

Meta 43(4).

Kettemann, B. (1995) ‘On the use of concordancing in ELT’. TELL&CALL 4: 4-15.

Kettemann, B. (1996) ‘Concordancing in English Language Teaching’ in S. Botley, J.

Glass, A. McEnery and A. Wilson (eds.) Proceedings of Teaching and Language

Corpora, pp. 4-16. Lancaster University.

Kettemann, B. and Marko, G. (2002) Teaching and Learning by Doing Corpus

Analysis. Amsterdam: Rodopi.

53 Kettemann, B. and Marko, G (eds) (2006) Planning, Gluing and Painting Corpora:

Inside the Applied Corpus Linguist’s Workshop. Frankfurt: Peter Lang.

Kita, K. and Ogata, H. (1997) ‘Collocations in language learning: corpus-based

automatic compilation of collocations and bilingual collocation concordancer’.

Computer Assisted Language Learning 10/3: 229-238.

Kjellmer, G. (1991) ‘A mint of phrases’ in K. Aijmer and B. Altenberg (eds.) English

Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman.

Koester , A. (2002) ‘The performance of speech acts in workplace conversations and

the teaching of communicative functions’. System 30: 167-184.

Lado, R. (1957) Linguistics across Cultures: Applied Linguistics for Language

Teachers. Ann Arbor: University of Michigan Press.

Laviosa, S. (1997) ‘How comparable can “comparable corpora” be?’. Target 9: 289-

319.

Laviosa, S. (1998) ‘Core patterns of lexical use in a comparable corpus of English

narrative prose’. Meta 43(4).

Leech, G. (1997) ‘Teaching and language corpora: a convergence’ in A. Wichmann,

S. Fligelstone, A. McEnery and G. Knowles (eds.) Teaching and Language

Corpora, pp. 1-23. London: Longman.

Lewis, M. (1993) The Lexical Approach: The State of ELT and the Way Forward.

Hove: Language Teaching Publications.

Lewis, M. (1997a) Implementing the Lexical Approach: Putting Theory into Practice.

Hove: Language Teaching Publications.

Lewis, M. (1997b) ‘Pedagogical implications of the lexical approach’ in J. Coady and

T. Huckin (eds.) Second Language Vocabulary Acquisition: A Rationale for

Pedagogy, pp. 255-270. Cambridge: Cambridge University Press.

54 Lewis, M. (ed.) (2000) Teaching Collocation: Further Developments in the Lexical

Approach. Hove: Language Teaching Publications.

Lü, S. and Zhu, D. (1979) Yufa Xiuci Jianghua (Talks on Grammar and Rhetoric).

Beijing: Chinese Youth Press.

Mauranen, A. (2002) ‘Will “translationese” ruin a contrastive study?’. Languages in

Contrast 2(2): 161-186.

McAlpine, J. and Myles, J. (2003) ‘Capturing phraseology in an online dictionary for

advanced users of English as a second language: a response to user needs’. System

31: 71-84.

McCarthy, M., McCarten, J. and Sandiford, H. (2005-2006) Touchstone (Books 1-4).

cambridge. Cambridge University Press.

McEnery, A. and Wilson, A. (2001) Corpus Linguistics (1st ed. 1996). Edinburgh:

Edinburgh University Press.

McEnery, A. and Xiao, R. (2002) ‘Domains, text types, aspect marking and English-

Chinese translation’. Languages in Contrast 2(2): 211-229.

McEnery, A. and Xiao, R. (2005) ‘Help or help to: What do corpora have to say?’

English Studies 86(2): 161-187.

McEnery, A. and Xiao, R. (2007) ‘Parallel and comparable corpora: What is

happening?’. In M. Rogers and G. Anderman (eds.) Incorporating Corpora: The

Linguist and the Translator, pp. 18-31. Clevedon: Multilingual Matters.

McEnery, A. and Xiao, R. (2008) CALLHOME Mandarin Chinese Transcripts - XML

version. Pennsylvania: Linguistic Data Consortium.

McEnery, A., Xiao, R. and Mo, L. (2003) ‘Aspect marking in English and Chinese:

using the Lancaster Corpus of Mandarin Chinese for contrastive language study’.

Literary and Linguistic Computing 18(4): 361-378.

55 McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-Based Language Studies: An

Advanced Resource Book. London: Routledge.

Meunier, F. (2002) ‘The pedagogical value of native and learner corpora in EFL

grammar teaching’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer

Learner Corpora, Second Language Acquisition and Foreign Language Teaching,

pp. 119–142. Philadelphia: John Benjamins.

Mindt, D. (1996) ‘English corpus linguistics and the foreign language teaching

syllabus’ in J. Thomas and M. Short (eds.) Using Corpora for Language

Research, pp. 232-247. London: Longman.

Mishan, F. (2005) Designing Authenticity into Language Learning Materials.

Chicago: Chicago University Press.

Mukherjee, J. and Rohrbach, J. (2006) ‘Rethinking applied corpus linguistics from a

language-pedagogical perspective: New departures in learner corpus research’. In

B. Kettemann and G. Marko (eds) Planning, Gluing and Painting Corpora: Inside

the Applied Corpus Linguist’s Workshop, pp. 205-232. Frankfurt: Peter Lang.

Murison-Bowie, S. (1996) ‘Linguistic corpora and language teaching’. Annual

Review of Applied Linguistics 16: 182-199.

Myles, F. (2005) ‘Interlanguage corpora and second language acquisition research’.

Second Language Research 21(4): 373-391.

Nelson, G. (1996) ‘The design of the corpus’. In S. Greenbaum (ed.) Comparing

English Worldwide: The International Corpus of English, pp. 27-35. Oxford:

Clarendon Press.

Nelson, M. (2000) A Corpus-Based Study of Business English and Business English

Teaching Materials. PhD thesis, the University of Manchester, Manchester.

Available at http://users.utu.fi/micnel/thesis.html.

56 Nesselhauf, N. (2003) ‘The use of collocations by advanced learners of English and

some implications for teaching.’ Applied Linguistics 24/2: 223-42.

Nesselhauf, N. (2005) Collocations in a Learner Corpus. Amsterdam: John

Benjamins.

O’Keeffe, A. and Farr, F. (2003) ‘Using language corpora in initial teacher education:

pedagogic issues and practical applications’. TESOL Quarterly 37/3: 389-418.

O’Keeffe, A., McCarthy, M. and Carter, R (2007) From Corpus to Classroom:

Language Use and Language Teaching. Cambridge: Cambridge University Press.

Osborne, J. (2001) ‘Integrating corpora into a language-learning syllabus’ in B.

Lewandowska-Tomaszczyk (ed.) PALC 2001: Practical Applications in

Language Corpora, pp. 479-492. Frankfurt: Peter Lang.

Osborne, J. (2002) ‘Top-down and bottom-up approaches to corpora in language

teaching’. In U. Connor and T. Upton (eds) Applied Corpus Linguistics: A

Multidimensional Perspective, pp. 251-265. Amsterdam: Rodopi.

Øverås, S. (1998) ‘In search of the third code: an investigation of norms in literary

translation’. Meta 43(4).

Partington, A. (1998) Patterns and Meanings: Using Corpora for English Language

Research and Teaching. Amsterdam: John Benjamins.

Pravec, N. (2002) ‘Survey of learner corpora’. ICAME Journal 26: 81-114.

Renouf, A. (1987) ‘Moving on’ in J. Sinclair (ed.) Looking Up: An Account of the

COBUILD Project. London: HarperCollins.

Römer, U. (2005) Progressives, Patterns, Pedagogy: A Corpus-driven Approach to

English Progressive Forms, Functions, Contexts and Didactics. Amsterdam: John

Benjamins.

57 Römer, U. (2008) ‘Corpora and language teaching’. In A. Lüdeling and M. Kyto

(eds.) Corpus Linguistics: An International Handbook, pp. 112-131. Berlin:

Mouton de Gruyter.

Sajavaara, K. (1996) ‘New challenges for contrastive linguistics’. In K. Aijmer, B.

Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund University

Press. 17-36.

Salkie, R. (1999) ‘How can linguists profit from parallel corpora?’. Paper given at the

Symposium on Parallel Corpora. 22-23 April 1999, University of Uppsala.

Santos, D. (1996) Tense and Aspect in English and Portuguese: A Contrastive

Semantical Study. PhD thesis. Universidade Tecnica de Lisboa.

Scott, M. and Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis

in Language Education. Amsterdam: John Benjamins.

Seidlhofer, B. (2000) ‘Operationalizing intertextuality: using learner corpora for

learning’ in L. Burnard and A. McEnery (eds.) Rethinking Language Pedagogy

from a Corpus Perspective, pp. 207–24. New York: Peter Lang.

Seidlhofer, B. (2002) ‘Pedagogy and local learner corpora: working with learning

driven data’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner

Corpora, Second Language Acquisition and Foreign Language Teaching, pp.

213–234. Philadelphia: John Benjamins.

Shei, C. and Pain, H. (2000) ‘An ESL writer’s collocational aid’. Computer Assisted

Language Learning 13/2: 167-182.

Sinclair, J. (1990) Collins COBUILD English Grammar. London: HarperCollins.

Sinclair, J. (1992) Collins COBUILD English Usage. London: HarperCollins.

Sinclair, J. (2000) ‘Lexical grammar’. Naujoji Metodologija 24: 191-203.

Sinclair, J. (2003) Reading Concordances. London: Longman.

58 Sinclair, J. (ed.) (2004) How to Use Corpora in Language Teaching. Amsterdam:

John Benjamins.

Sinclair, J., Bullon, S., Krishnamurthy, R., Manning, E. and Todd, J. (1990) Collins

COBUILD English Grammar. London: HarperCollins.

Sinclair, J. and Renouf, A. (1988) ‘A lexical syllabus for language learning’ in R.

Carter and M. McCarthy (eds.) Vocabulary and Language Teaching. London:

Longman.

Smadja, F. and McKeown, K. (1990) ‘Automatically extracting and representing

collocations for language generation’ in Proceedings of the 28th Annual Meeting

of Association for Computational Linguistics, pp. 252-259.

Sripicharn, P. (2000) ‘Data-driven learning materials as a way to teach lexis in

context’ in C. Heffer, H. Sauntson and G. Fox (eds.) Words in Context: A tribute

to John Sinclair on his Retirement. Birmingham: University of Birmingham.

Stubbs, M. (2001) ‘Texts, corpora, and problems of interpretation: a response to

Widdowson’. Applied Linguistics 22(2): 149-172.

Tan, M. (2002) Corpus Studies in Language Education. Bangkok: IELE Press.

Taylor, L. (2003) ‘The Cambridge approach to speaking assessment’. Research Notes

13: 2-4.

Teubert, W. (1996) ‘Comparable or parallel corpora?’. International Journal of

Lexicography 9(3): 238-264.

Thompson, P. and Tribble, C. (2001) ‘Looking at citations: using corpora in English

for academic purposes’. Language Learning & Technology 5/3: 91-105.

Thurstun, J. and Candlin, C. (1997) Exploring Academic English: A Workbook for

Student Essay Writing. Sydney: NCELTR.

59 Thurstun, J. and Candlin, C. (1998) ‘Concordancing and the teaching of the

vocabulary of academic English’. English for Specific Purposes 17: 267-280.

Tribble, C. (1991) ‘Concordancing and an EAP writing program’. CAELL Journal

1/2: 10-15.

Tribble, C. (1997a) ‘Corpora, concordances and ELT’ in T. Boswood (ed.) New Ways

of Using Computers in Language Teaching. Alexandria VA: TESOL.

Tribble C. (1997b) ‘Improving corpora for ELT: quick and dirty ways of developing

corpora for language teaching’ in B. Lewandowska-Tomaszczyk, P. Melia (eds.)

Practical Applications in Language Corpora – Proceedings of PALC ’97, pp.

107-117. Łódź: Łódź University Press.

Tribble, C. (2000) ‘Practical uses for language corpora in ELT’ in P. Brett, and G.

Motteram (eds.) A Special Interest in Computers: Learning and Teaching with

Information and Communications Technologies, pp. 31-41. Kent: IATEFL.

Tribble, C. (2003) ‘The text, the whole text…or why large published corpora aren’t

much use to language learners and teachers’ in B. Lewandowska-Tomaszczyk

(ed.) Practical Applications in Language and Computers, pp. 303-318. Frankfurt:

Peter Lang.

Tribble, C. and Jones, G. (1990) Concordances in the Classroom: A Resource Book

for Teachers. London: Longman.

Tribble, C. and Jones, G. (1997) Concordances in the Classroom: Using Corpora in

Language Education. Houston TX: Athelstan.

Upton, T. and Connor, U. (2001) ‘Using computerized corpus analysis to investigate

the textlinguistic discourse move of a genre’. English for Specific Purposes 20:

313-329.

60 Wang, L. (1984) Zhongguo Jufa Lilun (Syntactic Theories in China). Qingdao:

Shandong Education Press.

Wichmann, A. (1995) ‘Using concordances for the teaching of modern languages in

higher education’. Language Learning Journal 11: 61-63.

Wichmann, A. (1997) ‘General introduction’ in A. Wichmann, S. Fligelstone, A.

McEnery and G. Knowles (eds.) Teaching and Language Corpora, pp. xvi-xvii.

London: Longman.

Wichmann, A. Fligelstone, S. McEnery A. and Knowles, G. (eds.) (1997) Teaching

and Language Corpora. London: Longman.

Widdowson, H. (1990) Aspects of Language Teaching. Oxford: Oxford University

Press.

Widdowson, H. (1991) ‘The description and prescription of language’ in J. Alatis

(ed.) Georgetown University Round Table on Languages and Linguistics 1991,

pp. 11-24. Washington, D.C.: Georgetown University Press.

Widdowson, H. (2000) ‘The limitations of linguistics applied’. Applied Linguistics

21/1: 3-25.

Widdowson, H. (2003) Defining Issues in English Language Teaching. Oxford:

Oxford University Press.

Willis, D. (1990) The Lexical Syllabus: A New Approach to Language Teaching.

London: HarperCollins.

Willis, J., Willis, D. and Davids, J. (1988-1989) Collins COBUILD English Course

(Parts 1-3). London: HarperCollins.

Woolls, D. (1998) ‘Multilingual parallel concordancing for pedagogical use’ in

Teaching and Language Corpora, pp. 222-227. Keble College, Oxford, 24-27

July 1998.

61 Xiao, R. (2003) ‘Use of parallel and comparable corpora in language study’. English

Education in China 2003(1).

Xiao, R. (2008) ‘Well-known and influential corpora’. In A. Lüdeling and M. Kyto

(eds) Corpus Linguistics: An International Handbook, pp. 383-457. Berlin:

Mouton de Gruyter.

Xiao, R., He, L. and Yue, M. (forthcoming) ‘In pursuit of the third code: Using the

ZJU Corpus of Translational Chinese in Translation Studies.’ In R. Xiao (ed.)

Using Corpora in Contrastive and Translation Studies. Newcastle upon Tyne:

Cambridge Scholars Publishing.

Xiao, R., McEnery, T. and Qian, Y. (2006) ‘Passive constructions in English and

Chinese: A corpus-based contrastive study’. Languages in Contrast 6(1): 109-149.

Xiao, R. and Yue, M. (2009) ‘Using corpora in Translation Studies: The state of the

art’. In P. Baker (ed.) Contemporary Corpus Linguistics. London: Continuum.

Yang, Y. and Allison, D. (2003) ‘Research articles in applied linguistics: moving

from results to conclusions’. English for Specific Purposes 22: 365-385.

Zhang, X. (1993) English Collocations and Their Effect on the Writing of Native and

Non-native College Freshmen. PhD thesis. Indiana University of Pennsylvania.

62