<<

Opportunities in opportunism: a critical evaluation of data collection methods in the Spoken BNC2014 Robbie Love Cambridge Assessment English [email protected] lovermob

http://cass.lancs.ac.uk Today’s talk

1. The Spoken BNC2014 2. Opportunistic vs. principled design 3. The Spoken BNC1994 & other corpora 4. Opportunism in the Spoken BNC2014 5. Discussion: was it worth it? 6. Conclusions

http://cass.lancs.ac.uk 2 The Spoken BNC2014

• Conversational, L1 British English • 2012-2016 • 672 speakers • 1,251 texts • 11,422,617 words • Freely available to the public: – (1) on Lancaster’s CQPweb, then – (2) file download

3 Why?

4 Who built the Spoken BNC2014?

Lancaster University: • Robbie Love, Andrew Hardie, Vaclav Brezina, Tony McEnery

Cambridge University Press: • Claire Dembry • Olivia Goodman, Imogen Dickens, Sarah Grieves, Laura Grimes, Samantha Owen, 20 transcribers

http://cass.lancs.ac.uk 5 Opportunistic vs. principled design

Principled corpus compilation • pre-determined sampling frame • proportions of types of data assigned before collection • concerned with representativeness and/or balance

Well-known examples: • The Written BNC1994 • The Brown family

http://cass.lancs.ac.uk 6 Opportunistic vs. principled design

Opportunistic corpus compilation • gather everything/as much as you can that is relevant to RQ • resulting proportions reflect availability of data given constraints of time/money • supplement corpus with targeted data collection AND/OR • discover some ‘data’ and construct RQ(s) around it

http://cass.lancs.ac.uk 7 Opportunistic vs. principled design

“converting spoken recordings into machine-readable transcriptions is a very time-consuming task” (McEnery & Hardie 2012: 12)

“the opportunistic (or cannibalistic) corpus…is based on the assumption that each and every corpus is unbalanced” (Teubert & Cermáková 2004: 120)

“only very few corpus linguists but many computational linguists use large opportunistic corpora” (Lüdeling & Zeldes 2008)

“opportunistic corpus building has often been dismissed, and in some cases with justification, as being unscientific and unrepresentative” (Douglas 2003: 34)

8 How did the BNC do it?

BNC1994 = intended to be “non- opportunistic” (Burnard 2002: 2)

BNC1994 ‘spoken demographic’ = certainly not fully principled

124 contributors = balanced core 1,284 speakers = not controlled Result: not balanced, nor representative Crowdy (1993: 260)

http://cass.lancs.ac.uk 9 How did the BNC do it?

70

60

50

40 Males 30 Females 20

10

0 % (words collected by BNC % (UK population 1991) respondents) 25

20 % (words collected by BNC respondents) 15

10

% (UK population 1991) 5

0 0-14 15-24 25-34 35-44 45-59 60+

http://cass.lancs.ac.uk 10 Other corpora

PRINCIPLED Research and Teaching Corpus of Spoken German (FOLK) “opportunism, although often a necessary evil, is not necessarily indefensible so long as the collection (Wordbanks Online) method itself is transparent” Corpus of Academic (Douglas 2003: 34) Spoken English (CASE)

International Corpus of English – Great Britain (ICE- Scottish Corpus of Texts GB) and (SCOTS)

National Corpus of Contemporary Welsh (CorCenCC) BBC Voices project

Santa Barbara Corpus of Spoken American English (SBCSAE)

OPPORTUNISTIC

http://cass.lancs.ac.uk 11 The Spoken BNC2014: aims

• to compile a corpus of informal British English conversation from the 2010s which is comparable to the Spoken BNC1994’s demographic component;

• to compile the corpus in a manner which reflects, as much as possible, the state of the art with regards to methodological approach; and, in achieving this,

• to provide a fresh data source for a new series of wide- ranging studies in linguistics and the social sciences, and for the teaching of English

http://cass.lancs.ac.uk 12 The Spoken BNC2014

• More opportunistic than the Spoken BNC1994 • Open call for contributors = PPSR (Shirk et al. 2012) • “contributory [public participation in scientific research] projects generally result in large- scale data sets” • Smartphones, payment, press campaigns

http://cass.lancs.ac.uk 13 The Spoken BNC2014 • No balanced core of contributors i.e. no sampling frame • Collect any and all relevant data • Monitor during collection & attempt to address ‘holes’ – Social media, student recruitment, press campaigns

http://cass.lancs.ac.uk 14 The result

8000000 4500000 7000000 4000000 6000000 3500000 5000000 3000000 2500000 4000000 2000000 3000000 1500000 2000000 1000000 1000000 500000 0 0 Male Female

4500000 6000000 4000000 5000000 3500000

3000000 4000000 2500000 3000000 2000000

1500000 2000000 1000000 1000000 500000 0 0 A B C1 C2 D E unknown north midlands south

http://cass.lancs.ac.uk 15 The result

10000000

8000000 200000 6000000 4000000 2000000 0 150000

100000

50000 861

0 scotland wales n_ireland r_ireland

http://cass.lancs.ac.uk 16 Entering the unknown

45

40

35

30

25

20

15

10

5

0 Age (1990s) Age (2010s) Gender (1990s) Gender (2010s) SES (1990s) SES (2010s)

17 Reflections

• Less concerned about balance, but more about size of components • English regions – an ‘English English’ corpus • Speed of construction – last recording only a year before release • Rich record of metadata

• Still…could be bigger! • Need not be used on its own – e.g. Listening Project

• Opportunism paid off, for the most part • But some interventions better than others, given constraints • We are at the whim of participants – remember that these are intimate conversations

18 “while the notions of monitor and snapshot corpora provide us with relatively idealised models of corpus construction, it should be noted, and accepted, that the corpora that we use and construct must sometimes be determined by pragmatic considerations”

(McEnery & Hardie 2012: 13)

19 For more info

See: • Project website – corpora.lancs.ac.uk/bnc2014/ • Love, Dembry, Hardie, Brezina & McEnery (2017 in press) – corpus citation paper • Love, Hawtin & Hardie (2017 forthcoming) – user guide • Love (submitted) – entire PhD thesis

http://cass.lancs.ac.uk 20 21 References

•Burnard, L. (2002). Where did we go wrong? A retrospective look at the . In B. Kettemann, & G. Markus (Eds.), Teaching and learning by doing corpus analysis (pp. 51-71). Amsterdam: Rodopi. •Crowdy, S. (1993). Spoken Corpus Design. Literary and Linguistic Computing, 8(4), 259-265. •Douglas, F. (2003). The Scottish Corpus of Texts and Speech: problems of corpus design. Literary and Linguistic Computing, 18(1), 23-37. •Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: designing and building a spoken corpus of everyday conversations. International Journal of , 22(3). •Love, R., Hawtin, A., & Hardie, A. (2017). The British National Corpus 2014: User Manual and Reference Guide (version 1.0). Lancaster: ESRC Centre for Corpus Approaches to Social Science. •Lüdeling, A., & Zeldes, A. (2008). Three Views on Corpora: Corpus Linguistics, Literary Computing, and Computational Linguistics. Available at : http://computerphilologie.tu-darmstadt.de/jg07/luedzeldes.html (last accessed October 2017). •McEnery, T., & Hardie, A. (2012). Corpus linguistics: method, theory and practice. Cambridge: Cambridge University Press. •Shirk, J. L., Ballard, H. L., Wilderman, C. C., Phillips, T., Wiggins, A., Jordan, R., McCallie, E., Minarchek, M., Lewenstein, B. V., Krasny, M. E., & Bonney, R. (2012). Public participation in scientific research: A framework for deliberate design. Ecology and Society, 17(2), 29. •Teubert, W. & Cermáková, A. (2004). Directions in Corpus Linguistics. In M. A. K. Halliday, W. Teubert, C. Yallop, & A. Cermáková. (Eds.), Lexicology and Corpus Linguistics (pp. 113-166). London: Continuum.

http://cass.lancs.ac.uk 22