Automating Second Language Acquisition Research: Integrating Information Visualisation and Machine Learning

Automating Second Language Acquisition Research: Integrating Information Visualisation and Machine Learning Helen Yannakoudakis Ted Briscoe Theodora Alexopoulou Computer Laboratory Computer Laboratory DTAL University of Cambridge University of Cambridge University of Cambridge United Kingdom United Kingdom United Kingdom [email protected] [email protected] [email protected] Abstract work, we modify and combine techniques developed for information visualisation with method- We demonstrate how data-driven ap- ologies from computational linguistics to support proaches to learner corpora can support a novel and more empirical perspective on CEFR Second Language Acquisition research levels. In particular, we build a visual user in- when integrated with visualisation tools. terface (hereafter UI) which aids the develop- We present a visual user interface supporting the investigation of a set of linguistic ment of hypotheses about learner grammars us- features discriminating between pass and ing graphs of linguistic features discriminating fail ‘English as a Second or Other Lan- pass/fail exam scripts for intermediate English. guage’ exam scripts. The system displays Briscoe et al. (2010) use supervised discrimi- directed graphs to model interactions native machine learning methods to automate the between features and supports exploratory assessment of ‘English as a Second or Other Lan- search over a set of learner scripts. We guage’ (ESOL) exam scripts, and in particular, the illustrate how the interface can support the investigation of the co-occurrence First Certificate in English (FCE) exam, which of many individual features, and discuss assesses English at an upper-intermediate level how such investigations can shed light on (CEFR level B2). They use a binary discrimina- understanding the linguistic abilities that tive classifier to learn a linear threshold function characterise different levels of attainment that best discriminates passing from failing FCE and, more generally, developmental aspects scripts, and predict whether a script can be clas- of learner grammars. sified as such. To facilitate learning of the clas- sification function, the data should be represented 1 Introduction appropriately with the most relevant set of (linguistic) features. They found a discriminative fea- The Common European Framework of Reference ture set includes, among other feature types, lexi- for Languages (CEFR)1 is an international bench- cal and part-of-speech (POS) ngrams. We extract mark of language attainment at different stages of the discriminative instances of these two feature learning. The English Profile (EP)2 research pro- types and focus on their linguistic analysis3. Ta- gramme aims to enhance the learning, teaching ble 1 presents a small subset ordered by discrimi- and assessment of English as an additional lan- native weight. guage by creating detailed reference level descrip- The investigation of discriminative features can tions of the language abilities expected at each offer insights into assessment and into the linguis- level. As part of our research within that frame- 3Briscoe et al. (2010) POS tagged and parsed the data 1http://www.coe.int/t/dg4/linguistic/cadre en.asp using the RASP toolkit (Briscoe et al., 2006). POS tags are 2http://www.englishprofile.org/ based on the CLAWS tagset. tic properties characterising the relevant CEFR Feature Example level. However, the amount and variety of data VM RR (POS bigram: +) could clearly potentially made available by the classifier is con- , because (word bigram: −) , because of siderable, as it typically finds hundreds of thou- necessary (word unigram: +) it is necessary that sands of discriminative feature instances. Even the people (word bigram: −) *the people are clever if investigation is restricted to the most discrim- VV; VV; (POS bigram: −) *we go see film inative ones, calculations of relationships be- NN2 VVG (POS bigram: +) children smiling tween features can rapidly grow and become over- whelming. Discriminative features typically cap- Table 1: Subset of features ordered by discriminative weight; + and − show their association with either ture relatively low-level, specific and local prop- passing or failing scripts. erties of texts, so features need to be linked to the scripts they appear in to allow investigation of the contexts in which they occur. The scripts, in turn, of their co-occurrence in the corpus and directed need to be searched for further linguistic prop- graphs are used in order to illustrate their rela- erties in order to formulate and evaluate higher- tionships. Selection of different feature combi- level, more general and comprehensible hypothe- nations automatically generates queries over the ses which can inform reference level descriptions data and returns the relevant scripts as well as as- and understanding of learner grammars. sociations with meta-data and different types of The appeal of information visualisation is to errors committed by the learners4. In the next sec- gain a deeper understanding of important phe- tions we describe in detail the visualiser, illustrate nomena that are represented in a database (Card et how it can support the investigation of individual al., 1999) by making it possible to navigate large features, and discuss how such investigations can amounts of data for formulating and testing hy- shed light on the relationships between features potheses faster, intuitively, and with relative ease. and developmental aspects of learner grammars. An important challenge is to identify and assess To the best of our knowledge, this is the first the usefulness of the enormous number of pro- attempt to visually analyse as well as perform jections that can potentially be visualised. Explo- a linguistic interpretation of discriminative fea- ration of (large) databases can lead quickly to nu- tures that characterise learner English. We also merous possible research directions; lack of good apply our visualiser to a set of 1,244 publically- tools often slows down the process of identifying available FCE ESOL texts (Yannakoudakis et al., the most productive paths to pursue. 2011) and make it available as a web service to In our context, we require a tool that visu- other researchers5. alises features flexibly, supports interactive investigation of scripts instantiating them, and allows 2 Dataset statistics about scripts, such as the co-occurrence We use texts produced by candidates taking the of features or presence of other linguistic proper- FCE exam, which assesses English at an upper- ties, to be derived quickly. One of the advantages intermediate level. The FCE texts, which are of using visualisation techniques over command- part of the Cambridge Learner Corpus6, are pro- line database search tools, is that Second Lan- duced by English language learners from around guage Acquisition (SLA) researchers and related the world, sitting Cambridge Assessment’s ESOL users, such as assessors and teachers, can access examinations7. The texts are manually tagged scripts, associated features and annotation intuitively without the need to learn query language 4Our interface integrates a command-line Lucene search syntax. tool developed by Gram and Buttery (2009). 5 We modify previously-developed visualisation Available by request: http://ilexir.co.uk/applications/ep- visualiser/ techniques (Di Battista et al., 1999) and build a 6http://www.cup.cam.ac.uk/gb/elt/catalogue/subject/custom/ visual UI supporting hypothesis formation about item3646603/ learner grammars. Features are grouped in terms 7http://www.cambridgeesol.org/ with information about linguistic errors (Nicholls, using score(fi; fj). Feature relations are shown 2003) and linked to meta-data about the learners via highlighting of features when the user hovers (e.g., age and native language) and the exam (e.g., the cursor over them, while the strength of the re- grade). lations is visually encoded in the edge width. For example, one of the highest-weighted pos- 3 The English Profile visualiser itive discriminative features is VM RR (see Ta- 3.1 Basic structure and front-end ble 1), which captures sequences of a modal auxiliary followed by an adverb as in will al- The English Profile (EP) visualiser is developed ways (avoid) or could clearly (see). Investigat- in Java and uses the Prefuse library (Heer et ing its relative co-occurrence with other features al., 2005) for the visual components. Figure 1 using a score range of 0.8–1 and regardless of shows its front-end. Features are represented directionality, we find that VM RR is related to by a labelled node and displayed in the central the following: (i) POS ngrams: RR VB; AT1, panel; positive features (i.e., those associated with VM RR VB;, VM RR VH;, PPH1 VM RR, passing the exam) are shaded in a light green 8 VM RR VV;, PPIS1 VM RR, PPIS2 VM RR, color while negative ones are light red . A field RR VB;; (ii) word ngrams: will also, can only, at the bottom right supports searching for fea- can also, can just. These relations show us the tures/nodes that start with specified characters and syntactic environments of the feature (i) or its highlighting them in blue. characteristic lexicalisations (ii). 3.2 Feature relations 3.3 Dynamic creation of graphs via selection Crucial to understanding discriminative features criteria is finding the relationships that hold between them. We calculate co-occurrences of features at Questions relating to a graph display may include the sentence-level in order to extract ‘meaningful’ information about the most connected nodes, sep- relations and possible patterns of use. Combi- arate components of the graph, types of intercon- nations of features that may be ‘useful’ are kept nected features, etc. However, the functionality, while the rest are discarded. ‘Usefulness’ is mea- usability and tractability of graphs is severely lim- sured as follows: consider the set of all the sen- ited when the number of nodes and edges grows tences in the corpus S = fs ; s ; :::; s g and the 1 2 N by more than a few dozen (Fry, 2007). In order set of all the features F = ff1; f2; :::; fM g.

Load more