R-Based Strategies for DH in English Linguistics: a Case Study
Total Page:16
File Type:pdf, Size:1020Kb
R-based strategies for DH in English Linguistics: a case study Nicolas Ballier Paula Lissón Université Paris Diderot Université Paris Diderot UFR Études Anglophones UFR Études Anglophones CLILLAC-ARP (EA 3967) CLILLAC-ARP (EA 3967) nicolas.ballier@univ- [email protected] paris-diderot.fr paris-diderot.fr language for research in linguistics, both in a Abstract quantitative and qualitative approach. Developing a culture based on the program- This paper is a position statement advocating ming language R (R Core Team 2016) for NLP the implementation of the programming lan- among MA and PhD students doing English Lin- guage R in a curriculum of English Linguis- guistics is no easy task. Broadly speaking, most tics. This is an illustration of a possible strat- of the students have little background in mathe- egy for the requirements of Natural Language matics, statistics, or programming, and usually Processing (NLP) for Digital Humanities (DH) studies in an established curriculum. R feel reluctant to study any of these disciplines. plays the role of a Trojan Horse for NLP and While most PhD students in English linguistics statistics, while promoting the acquisition of are former students with a Baccalauréat in Sci- a programming language. We report an over- ences, some MA students pride themselves on view of existing practices implemented in an having radically opted out of Maths. However, MA and PhD programme at the University of we believe that students should be made aware of Paris Diderot in the recent years. We empha- the growing use of statistical and NLP methods size completed aspects of the curriculum and in linguistics and to be able to interpret and im- detail existing teaching strategies rather than plement these techniques. work in progress but our last section alludes We need to show our students how important to work still under way, such as getting PhD students to write their own R packages. the DH are for their research, enabling them to see that the use of NLP techniques provides them We describe our strategy, discuss better prac- with a whole range of new possibilities for the tices and teaching concepts, and present ex- treatment and the analysis of their data. In addi- periments in a curriculum. We express the tion, with the growing tendency of corpus lin- needs of an initially limited NLP environ- guistics, students often need to work with large ment and provide directions for future DH corpora or huge databases of images, and stand- curricular developments. We detail the chal- ard tutorials and introductory books do not cover lenges in teaching a non-CL audience, show- these needs (Arnold and Tilton 2015). ing that some software suites can be integrat- Preparing students to work with NLP methods ed to a curriculum, outlining how some spe- and with command lines also means to ask them cific R packages contribute to the acquisition of NLP-based techniques and favour the to work with particular formats of data they need awareness of the needs for statistical model- to get used to (e.g. tabular format, utf8, limited ling. use of special characters unless necessary). However, this facilitates the interoperability of 1 Introduction their data, as well as the replicability of their re- search. This paper deals with the development of an R- The rest of the paper is organized as follow- based culture of DH for students of English Lin- ing: section 2 describes the context of an MA in guistics at the university of Paris Diderot. We English Linguistics and explains why the culture describe some aspects of a curriculum (MA and is traditionally limited in NLP and DH in this PhD) that aims at taking advantage of the flexi- kind of curriculum. It also details the strategy bility and the adaptability of this programming used to develop an ‘R-based culture’ among MA 1 and PhD students, taking advantage of its flexi- of the Speech (POS) taggers or classifiers, bility and adaptability to the various require- among many other possibilities, require some ments of linguistic data. Section 3 explains how familiarity with, at least, one programming lan- collaboration with the Maths department has en- guage. Although most programs designed for abled the emergence of an R-based common cul- corpus linguistics, such as AntConc (Anthony ture for statistics to be taught to mostly ‘math- 2011), Cesax (Komen 2011), Sketch Engine less’ students. Section 4 discusses the various (Kilgarriff et al. 2004), Wordsmith (Scott 2016); strategies to teach R in recent textbooks of quan- or within the French lexicometric tradition, Le titative linguistics (from Baayen, 2008 to Trameur (Fleury and Zimina 2014), or Le Gro- Levshina, 2015). Several teaching styles and moteur (Gerdes 2014) are presented in a graph- aims are discussed. Section 5 discusses DH in a ical and more or less user-friendly interface wider perspective of Machine Learning (ML) (GUI) with built-in functions, the command line based analyses and show the benefits of R, re- offers much more flexibility in terms of explora- porting on the possibility of an interdisciplinary tion and modelling of the data. For instance, R bridge for programs in data science. Section 6 can be first used as a concordancer (see (S. Gries proposes an epistemological interpretation of R 2009) in order to explore a given corpus, and as a possible medium for the 3rd revolution of then, once the desired structures have been ex- grammatisation, expanding on Sylvain Auroux’s tracted, they can also be treated in R in terms of notion. The conclusion reflects on the limitation statistical analysis and/or modelling. Finally, a of our strategy, when compared with other pro- visual representation of the results of the analysis gramming languages. We try to assess the rele- can be easily plotted. vance of this Esperanto-like, high-level pro- Apart from all the statistical packages that can gramming language for digital humanities. be used for a quantitative analysis of data, there are currently more than 50 packages for NLP 2 Developing an R-based culture for available in the CRAN repository1, some of them NLP, DH and beyond being particularly useful for linguists. Because our students have research questions on spoken 2.1 Why is it necessary? or written data, they can find several R packages In France, the study of English Linguistics in to suit their needs if they are taught how to use English departments has traditionally been linked them. These are some of the specific packages to the competitive exams to become teachers of our students are working with: English (agrégation), so that linguistics is only a • Phonetics/Phonology. Students are pre- sub-domain in relation to other domains of Eng- sented with normalization issues and anal- lish studies such as translation and literature. As yses of vowel systems drawing from packag- a consequence, the core of this curriculum can es such as phonTools (Barreda 2015) vowels hardly be dedicated to corpus linguistics. There (Kendall and Thomas 2010) emuR also is another structural (devastating) side effect (Winkelmann, Jaensch, and Harrington 2017) for linguistic research: since there is not a single and phonR (McCloy 2016), which allows for trace of NLP-driven questions for the agréga- the treatment of data extracted with Praat tion, there is nothing in the curriculum of English (Boersma and Weenink, 2017), one of the studies about these issues (contrary to what the most well-known pieces of software used in introduction of English phonology somehow phonetics/phonology. triggered after 2000 in the agrégation and in the • Data mining/text processing: cleanNLP undergraduate programmes that are meant to (Arnold 2017), koRpus (Michalke 2017) lan- prepare for this competitive exam). Since in most guageR (Baayen, 2011) tidytext (Silge and European Universities the rise of corpus linguis- Robinson 2016) openNLP (Baldridge 2005) tics and quantitative methods has become essen- qdap (Rinker 2013). For the treatment and tial in Linguistics curricula, the gap in France the exploration of written corpora, some of between the agrégation and linguistic research is the functions that these packages offer are: widening. automatic POS tagging, implementations of In corpus linguistics, the treatment of large parsers (e.g. Stanford coreNLP and the corpora and complex datasets with many varia- SpaCy library implemented in cleanNLP), bles is getting increasingly frequent. However, the application of statistical models, as well as 1 https://cran.r- the use of NLP techniques such as parsers, Part project.org/web/views/NaturalLanguageProcessing.html 2 text trimming, mathematical modelling of ability distribution, linear regression models, vocabulary with LNRE models, automatic ANOVAs, linear discriminant analysis, principal syllabification, measures of lexical diversity component analysis… and readability… Data sessions. As a follow up to the Statistics seminar, PhD students can discuss their data dur- 2.2. Aspect of the current curriculum ing data sessions. In these sessions, our students have the opportunity to present their linguistic The implementation of R-based modules in the dataset to a statistician. Ideally, the student has BA and the MA has been taking place progres- already made some progress on a basic level and sively. Here we detail the various courses and knows how to present the structure of the data activities that are currently implemented in our and detail the kind of investigated variables. To- curriculum. gether with a statistician, they explore all the dif- ferent methods that can be applied according to Undergraduate module. We present R among what the student’s project requires. other software used by linguists in a module cen- Individual work. Because we are conscious tred on student’s projects, following Language that our curriculum cannot cover all the for- and Computers . This introductory module also mation on R that our students would need, some aims at getting undergraduate students to be in- individual extra work is needed.