R-Based Strategies for DH in English Linguistics: a Case Study Nicolas Ballier, Paula Lissón

R-based strategies for DH in English Linguistics: a case study Nicolas Ballier, Paula Lissón

To cite this version:

Nicolas Ballier, Paula Lissón. R-based strategies for DH in English Linguistics: a case study . Teaching NLP for Digital Humanities (Teach4DH) co-located with GSCL 2017., Sep 2017, Berlin, Germany. pp.1-10. hal-01587126

HAL Id: hal-01587126 https://hal.archives-ouvertes.fr/hal-01587126 Submitted on 13 Sep 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. R-based strategies for DH in English Linguistics: a case study

Nicolas Ballier Paula Lissón Université Paris Diderot Université Paris Diderot UFR Études Anglophones UFR Études Anglophones CLILLAC-ARP (EA 3967) CLILLAC-ARP (EA 3967) nicolas.ballier@univ- [email protected] paris-diderot.fr paris-diderot.fr

language for research in linguistics, both in a Abstract quantitative and qualitative approach. Developing a culture based on the program- This paper is a position statement advocating ming language R (R Core Team 2016) for NLP the implementation of the programming lan- among MA and PhD students doing English Lin- guage R in a curriculum of English Linguis- guistics is no easy task. Broadly speaking, most tics. This is an illustration of a possible strat- of the students have little background in mathe- egy for the requirements of Natural Language matics, statistics, or programming, and usually Processing (NLP) for Digital Humanities (DH) studies in an established curriculum. R feel reluctant to study any of these disciplines. plays the role of a Trojan Horse for NLP and While most PhD students in English linguistics statistics, while promoting the acquisition of are former students with a Baccalauréat in Sci- a programming language. We report an over- ences, some MA students pride themselves on view of existing practices implemented in an having radically opted out of Maths. However, MA and PhD programme at the University of we believe that students should be made aware of Paris Diderot in the recent years. We empha- the growing use of statistical and NLP methods size completed aspects of the curriculum and in linguistics and to be able to interpret and im- detail existing teaching strategies rather than plement these techniques. work in progress but our last section alludes We need to show our students how important to work still under way, such as getting PhD students to write their own R packages. the DH are for their research, enabling them to see that the use of NLP techniques provides them We describe our strategy, discuss better prac- with a whole range of new possibilities for the tices and teaching concepts, and present ex- treatment and the analysis of their data. In addi- periments in a curriculum. We express the tion, with the growing tendency of corpus lin- needs of an initially limited NLP environ- guistics, students often need to work with large ment and provide directions for future DH corpora or huge databases of images, and stand- curricular developments. We detail the chal- ard tutorials and introductory books do not cover lenges in teaching a non-CL audience, show- these needs (Arnold and Tilton 2015). ing that some software suites can be integrat- Preparing students to work with NLP methods ed to a curriculum, outlining how some spe- and with command lines also means to ask them cific R packages contribute to the acquisition of NLP-based techniques and favour the to work with particular formats of data they need awareness of the needs for statistical model- to get used to (e.g. tabular format, utf8, limited ling. use of special characters unless necessary). However, this facilitates the interoperability of 1 Introduction their data, as well as the replicability of their research. This paper deals with the development of an R- The rest of the paper is organized as follow- based culture of DH for students of English Lin- ing: section 2 describes the context of an MA in guistics at the university of Paris Diderot. We English Linguistics and explains why the culture describe some aspects of a curriculum (MA and is traditionally limited in NLP and DH in this PhD) that aims at taking advantage of the flexi- kind of curriculum. It also details the strategy bility and the adaptability of this programming used to develop an ‘R-based culture’ among MA

1 and PhD students, taking advantage of its flexi- of the Speech (POS) taggers or classifiers, bility and adaptability to the various require- among many other possibilities, require some ments of linguistic data. Section 3 explains how familiarity with, at least, one programming lan- collaboration with the Maths department has en- guage. Although most programs designed for abled the emergence of an R-based common cul- corpus linguistics, such as AntConc (Anthony ture for statistics to be taught to mostly ‘math- 2011), Cesax (Komen 2011), Sketch Engine less’ students. Section 4 discusses the various (Kilgarriff et al. 2004), Wordsmith (Scott 2016); strategies to teach R in recent textbooks of quan- or within the French lexicometric tradition, Le titative linguistics (from Baayen, 2008 to Trameur (Fleury and Zimina 2014), or Le Gro- Levshina, 2015). Several teaching styles and moteur (Gerdes 2014) are presented in a graph- aims are discussed. Section 5 discusses DH in a ical and more or less user-friendly interface wider perspective of Machine Learning (ML) (GUI) with built-in functions, the command line based analyses and show the benefits of R, re- offers much more flexibility in terms of explora- porting on the possibility of an interdisciplinary tion and modelling of the data. For instance, R bridge for programs in data science. Section 6 can be first used as a concordancer (see (S. Gries proposes an epistemological interpretation of R 2009) in order to explore a given corpus, and as a possible medium for the 3rd revolution of then, once the desired structures have been ex- grammatisation, expanding on Sylvain Auroux’s tracted, they can also be treated in R in terms of notion. The conclusion reflects on the limitation statistical analysis and/or modelling. Finally, a of our strategy, when compared with other pro- visual representation of the results of the analysis gramming languages. We try to assess the rele- can be easily plotted. vance of this Esperanto-like, high-level pro- Apart from all the statistical packages that can gramming language for digital humanities. be used for a quantitative analysis of data, there are currently more than 50 packages for NLP 2 Developing an R-based culture for available in the CRAN repository1, some of them NLP, DH and beyond being particularly useful for linguists. Because our students have research questions on spoken 2.1 Why is it necessary? or written data, they can find several R packages In France, the study of English Linguistics in to suit their needs if they are taught how to use English departments has traditionally been linked them. These are some of the specific packages to the competitive exams to become teachers of our students are working with: English (agrégation), so that linguistics is only a • Phonetics/Phonology. Students are pre- sub-domain in relation to other domains of Eng- sented with normalization issues and anal- lish studies such as translation and literature. As yses of vowel systems drawing from packag- a consequence, the core of this curriculum can es such as phonTools (Barreda 2015) vowels hardly be dedicated to corpus linguistics. There (Kendall and Thomas 2010) emuR also is another structural (devastating) side effect (Winkelmann, Jaensch, and Harrington 2017) for linguistic research: since there is not a single and phonR (McCloy 2016), which allows for trace of NLP-driven questions for the agréga- the treatment of data extracted with Praat tion, there is nothing in the curriculum of English (Boersma and Weenink, 2017), one of the studies about these issues (contrary to what the most well-known pieces of software used in introduction of English phonology somehow phonetics/phonology. triggered after 2000 in the agrégation and in the • Data mining/text processing: cleanNLP undergraduate programmes that are meant to (Arnold 2017), koRpus (Michalke 2017) lan- prepare for this competitive exam). Since in most guageR (Baayen, 2011) tidytext (Silge and European Universities the rise of corpus linguis- Robinson 2016) openNLP (Baldridge 2005) tics and quantitative methods has become essen- qdap (Rinker 2013). For the treatment and tial in Linguistics curricula, the gap in France the exploration of written corpora, some of between the agrégation and linguistic research is the functions that these packages offer are: widening. automatic POS tagging, implementations of In corpus linguistics, the treatment of large parsers (e.g. Stanford coreNLP and the corpora and complex datasets with many varia- SpaCy library implemented in cleanNLP), bles is getting increasingly frequent. However, the application of statistical models, as well as 1 https://cran.r- the use of NLP techniques such as parsers, Part project.org/web/views/NaturalLanguageProcessing.html

2 text trimming, mathematical modelling of ability distribution, linear regression models, vocabulary with LNRE models, automatic ANOVAs, linear discriminant analysis, principal syllabification, measures of lexical diversity component analysis… and readability… Data sessions. As a follow up to the Statistics seminar, PhD students can discuss their data dur- 2.2. Aspect of the current curriculum ing data sessions. In these sessions, our students have the opportunity to present their linguistic The implementation of R-based modules in the dataset to a statistician. Ideally, the student has BA and the MA has been taking place progres- already made some progress on a basic level and sively. Here we detail the various courses and knows how to present the structure of the data activities that are currently implemented in our and detail the kind of investigated variables. To- curriculum. gether with a statistician, they explore all the different methods that can be applied according to Undergraduate module. We present R among what the student’s project requires. other software used by linguists in a module cen- Individual work. Because we are conscious tred on student’s projects, following Language that our curriculum cannot cover all the for- and Computers . This introductory module also mation on R that our students would need, some aims at getting undergraduate students to be in- individual extra work is needed. In that sense, terested in pursuing our MA research program. specific manuals for linguists such as Analyzing Linguistic Data: a practical introduction using MA seminars. Although during the first year of R (Baayen 2008), Quantitative Corpus Linguis- the MA efforts are made to encourage students to tics with R (S. Gries 2009), Quantitative Methods start using R as a way to process their corpus- in Linguistics (Johnson 2011), Statistics for Lin- based data, it is during the second year of the guists (S. T. Gries 2013), How to do Linguistics MA that a seminar on R is offered. This seminar with R, (Levshina 2015), Data Humanities with consists on 6 sessions of 120 minutes covering R (Arnold and Tilton 2015) are recommended the use of R for phonetics and phonology, and 6 knowing that they have different pre-requisites in sessions of 120 minutes on the use of R for tex- mathematics. Although not all these manuals tual data. Over the two years of the MA, the in- have the same approach in regards to the techni- troduction to R-based packages in the curriculum cality and the progression path, all of them offer takes the following form, in the different relevant a comprehensive view of what linguists can do modules (cf. Table 1): with R. For a more detailed summary and com- parison between some of these manuals, see Bal- Table 1: R in the MA modules lier (Ballier, forthcoming). Bootcamps. Bootcamps are regularly orga- Year First semester Second semester nized, both for complete beginners and for in- M1 Corpus Phonetic Analysis 1 termediate users who seek to improve their abili- methodology: (normalisation, ties in R and to discover new techniques. Gener- descriptive plots, visualisation) ally, one basic bootcamp is proposed at the be- statistics ginning of the year for all students who want to M2 Language and Computational pho- participate, and a second bootcamp is later of- Variation (infer- nology (classifiers)/ fered for more advanced learners. Bootcamps are ential statistics) Phonetic Analysis 2 conceived as an intensive way to approach particular issues (e.g. regression models, visualisa- PhD seminar. Currently, our graduate school tion, classifiers…) and consist of a week of 25- offers 12 sessions of 90 min covering advanced 30 hours of instruction. Bootcamps are normally statistical techniques for linguists. A basic com- taught with examples of datasets taken from pub- mand of R is presupposed, especially since MA lished papers or on-going research, but students seminars already cover the use of R for begin- may also explore their own data if applicable. ners. This seminar mainly focuses on the ap- These official bootcamps are normally instructed plicability of statistical methods to different by visiting scholars, such as Stefan Gries or Tay- types of data linguists deal with, but also on the lor Arnold. mathematical formulae behind all these methods. Empirical cases (already published by linguists) are used as examples. This seminar covers prob-

3 3 R-based strategies for English studies 3.2 Pedagogy of datasets and scripts In the previous section, we presented the R Datasets as such can be understood as a common courses and activities that are currently taking aspect of the methodology that enables students place in our programme, but we think that it is to understand the logic behind all the possible insufficient, especially during the MA years. In statistical tests and NLP techniques. It is, in a this section, we present all the aspects that way, a dialectic form of the transferability of should be taken into account when considering knowledge: understand the mathematics used an R-based curriculum for English Linguistics with something else than your specialty. Students studies. We aim at establishing course modules then need to learn to adapt the code to their own that present statistical and NLP techniques cur- needs (dataset and research questions), as well as rently used in linguistic research. We know that getting familiarized with importing-exporting getting to use command lines and to work with a methods from the treatment of other datasets. programming language takes a lot of time for Part of the benefits of the R strategy can be non-specialists, but it also offers a lot of new cashed in with the on-line forums and scientific possibilities for linguists. This section details blogs. They heavily rely on few datasets when some strategies to ease R’s steep learning curve. explaining complex methods, so that it makes sense to teach these recurrent datasets (mtcars, 3.1 Mathematics for linguistics and inter- Titanic, iris) to our students so that they become disciplinarity more autonomous in understanding on-line ex- One key feature of our strategy is the collabora- planations. The use of particular datasets such as tion between statisticians, mathematicians, pro- the classical iris dataset is notable for classifica- grammers, and humanists. Although we want our tion problems or dimensionality reduction. Con- students to be independent users of R and to un- versely, typical analyses based on linguistic data derstand what they do when they use and manip- such as the Bresnan and Nikitina (2003)dataset ulate data with NLP techniques, we do not ex- (Johnson, 2008 [2011], Baayen, 2008) or even pect them to become professional programmers. Jane Austen’s novels (Arnold & Tilton, 2017) However, we do expect linguists to become fa- may also help by showing students the multiple miliar enough with NLP techniques, program- applications of R from a linguistic perspective. ming and statistics, so that they can efficiently Graphs always help motivating students. Mak- identify what they need to advance in their re- ing datasets more attractive by showing visuali- search in terms of technical treatment of their sation techniques that catch students’ attention, data. This interaction between experts from dif- not only about the statistical procedure, but also ferent fields promotes cross-fertilization between about how this technique can be later visualized the domains, and it also allows mathematicians and presented in talks and vivas is definitely and programmers involved in DH and in NLP to something to take into account. understand linguists’ needs. Eventually, students may also rely on the easi- However, one of the issues we face is the crea- ness and the practicality of the script in order to tion of a reasonable time schedule for a progres- reproduce, compare and share results. With sive acquisition of all these techniques without RStudio, students can actually send their whole scaring students, knowing that most of them are project (including the environment, datasets, rather averse to learning these methods. Another graphs, objects created by them, and code) to issue related to the pedagogy of these methods is their supervisors, for example. the amount of maths that should be taught so that The scripts actually represent a new model of students get to know what they are doing with exercise and teaching methodology: scripts re- their data. On the one hand, if students are ex- place the classical textbook, showing detailed plained all the on-going mathematical methods at and commented functions. Examples and da- the same time as they are introduced to the tech- tasets need to be previously adapted to this for- nique, there is a risk that they do not understand mat. It is an example-based methodology that the procedure and refuse to use it. On the other promotes autonomy: in many cases, the script is hand, if students do not receive the mathematical not only the example, but also includes the exer- explanations of the method, they will never fully cises that need to be executed. The difficulty of understand what they are doing with their data. these exercises increases progressively. This is, The risk is that students may run theirs scripts in a way, the creation of a new didactic method: too blindly. most textbooks rely on a specific R package, R

4 scripts and companion website. The companion their personality best. So far, we have identified website to Levshina’s textbook is a good case in four main teaching approaches: point2. a) Scaring literary/linguist students for their 3.3 Package-driven pedagogy greatest benefit. Teaching the general benefit An alternative method for students who seek to of learning R as an interface in itself. This is discover the usefulness of some specific R pack- mostly relevant for future PhD students. ages to linguistic research is to present them an Among the MA students, the strategy con- MA research project related to the functionalities sists in insisting on the advantages of the ex- of one particular package. For instance, the isting libraries, and the developer’s commu- {koRpus} package implements 14 metrics of nity at large. More specific packages lexical diversity and 35 metrics of readability {ZipfR} are presented and functions are de- that can be automatically computed. Computing tailed for their interest for linguists but also each one of these metrics individually for a cor- in relation to a research community repre- pus made up of hundreds of texts results time- senting a (professional) lifetime investment consuming and unnecessary. Therefore, the stu- with possible job prospects outside the lin- dent had to learn how to use R and to explore the guistic community proper, provided time is functionalities of the package: starting from the spent on learning enough statistics and other POS tagging with TreeTagger (Schmid 1995), interdisciplinary skills. In other words, to the interpretation of the (huge) numerical result- make the most of their potential role as inter- ing dataset with all the results of the formulae, face in the DH, linguists of English are ad- the correlations of the various metrics, ANOVAs vised to invest at the same in statistics, not to to compare results between different groups… say Data Science, for longer term benefits. Getting students to work with a specific interest b) R for statistics in disguise: this approach for an R package is another way to foster acqui- consists in presenting techniques from a sition of the package itself but also of R. mathematical point of view; showing, detailing and explaining formulae. Although it of- 3.4 We love you just the way you ‘R’: A fers the possibility to really understand what typology of teaching styles is going on with the data, students normally R is known for its steep learning curve, counter- get scared and lose interest on quantitative intuitive to real programmers. How is it taught in methods because they consider them to be the reference textbooks? In general, most text- too difficult. This methodology might be par- books for linguistics with R do not assume any ticularly useful for intermediate and ad- prior experience with any programming lan- vanced students, but not so much for begin- guage, or any knowledge in statistics. They start ners. by giving basic notions on descriptive statistics, c) Motivating students. Contrary to the previ- and complexity increases progressively. Howev- ous methodology, here students are intro- er, not all books present the same degree of duced to the various techniques without con- mathematical complexity, and not all books de- sidering the on-going mathematical process- tail the mathematical processes ongoing behind es. Although it motivates students because the various commands and functions used in R. they can see ‘quick results’ and the applica- Therefore, the choice of one book or another bility to their data, it does not cover a deep may depend on the interest of the students, their analysis of the processes the data is going background in maths/programming, and their through. The emphasis is on learning simple own attitude towards NLP and statistics. When it code, for instance, heavily relying on the re- comes to courses, the instructor faces the same current functions of the tidyverse collection. type of issues. Not all students follow the same Displaying fancy visualisations with simple progression curve, and this is mainly determined code also helps catching students’ attention. by their own background and profile. These ap- d) Intermediate point: teaching how statistics proaches are not mutually exclusive. In an ideal are reported in the journals. This second ap- curriculum, students should be confronted to proach gives students the basic maths to un- several approaches and follow whatever suits derstand descriptive and inferential statistics, as well as reported results from linguistic journals. It prepares students to understand 2 https://benjamins.com/sites/z.195/ what they will be reading in quantitative lin-

5 guistics. However, it does not get into much approach, here is a bottom-up approach that tries detail when it comes to more complex tech- to address what should be the real order of the niques. day: what does it take to turn a motivated student from an English department into a DH specialist Lastly, one should not underestimate the im- (if not a data scientist)? portance of online forums. They are complemen- DH is an emerging field where curricula still tary to any teaching methodology. R has a big need to be designed. Felicitous encounters foster community of users that answer quickly on fo- cross-disciplinary achievements, but how do we rums, mailing lists of R packages for doubts and enforce these developments in our curricula? bugs are usually very active, pages such as There is a historical responsibility in curriculum Stackoverflow 3 , r-bloggers 4 , and many other design to face the challenges of DH. More than a websites also offer a comprehensive amount of political agenda, there is an epistemological tran- help and code that is relatively easy to under- sition going on in terms of required skills (and stand. Being part of the R community is being knowledge) to process data and knowledge. Ad- part of a community of experts that speak the vocating a programming language with a strong same language, in spite of their fields of exper- background and tradition in maths rather than tise, and students may benefit from being part of mere NLP modules is also a unique possibility to this wider research community. bootstrap Data Science in Humanities curriculum. Modules in Maths would be required, but 4 The bigger picture: NLP, DH and the having acquired an R culture may ease this tran- quantitative turn. sition to Maths. What follows builds on existing modules to outline possible developments: here A set of analytical practices is being established we sum up putative modules around R that may (while being at the same time constantly revised) help students of English to make a transition to- in the industry and in the research community, wards a postgraduate programme in data mining: but the field is evolving so quickly that this knowledge has not made it to the modules taught • First semester: mathematical bases of data in Humanities yet. At best, competing textbooks mining. Pedagogical packages related to a about How to do are being produced, but they particular manual (e.g. {languageR, car, car- reflect on individual solutions more than on aca- et}). Functions and Datasets to accompany demic curricula. NLP teachers have introduced An R Companion to Applied Regression (Fox modules about Machine Learning, but the curric- et al. 2014) and Implementing reproducible ulum in most NLP departments is still being re- research (Stodden, Leisch, and Peng 2014). vised to take into account the constant changes in • Second semester: For data mining, Data the perimeter of existing technologies. mining and analysis: fundamental concepts One should distinguish between sets of tools and algorithms (Zaki, Meira Jr, and Meira and methods, text mining, data mining, and data 2014). For statistical modelling, Applied science. These labels interact with NLP as such, predictive modelling (Kuhn and Johnson but need to be nuanced. One of the reasons for 2013). the complexity of contemporary curriculum design for Humanities is that NLP in this frame- 5 Philosophical Implications for R as a work is no more a means in itself, or an autono- medium for the 3rd revolution of mous curriculum in the MA degree we describe. grammatisation For linguists from English faculties, the NLP language game can be caricatured as getting the This section draws the bigger picture that best F-score you can for a given task/dataset in a explains why we promote the teaching of R for challenge. NLP as a field may well become NLP DH. The main revolution in this kind of faculties as set of tools in DH (or a step in processing consists in convincing students to use the com- chains). mand line. RStudio is an interface that reassures While one may argue that academic political students because it has windows and a GUI, it decisions are mostly designed by means of con- enables them to run scripts and learn how to flicting calls for projects in an indirect top-down comment script. It is a good compromise with a console and loading functions are actionable with 3 https://stackoverflow.com the mouse as with any GUI. R commander (and 4 similar plugins as the one proposed with the https://www.r-bloggers.com

6 {Rattle} package) were previous attempts at volves modelling unstructured data to extract simplifying R (not to mention RExcel and other information and knowledge, leveraging numer- Excel-based interfaces) with the same click and ous statistical, machine learning, and computa- play approach. We would like to suggest that as a tional linguistic techniques.’. R packages such as programming language and as a programme giv- {koRpus} (Michalke, 2017) and {tm} (Feinerer ing access to thousands of packages, R has a spe- and Hornik 2015) make text mining with R much cial status for the DH turn under way (not to easier. We could even say that they favour hy- mention the fact that any epistemological angle brid uses between concordancers and the stand- on NLP tools should consider the programming ard NLP blind processing of data. side). Again, the uses of R are becoming increasing- The DH turn is closely related to the quanti- ly more user-friendly than they used to be. Re- tative turn and we consider that this piece of cent publications have heralded the emergence of software takes part in what we call, after Sylvain accessible approaches to text mining, where NLP Auroux, the third revolution of ‘grammatisation’. is in disguise. For example, Jockers (2014) dis- According to Auroux (Auroux 1994) the first tinguishes between micro-, meso- and macro- revolution of the ‘grammatisation’ was related to analysis. Micro-analysis deals with frequency writing. With Gutenberg, speech enjoys some analysis and correlation between the presence or specific codification. Textual structures (para- absence of a word in a text with randomization graphs, books) play the role of technological in- techniques. Mesoanalysis consists in detailing novation that preconditions linguistic analysis. lexical complexity, hapax analysis and teaching The second technological invention that how to build KWIC-type concordancing with R. revolutionized linguistics according to Auroux Macroanalysis presents unsupervised clustering was the invention of dictionaries and grammars, and some initiation to support vector machines especially in 18th century Europe. Again, codifi- supervised learning models. cation of the language, and standardisation of This more recent approach allows for more spelling triggered some reflection about lemma- flexible analyses where individual texts in a cor- tisation. Linguistic data was processed according pus may be taken into account more flexibly than to a certain format (e.g. dictionary entries). with other command line or corpus-based meth- Auroux (1994) suggests that the emergence of ods. The interaction with visualisation techniques corpora and NLP techniques boils down to the and the flexibility offered by the tidyverse col- emergence of a third revolution of grammatisa- lection facilitates the focalisation on the mining tion, which is still under way. We see R as a pos- of texts. As the 2017 rOpenSci Text Workshop sible interface between corpus linguistics and the puts it5, working with these recent r packages quantitative turn, our free access to statistical encompasses ‘text analysis, natural language libraries and NLP tools. One of the benefits is processing, and other aspects of text mining and that R can increasingly be used as a concordanc- text data handling’. Because it can be used as an er, mining corpora, especially with the tidyverse interface for these practical and theoretical is- collection of packages. Adopting R facilitates a sues, R is a good candidate for the development roadmap to data science, since corpus extractions of hybrid approaches to text mining, mixing con- end up as a dataset. The current ‘tidy data’ cordance-based approaches and more ambitious (Wickham 2014) philosophy favours the struc- analyses of metadata and quantitative (textomet- ture of the data frame where ‘each type of obser- ric) aspects of texts in a single environment. vational unit is a table’. Data visualisation and all of the flexibility of We believe that the DH are a crossroad for the the R environment posit R in a very favourable cross-fertilization of several disciplines: linguis- situation as a specific tool for the third revolution tics (where NLP is crucial), statistics, and the of grammatisation that is taking place around emerging domain of data science. In this respect, text mining. This can be seen with the flowcharts R is an excellent candidate as a tool to promote describing the interaction of R packages and ti- real pluridisciplinarity. Conceptions and bounda- dyverse functions for topic modelling in Silge ries vary as to the content of machine learning, and Robinson (2016), or the R open science ef- data mining, and data science, but the recent evo- forts for the interoperability of formats in some lutions within the tidyverse collection of R packages facilitate text mining. The common denom- inator to this emerging field is text mining: ‘Text 5 mining is an interdisciplinary field which in- http://textworkshop17.ropensci.org/

7 packages (the text interchange format package, scientist. The essential roadmap to emancipation {tfi}6). still has to be designed. We are not to turn our students of English linguistics into programmers, 6 Conclusion: limits and difficulties but we want our PhD laureates to be as proficient as can be. Replicating and adapting a script to the The conclusion reflects on the limitations of our needs of your data is one thing, being an expert curriculum and achievements as well as the is quite another. drawbacks of our choice of R. The first part dis- As to the limitations of R, some of them are cusses actions that are still under way. The se- inherent to the language, some can be related to cond part summarises some of the issues against the misuses of R. The first limitation is that a R. programming language is not NLP as such. Be- Alternative strategies still to be tested include coming acquainted with some logic of packages reverse teaching for R packages, sessions where gives access to specific resources, but with more students would have to present an R package and limitations as to the languages under scrutiny what can be done with it, which is a way to get than with Python-based tools. them acquainted both with the code and the big- Regarding the internal limitations of the pro- ger picture: the why? and the what for? gramming language, known problems with R Another strategy to be tested consists in su- include issues related to the floating point calcu- pervising an MA whose every step implies spe- lations (which seem to play a role in the R cific coding in R, to the point of designing the word2vec implementation of the word embed- requirements of an R package along the MA, ding algorithms), a quirky syntax, issues with each cornerstone of the MA corresponding to an loops and the way everything is stored in the elaborate R function which needs to be imple- memory. With specific data processing (very mented to process the data. Designing your own often, phonetic data), matlab libraries and scripts R package centred around your research question have been developed so that R is superseded in is an option, more likely to be realised at the end these areas, not to mention the fact that the aver- of a PhD. We are encouraging PhD students age documentation is usually more complete, but reaching the end of their thesis to design their this service comes with a price, whereas R is own packages, encapsulating their codes and par- free. tial datasets, to join the github community in or- Last, we would like to report on some issues der to bypass the heavier requirements of the that have been encountered by students. On top CRAN repository and propose their packages as of classic mismanagement of codes for models or prototypes or proofs of concepts through the dev- serious issues with data coercion with R, we tools library. would like to report a typology of attitudes which What would make sense in a pluridisciplinary may reflect more poorly on some of infelicitous university would be to teach a general introduc- uses of R. These attitudes are mainly related to tory course resembling a seminar centred on the what one could call ‘the encyclopaedic ignorance versatility of R while teaching core linguistic of the self-taught linguists’: becoming experts at concepts to non-specialists and presenting useful secondary details but missing the basic maths. R packages for linguistically-related research This means being able to execute relatively com- questions. The challenge is to teach basic coding, plex methods but not knowing exactly what the linguistic notions and R packages in lecture halls. model does to the data. Another variant is linked Nevertheless, a module like “Language, Texts to wanting to know everything about the R code, and Data mining: An R-based introduction to without considering the big picture, i.e., the un- Digital Humanities” should be offered to under- derlying mathematical model required to address graduates to promote DH. the data, practicing some sort of code-induced Suggesting that R should be essential to stu- short-sightedness. It may well be the case that dents contributes to developing a coding culture this package-based approach to R distorts the and foster on-line learning (as stackoverflow is representation of a programming language. Full your friend) but it makes sense in terms of life- empowerment of students leading to the possibil- long learning. Our MA students get the extra- ity of writing a programme should be the real of benefit of an initial training to data mining, how- the day, whereas we probably endorse some em- ever basic it may sound to a professional data powerment limited to existing packages and scripts in a kind of solving problem based phi- 6 https://github.com/ropensci/tif losophy. Clearly, learning another programming

8 language would help for this kind of learning Serge Fleury, and Maria Zimina. 2014. ‘Trameur: A profile. Framework for Annotated Text Corpora Ex- Promoting the teaching of R is an important ploration.’ In COLING (demos), 57–61. aspect of pluridisciplinarity (and an easy way to John Fox, Sanford Weisberg, Daniel Adler, Douglas Bates, Gabriel Baud-Bovy, Steve Ellison, start building common background with the David Firth, Michael Friendly, Gregor Gor- maths department) but should be seen as the first janc, and Spencer Graves. 2014. ‘Companion step. Beginning with R, students may move to to Applied Regression R Package.Version Python (for example using jupyter or Anaconda) 2.0-20’. and then for example use the scikit-learn ML in Kim Gerdes. 2014. ‘Corpus Collection and Analysis Python (Pedregosa et al. 2011). for the Linguistic Layman: The Gromoteur’. The new challenges posed by this new config- In proceedings of the JADT. uration of knowledge is mostly unheard of. Most Stefan T. Gries. 2009. Quantitative Corpus Linguis- learning paths for the different intersecting sub- tics with R. A Practical Introduction. New disciplines are still unchartered territories. For York-London: Routledge. Stefan T. Gries. 2013. Statistics for Linguistics with this complex language game ahead of us, lin- R: A Practical Introduction. Berlin: Walter guists may lack computational linguistics, but de Gruyter. other scientific partners will need more linguis- Matthew L. Jockers. 2014. Text Analysis with R for tics for true interdisciplinarity. Students of Literature. Springer. Keith Johnson. 2011. Quantitative Methods in Lin- guistics. Chicester, UK: John Wiley & Sons. References Tyler Kendall, and Erik R Thomas. 2010. ‘Vowels: Vowel Manipulation, Normalization, and Laurence Anthony. 2011. AntConc (Version 3.2. Plotting in R. R Package, Version 1.1’. Soft- 2)[Computer Software]. Tokyo, Japan: ware Resource: Http://Ncslaap. Lib. Ncsu. Waseda University. Edu/Tools/Norm/. Taylor Arnold. 2017. ‘A Tidy Data Model for Natural Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and Da- Language Processing Using cleanNLP’. vid Tugwell. 2004. ‘The Sketch Engine’. In arXiv Preprint arXiv:1703.09570. Proceedings of Euralex, 105–16. Taylor Arnold, and Lauren Tilton. 2015. Humanities Erwin R. Komen. 2011. ‘Cesax: Coreference Editor Data in R. Springer. for Syntactically Annotated XML Corpora’. Sylvain Auroux. 1994. La Révolution Technologique Reference Manual. Nijmegen, Netherlands: de La Grammatisation. Introduction À Radboud University Nijmegen. L’histoire Des Sciences Du Langage. Liège: Max Kuhn, and Kjell Johnson. 2013. Applied Predic- Mardaga. tive Modeling. Vol. 810. Springer. Harald R. Baayen. 2008. Analyzing Linguistic Data: Natalia Levshina. 2015. How to Do Linguistics with A Practical Introduction to Statistics Using R: Data Exploration and Statistical Analysis. R. Cambridge: Cambridge University Press. John Benjamins Publishing Company. Jason Baldridge. 2005. ‘The Opennlp Project’. URL: Daniel R. McCloy. 2016. phonR: Tools for Phoneti- Http://Opennlp. Apache. Org/Index cians and Phonologists. (version 1.0-7). Nicolas Ballier. forthcoming. ‘R, Pour Un Écosys- https://CRAN.R-project.org/package=phonR. tème Du Traitement Des Données? Meik Michalke. 2017. Package koRpus: An R Pack- L’exemple de La Linguistique.’ In Données, age for Text Analysis (version 0.10-2). Métadonnées Des Corpus et Catalogage Des http://reaktanz.de/?c=hacking&s=koRpus. Objets En Sciences Humaines et Sociales., Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- edited by Ph Caron. Presses universitaires de fort, Vincent Michel, Bertrand Thirion, Oliv- Rennes. ier Grisel, Mathieu Blondel, Peter Prettenho- Santiago Barreda. 2015. ‘phonTools: Functions for fer, Ron Weiss, and Vincent Dubourg. 2011. Phonetics in R’. R Package Version 0.2-2.1. ‘Scikit-Learn: Machine Learning in Python’. Paul Boersma ,and David Weenink. 2017. Praat. Journal of Machine Learning Research 12 (Version 6.0.29) [Computer Software]. (Oct): 2825–30. Joan Bresnan, and Tatiana Nikitina. 2003. ‘The Gra- R Core Team. 2016. R: A Language and Environment dience of the Dative Alternation.’ Stanford for Statistical Computing. (version 3.3.1 University. Retrieved from http://www- (2016-06-21)). English. Vienna, Austria.: R lfg.stanford.edu/bresnan/download. Foundation for Statistical Computing. Ingo Feinerer, and Kurt Hornik. 2015. Tm: Text Min- https://www.R-project.org/. ing Package (version 0.6-6). https://cran.r- Tyler W. Rinker. 2013. Qdap: Quantitative Discourse project.org/web/packages/zipfR/index.html. Analyisis Package (version 2.1.0.). Universi- ty at Buffalo/SUNY, Buffalo, New York.

9 Helmut Schmid. 1995. ‘Treetagger: a Language Inde- pendent Part-of-Speech Tagger’. Institut Für Maschinelle Sprachverarbeitung, Universität Stuttgart. Mike Scott. 2016. WordSmith Tools, Stroud: Lexical Analysis Software. (version 7). Julia Silge, and David Robinson. 2016. Tidytext: Text Mining and Analysis Using Tidy Data Prin- ciples in R. Julia Silge, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media, Inc. Victoria Stodden, Friedrich Leisch, and Roger D. Peng. 2014. Implementing Reproducible Re- search. CRC Press. Hadley Wickham. 2014. ‘Tidy Data’. Journal of Sta- tistical Software 59 (10): 1–23. Raphael Winkelmann, Klaus Jaensch, and Jonathan Harrington. 2017. emuR: Main Package of the EMU Speech Database Management Sys- temR Package Version (version 0.2.3.). https://CRAN.R-project.org/package=emuR. Mohammed J. Zaki, and Wagner Meira. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press.