Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries Stephanie Hirmer1, Alycia Leonard1, Josephine Tumwesige2, Costanza Conforti2;3 1Energy and Power Group, University of Oxford 2Rural Senses Ltd. 3Language Technology Lab, University of Cambridge
[email protected] Abstract areas where the bulk of the population lives (Roser Most well-established data collection methods and Ortiz-Ospina(2016), Figure1). As a conse- currently adopted in NLP depend on the as- quence, common data collection techniques – de- sumption of speaker literacy. Consequently, signed for use in HICs – fail to capture data from a the collected corpora largely fail to repre- vast portion of the population when applied to LICs. sent swathes of the global population, which Such techniques include, for example, crowdsourc- tend to be some of the most vulnerable and ing (Packham, 2016), scraping social media (Le marginalised people in society, and often live et al., 2016) or other websites (Roy et al., 2020), in rural developing areas. Such underrepre- sented groups are thus not only ignored when collecting articles from local newspapers (Mari- making modeling and system design decisions, vate et al., 2020), or interviewing experts from in- but also prevented from benefiting from de- ternational organizations (Friedman et al., 2017). velopment outcomes achieved through data- While these techniques are important to easily build driven NLP. This paper aims to address the large corpora, they implicitly rely on the above- under-representation of illiterate communities mentioned assumptions (i.e. internet access and in NLP corpora: we identify potential biases literacy), and might result in demographic misrep- and ethical issues that might arise when col- resentation (Hovy and Spruit, 2016).