Language Exclusion in Computational Linguistics and Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Hard Numbers: Language Exclusion in Computational Linguistics and Natural Language Processing Martin Benjamin Kamusi Project International Place de la Gare 12 C 1020 Renens, Switzerland [email protected] Abstract The intersection between computer science and human language occurs largely for English and a few dozen other languages with strong economic or political support. The supermajority of the world’s languages have extremely little digital presence, and little activity that can be forecast to change that status. However, such an assertion has remained impressionistic in the absence of data comparing the attention lavished on elite languages with that given to the rest of the world. This study seeks to give some numbers to the extent to which non-lucrative languages sit at the margins of language technology and computational research. Three datasets are explored that reveal current hiring and research activity at universities and corporations concerned with computational linguistics and natural language processing. The data supports the conclusion that most research activity and career opportunities focus on a few languages, while most languages have little or no current research and little possibility for the professional pursuit of their development. Keywords: under-resourced languages, computational linguistics, NLP, language technology 1.0. Introduction reveal almost nil coverage for the supermajority of languages. One could hunt for resources per This paper looks at technology for “under- language, and find a corpus here, a spell-checker resourced” languages by examining the amount of there, and a bunch of Wikipedia stub pages about career opportunities and research projects in the asteroids somewhere else1. In the end, though, a field. Two data sets were evaluated to provide hard spreadsheet with 7000 languages and all known numbers regarding the proportion of high level technologies would show a smattering of ticks for a research in areas related to computational long tail of languages, for example where a linguistics. The hypothesis for the research was that passionate developer created an Android app2 or high level pursuit of technological development for where a field linguist shared a dictionary on most of the world’s languages is not a widely Webonary,3 and a huge clustering of resources for a available career option. This hypothesis was fully small assortment of languages that could be guessed supported by the data. Without a significant number without looking at the data. Kornai’s impressive of people working in the field, resources for under- effort (2013) to quantify existing resources for resourced languages cannot be developed. The languages of the world found that 6,541 had no numbers in this study, which indicate where the field detectable live online presence. Following Scannell will be casting its gaze for years to come, give no (2013) and Gibson (2014), it is possible to find cause for optimism that the situation will improve. instances of usage of nearly 2000 languages within communication technologies such as Twitter, but People who work on excluded languages know from these are examples of technology as a vessel rather experience that most of the world’s languages than an avenue for development. Because data about remain outside of the technological sphere, but do the topic of research activity is not obviously not have numerical ways to demonstrate the extent available, we have been left to make impressionistic of the marginalization. In principle, one could look assertions about the paucity of work in the field. to the amount of software that is localized in each language, but that would involve getting access to This study examined three sources of data that thousands of programs, installing them, and provide numerical indications of the extent to which enumerating the available user interface languages – under-resourced languages are active within the an impossible task that one already knows would overall profession of language technology. The first 1 The Yoruba Wikipedia, https://yo.wikipedia.org, has Clicking the link labelled “Ojúewé àrìnàkò” from any more than 31,000 articles listed. However, most of those Yoruba Wikipedia page gives a random page from the contain bogus content, including thousands of pages like project, with a high probability of landing on an asteroid. https://yo.wikipedia.org/wiki/23006_Pazden that are bot- 2 https://mothertongues.org/ generated stubs containing the names of asteroids. 3 https://www.webonary.org/ dataset consists of all the jobs posted on Linguist Category B, and nearly 7000 units for Category C. List (LL) in 2017 that specify applied, Figure 2 was a speculation drawn about two years computational, or text/ corpus linguistics.4 The prior to the present study that posited the ratio of second dataset consists of all the papers and posters research invested in each category as a rough inverse presented at COLING 2016,5 the 26th International of the number of languages affected. This paper Conference on Computational Linguistics, examines the hypothesis implicit in Figure 2. The organized by the Association for Natural Language hard numbers in this study show that the scale shown Processing, in Osaka, Japan. The third consists of all for research activity between categories A and C is the papers and posters presented at ICLDC 2017, 5th about right. The representation underestimates the International Conference on Language level of activity for Category B languages, however Documentation & Conservation, in Honolulu, – mid-resourced languages, where Kilgarriff and Hawaii. None of these are perfect representations of Grefenstette’s 2003 observations about trends the state of the field, for reasons discussed below, toward increasing digital multilingualism hold true, but they give an overall up-to-the-moment should have a bigger box, with a gradient toward indication of the state of attention that under- “under-resourced” that is certainly reached around resourced languages receive among those active in 50. On the other hand, the data bears out that several the profession. languages that were cast as Category B – notably Chinese, Japanese, and Arabic – are benefiting from Spoiler alert: under-resourced languages receive significant professional attention, and could now be almost no attention in work related to computational on the borderline of Category A, which would linguistics or natural language processing (NLP). maintain the ratio between A and B closer to the Type Number/ Examples Characteristics A 4 languages (English, Massive investment, many existing digital resources, large monolingual and French, German, Spanish) aligned corpora, somewhat functional machine translation with other languages in the group, primary focus of language technology research and development.6 B About 25 languages (many Moderate to large investment and research, increasing digital resources, large official languages of the monolingual corpora with bilingual alignment to A languages (especially EU, Chinese, Japanese, English), rough machine translation to A languages (usually English), focus Russian, Arabic) of interest for EU and national funders C All the rest. Almost 7000 Zero to mediocre investment and research. Some languages like Swahili and languages, spoken by the major languages of India, with more than 100 million speakers, have active majority of the world's 7 research communities and rough machine translation, usually to English. A billion people. couple of thousand have some form of print dictionary, ranging from lists of a few hundred words to massive volumes with hundreds of pages. Most are 'embattled' – either close to extinction, or disfavored by policy or practice. Funding is usually sparse. Table 1: Language Categories initial depiction. While a discussion of the state of 2. How much less resourced are “less- Category B languages is outside of the scope of this resourced” languages? paper, as they enjoy a host of advantages that elevate them above any notion of “under-resourced”, it is Table 1 proposes a typology of languages, wherein worth noting that many are making strides that will languages in Category A are the ones that receive redound increasingly to their benefit in the years to high attention in NLP research, Category B come. languages receive moderate attention, and Category C languages receive little or no attention. Figure 1 shows those languages at exact scale, proportional to the total number of languages in each category: a square with 4 units for Category A, 25 units for 4 Records were laboriously reviewed by setting search 5 http://coling2016.anlp.jp/ criteria on the Linguist List jobs page, 6 The list of “Any language” treated by DeepL (English, https://linguistlist.org/jobs/search-job1.cfm. Linguist List German, French, Spanish, Italian, Dutch, and Polish) at keeps records dating back many more years, but http://www.deepl.com/translator could be a better procuring those in a practical format would require estimate, but discussion of levels of inclusiveness among imposing on their staff. Whether a single year’s data is better-resourced languages would be the subject of a completely representative is therefore an open question. different paper. Figure 1: Languages of each type as a proportion of world Figure 2: Languages of each type as a proportion of total (actual numbers) investment and research attention (hypothesized estimate) people. The inclusion of these languages may 2.1 Linguist List indicate a glimmer of recognition that