Deep Internationalization for Gboard, the Google Keyboard

Writing Across the World’s Languages: Deep Internationalization for Gboard, the Google Keyboard Daan van Esch,∗ Elnaz Sarbar, Tamar Lucassen, Jeremy O’Brien, Theresa Breiner, Manasa Prasad, Evan Crew, Chieu Nguyen, Françoise Beaufays Google Mountain View, CA, USA November 2019 Abstract This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world. 1 Introduction Our world has a tremendous wealth of linguistic diversity, with thousands of languages spoken around the globe every day. Historically, most of the world’s languages have by and large been used only in spoken face-to-face conversations, with very little writing taking place in the majority of languages. But as more of the world comes online, many language varieties that have mostly been limited to spoken usage in the past are now being used increasingly in writing to communicate using online message boards, chat apps, and social media, as described by e.g. Kral (2010), Osborn (2010), Kral (2012), Jones and Uribe-Jongbloed (2012), Pischloeger (2014), Nguyen et al. (2015), Keegan et al. ∗Correspondence about this technical report can be sent to [email protected]. For product feedback or feature requests, please visit the Play Store page for Gboard at https://goo.gle/gboard. 1 (2015), Cru (2015), Lillehaugen (2016), Lackaff and Moner (2016), Jongbloed- Faber et al. (2017), Jany (2018), Soria et al. (2018), McMonagle et al. (2019), Eberhard and Mangulamas (2019), and McMonagle (2019). In general, the trend seems to be that people want to communicate in informal environments like chat apps and social media in the same kind of language they would normally speak in face-to-face conversations. And since chat apps and social media are usually (but not always) used to communicate through text, many more of the world’s languages are now being written regularly by their users; perhaps only in informal contexts for now, but the trend is unmistakably moving in the direction of more languages being written on an increasingly regular basis. Regular use of smartphone applications for informal communications appears to be amplifying the grassroots-literacy trends observed in works such as Blommaert (2008). In this technical report, we describe how we have been bringing support for hundreds of such language varieties to Gboard, Google’s smartphone keyboard for the Android operating system1, in order to help smartphone users around the world communicate and share knowledge in their preferred language(s). Gboard supports more than 900 language varieties today. It is installed out-of-the-box on many Android smartphones, and for most other Android smartphones, the application can be downloaded from the Google Play Store. Overall, it has more than 1 billion installs worldwide. As this report will show, the hugely diverse pool of people using smartphone apps like Gboard means that language technology now needs to support many more language varieties than has historically been the case. The focus of this report is not on the technical implementation of our keyboard application, which is described in Ouyang et al. (2017) and other technical papers. Rather, this report focuses on the implications of the complex interactions between today’s global linguistic usage trends and language technology products such as keyboards. In short, the confluence of widely acces- sible technology and informal written communication platforms has led to de- mand for many more language varieties to be supported in applications such as smartphone keyboards. This report describes in some detail technical and non-technical challenges that language technology product development teams face when scaling up to hundreds of language varieties, and the solutions we invented along the way.2 First, we’ll describe in some more detail the sociolinguistic background against which many of the world’s language varieties are now increasingly written, as well as some of the technological challenges encountered by language communities when doing so. Users have generally responded to these challenges by inventing various work-arounds, which we’ll catalog briefly. Then, we’ll de- 1Gboard is also available on the iOS operating system, supporting a subset of the language varieties available on Android. 2Throughout this report, ISO 639 language codes are formatted like en or nan, where two- letter language codes are used whenever available. Four-letter codes like Latn are ISO 15924 script codes. Character names from the Unicode standard are formatted like latin small letter a. 2 scribe how we have been going about building support for many more language varieties in our smartphone keyboard application. Finally, we will present an analysis of some of the usage trends we are seeing, and provide an overview of future challenges to be solved. 2 Background 2.1 Languages, Writing, and Technology Historically, writing was commonplace only for language varieties that were used in formal written publications, like books, newspapers, and religious materials. Today, however, many language communities have access to chat apps and social media, which are generally informal environments. Most communication in chat apps and on social media happens in writing, but language usage is more similar to the patterns that would historically have been restricted to spoken usage (McCulloch, 2019). Along these lines, to communicate in natural ways within this informal yet mostly written context, many speakers across the world have picked up writing in their own language varieties, even if these language varieties were rarely written historically. In general, however, support for these language varieties without a long- standing widespread written tradition remains rare within language technology products, despite early efforts to address this problem, as described ine.g. Paterson (2015). This means there are significant opportunities for language technology products to make a positive difference for their users by adding in support for many more language varieties around the world. To cite just one example, the Digital Language Diversity Project, which studied a number of European regional languages, calls out “technological barriers, such as the un- availability of a specific keyboard or spell checkers that would ease the writing” as one of the main problems facing the languages communities they studied in the use of their languages online (Soria et al., 2018). Typically, chat apps and social media are accessed using smartphones. Text input on these devices is generally facilitated by a virtual keyboard application, displaying a keyboard layout (such as QWERTY or AZERTY) on-screen and using the touchscreen capability to detect taps or gestures. Because these screens are small, machine-learning language technologies like predictive text and auto- correction can help make input faster and more accurate (Fowler et al., 2015; Ouyang et al., 2017). Until about 2016, these technologies have only been available in about 100 language varieties. In the last few years, however, it has become very clear that, with the rise in informal written communication globally, along with the increasing ubiquity of smartphones (ITU/UNESCO Broadband Commission, 2017), language technology needs to scale beyond the traditional set of about 100 language varieties in order to support communities across the world. Wikipedia, for example, is already available in about 300 language varieties (Wikipedia, 2019). Beyond Wikipedia, Scannell (2007) and our previous research (Prasad et al., 2018) found 3 textual data in more than 2,000 language varieties online. We propose calling efforts to bring technology to many more languages at scale ‘deep internationalization’, after the industry-standard term ‘internationalization’3, which is typically used to mean extending support to languages and communities beyond American English, the language and locale that most software products are designed to support first. However, the term ‘internationalization’ has not typically been thought of as extending to hundreds or even thousands of language varieties, so we added the adjective ‘deep’ to distinguish our effort from more limited internationalization efforts. 2.2 Linguistic Diversity Online Over the last few years, we have been working on a deep internationalization project to add support for many more languages to Gboard, our smartphone keyboard application. Before we decided to start working on deep internationalization for Gboard, we conducted a number of literature surveys, data analyses, and observational studies to understand user expectations around the use of language in the smartphone age. Earlier in this report, we already referenced a number of papers analyzing the general usage trends for these language varieties, e.g. in analyses of Twitter posts. It stands to reason to assume

Deep Internationalization for Gboard, the Google Keyboard

Cross-Language Framework for Word Recognition and Spotting of Indic Scripts

A Universal Amazigh Keyboard for Latin Script and Tifinagh

Getting Started with Libreoffice 3.4 Copyright

Background Information History, Licensing, and File Formats Copyright This Document Is Copyright © 2008 by Its Contributors As Listed in the Section Titled Authors

SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation

ISO Basic Latin Alphabet

Plaquette De Présentation De Bépo Est Sous Double Licence CC-BY-SA Et GFDL ©2014 Association Ergodis, Avec L’Aimable Collaboration De Ploum

The Yubikey Manual

How to Enter Foreign Language Characters on Computers

Gmail Smart Compose: Real-Time Assisted Writing

Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library

Smart Reply Feature ` 01-Feb-2018