Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook Yudhanjaya Wijeratne†, Nisansa de Silva‡ † LIRNEasia, 12 Balcombe Place, Colombo, Sri Lanka (
[email protected]) ‡ University of Oregon, 1585 E 13th Ave, Eugene, OR 97403, United States (
[email protected]) LIRNEasia is a pro-poor, pro-market think tank whose mis- sion is catalyzing policy change through research to improve people’s lives in the emerging Asia Pacific by facilitating their use of hard and soft infrastructures through the use of knowledge, information and technology. This work was carried out with the aid of a grant from the International Development Research Centre (IDRC), Ottawa, Canada. 1 Abstract This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type. Introduction ‘The limits of my language mean the limits of my world.’ – Ludwig Wittgenstein Sinhala, as with many other languages in the Global South, currently suffers from a phenomenon know as resource poverty [1]. To wit, many of the fundamental tools that are required for easy and efficient natural language analysis are unavailable; many of the more computational components taken for granted in languages like English are either as yet unbuilt, in a nascent stage, and in other cases, lost or retained among select institutions [2].