A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
Total Page:16
File Type:pdf, Size:1020Kb
A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes 1,2Samiul Alam , 1Tahsin Reasat , 1Asif Shahriyar Sushmit , 1Sadi Mohammad Siddique 3Fuad Rahman , 4Mahady Hasan , 1,*Ahmed Imtiaz Humayun 1Bengali.AI, 2Bangladesh University of Engineering and Technology, 3Apurba Technologies Inc., 4Independent University, Bangladesh Abstract Vowel Grapheme Roots Diacritics Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) re- search. Adapting existing systems from Latin to alpha- a l syllabary languages is particularly challenging due to a g n a sharp contrast between their orthographies. The seg- B mentation of graphical constituents corresponding to characters becomes significantly hard due to a cur- sive writing system and frequent use of diacritics in Consonant h s Diacritic i l the alpha-syllabary family of languages. We propose g n a labeling scheme based on graphemes (linguistic seg- E i ments of word formation) that makes segmentation in- r a g side alpha-syllabary words linear and present the first a n v e dataset of Bengali handwritten graphemes that are com- D monly used in everyday context. The dataset contains 411k curated samples of 1295 unique commonly used Alpha-syllabaryAbugida Gra pGraphemeheme Seg mSegmentsents Bengali graphemes. Additionally, the test set contains 900 uncommon Bengali graphemes for out of dictionary Figure 1: Orthographic components in a Bangla (Ben- performance evaluation. The dataset is open-sourced gali) word compared to English and Devnagari (Hindi). as a part of a public Handwritten Grapheme Classi- The word ‘Proton’ in both Bengali and Hindi is equiv- fication Challenge on Kaggle to benchmark vision al- alent to its transliteration. Characters are color- gorithms for multi-target grapheme classification. The coded according to phonemic correspondence. Alpha- unique graphemes present in this dataset are selected syllabary grapheme segments and corresponding char- based on commonality in the Google Bengali ASR cor- acters from the three languages are segregated with the pus. From competition proceedings, we see that deep markers. While characters in English are arranged hor- learning methods can generalize to a large span of out izontally according to phonemic sequence, the order is of dictionary graphemes which are absent during train- not maintained in the other two languages. ing. text for such languages with numerous applications in 1. Introduction e-commerce, security, digitization, and e-learning. In the alpha-syllabary writing system, each word is com- Speakers of languages from the alpha-syllabary or prised of segments made of character units that are in Abugida family comprise of up to 1:3 billion people phonemic sequence. These segments act as the small- across India, Bangladesh, and Thailand alone. There is est written unit in alpha-syllabary languages and are significant academic and commercial interest in devel- termed as Graphemes [14]; the term alpha-syllabary oping systems that can optically recognize handwritten itself originates from the alphabet and syllabary qual- 1 ities of graphemes [7]. Each grapheme comprises of a along with solutions developed by the top-ranking par- grapheme root, which can be one character or several ticipants. Finally section 6 presents our conclusions. characters combined as a conjunct. The term char- For ease of reading, we have included the IPA stan- acter is used interchangeably with unicode character dard English transliteration of Bengali characters in throughout this text. Root characters may be accom- {.} throughout the text. panied by vowel or consonant diacritics- demarcations which correspond to phonemic extensions. To better 2. Related Work understand the orthography, we can compare the En- Handwritten OCR Datasets. Although several glish word Proton to its Bengali transliteration েপৰ্াটন (Fig. 1). While in English the characters are hori- datasets [2, 6, 20, 22] have been made for Bengali hand- zontally arranged according to phonemic sequence, the written characters, their effectiveness has been limited. first grapheme for both Bengali and Devanagari scripts This could be attributed to the fact that they were have a sequence of glyphs that do not correspond to the formulated following contemporary datasets of English linear arrangement of unicode characters or phonemes. characters. All these datasets label individual charac- As most OCR systems make a linear pass through a ters or words. These datasets work well for English written line, we believe this non-linear positioning is and can even be adapted for other document recogni- important to consider when designing such systems for tion tasks, setting the standard. Most character recog- Bengali as well as other alpha-syllabary languages. nition datasets for other languages like the Devanagari Character Dataset [1, 19, 12] and the Arabic Printed We propose a labeling scheme based on grapheme Text Image Database [13, 3] were created following segments of Abugida languages as a proxy for char- their design. However, they do not boast the same acter based OCR systems; grapheme recognition in- effectiveness and adaptability of their English counter- stead of character recognition bypasses the complex- parts. Languages with different writing systems there- ities of character segmentation inside handwritten fore require language specific design of the recognition alpha-syllabary words. We have curated the first Hand- pipeline and need more understanding of how it affects written Grapheme Dataset of Bengali as a candidate performance. To the best of our knowledge, this is the alpha-syllabary language, containing 411882 images of first work that proposes grapheme level recognition for 1295 unique commonly used handwritten graphemes alpha-syllabary OCR. and ∼ 900 uncommon graphemes (exact numbers are excluded for the integrity of the test set). While the 3. Challenges of Bengali Orthography cardinality of graphemes is significantly larger than that of characters, through competition proceedings we As mentioned before in section 1, each Bengali word show that the classification task is tractable even with a is comprised of segmental units called graphemes. Ben- small number of graphemes- deep learning models can gali has 48 characters in its alphabet- 11 vowels and generalize to a large span of unique graphemes even if 38 consonants (including special characters ‘ৎ’{ṯ},‘◌ং’ they are trained with a smaller set. Furthermore, the {ṁ},‘◌ঃ’{ḥ}). Out of the 11 vowels, 10 vowels have scope of this dataset is not limited to the domain of diacritic forms. There are also four consonant diacrit- OCR, it also creates an opportunity to evaluate multi- ics, ‘◌뷍’ (from consonant য {gha}), ‘◌쇍’ (from consonant target classification algorithms based on root and dia- র{ra}), ‘৴’ (also from consonant র {ra}) and ‘◌ঁ’. We critic annotations of graphemes. Compared to Multi- follow the convention of considering ‘◌ং’{ṁ},‘◌ঃ’ {ḥ} Mnist [21] which is a frequently used synthetic dataset as standalone consonants since they are always present used for multi-target benchmarking, our dataset pro- at the end of a grapheme and can be considered a sep- vides natural data of multi-target task comprising of arate root character. three target variables. 3.1. Grapheme Roots and Diacritics The rest of the paper is organized as follows. Section 2 discuses previous works. Section 3 shows the differ- Graphemes in Bengali consist of a root character ent challenges that arise due to the orthography of the which may be a vowel or a consonant or a conso- Bengali language, which is analogous to other Abugida nant conjunct along with vowel and consonant diacrit- languages. Section 4 formally defines the dataset objec- ics whose occurrence is optional. These three symbols tive, goes into the motivation behind a grapheme based together make a grapheme in Bengali. The consonant labeling scheme, and discusses briefly the methodol- and vowel diacritics can occur horizontally, vertically ogy followed to gather, extract, and standardize the adjacent to the root or even surrounding the root (Fig. data. Section 5 discusses some insights gained from 2). These roots and diacritics cannot be identified in the Bengali AI Grapheme Classification Competition written text by parsing horizontally and detecting each 2 the consonant conjunct = ঙ + গ as in Fig. 3b, a sim- plified more explicit form is written inFig. 3a. The same can be seen for diacritics in Fig. 3c and Fig 3d. It can be argued that allographs portray the linguistic Figure 2: Different vowel diacritics (green) and conso- plasticity of handwritten Bengali. nant diacritics (red) used in Bengali orthography. The 3.4. Unique Grapheme Combinations placement of the diacritics are not dependent on the grapheme root. One challenge posed by grapheme recognition is the huge number of unique graphemes possible. Taking into account the 38 consonants (nc) including three 3 2 glyph separately. Instead, one must look at the whole special characters, 11 vowels (nv) and (nc +nc ) possible grapheme and identify them as separate targets. In consonant conjuncts (considering 2nd and 3rd order), 3 2 light of this, our dataset labels individual graphemes there can be ((nc −3) +(nc −3) +(nc −3))+3 different with root characters, consonant and vowel diacritics as grapheme roots possible in Bengali. Grapheme roots separate targets. can have any of the 10+1 vowel diacritics (nvd) and 7+1 consonant diacritics (ncd). So the approximate 3.2. Consonant Conjuncts or Ligatures number of possible graphemes will be nv + 3 + ((nc − 3 − 2 − · · Consonant conjuncts in Bengali are analogous to lig- 3) + (nc 3) + (nc 3)) nvd ncd or 3883894 unique atures in Latin where multiple consonants combine to- graphemes. While this is a big number, not all of these gether to form glyphs which may or may not contain combinations are viable or are used in practice. characteristics from the standalone consonant glyphs. In Bengali, up to three consonants can combine to form 4. The Dataset consonant conjuncts. Consonant conjuncts may have Of all the possible graphemes combinations, only two (second order conjuncts, eg. ষ্ট = শ +ট {sta = śa a small amount is prevalent in modern Bengali.