Bnlp: Natural Language Processing Toolkit for Bengali Language
Total Page:16
File Type:pdf, Size:1020Kb
BNLP: NATURAL LANGUAGE PROCESSING TOOLKIT FOR BENGALI LANGUAGE Sagor Sarker Begum Rokeya University Rangpur, Bangladesh [email protected] ABSTRACT BNLP is an open source language processing toolkit for Bengali language consisting with tokenization, word embedding, pos tagging, ner tagging facilities. BNLP provides pre-trained model with high accuracy to do model based tokenization, embedding, pos tagging, ner tagging task for Bengali language. BNLP pretrained model achieves significant results in Bengali text tokenization, word embeddding, POS tagging and NER tagging task. BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks. BNLP is available at https://github. com/sagorbrur/bnlp. 1 Introduction Natural language processing is one of the most important field in computation linguistics. Tokenization, embedding, pos tagging, ner tagging, text classification, language modeling are some of the sub task of NLP. Any computational linguistics researcher or developer need hands on tools to do these sub task efficiently. Due to the recent advancement of NLP there are so many tools and method to do word tokenization, word embedding, pos tagging, ner tagging in English language. NLTK[1], coreNLP[2], spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of the tools. These tools provide a variety of method to do tokenization, embedding, pos tagging, ner tagging, language modeling for English language. Support for other low resource language like Bengali is limited or no support at all. Recent tool like iNLTK1[7] is an initial approach for different indic language. But as it groups with other indic language special monolingual support for Bengali language is missing. BNLP is an open source language processing toolkit for Bengali language is build to address this problem and breaks the barrier to do different Bengali NLP task by: arXiv:2102.00405v1 [cs.CL] 31 Jan 2021 • Providing different tokenization method to tokenize Bengali text efficiently • Providing different embedding method to embed Bengali word using pretrained model and also provides an option to train an embedding model from scratch • Providing hands on start option for pos tagging or ner tagging of Bengali sentences and also provides an option for training CRF based pos tagger or ner tagger model from scratch. BNLP also provides some utility methods like to remove stopwords from Bengali text, to get Bengali letters list or punctuation list. BNLP github repositories2 for source code of the package, pretrained model and documentation3. BNLP libraries has a permissive MIT license. BNLP is easy to install via pip or by cloning repository, easy to plugin with any python projects. 1https://github.com/goru001/inltk 2https://github.com/sagorbrur/bnlp 3https://bnlp.readthedocs.io/ Figure 1: Overview of BNLP’s pipeline. BNLP takes raw text as input, and produces trained model of sentencepiece, word2vec and fasttext. Using that trained model BNLP prediction API do different prediction task. Figure 2: BNLP Basic Tokenization API 2 BNLP API BNLP tool is too simple to use. Researcher or developer can integrate this tool with installing simple python package. In this section we are describing how to do different NLP task for Bengali text using BNLP toolkit. 2.1 Tokenization BNLP provides three different tokenization option to tokenize Bengali text. Under rule based tokenizer BNLP provides Basic Tokenizer a punctuation splitting tokenizer and NLTK4 tokenizer. As NLTK tokenizer is for English language, we modified nltk tokenize output to use it for Bengali language keeping in mind the difference between punctuation of English and Bengali. Under model based tokenization BNLP provides sentencepice5 tokenizer for Bengali text called Bengali Sentencepiece. Bengali sentencepiece api provide two option, pretrained sentencepiece model and training sentencepiece model. Anyone can tokenize Bengali text using pretrained sentencepiece model or can train their own Bengali sentencepiece model by calling train api. 2.2 Embedding BNLP provides two different embedding option to embed Bengali words, one is Bengali word2vec and Bengali fasttext. Both Bengali word2vec and fasttext has two option, one is embed Bengali word using pretrained model and another is train Bengali word2vec/fasttext model from scratch. For both embedding model we used gensim6 embedding api and trained with Bengali corpora. 4https://github.com/nltk/nltk 5https://github.com/google/sentencepiece 6https://github.com/RaRe-Technologies/gensim 2 Figure 3: BNLP Word2Vec API Figure 4: BNLP NER API 2.3 POS Tagging BNLP provides a hands on starting option for pos tagging to Bengali by giving a method to tag pos from given sentence using pretrained CRF model and also train a CRF model by giving custom data. 2.4 NER Tagging Similar to pos tagging BNLP provides a hands on starting option to NER tagging for Bengali sentences and also provides an option to train a CRF based NER model using custom data. Apart from this BNLP provides some extra utilities methods like getting Bengali stopwords, letters, punctuation from Corpus class. 3 BNLP Training and Evaluation In this section we describe about different BNLP model training datasets, training procedure, evaluation procedures. 3.1 Datasets For training sentencepiece, word2vec, fasttext we used Bengali raw text data from two sources. One is wikipedia7 dump dataset and another is crawl news articles from different news portal sites. As shown in Table 1 our raw data contains total of 99139 wikipedia Bengali articles and 127867 news articles. Wikipedia corpus contains total of 1818523 sentenes with 32908419 tokens. News articles corpus contains total of 4017940 sentences with 60526710 tokens. 7https://dumps.wikimedia.org/bnwiki/latest/ Corpus Articles Sentences Tokens Wikipedia 99139 1818523 32908419 News Articles 127867 4017940 60526710 Total 227006 5836463 93435129 Table 1: Statistics of Datasets used for training sentencepiece, word2vec, fasttext Models 3 Sentences Train Test POS 2997 2247 750 NER 67719 64155 3564 Table 2: Statistics of POS and NER datasets Precision Recall F1 POS 81.74 79.78 80.75 NER 74.15 60.91 66.88 Table 3: Evaluation results of POS and NER model For POS tagging we used nltr 8 datasets which contains total of 2997 sentences. We split that datasets into 2247 train and 750 test set and train our POS tagging model. For NER we used NER-Bangla-Datasets [8] which contains total of 67719 data with 64155 train and 3564 test. Table 2 provides details statistics of POS and NER datasets. 3.2 Training and Evaluation We train sentencepiece model with our raw text data with vocab size 50000. As sentencepiece provide us end-to-end system for training and tokenizing we did not do any preprocessing task in our raw text datasets. We train our word2vec model with embedding dimention 300, window size 5, minimum number of word occurrences 1, and total workers number 8. We train it total of 50000 iterations. For training fasttext we set embedding dimension 300, windows size 5, number of minimum word occurrences 1, model type skipgram, learning rate 0.05. We trained total of 50 epochs and our loss is 0.318668. Our CRF based POS tagging model and NER tagging model training approach is similar. We splited data into 75% train and 25% test. Our evaluation result for POS tagging model is 80.75 F1 score and NER model is 66.88 F1 score. Table 3 describe details about evaluation results. 4 Related Works There are significant number of open-source NLP tools for English language. Tools like NLTK[1], coreNLP[2], spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of them. These tools mostly build for English language and has limited or no support for other languages. Specially a low resource language like Bengali, there is huge scarcity of tool to process it. iNLTK[7] is an initial approach to help process Bengali language with tokenization, language model support. But as it’s group with different indic language, specail monolingual concern for Bengali language is missing. Keeping that concern in mind we build BNLP to support specially for Bengali language and provides tokenization, embedding, pos, ner supports. 5 Conclusion and Future Work BNLP language processing toolkit provides tokenization, embedding, pos tagging, ner tagging, language modeling facilities for Bengali language. BNLP pertrained model achieves significant results in Bengali text tokenizing, word embeddding, POS tagging and NER tagging task. BNLP is using widely in Bengali language research communities and appreciated by the communities. We are working on extending the support tools like stemming, lemmatizing, corpus support for BNLP in future. We are working on to add language model based support like BERT based LM in BNLP so that researcher can use it for different downstream task efficiently. While these task under development, we are hopping that BNLP will accelerate Bengali NLP research and development. References [1] Edward Loper and Steven Bird. Nltk: the natural language toolkit. CoRR, cs.CL/0205028, 07 2002. 8https://github.com/abhishekgupta92/bangla_pos_tagger 4 [2] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland, June 2014. Association for Computational Linguistics. [3] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolu- tional neural networks and incremental parsing. To appear, 2017. [4] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform. CoRR, abs/1803.07640, 2018. [5] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota, June 2019.