Bnlp: Natural Language Processing Toolkit for Bengali Language

Total Page:16

File Type:pdf, Size:1020Kb

Bnlp: Natural Language Processing Toolkit for Bengali Language BNLP: NATURAL LANGUAGE PROCESSING TOOLKIT FOR BENGALI LANGUAGE Sagor Sarker Begum Rokeya University Rangpur, Bangladesh [email protected] ABSTRACT BNLP is an open source language processing toolkit for Bengali language consisting with tokenization, word embedding, pos tagging, ner tagging facilities. BNLP provides pre-trained model with high accuracy to do model based tokenization, embedding, pos tagging, ner tagging task for Bengali language. BNLP pretrained model achieves significant results in Bengali text tokenization, word embeddding, POS tagging and NER tagging task. BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks. BNLP is available at https://github. com/sagorbrur/bnlp. 1 Introduction Natural language processing is one of the most important field in computation linguistics. Tokenization, embedding, pos tagging, ner tagging, text classification, language modeling are some of the sub task of NLP. Any computational linguistics researcher or developer need hands on tools to do these sub task efficiently. Due to the recent advancement of NLP there are so many tools and method to do word tokenization, word embedding, pos tagging, ner tagging in English language. NLTK[1], coreNLP[2], spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of the tools. These tools provide a variety of method to do tokenization, embedding, pos tagging, ner tagging, language modeling for English language. Support for other low resource language like Bengali is limited or no support at all. Recent tool like iNLTK1[7] is an initial approach for different indic language. But as it groups with other indic language special monolingual support for Bengali language is missing. BNLP is an open source language processing toolkit for Bengali language is build to address this problem and breaks the barrier to do different Bengali NLP task by: arXiv:2102.00405v1 [cs.CL] 31 Jan 2021 • Providing different tokenization method to tokenize Bengali text efficiently • Providing different embedding method to embed Bengali word using pretrained model and also provides an option to train an embedding model from scratch • Providing hands on start option for pos tagging or ner tagging of Bengali sentences and also provides an option for training CRF based pos tagger or ner tagger model from scratch. BNLP also provides some utility methods like to remove stopwords from Bengali text, to get Bengali letters list or punctuation list. BNLP github repositories2 for source code of the package, pretrained model and documentation3. BNLP libraries has a permissive MIT license. BNLP is easy to install via pip or by cloning repository, easy to plugin with any python projects. 1https://github.com/goru001/inltk 2https://github.com/sagorbrur/bnlp 3https://bnlp.readthedocs.io/ Figure 1: Overview of BNLP’s pipeline. BNLP takes raw text as input, and produces trained model of sentencepiece, word2vec and fasttext. Using that trained model BNLP prediction API do different prediction task. Figure 2: BNLP Basic Tokenization API 2 BNLP API BNLP tool is too simple to use. Researcher or developer can integrate this tool with installing simple python package. In this section we are describing how to do different NLP task for Bengali text using BNLP toolkit. 2.1 Tokenization BNLP provides three different tokenization option to tokenize Bengali text. Under rule based tokenizer BNLP provides Basic Tokenizer a punctuation splitting tokenizer and NLTK4 tokenizer. As NLTK tokenizer is for English language, we modified nltk tokenize output to use it for Bengali language keeping in mind the difference between punctuation of English and Bengali. Under model based tokenization BNLP provides sentencepice5 tokenizer for Bengali text called Bengali Sentencepiece. Bengali sentencepiece api provide two option, pretrained sentencepiece model and training sentencepiece model. Anyone can tokenize Bengali text using pretrained sentencepiece model or can train their own Bengali sentencepiece model by calling train api. 2.2 Embedding BNLP provides two different embedding option to embed Bengali words, one is Bengali word2vec and Bengali fasttext. Both Bengali word2vec and fasttext has two option, one is embed Bengali word using pretrained model and another is train Bengali word2vec/fasttext model from scratch. For both embedding model we used gensim6 embedding api and trained with Bengali corpora. 4https://github.com/nltk/nltk 5https://github.com/google/sentencepiece 6https://github.com/RaRe-Technologies/gensim 2 Figure 3: BNLP Word2Vec API Figure 4: BNLP NER API 2.3 POS Tagging BNLP provides a hands on starting option for pos tagging to Bengali by giving a method to tag pos from given sentence using pretrained CRF model and also train a CRF model by giving custom data. 2.4 NER Tagging Similar to pos tagging BNLP provides a hands on starting option to NER tagging for Bengali sentences and also provides an option to train a CRF based NER model using custom data. Apart from this BNLP provides some extra utilities methods like getting Bengali stopwords, letters, punctuation from Corpus class. 3 BNLP Training and Evaluation In this section we describe about different BNLP model training datasets, training procedure, evaluation procedures. 3.1 Datasets For training sentencepiece, word2vec, fasttext we used Bengali raw text data from two sources. One is wikipedia7 dump dataset and another is crawl news articles from different news portal sites. As shown in Table 1 our raw data contains total of 99139 wikipedia Bengali articles and 127867 news articles. Wikipedia corpus contains total of 1818523 sentenes with 32908419 tokens. News articles corpus contains total of 4017940 sentences with 60526710 tokens. 7https://dumps.wikimedia.org/bnwiki/latest/ Corpus Articles Sentences Tokens Wikipedia 99139 1818523 32908419 News Articles 127867 4017940 60526710 Total 227006 5836463 93435129 Table 1: Statistics of Datasets used for training sentencepiece, word2vec, fasttext Models 3 Sentences Train Test POS 2997 2247 750 NER 67719 64155 3564 Table 2: Statistics of POS and NER datasets Precision Recall F1 POS 81.74 79.78 80.75 NER 74.15 60.91 66.88 Table 3: Evaluation results of POS and NER model For POS tagging we used nltr 8 datasets which contains total of 2997 sentences. We split that datasets into 2247 train and 750 test set and train our POS tagging model. For NER we used NER-Bangla-Datasets [8] which contains total of 67719 data with 64155 train and 3564 test. Table 2 provides details statistics of POS and NER datasets. 3.2 Training and Evaluation We train sentencepiece model with our raw text data with vocab size 50000. As sentencepiece provide us end-to-end system for training and tokenizing we did not do any preprocessing task in our raw text datasets. We train our word2vec model with embedding dimention 300, window size 5, minimum number of word occurrences 1, and total workers number 8. We train it total of 50000 iterations. For training fasttext we set embedding dimension 300, windows size 5, number of minimum word occurrences 1, model type skipgram, learning rate 0.05. We trained total of 50 epochs and our loss is 0.318668. Our CRF based POS tagging model and NER tagging model training approach is similar. We splited data into 75% train and 25% test. Our evaluation result for POS tagging model is 80.75 F1 score and NER model is 66.88 F1 score. Table 3 describe details about evaluation results. 4 Related Works There are significant number of open-source NLP tools for English language. Tools like NLTK[1], coreNLP[2], spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of them. These tools mostly build for English language and has limited or no support for other languages. Specially a low resource language like Bengali, there is huge scarcity of tool to process it. iNLTK[7] is an initial approach to help process Bengali language with tokenization, language model support. But as it’s group with different indic language, specail monolingual concern for Bengali language is missing. Keeping that concern in mind we build BNLP to support specially for Bengali language and provides tokenization, embedding, pos, ner supports. 5 Conclusion and Future Work BNLP language processing toolkit provides tokenization, embedding, pos tagging, ner tagging, language modeling facilities for Bengali language. BNLP pertrained model achieves significant results in Bengali text tokenizing, word embeddding, POS tagging and NER tagging task. BNLP is using widely in Bengali language research communities and appreciated by the communities. We are working on extending the support tools like stemming, lemmatizing, corpus support for BNLP in future. We are working on to add language model based support like BERT based LM in BNLP so that researcher can use it for different downstream task efficiently. While these task under development, we are hopping that BNLP will accelerate Bengali NLP research and development. References [1] Edward Loper and Steven Bird. Nltk: the natural language toolkit. CoRR, cs.CL/0205028, 07 2002. 8https://github.com/abhishekgupta92/bangla_pos_tagger 4 [2] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland, June 2014. Association for Computational Linguistics. [3] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolu- tional neural networks and incremental parsing. To appear, 2017. [4] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform. CoRR, abs/1803.07640, 2018. [5] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota, June 2019.
Recommended publications
  • 15 Things You Should Know About Spacy
    15 THINGS YOU SHOULD KNOW ABOUT SPACY ALEXANDER CS HENDORF @ EUROPYTHON 2020 0 -preface- Natural Language Processing Natural Language Processing (NLP): A Unstructured Data Avalanche - Est. 20% of all data - Structured Data - Data in databases, format-xyz Different Kinds of Data - Est. 80% of all data, requires extensive pre-processing and different methods for analysis - Unstructured Data - Verbal, text, sign- communication - Pictures, movies, x-rays - Pink noise, singularities Some Applications of NLP - Dialogue systems (Chatbots) - Machine Translation - Sentiment Analysis - Speech-to-text and vice versa - Spelling / grammar checking - Text completion Many Smart Things in Text Data! Rule-Based Exploratory / Statistical NLP Use Cases � Alexander C. S. Hendorf [email protected] -Partner & Principal Consultant Data Science & AI @hendorf Consulting and building AI & Data Science for enterprises. -Python Software Foundation Fellow, Python Softwareverband chair, Emeritus EuroPython organizer, Currently Program Chair of EuroSciPy, PyConDE & PyData Berlin, PyData community organizer -Speaker Europe & USA MongoDB World New York / San José, PyCons, CEBIT Developer World, BI Forum, IT-Tage FFM, PyData London, Berlin, PyParis,… here! NLP Alchemy Toolset 1 - What is spaCy? - Stable Open-source library for NLP: - Supports over 55+ languages - Comes with many pretrained language models - Designed for production usage - Advantages: - Fast and efficient - Relatively intuitive - Simple deep learning implementation - Out-of-the-box support for: - Named entity recognition - Part-of-speech (POS) tagging - Labelled dependency parsing - … 2 – Building Blocks of spaCy 2 – Building Blocks I. - Tokenization Segmenting text into words, punctuations marks - POS (Part-of-speech tagging) cat -> noun, scratched -> verb - Lemmatization cats -> cat, scratched -> scratch - Sentence Boundary Detection Hello, Mrs. Poppins! - NER (Named Entity Recognition) Marry Poppins -> person, Apple -> company - Serialization Saving NLP documents 2 – Building Blocks II.
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • ACL 2019 Social Media Mining for Health Applications (#SMM4H)
    ACL 2019 Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Proceedings of the Fourth Workshop August 2, 2019 Florence, Italy c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-46-8 ii Preface Welcome to the 4th Social Media Mining for Health Applications Workshop and Shared Task - #SMM4H 2019. The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day 1. The latest Pew Research Report 2, nearly half of adults worldwide and two- thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. In its fourth iteration, the #SMM4H workshop takes place in Florence, Italy, on August 2, 2019, and is co-located with the
    [Show full text]
  • Mapping the Hyponymy Relation of Wordnet Onto
    Under review as a conference paper at ICLR 2019 MAPPING THE HYPONYMY RELATION OF WORDNET ONTO VECTOR SPACES Anonymous authors Paper under double-blind review ABSTRACT In this paper, we investigate mapping the hyponymy relation of WORDNET to feature vectors. We aim to model lexical knowledge in such a way that it can be used as input in generic machine-learning models, such as phrase entailment predictors. We propose two models. The first one leverages an existing mapping of words to feature vectors (fastText), and attempts to classify such vectors as within or outside of each class. The second model is fully supervised, using solely WORDNET as a ground truth. It maps each concept to an interval or a disjunction thereof. On the first model, we approach, but not quite attain state of the art performance. The second model can achieve near-perfect accuracy. 1 INTRODUCTION Distributional encoding of word meanings from the large corpora (Mikolov et al., 2013; 2018; Pen- nington et al., 2014) have been found to be useful for a number of NLP tasks. These approaches are based on a probabilistic language model by Bengio et al. (2003) of word sequences, where each word w is represented as a feature vector f(w) (a compact representation of a word, as a vector of floating point values). This means that one learns word representations (vectors) and probabilities of word sequences at the same time. While the major goal of distributional approaches is to identify distributional patterns of words and word sequences, they have even found use in tasks that require modeling more fine-grained relations between words than co-occurrence in word sequences.
    [Show full text]
  • Pairwise Fasttext Classifier for Entity Disambiguation
    Pairwise FastText Classifier for Entity Disambiguation a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected] deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels.
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Natural Language Toolkit Based Morphology Linguistics
    Published by : International Journal of Engineering Research & Technology (IJERT) http://www.ijert.org ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Natural Language Toolkit based Morphology Linguistics Alifya Khan Pratyusha Trivedi Information Technology Information Technology Vidyalankar Institute of Technology Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Karthik Ashok Prof. Kanchan Dhuri Information Technology, Information Technology, Vidyalankar Institute of Technology, Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Abstract— In the current scenario, there are different apps, of text to different languages, grammar check, etc. websites, etc. to carry out different functionalities with respect Generally, all these functionalities are used in tandem with to text such as grammar correction, translation of text, each other. For e.g., to read a sign board in a different extraction of text from image or videos, etc. There is no app or language, the first step is to extract text from that image and a website where a user can get all these functions/features at translate it to any respective language as required. Hence, to one place and hence the user is forced to install different apps do this one has to switch from application to application or visit different websites to carry out those functions. The which can be time consuming. To overcome this problem, proposed system identifies this problem and tries to overcome an integrated environment is built where all these it by providing various text-based features at one place, so that functionalities are available. the user will not have to hop from app to app or website to website to carry out various functions.
    [Show full text]
  • Question Embeddings for Semantic Answer Type Prediction
    Question Embeddings for Semantic Answer Type Prediction Eleanor Bill1 and Ernesto Jiménez-Ruiz2 1, 2 City, University of London, London, EC1V 0HB Abstract. This paper considers an answer type and category prediction challenge for a set of natural language questions, and proposes a question answering clas- sification system based on word and DBpedia knowledge graph embeddings. The questions are parsed for keywords, nouns and noun phrases before word and knowledge graph embeddings are applied to the parts of the question. The vectors produced are used to train multiple multi-layer perceptron models, one for each answer type in a multiclass one-vs-all classification system for both answer cat- egory prediction and answer type prediction. Different combinations of vectors and the effect of creating additional positive and negative training samples are evaluated in order to find the best classification system. The classification system that predict the answer category with highest accuracy are the classifiers trained on knowledge graph embedded noun phrases vectors from the original training data, with an accuracy of 0.793. The vector combination that produces the highest NDCG values for answer category accuracy is the word embeddings from the parsed question keyword and nouns parsed from the original training data, with NDCG@5 and NDCG@10 values of 0.471 and 0.440 respectively for the top five and ten predicted answer types. Keywords: Semantic Web, knowledge graph embedding, answer type predic- tion, question answering. 1 Introduction 1.1 The SMART challenge The challenge [1] provided a training set of natural language questions alongside a sin- gle given answer category (Boolean, literal or resource) and 1-6 given answer types.
    [Show full text]
  • Design and Implementation of an Open Source Greek Pos Tagger and Entity Recognizer Using Spacy
    DESIGN AND IMPLEMENTATION OF AN OPEN SOURCE GREEK POS TAGGER AND ENTITY RECOGNIZER USING SPACY A PREPRINT Eleni Partalidou1, Eleftherios Spyromitros-Xioufis1, Stavros Doropoulos1, Stavros Vologiannidis2, and Konstantinos I. Diamantaras3 1DataScouting, 30 Vakchou Street, Thessaloniki, 54629, Greece 2Dpt of Computer, Informatics and Telecommunications Engineering, International Hellenic University, Terma Magnisias, Serres, 62124, Greece 3Dpt of Informatics and Electronic Engineering, International Hellenic University, Sindos, Thessaloniki, 57400, Greece ABSTRACT This paper proposes a machine learning approach to part-of-speech tagging and named entity recog- nition for Greek, focusing on the extraction of morphological features and classification of tokens into a small set of classes for named entities. The architecture model that was used is introduced. The greek version of the spaCy platform was added into the source code, a feature that did not exist before our contribution, and was used for building the models. Additionally, a part of speech tagger was trained that can detect the morphologyof the tokens and performs higher than the state-of-the-art results when classifying only the part of speech. For named entity recognition using spaCy, a model that extends the standard ENAMEX type (organization, location, person) was built. Certain experi- ments that were conducted indicate the need for flexibility in out-of-vocabulary words and there is an effort for resolving this issue. Finally, the evaluation results are discussed. Keywords Part-of-Speech Tagging, Named Entity Recognition, spaCy, Greek text. 1. Introduction In the research field of Natural Language Processing (NLP) there are several tasks that contribute to understanding arXiv:1912.10162v1 [cs.CL] 5 Dec 2019 natural text.
    [Show full text]
  • Translation, Sentiment and Voices: a Computational Model to Translate and Analyze Voices from Real-Time Video Calling
    Translation, Sentiment and Voices: A Computational Model to Translate and Analyze Voices from Real-Time Video Calling Aneek Barman Roy School of Computer Science and Statistics Department of Computer Science Thesis submitted for the degree of Master in Science Trinity College Dublin 2019 List of Tables 1 Table 2.1: The table presents a count of studies 22 2 Table 2.2: The table present a count of studies 23 3 Table 3.1: Europarl Training Corpus 31 4 Table 3.2: VidALL Neural Machine Translation Corpus 32 5 Table 3.3: Language Translation Model Specifications 33 6 Table 4.1: Speech Analysis Corpus Statistics 42 7 Table 4.2: Sentences Given to Speakers 43 8 Table 4.3: Hacki’s Vocal Intensity Comparators (Hacki, 1996) 45 9 Table 4.4: Model Vocal Intensity Comparators 45 10 Table 4.5: Sound Intensities for Sentence 1 and Sentence 2 46 11 Table 4.6: Classification of Sentiment Scores 47 12 Table 4.7: Sentiment Scores for Five Sentences 47 13 Table 4.8: Simple RNN-based Model Summary 52 14 Table 4.9: Simple RNN-based Model Results 52 15 Table 4.10: Embedded-based RNN Model Summary 53 16 Table 4.11: Embedded-RNN based Model Results 54 17 Table 5.1: Sentence A Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 18 Table 5.2: Sentence B Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 19 Table 5.3: Evaluation on VidALL Speech Transcriptions 64 v List of Figures 1 Figure 2.1: Google Neural Machine Translation Architecture (Wu et al.,2016) 10 2 Figure 2.2: Side-by-side Evaluation of Google Neural Translation Model (Wu
    [Show full text]
  • A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications
    The Thirty-Second Innovative Applications of Artificial Intelligence Conference (IAAI-20) A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications Shivashankar Subramanian,1,2 Ioana Baldini,1 Sushma Ravichandran,1 Dmitriy A. Katz-Rogozhnikov,1 Karthikeyan Natesan Ramamurthy,1 Prasanna Sattigeri,1 Kush R. Varshney,1 Annmarie Wang,3,4 Pradeep Mangalath,3,5 Laura B. Kleiman3 1IBM Research, Yorktown Heights, NY, USA, 2University of Melbourne, VIC, Australia, 3Cures Within Reach for Cancer, Cambridge, MA, USA, 4Massachusetts Institute of Technology, Cambridge, MA, USA, 5Harvard Medical School, Boston, MA, USA Abstract (Pantziarka et al. 2017; Bouche, Pantziarka, and Meheus 2017; Verbaanderd et al. 2017). However, manual review to More than 200 generic drugs approved by the U.S. Food and Drug Administration for non-cancer indications have shown identify and analyze potential evidence is time-consuming promise for treating cancer. Due to their long history of safe and intractable to scale, as PubMed indexes millions of ar- patient use, low cost, and widespread availability, repurpos- ticles and the collection is continuously updated. For exam- ing of these drugs represents a major opportunity to rapidly ple, the PubMed query Neoplasms [mh] yields more than improve outcomes for cancer patients and reduce healthcare 3.2 million articles.1 Hence there is a need for a (semi-) au- costs. In many cases, there is already evidence of efficacy for tomatic approach to identify relevant scientific articles and cancer, but trying to manually extract such evidence from the provide an easily consumable summary of the evidence. scientific literature is intractable.
    [Show full text]
  • Practice with Python
    CSI4108-01 ARTIFICIAL INTELLIGENCE 1 Word Embedding / Text Processing Practice with Python 2018. 5. 11. Lee, Gyeongbok Practice with Python 2 Contents • Word Embedding – Libraries: gensim, fastText – Embedding alignment (with two languages) • Text/Language Processing – POS Tagging with NLTK/koNLPy – Text similarity (jellyfish) Practice with Python 3 Gensim • Open-source vector space modeling and topic modeling toolkit implemented in Python – designed to handle large text collections, using data streaming and efficient incremental algorithms – Usually used to make word vector from corpus • Tutorial is available here: – https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials – https://rare-technologies.com/word2vec-tutorial/ • Install – pip install gensim Practice with Python 4 Gensim for Word Embedding • Logging • Input Data: list of word’s list – Example: I have a car , I like the cat → – For list of the sentences, you can make this by: Practice with Python 5 Gensim for Word Embedding • If your data is already preprocessed… – One sentence per line, separated by whitespace → LineSentence (just load the file) – Try with this: • http://an.yonsei.ac.kr/corpus/example_corpus.txt From https://radimrehurek.com/gensim/models/word2vec.html Practice with Python 6 Gensim for Word Embedding • If the input is in multiple files or file size is large: – Use custom iterator and yield From https://rare-technologies.com/word2vec-tutorial/ Practice with Python 7 Gensim for Word Embedding • gensim.models.Word2Vec Parameters – min_count:
    [Show full text]