Malaya Documentation

malaya Documentation huseinzol05 Sep 16, 2021 GETTING STARTED 1 Documentation 3 2 Installing from the PyPI 5 3 Features 7 4 Pretrained Models 9 5 References 11 6 Acknowledgement 13 7 Contributing 15 8 License 17 9 Contents: 19 9.1 Speech Toolkit.............................................. 19 9.2 Dataset.................................................. 21 9.3 Installation................................................ 22 9.4 Malaya Cache.............................................. 23 9.5 Running on Windows.......................................... 26 9.6 API.................................................... 27 9.7 Contributing............................................... 97 9.8 GPU Environment............................................ 98 9.9 Devices.................................................. 101 9.10 Precision Mode.............................................. 102 9.11 Quantization............................................... 103 9.12 Deployment............................................... 105 9.13 Transformer............................................... 108 9.14 Word Vector............................................... 116 9.15 Word and sentence tokenizer....................................... 130 9.16 Spelling Correction............................................ 139 9.17 Coreference Resolution......................................... 153 9.18 Normalizer................................................ 157 9.19 Stemmer and Lemmatization....................................... 170 9.20 True Case................................................. 173 9.21 Segmentation............................................... 177 9.22 Preprocessing............................................... 184 9.23 Kesalahan Tatabahasa.......................................... 193 9.24 Num2Word................................................ 203 i 9.25 Word2Num................................................ 204 9.26 Knowledge Graph Triples........................................ 205 9.27 Knowledge Graph from Dependency.................................. 217 9.28 Text Augmentation............................................ 226 9.29 Prefix Generator............................................. 231 9.30 Isi Penting Generator........................................... 238 9.31 Lexicon Generator............................................ 243 9.32 Paraphrase................................................ 290 9.33 Emotion Analysis............................................ 294 9.34 Language Detection........................................... 305 9.35 NSFW Detection............................................. 311 9.36 Relevancy Analysis........................................... 312 9.37 Sentiment Analysis............................................ 321 9.38 Subjectivity Analysis........................................... 329 9.39 Toxicity Analysis............................................. 337 9.40 Doc2Vec................................................. 347 9.41 Semantic Similarity........................................... 355 9.42 Unsupervised Keyword Extraction.................................... 360 9.43 Keyphrase similarity........................................... 368 9.44 Entities Recognition........................................... 374 9.45 Part-of-Speech Recognition....................................... 400 9.46 Dependency Parsing........................................... 413 9.47 Constituency Parsing........................................... 438 9.48 Abstractive................................................ 445 9.49 Long Text Abstractive Summarization.................................. 452 9.50 Extractive................................................. 458 9.51 MS to EN................................................. 474 9.52 EN to MS................................................. 484 9.53 Long Text Translation.......................................... 495 9.54 SQUAD.................................................. 516 9.55 Classification............................................... 523 9.56 Topic Modeling.............................................. 531 9.57 Clustering................................................ 545 9.58 Stacking................................................. 558 9.59 Finetune ALXLNET-Bahasa....................................... 562 9.60 Finetune BERT-Bahasa.......................................... 574 9.61 Finetune XLNET-Bahasa........................................ 590 9.62 Crawler.................................................. 606 9.63 Donation................................................. 609 9.64 How Malaya gathered corpus?...................................... 610 9.65 References................................................ 612 Python Module Index 619 Index 621 ii malaya Documentation Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. GETTING STARTED 1 malaya Documentation 2 GETTING STARTED CHAPTER ONE DOCUMENTATION Proper documentation is available at https://malaya.readthedocs.io/ 3 malaya Documentation 4 Chapter 1. Documentation CHAPTER TWO INSTALLING FROM THE PYPI CPU version $ pip install malaya GPU version $ pip install malaya[gpu] Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported. We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4, 2.4.1 and 2.5. 5 malaya Documentation 6 Chapter 2. Installing from the PyPI CHAPTER THREE FEATURES • Augmentation, augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa. • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa. • Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models. • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa. • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa. • Entities Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer- Bahasa. • Generator, generate any texts given a context using T5-Bahasa, GPT2-Bahasa or Transformer-Bahasa. • Keyword Extraction, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa. • Knowledge Graph, generate Knowledge Graph using T5-Bahasa or parse from Dependency Parsing models. • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish. • Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any bahasa texts. • Num2Word, convert from numbers to cardinal or ordinal representation. • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa. • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa. • Question Answer, reading comprehension using finetuned Transformer-Bahasa. • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa. • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa. • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer- Bahasa. • Spell Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any bahasa words and NeuSpell using T5-Bahasa. • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming. • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa. • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa. 7 malaya Documentation • Summarization, provide Abstractive T5-Bahasa also Extractive interface using Transformer-Bahasa, skip- thought and Doc2Vec. • Topic Modelling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization. • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer- Bahasa. • Transformer, provide easy interface to load Pretrained Language models Malaya. • Translation, provide Neural Machine Translation using Transformer for EN to MS and MS to EN. • Word2Num, convert from cardinal or ordinal representation to numbers. • Word2Vec, provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualization. • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data. • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x. • Longer Sequences Transformer, provide BigBird + Pegasus for longer Abstractive Summarization, Neural Machine Translation and Relevancy Analysis sequences. 8 Chapter 3. Features CHAPTER FOUR PRETRAINED MODELS Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model • ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909. 11942 • ALXLNET, a Lite XLNET, no paper produced. • BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/ 1810.04805 • BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062 • ELECTRA, Pre-training Text Encoders as

Malaya Documentation

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

Marzuki Malaysiakini 24 Februari 2020 Wartawan Malaysiakini

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Naik Taraf Swettenham Pier Cruise Terminal Jana Pertumbuhan Ekonomi Dan Pelancongan

Teks-Ucapan-Pengumuman-Kabinet

Actress with a Big Heart

EDISI 84 • 7 - 13 JANUARI 2018 • Amanah

Battle Against the Bug Asia’S Ght to Contain Covid-19

Malaysia Politics