Malaya Documentation
Total Page:16
File Type:pdf, Size:1020Kb
malaya Documentation huseinzol05 Sep 16, 2021 GETTING STARTED 1 Documentation 3 2 Installing from the PyPI 5 3 Features 7 4 Pretrained Models 9 5 References 11 6 Acknowledgement 13 7 Contributing 15 8 License 17 9 Contents: 19 9.1 Speech Toolkit.............................................. 19 9.2 Dataset.................................................. 21 9.3 Installation................................................ 22 9.4 Malaya Cache.............................................. 23 9.5 Running on Windows.......................................... 26 9.6 API.................................................... 27 9.7 Contributing............................................... 97 9.8 GPU Environment............................................ 98 9.9 Devices.................................................. 101 9.10 Precision Mode.............................................. 102 9.11 Quantization............................................... 103 9.12 Deployment............................................... 105 9.13 Transformer............................................... 108 9.14 Word Vector............................................... 116 9.15 Word and sentence tokenizer....................................... 130 9.16 Spelling Correction............................................ 139 9.17 Coreference Resolution......................................... 153 9.18 Normalizer................................................ 157 9.19 Stemmer and Lemmatization....................................... 170 9.20 True Case................................................. 173 9.21 Segmentation............................................... 177 9.22 Preprocessing............................................... 184 9.23 Kesalahan Tatabahasa.......................................... 193 9.24 Num2Word................................................ 203 i 9.25 Word2Num................................................ 204 9.26 Knowledge Graph Triples........................................ 205 9.27 Knowledge Graph from Dependency.................................. 217 9.28 Text Augmentation............................................ 226 9.29 Prefix Generator............................................. 231 9.30 Isi Penting Generator........................................... 238 9.31 Lexicon Generator............................................ 243 9.32 Paraphrase................................................ 290 9.33 Emotion Analysis............................................ 294 9.34 Language Detection........................................... 305 9.35 NSFW Detection............................................. 311 9.36 Relevancy Analysis........................................... 312 9.37 Sentiment Analysis............................................ 321 9.38 Subjectivity Analysis........................................... 329 9.39 Toxicity Analysis............................................. 337 9.40 Doc2Vec................................................. 347 9.41 Semantic Similarity........................................... 355 9.42 Unsupervised Keyword Extraction.................................... 360 9.43 Keyphrase similarity........................................... 368 9.44 Entities Recognition........................................... 374 9.45 Part-of-Speech Recognition....................................... 400 9.46 Dependency Parsing........................................... 413 9.47 Constituency Parsing........................................... 438 9.48 Abstractive................................................ 445 9.49 Long Text Abstractive Summarization.................................. 452 9.50 Extractive................................................. 458 9.51 MS to EN................................................. 474 9.52 EN to MS................................................. 484 9.53 Long Text Translation.......................................... 495 9.54 SQUAD.................................................. 516 9.55 Classification............................................... 523 9.56 Topic Modeling.............................................. 531 9.57 Clustering................................................ 545 9.58 Stacking................................................. 558 9.59 Finetune ALXLNET-Bahasa....................................... 562 9.60 Finetune BERT-Bahasa.......................................... 574 9.61 Finetune XLNET-Bahasa........................................ 590 9.62 Crawler.................................................. 606 9.63 Donation................................................. 609 9.64 How Malaya gathered corpus?...................................... 610 9.65 References................................................ 612 Python Module Index 619 Index 621 ii malaya Documentation Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. GETTING STARTED 1 malaya Documentation 2 GETTING STARTED CHAPTER ONE DOCUMENTATION Proper documentation is available at https://malaya.readthedocs.io/ 3 malaya Documentation 4 Chapter 1. Documentation CHAPTER TWO INSTALLING FROM THE PYPI CPU version $ pip install malaya GPU version $ pip install malaya[gpu] Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported. We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4, 2.4.1 and 2.5. 5 malaya Documentation 6 Chapter 2. Installing from the PyPI CHAPTER THREE FEATURES • Augmentation, augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa. • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa. • Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models. • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa. • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa. • Entities Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer- Bahasa. • Generator, generate any texts given a context using T5-Bahasa, GPT2-Bahasa or Transformer-Bahasa. • Keyword Extraction, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa. • Knowledge Graph, generate Knowledge Graph using T5-Bahasa or parse from Dependency Parsing models. • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish. • Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any bahasa texts. • Num2Word, convert from numbers to cardinal or ordinal representation. • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa. • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa. • Question Answer, reading comprehension using finetuned Transformer-Bahasa. • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa. • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa. • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer- Bahasa. • Spell Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any bahasa words and NeuSpell using T5-Bahasa. • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming. • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa. • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa. 7 malaya Documentation • Summarization, provide Abstractive T5-Bahasa also Extractive interface using Transformer-Bahasa, skip- thought and Doc2Vec. • Topic Modelling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF and LSA interface for easy topic mod- elling with topics visualization. • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer- Bahasa. • Transformer, provide easy interface to load Pretrained Language models Malaya. • Translation, provide Neural Machine Translation using Transformer for EN to MS and MS to EN. • Word2Num, convert from cardinal or ordinal representation to numbers. • Word2Vec, provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualiza- tion. • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data. • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x. • Longer Sequences Transformer, provide BigBird + Pegasus for longer Abstractive Summarization, Neural Machine Translation and Relevancy Analysis sequences. 8 Chapter 3. Features CHAPTER FOUR PRETRAINED MODELS Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model • ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909. 11942 • ALXLNET, a Lite XLNET, no paper produced. • BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/ 1810.04805 • BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062 • ELECTRA, Pre-training Text Encoders as