malaya Documentation

huseinzol05

Sep 16, 2021

GETTING STARTED

1 Documentation 3

2 Installing from the PyPI 5

3 Features 7

4 Pretrained Models 9

5 References 11

6 Acknowledgement 13

7 Contributing 15

8 License 17

9 Contents: 19 9.1 Speech Toolkit...... 19 9.2 Dataset...... 21 9.3 Installation...... 22 9.4 Malaya Cache...... 23 9.5 Running on Windows...... 26 9.6 API...... 27 9.7 Contributing...... 97 9.8 GPU Environment...... 98 9.9 Devices...... 101 9.10 Precision Mode...... 102 9.11 Quantization...... 103 9.12 Deployment...... 105 9.13 Transformer...... 108 9.14 Vector...... 116 9.15 Word and sentence tokenizer...... 130 9.16 Spelling Correction...... 139 9.17 Coreference Resolution...... 153 9.18 Normalizer...... 157 9.19 Stemmer and Lemmatization...... 170 9.20 True Case...... 173 9.21 Segmentation...... 177 9.22 Preprocessing...... 184 9.23 Kesalahan Tatabahasa...... 193 9.24 Num2Word...... 203

i 9.25 Word2Num...... 204 9.26 Knowledge Graph Triples...... 205 9.27 Knowledge Graph from Dependency...... 217 9.28 Text Augmentation...... 226 9.29 Prefix Generator...... 231 9.30 Isi Penting Generator...... 238 9.31 Lexicon Generator...... 243 9.32 Paraphrase...... 290 9.33 Emotion Analysis...... 294 9.34 Language Detection...... 305 9.35 NSFW Detection...... 311 9.36 Relevancy Analysis...... 312 9.37 ...... 321 9.38 Subjectivity Analysis...... 329 9.39 Toxicity Analysis...... 337 9.40 Doc2Vec...... 347 9.41 ...... 355 9.42 Unsupervised Keyword Extraction...... 360 9.43 Keyphrase similarity...... 368 9.44 Entities Recognition...... 374 9.45 Part-of-...... 400 9.46 Dependency ...... 413 9.47 Constituency Parsing...... 438 9.48 Abstractive...... 445 9.49 Long Text Abstractive Summarization...... 452 9.50 Extractive...... 458 9.51 MS to EN...... 474 9.52 EN to MS...... 484 9.53 Long Text Translation...... 495 9.54 SQUAD...... 516 9.55 Classification...... 523 9.56 Topic Modeling...... 531 9.57 Clustering...... 545 9.58 Stacking...... 558 9.59 Finetune ALXLNET-Bahasa...... 562 9.60 Finetune BERT-Bahasa...... 574 9.61 Finetune XLNET-Bahasa...... 590 9.62 Crawler...... 606 9.63 Donation...... 609 9.64 How Malaya gathered corpus?...... 610 9.65 References...... 612

Python Module Index 619

Index 621

ii malaya Documentation

Malaya is a Natural-Language-Toolkit library for bahasa , powered by Deep Learning Tensorflow.

GETTING STARTED 1 malaya Documentation

2 GETTING STARTED CHAPTER ONE

DOCUMENTATION

Proper documentation is available at https://malaya.readthedocs.io/

3 malaya Documentation

4 Chapter 1. Documentation CHAPTER TWO

INSTALLING FROM THE PYPI

CPU version

$ pip install malaya

GPU version

$ pip install malaya[gpu]

Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported. We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4, 2.4.1 and 2.5.

5 malaya Documentation

6 Chapter 2. Installing from the PyPI CHAPTER THREE

FEATURES

• Augmentation, augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa. • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa. • Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models. • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa. • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa. • Entities Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer- Bahasa. • Generator, generate any texts given a context using T5-Bahasa, GPT2-Bahasa or Transformer-Bahasa. • Keyword Extraction, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa. • Knowledge Graph, generate Knowledge Graph using T5-Bahasa or parse from Dependency Parsing models. • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish. • Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any bahasa texts. • Num2Word, convert from numbers to cardinal or ordinal representation. • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa. • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa. • Question Answer, reading comprehension using finetuned Transformer-Bahasa. • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa. • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa. • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer- Bahasa. • Spell Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any bahasa and NeuSpell using T5-Bahasa. • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa . • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa. • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa.

7 malaya Documentation

• Summarization, provide Abstractive T5-Bahasa also Extractive interface using Transformer-Bahasa, skip- thought and Doc2Vec. • Topic Modelling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF and LSA interface for easy topic mod- elling with topics visualization. • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer- Bahasa. • Transformer, provide easy interface to load Pretrained Language models Malaya. • Translation, provide Neural using Transformer for EN to MS and MS to EN. • Word2Num, convert from cardinal or ordinal representation to numbers. • , provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualiza- tion. • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data. • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x. • Longer Sequences Transformer, provide BigBird + Pegasus for longer Abstractive Summarization, Neural Machine Translation and Relevancy Analysis sequences.

8 Chapter 3. Features CHAPTER FOUR

PRETRAINED MODELS

Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model • ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909. 11942 • ALXLNET, a Lite XLNET, no paper produced. • BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/ 1810.04805 • BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062 • ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators, https://arxiv.org/abs/2003. 10555 • GPT2, Language Models are Unsupervised Multitask Learners, https://github.com/openai/gpt-2 • LM-Transformer, Exactly like T5, but use Tensor2Tensor instead Mesh Tensorflow with little tweak, no paper produced. • PEGASUS, Pre-training with Extracted Gap-sentences for Abstractive Summarization, https://arxiv.org/abs/ 1912.08777 • T5, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, https://arxiv.org/abs/ 1910.10683 • TinyBERT, Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351 • Word2Vec, Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781 • XLNET, Generalized Autoregressive Pretraining for Language Understanding, https://arxiv.org/abs/1906. 08237 • FNet, FNet: Mixing Tokens with Fourier Transforms, https://arxiv.org/abs/2105.03824

9 malaya Documentation

10 Chapter 4. Pretrained Models CHAPTER FIVE

REFERENCES

If you use our software for research, please cite:

@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep

˓→Learning Tensorflow, author= {Husein, Zolkepli}, title= {Malaya}, year={2018}, publisher= {GitHub}, journal= {GitHub repository}, howpublished= {\url{https://github.com/huseinzol05/malaya}} }

11 malaya Documentation

12 Chapter 5. References CHAPTER SIX

ACKNOWLEDGEMENT

Thanks to KeyReply for sponsoring private cloud to train Malaya models, without it, this library will collapse entirely. Also, thanks to Tensorflow Research Cloud for free TPUs access.

13 malaya Documentation

14 Chapter 6. Acknowledgement CHAPTER SEVEN

CONTRIBUTING

Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!

15 malaya Documentation

16 Chapter 7. Contributing CHAPTER EIGHT

LICENSE

17 malaya Documentation

18 Chapter 8. License CHAPTER NINE

CONTENTS:

9.1 Speech Toolkit

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. We maintain it at separate repository, https://github.com/huseinzol05/malaya-speech

9.1.1 Documentation

Proper documentation is available at https://malaya-speech.readthedocs.io/

9.1.2 Installing from the PyPI

CPU version

$ pip install malaya-speech

GPU version

$ pip install malaya-speech[gpu]

Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported. We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4 and 2.4.1.

9.1.3 Features

• Age Detection, detect age in speech using Finetuned Speaker Vector. • Speaker Diarization, diarizing speakers using Pretrained Speaker Vector. • Emotion Detection, detect emotions in speech using Finetuned Speaker Vector. • Gender Detection, detect genders in speech using Finetuned Speaker Vector. • Language Detection, detect hyperlocal languages in speech using Finetuned Speaker Vector. • Multispeaker Separation, Multispeaker separation using FastSep on 8k Wav. • Noise Reduction, reduce multilevel noises using STFT UNET. • Speaker Change, detect changing speakers using Finetuned Speaker Vector.

19 malaya Documentation

• Speaker overlap, detect overlap speakers using Finetuned Speaker Vector. • Speaker Vector, calculate similarity between speakers using Pretrained Speaker Vector. • Speech Enhancement, enhance voice activities using Waveform UNET. • SpeechSplit Conversion, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK. • Speech-to-Text, End-to-End Speech to Text for Malay and Mixed (Malay and Singlish) using RNN-Transducer and Wav2Vec2 CTC. • Super Resolution, Super Resolution 4x for Waveform. • Text-to-Speech, Text to Speech for Malay and Singlish using Tacotron2 and FastSpeech2. • Vocoder, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder. • Voice Activity Detection, detect voice activities using Finetuned Speaker Vector. • Voice Conversion, Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion. • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.

9.1.4 Pretrained Models

Malaya-Speech also released pretrained models, simply check at malaya-speech/pretrained-model • Wave UNET, Multi-Scale Neural Network for End-to-End Audio Source Separation, https://arxiv.org/abs/1806. 03185 • Wave ResNet UNET, added ResNet style into Wave UNET, no paper produced. • Wave ResNext UNET, added ResNext style into Wave UNET, no paper produced. • Deep Speaker, An End-to-End Neural Speaker Embedding System, https://arxiv.org/pdf/1705.02304.pdf • SpeakerNet, 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification, https://arxiv.org/abs/2010.12653 • VGGVox, a large-scale speaker identification dataset, https://arxiv.org/pdf/1706.08612.pdf • GhostVLAD, Utterance-level Aggregation For Speaker Recognition In The Wild, https://arxiv.org/abs/1902. 10107 • Conformer, Convolution-augmented Transformer for Speech Recognition, https://arxiv.org/abs/2005.08100 • ALConformer, A lite Conformer, no paper produced. • Jasper, An End-to-End Convolutional Neural Acoustic Model, https://arxiv.org/abs/1904.03288 • Tacotron2, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv. org/abs/1712.05884 • FastSpeech2, Fast and High-Quality End-to-End Text to Speech, https://arxiv.org/abs/2006.04558 • MelGAN, Generative Adversarial Networks for Conditional Waveform Synthesis, https://arxiv.org/abs/1910. 06711 • Multi-band MelGAN, Faster Waveform Generation for High-Quality Text-to-Speech, https://arxiv.org/abs/ 2005.05106 • SRGAN, Modified version of SRGAN to do 1D Convolution, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, https://arxiv.org/abs/1609.04802

20 Chapter 9. Contents: malaya Documentation

• Speech Enhancement UNET, https://github.com/haoxiangsnr/Wave-U-Net-for-Speech-Enhancement • Speech Enhancement ResNet UNET, Added ResNet style into Speech Enhancement UNET, no paper pro- duced. • Speech Enhancement ResNext UNET, Added ResNext style into Speech Enhancement UNET, no paper pro- duced. • Universal MelGAN, Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains, https://arxiv.org/abs/2011.09631 • FastVC, Faster and Accurate Voice Conversion using Transformer, no paper produced. • FastSep, Faster and Accurate Speech Separation using Transformer, no paper produced. • wav2vec 2.0, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006. 11477 • FastSpeechSplit, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced. • Sepformer, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154 • FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer, no paper produced. • HuBERT, Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, https:// arxiv.org/pdf/2106.07447v1.pdf

9.1.5 References

If you use our software for research, please cite:

@misc{Malaya, Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning

˓→Tensorflow, author= {Husein, Zolkepli}, title= {Malaya-Speech}, year={2020}, publisher= {GitHub}, journal= {GitHub repository}, howpublished= {\url{https://github.com/huseinzol05/malaya-speech}} }

9.1.6 Acknowledgement

Thanks to KeyReply for sponsoring private cloud to train Malaya-Speech models, without it, this library will collapse entirely.

9.2 Dataset

We want to make sure not just the code we open-sourced, but also goes to dataset, so everyone can validate. You can check in Malay-Dataset for our open dataset.

9.2. Dataset 21 malaya Documentation

9.2.1 Citation

1. Please citate the repository if use these corpus. 2. Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free. 3. What do you see just the data, but nobody can see how much we spent our cost to make it public.

9.2.2 Donation

1. We want to make sure downloaders got the best bandwidth and top speed, we host everything on S3, please consider a donation to prevent top-speed shutdown or broken link! 2. Husein really need money to stay survive, he is still a human. 7053174643, CIMB Click, Husein Zolkepli

9.3 Installation

9.3.1 Installing/Upgrading From the PyPI

CPU version

$ pip install malaya

GPU version

$ pip install malaya[gpu]

We recommend to use virtualenv for development.

9.3.2 From Source

Malaya is actively developed on Github. You can clone the public repo:

git clone https://github.com/huseinzol05/malaya

Once you have the source, you can install it into your site-packages with:

python setup.py install

9.3.3 Python

Malaya only supported Python 3.6 and above.

22 Chapter 9. Contents: malaya Documentation

9.3.4 Tensorflow

Malaya only supported Tensorflow 1.15 and above. All examples tested on Tensorflow version 1.15.4, 2.4.1 and 2.5.

9.4 Malaya Cache

This tutorial is available as an IPython notebook at Malaya/example/caching.

9.4.1 Default Cache location

You actually can know where is your Malaya default caching folder. Caching folder will save any models, vocabs, and rules downloaded for specific modules.

[1]: import malaya

[2]: malaya.__home__ [2]: '/Users/huseinzolkepli/Malaya'

9.4.2 Change default Cache location

To change default cache location, you need to set MALAYA_CACHE OS environment before import Malaya,

export MALAYA_CACHE=/Users/huseinzolkepli/Documents/Malaya

Or you can set in bashenv to make it permanent if you want.

[1]: import os

os.environ['MALAYA_CACHE']='/Users/huseinzolkepli/Documents/malaya-cache'

[2]: import malaya

malaya.__home__ [2]: '/Users/huseinzolkepli/Documents/malaya-cache'

9.4.3 Cache subdirectories

Start from version 1.0, Malaya put models in subdirectories, you can print it by simply,

[3]: malaya.utils.print_cache() Malaya/ keyword-extraction/ alxlnet/ model.pb (continues on next page)

9.4. Malaya Cache 23 malaya Documentation

(continued from previous page) sp10m.cased.v9.model sp10m.cased.v9.vocab version tiny-/ model.pb sp10m.cased.bert.model sp10m.cased.bert.vocab version qa-squad/ albert/ model.pb sp10m.cased.v10.model sp10m.cased.v10.vocab version albert-quantized/ model.pb sp10m.cased.v10.model sp10m.cased.v10.vocab version alxlnet/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version bert/ tiny-bert/ model.pb sp10m.cased.bert.model sp10m.cased.bert.vocab version xlnet/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version xlnet-quantized/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version sentiment/ albert/ model.pb sp10m.cased.v10.model sp10m.cased.v10.vocab version alxlnet/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version bert/ model.pb sp10m.cased.bert.model sp10m.cased.bert.vocab version xlnet/ (continues on next page)

24 Chapter 9. Contents: malaya Documentation

(continued from previous page) model.pb xlnet-quantized/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version similarity/ albert/ model.pb sp10m.cased.v10.model sp10m.cased.v10.vocab version alxlnet/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version alxlnet-quantized/ model.pb sp10m.cased.v9.model sp10m.cased.v9.vocab version tiny-bert/ model.pb sp10m.cased.bert.model sp10m.cased.bert.vocab version stem/ lstm-bahdanau/ model.pb stemmer.yttm version translation-en-ms/ version wordvector/ news/ version wordvector.json wordvector.npy

9.4.4 Deleting specific model

Let say you want to clear some spaces, start from version 1.0, you can specifically choose which model you want to delete.

[4]: malaya.utils.delete_cache('wordvector/news') [4]: True

What happen if a directory does not exist?

[7]: malaya.utils.delete_cache('wordvector/news2') ------Exception Traceback (most recent call last) in (continues on next page)

9.4. Malaya Cache 25 malaya Documentation

(continued from previous page) ----> 1 malaya.utils.delete_cache('wordvector/news2')

~/Documents/tf-1.15/env/lib/python3.7/site-packages/malaya_boilerplate-0.0.1-py3.7.

˓→egg/malaya_boilerplate/utils.py in delete_cache(location) 188 if not os.path.exists(location): 189 raise Exception( --> 190 f'folder not exist, please check path from `{__package__}.utils.

˓→print_cache()`' 191 ) 192 if not os.path.isdir(location):

Exception: folder not exist, please check path from `malaya.utils.print_cache()`

9.4.5 Purge cache

You can simply delete all models, totally purge it. By simply,

[8]: malaya.utils.delete_all_cache [8]:

I am not gonna to run it, because I prefer to keep it for now?

9.5 Running on Windows

9.5.1 UnicodeDecodeError: ‘charmap’ codec can’t decode byte

To solve this, Windows Settings > Administrative language settings > Change system locale. Checked Beta: Use Unicode UTF-8 for worldwide language support. Restarted, everything works well. Full dicussion check issue 25.

9.5.2 youtokentome failed to build

YouTokenToMe required cython and Microsoft Visual C++ 14.0 are required to compile and usually Windows users will break on this part, so we need to install Malaya without YouTokenToMe.

pip install malaya --no-deps pip install tensorflow>=1.15

If we skipped YouTokenToMe, we not able to use, 1. language-detection module, https://malaya.readthedocs.io/en/latest/load-language-detection.html 2. True Case module, https://malaya.readthedocs.io/en/latest/load-true-case.html 3. Multinomial model in emotion analysis, https://malaya.readthedocs.io/en/latest/load-emotion.html# Load-multinomial-model

26 Chapter 9. Contents: malaya Documentation

4. Multinomial model in sentiment analysis, https://malaya.readthedocs.io/en/latest/load-sentiment.html# Load-multinomial-model 5. Multinomial model in subjectivity analysis, https://malaya.readthedocs.io/en/latest/load-subjectivity.html# Load-multinomial-model 6. Multinomial model in toxicity analysis, https://malaya.readthedocs.io/en/latest/load-toxic.html# Load-multinomial-model Or you still need these models, you need to install Cython, pip install cython

And install Visual Studio from https://docs.microsoft.com/en-us/visualstudio/install/ create-an-offline-installation-of-visual-studio?view=vs-2019, and choose Visual Studio 2019 Build Tools, vs_buildtools.exe. And follow https://stackoverflow.com/questions/43847542/rc-exe-no-longer-found-in-vs-2015-command-prompt

9.5.3 Unable to use any T5 models

T5 depends on tensorflow-text, currently there is no official tensorflow-text binary released for Windows. So no T5 model for Windows users. List T5 models, 1. malaya.summarization.abstractive.transformer 2. malaya.generator.transformer 3. malaya.paraphrase.transformer

9.6 API

9.6.1 malaya

9.6.2 malaya.augmentation malaya.augmentation.synonym(string: str, threshold: float = 0.5, top_n=5, cleaning=, **kwargs) augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym Parameters • string (str)– • threshold (float, optional (default=0.5)) – random selection for a word. • top_n (int, (default=5)) – number of nearest neighbors returned. Length of re- turned result should as top_n. • cleaning (function, (default=malaya.text.function. augmentation_textcleaning)) – function to clean text. Returns result Return type List[str]

9.6. API 27 malaya Documentation malaya.augmentation.wordvector(string: str, wordvector, threshold: float = 0.5, top_n: int = 5, soft: bool = False, cleaning=) augmenting a string using wordvector. Parameters • string (str)– • wordvector (object) – wordvector interface object. • threshold (float, optional (default=0.5)) – random selection for a word. • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. • top_n (int, (default=5)) – number of nearest neighbors returned. Length of re- turned result should as top_n. • cleaning (function, (default=malaya.text.function. augmentation_textcleaning)) – function to clean text. Returns result Return type List[str] malaya.augmentation.transformer(string: str, model, threshold: float = 0.5, top_p: float = 0.9, top_k: int = 100, temperature: float = 1.0, top_n: int = 5, cleaning=None) augmenting a string using transformer + nucleus sampling / top-k sampling. Parameters • string (str)– • model (object) – transformer interface object. Right now only supported BERT, AL- BERT and ELECTRA. • threshold (float, optional (default=0.5)) – random selection for a word. • top_p (float, optional (default=0.8)) – cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling. • top_k (int, optional (default=100)) – k for top-k sampling. • temperature (float, optional (default=0.8)) – logits * temperature. • top_n (int, (default=5)) – number of nearest neighbors returned. Length of re- turned result should as top_n. • cleaning (function, (default=None)) – function to clean text. Returns result Return type List[str] malaya.augmentation.replace_similar_consonants(word: str, threshold: float = 0.8) Naively replace consonants into similar consonants in a word. Parameters • word (str)– • threshold (float, optional (default=0.8))–

28 Chapter 9. Contents: malaya Documentation

Returns result Return type List[str] malaya.augmentation.replace_similar_vowels(word: str, threshold: float = 0.8) Naively replace vowels into similar vowels in a word. :param word: :type word: str :param threshold: :type threshold: float, optional (default=0.8) Returns result Return type List[str] malaya.augmentation.socialmedia_form(word: str) augmenting a word into socialmedia form. Parameters word (str)– Returns result Return type List[str] malaya.augmentation.vowel_alternate(word: str, threshold: float = 0.5) augmenting a word into vowel alternate. vowel_alternate(‘singapore’) -> sngpore vowel_alternate(‘kampung’) -> kmpng vowel_alternate(‘ayam’) -> aym Parameters • word (str)– • threshold (float, optional (default=0.5))– Returns result Return type str

9.6.3 malaya.cluster malaya.cluster.cluster_words(list_words: List[str], lowercase: bool = False) cluster similar words based on structure, eg, [‘’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2 Parameters • list_words (List[str])– • lowercase (bool, optional (default=True)) – if True, will group using low- ercase but maintain the original form. Returns string Return type List[str] malaya.cluster.cluster_pos(result: List[Tuple[str, str]]) cluster similar POS. Parameters result (List[Tuple[str, str]])– Returns result Return type Dict[str, List[str]]

9.6. API 29 malaya Documentation

malaya.cluster.cluster_entities(result: List[Tuple[str, str]]) cluster similar Entities. Parameters result (List[Tuple[str, str]])– Returns result Return type Dict[str, List[str]] malaya.cluster.cluster_tagging(result: List[Tuple[str, str]]) cluster any tagging results, as long the data passed [(string, label), (string, label)]. Parameters result (List[Tuple[str, str]])– Returns result Return type Dict[str, List[str]] malaya.cluster.cluster_scatter(corpus: List[str], vectorizer, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stopwords=, cleaning=, clustering=, de- composition=, ngram: Tuple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20) plot scatter plot on similar text clusters. Parameters • corpus (List[str])– • vectorizer (class) – vectorizer class. • num_clusters (int, (default=5)) – size of unsupervised clusters. • titles (List[str], (default=None)) – list of titles, length must same with cor- pus. • colors (List[str], (default=None)) – list of colors, length must same with num_clusters. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus. • cleaning (function, (default=malaya.texts.function. simple_textcleaning)) – function to clean the corpus. • batch_size (int, (default=10)) – size of strings for each vectorization and atten- tion. Only useful if use transformer vectorizer. Returns dictionary Return type {‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles} malaya.cluster.cluster_dendogram(corpus: List[str], vectorizer, titles: List[str] = None, stop- words=, cleaning=, random_samples: float = 0.3, ngram: Tu- ple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20) plot hierarchical dendogram with similar texts. Parameters • corpus (List[str])–

30 Chapter 9. Contents: malaya Documentation

• vectorizer (class) – vectorizer class. • num_clusters (int, (default=5)) – size of unsupervised clusters. • titles (List[str], (default=None)) – list of titles, length must same with cor- pus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] • cleaning (function, (default=malaya.text.function. simple_textcleaning)) – function to clean the corpus. • random_samples (float, (default=0.3)) – random samples from the corpus, 0.3 means 30%. • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus. • batch_size (int, (default=20)) – size of strings for each vectorization and atten- tion. Only useful if use transformer vectorizer. Returns dictionary Return type {‘linkage_matrix’: linkage_matrix, ‘titles’: titles} malaya.cluster.cluster_graph(corpus: List[str], vectorizer, threshold: float = 0.9, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stop- words=, ngram: Tuple[int, int] = (1, 3), cleaning=, clustering=, figsize: Tuple[int, int] = (17, 9), with_labels: bool = True, batch_size: int = 20) plot undirected graph with similar texts. Parameters • corpus (List[str])– • vectorizer (class) – vectorizer class. • threshold (float, (default=0.9)) – 0.9 means, 90% above absolute pearson correlation. • num_clusters (int, (default=5)) – size of unsupervised clusters. • titles (List[str], (default=True)) – list of titles, length must same with cor- pus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str]. • cleaning (function, (default=malaya.texts.function. simple_textcleaning)) – function to clean the corpus. • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus. • batch_size (int, (default=20)) – size of strings for each vectorization and atten- tion. Only useful if use transformer vectorizer. Returns dictionary Return type {‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

9.6. API 31 malaya Documentation malaya.cluster.cluster_entity_linking(corpus: List[str], vectorizer, entity_model, topic_modeling_model, threshold: float = 0.3, topic_decomposition: int = 2, topic_length: int = 10, fuzzy_ratio: int = 70, accepted_entities: List[str] = ['law', 'location', 'organization', 'person', 'event'], cleaning=, colors: List[str] = None, stopwords=, max_df: float = 1.0, min_df: int = 1, ngram: Tuple[int, int] = (2, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20) plot undirected graph for Entities and topics relationship. Parameters • corpus (list or str)– • vectorizer (class)– • titles (list) – list of titles, length must same with corpus. • colors (list) – list of colors, length must same with num_clusters. • threshold (float, (default=0.3)) – 0.3 means, 30% above absolute pearson correlation. • topic_decomposition (int, (default=2)) – size of decomposition. • topic_length (int, (default=10)) – size of topic models. • fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy. • max_df (float, (default=0.95)) – maximum of a word selected based on docu- ment frequency. • min_df (int, (default=2)) – minimum of a word selected on based on document frequency. • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus. • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str] Returns dictionary Return type {‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

9.6.4 malaya.constituency malaya.constituency.available_transformer() List available transformer models. malaya.constituency.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters.

32 Chapter 9. Contents: malaya Documentation

– 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.tf.Constituency class

9.6.5 malaya.coref malaya.coref.parse_from_dependency(models, string: str, references: List[str] = ['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'], re- jected_references: List[str] = ['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka', 'nya'], acceptable_subjects: List[str] = ['flat', 'subj', 'nsubj', 'csubj', 'obj'], accept- able_nested_subjects: List[str] = ['', 'flat'], split_nya: bool = True, aggregate: Callable = , top_k: int = 20) Apply Coreference Resolution using stacks of dependency models. Parameters • models (list) – list of dependency models, must has vectorize method. • string (str)– • references (List[str], optional (default=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of references. • rejected_references (List[str], optional (default=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of rejected references during populating subjects. • acceptable_subjects (List[str], optional) – List of dependency labels for subjects. • acceptable_nested_subjects (List[str], optional) – List of dependency labels for nested subjects, eg, syarikat (obl) facebook (compound). • split_nya (bool, optional (default=True)) – split nya, eg, disifatkannya -> disifatkan, nya. • aggregate (Callable, optional (default=numpy.mean)) – Aggregate function to aggregate list of vectors from model.vectorize. • top_k (int, optional (default=20)) – only accept near top_k to assume a co- herence. Returns result – {‘text’: [‘Husein’,’Zolkepli’,’suka’,’makan’,’ayam’,’.’,’Dia’,’pun’,’suka’,’makan’,’daging’,’.’], ‘coref’: {6: {‘index’: [0, 1], ‘text’: [‘Husein’, ‘Zolkepli’]}}} Return type Dict[text, coref]

9.6. API 33 malaya Documentation

9.6.6 malaya.dependency

malaya.dependency.describe() Describe Dependency supported. malaya.dependency.dependency_graph(tagging, indexing) Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models. malaya.dependency.available_transformer(version: str = 'v2') List available transformer dependency parsing models. Parameters version (str, optional (default='v2')) – Version supported. Allowed values: • 'v1' - version 1, maintain for knowledge graph. • 'v2' - Trained on bigger dataset, better version. malaya.dependency.transformer(version: str = 'v2', model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer Dependency Parsing model, transfer learning Transformer + biaffine attention. Parameters • version (str, optional (default='v2')) – Version supported. Allowed val- ues: – 'v1' - version 1, maintain for knowledge graph. – 'v2' - Trained on bigger dataset, better version. • model (str, optional (default='xlnet')) – Model architecture supported. Allowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.DependencyBERT. • if xlnet in model, will return malaya.model.xlnet.DependencyXLNET. Return type model

34 Chapter 9. Contents: malaya Documentation

9.6.7 malaya.emotion

malaya.emotion.available_transformer() List available transformer emotion analysis models. malaya.emotion.multinomial(**kwargs) Load multinomial emotion model. Returns result Return type malaya.model.ml.MulticlassBayes class malaya.emotion.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer emotion model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.MulticlassBERT. • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET. Return type model

9.6.8 malaya.entity malaya.entity.describe() Describe Entities supported. malaya.entity.describe_ontonotes5() Describe OntoNotes5 Entities supported. https://spacy.io/api/annotation#named-entities malaya.entity.available_transformer() List available transformer Entity Tagging models. malaya.entity.available_transformer_ontonotes5() List available transformer Entity Tagging models trained on Ontonotes 5 Bahasa. malaya.entity.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer Entity Tagging model, transfer learning Transformer + CRF. Parameters

9.6. API 35 malaya Documentation

• model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.TaggingBERT. • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET. Return type model malaya.entity.transformer_ontonotes5(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.TaggingBERT. • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET. Return type model malaya.entity.general_entity(model=None) Load Regex based general entities tagging along with another supervised entity tagging model. Parameters model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)]. Returns result

36 Chapter 9. Contents: malaya Documentation

Return type malaya.text.entity.EntityRegex class

9.6.9 malaya.generator malaya.generator.ngrams(sequence, n: int, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None) generate ngrams. Parameters • sequence (List[str]) – list of tokenize words. • n (int) – ngram size Returns result Return type List[Tuple[str, str]] malaya.generator.pos_entities_ngram(result_pos: List[Tuple[str, str]], result_entities: List[Tuple[str, str]], ngram: Tuple[int, int] = (1, 3), accept_pos: List[str] = ['NOUN', 'PROPN', 'VERB'], accept_entities: List[str] = ['law', 'location', 'organiza- tion', 'person', 'time']) generate ngrams. Parameters • result_pos (List[Tuple[str, str]]) – result from POS recognition. • result_entities (List[Tuple[str, str]]) – result of Entities recognition. • ngram (Tuple[int, int]) – ngram sizes. • accept_pos (List[str]) – accepted POS elements. • accept_entities (List[str]) – accept entities elements. Returns result Return type list malaya.generator.sentence_ngram(sentence: str, ngram: Tuple[int, int] = (1, 3)) generate ngram for a text Parameters • sentence (str)– • ngram (tuple) – ngram sizes. Returns result Return type list malaya.generator.babble(string: str, model, generate_length: int = 30, leed_out_len: int = 1, tem- perature: float = 1.0, top_k: int = 100, burnin: int = 15, batch_size: int = 5) Use pretrained transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094 Parameters • string (str)– • model (object) – transformer interface object. Right now only supported BERT, AL- BERT and ELECTRA.

9.6. API 37 malaya Documentation

• generate_length (int, optional (default=256)) – length of sentence to generate. • leed_out_len (int, optional (default=1)) – length of extra masks for each iteration. • temperature (float, optional (default=1.0)) – logits * temperature. • top_k (int, optional (default=100)) – k for top-k sampling. • burnin (int, optional (default=15)) – for the first burnin steps, sample from the entire next word distribution, instead of top_k. • batch_size (int, optional (default=5)) – generate sentences size of batch_size. Returns result Return type List[str] malaya.generator.available_gpt2() List available gpt2 generator models. malaya.generator.gpt2(model: str = '345M', generate_length: int = 256, temperature: float = 1.0, top_k: int = 40, **kwargs) Load GPT2 model to generate a string given a prefix string. Parameters • model (str, optional (default='345M')) – Model architecture supported. Al- lowed values: – '117M' - GPT2 117M parameters. – '345M' - GPT2 345M parameters. • generate_length (int, optional (default=256)) – length of sentence to generate. • temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1. • top_k (int, optional (default=40)) – top-k in nucleus sampling selection. Returns result Return type malaya.transformers.gpt2.Model class malaya.generator.available_transformer() List available transformer models. malaya.generator.transformer(model: str = 't5', quantized: bool = False, **kwargs) Load Transformer model to generate a string given a isu penting. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 't5' - T5 BASE parameters. – 'small-t5' - T5 SMALL parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine.

38 Chapter 9. Contents: malaya Documentation

Returns result – List of model classes: • if t5 in model, will return malaya.model.t5.Generator. Return type model

9.6.10 malaya.keyword_extraction malaya.keyword_extraction.rake(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=, **kwargs) Extract keywords using Rake algorithm. Parameters • string (str)– • model (Object, optional (default=None)) – Transformer model or any model has attention method. • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram auto- matically based on stopwords. • top_k (int, optional (default=5)) – return top-k results. • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator. Returns result Return type Tuple[float, str] malaya.keyword_extraction.textrank(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=, **kwargs) Extract keywords using Textrank algorithm. Parameters • string (str)– • model (Object, optional (default='None')) – model has fit_transform or vectorize method. • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram auto- matically based on stopwords. • top_k (int, optional (default=5)) – return top-k results. • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

9.6. API 39 malaya Documentation

• stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] Returns result Return type Tuple[float, str] malaya.keyword_extraction.attention(string: str, model, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=, **kwargs) Extract keywords using Attention mechanism. Parameters • string (str)– • model (Object) – Transformer model or any model has attention method. • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram auto- matically based on stopwords. • top_k (int, optional (default=5)) – return top-k results. • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] Returns result Return type Tuple[float, str] malaya.keyword_extraction.similarity(string: str, model, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=, **kwargs) Extract keywords using Sentence embedding VS keyword embedding similarity. Parameters • string (str)– • model (Object) – Transformer model or any model has vectorize method. • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram auto- matically based on stopwords. • top_k (int, optional (default=5)) – return top-k results. • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] Returns result Return type Tuple[float, str] malaya.keyword_extraction.available_transformer() List available transformer keyword similarity model.

40 Chapter 9. Contents: malaya Documentation

malaya.keyword_extraction.transformer(model: str = 'bert', quantized: bool = False, **kwargs) Load Transformer keyword similarity model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.KeyphraseBERT. • if xlnet in model, will return malaya.model.xlnet.KeyphraseXLNET. Return type model

9.6.11 malaya.knowledge_graph malaya.knowledge_graph.parse_from_dependency(tagging: List[Tuple[str, str]], index- ing: List[Tuple[str, str]], subjects: List[List[str]] = [['flat', 'subj', 'nsubj', 'csubj']], relations: List[List[str]] = [['acl', 'xcomp', 'ccomp', 'obj', 'conj', 'advcl'], ['obj']], objects: List[List[str]] = [['obj', 'compound', 'flat', 'nmod', 'obl']], get_networkx: bool = True) Generate knowledge graphs from dependency parsing, we suggest use dependency parsing v1. Parameters • tagging (List[Tuple(str, str)])– tagging result from dependency model. • indexing (List[Tuple(str, str)])– indexing result from dependency model. • subjects (List[List[str]], optional) – List of dependency labels for sub- jects. • relations (List[List[str]], optional) – List of dependency labels for rela- tions. • objects (List[List[str]], optional) – List of dependency labels for objects. • get_networkx (bool, optional (default=True)) – If True, will generate net- workx.MultiDiGraph. Returns result Return type Dict[result, G]

9.6. API 41 malaya Documentation malaya.knowledge_graph.available_transformer() List available transformer models. malaya.knowledge_graph.transformer(model: str = 'small-t5', quantized: bool = False, **kwargs) Load transformer to generate knowledge graphs in triples format from texts, MS text -> EN triples format. Parameters • model (str, optional (default='small-t5')) – Model architecture sup- ported. Allowed values: – 't5' - T5 BASE parameters. – 'small-t5' - T5 SMALL parameters. – 'tiny-t5' - T5 TINY parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.t5.KnowledgeGraph class

9.6.12 malaya.language_detection malaya.language_detection.(quantized: bool = True, **kwargs) Load Fasttext language detection model. Original size is 353MB, Quantized size 31.1MB. Parameters quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model. Returns result Return type malaya.model.ml.LanguageDetection class malaya.language_detection.deep_model(quantized: bool = False, **kwargs) Load deep learning language detection model. Original size is 51.2MB, Quantized size 12.8MB. quantized [bool, optional (default=False)] if True, will load 8-bit quantized model. Quantized model not nec- essary faster, totally depends on the machine.

Returns result Return type malaya.model.tf.DeepLang class

9.6.13 malaya.lexicon malaya.lexicon.random_walk(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, beta: float = 0.9, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False) Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf Parameters • lexicon (Dict[str : List[str]]) – curated lexicon from expert domain, {‘la- bel1’: [str], ‘label2’: [str]}. • wordvector (object) – wordvector interface object.

42 Chapter 9. Contents: malaya Documentation

• pool_size (int, optional (default=10)) – pick top-pool size from each lexi- cons. • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power. • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome. • beta (float, optional (default=0.9)) – penalty score, towards to 1.0 means less penalty. 0 < beta < 1. • arccos (bool, optional (default=True)) – covariance distribution for embed- ded.dot(embedded.T). If false, covariance + 1. • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. • silent (bool, optional (default=False)) – if True, will not print any logs. Returns result Return type tuple(labels[argmax(scores), axis = 1], scores, labels) malaya.lexicon.propagate_probabilistic(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False) Learns polarity scores via standard label propagation from lexicon sets. Parameters • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}. • wordvector (object) – wordvector interface object. • pool_size (int, optional (default=10)) – pick top-pool size from each lexi- cons. • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power. • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome. • arccos (bool, optional (default=True)) – covariance distribution for embed- ded.dot(embedded.T). If false, covariance + 1. • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. • silent (bool, optional (default=False)) – if True, will not print any logs. Returns result Return type tuple(labels[argmax(scores), axis = 1], scores, labels)

9.6. API 43 malaya Documentation

malaya.lexicon.propagate_graph(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, normalization: bool = True, soft: bool = False, silent: bool = False) Graph propagation method dapted from Velikovich, Leonid, et al. “The viability of web-derived polarity lexi- cons.” http://www.aclweb.org/anthology/N10-1119 Parameters • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}. • wordvector (object) – wordvector interface object. • pool_size (int, optional (default=10)) – pick top-pool size from each lexi- cons. • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power. • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome. • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. • silent (bool, optional (default=False)) – if True, will not print any logs. Returns result Return type tuple(labels[argmax(scores), axis = 1], scores, labels)

9.6.14 malaya.normalize malaya.normalize.normalizer(speller=None, **kwargs) Load a Normalizer using any spelling correction model. Parameters speller (spelling correction object, optional (default = None))– Returns result Return type malaya.normalize.Normalizer class class malaya.normalize.Normalizer

normalize(string: str, check_english: bool = True, normalize_text: bool = True, normalize_entity: bool = True, normalize_url: bool = False, normalize_email: bool = False, normalize_year: bool = True, normalize_telephone: bool = True) Normalize a string. Parameters • string (str)– • check_english (bool, (default=True)) – check a word in english dictionary. • normalize_text (bool, (default=True)) – if True, will try to replace short- forms with internal corpus.

44 Chapter 9. Contents: malaya Documentation

• normalize_entity (bool, (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only. • normalize_url (bool, (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com. • normalize_email (bool, (default=False)) – if True, replace @ with di, . with dot. [email protected] -> husein dot zol kosong lima di gmail dot com. • normalize_year (bool, (default=True)) – if True, tahun 1987 -> tahun sem- bilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh. • normalize_telephone (bool, (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh Returns string Return type normalized string

9.6.15 malaya.nsfw malaya.nsfw.lexicon(**kwargs) Load Lexicon NSFW model. Returns result Return type malaya.text.lexicon.nsfw.Lexicon class malaya.nsfw.multinomial(**kwargs) Load multinomial NSFW model. Returns result Return type malaya.model.ml.BAYES class

9.6.16 malaya.num2word malaya.num2word.to_cardinal(number) Translate from number input to cardinal text representation Parameters number (real number)– Returns result – cardinal representation Return type str malaya.num2word.to_ordinal(number) Translate from number input to ordinal text representation Parameters number (real number)– Returns result – ordinal representation Return type str malaya.num2word.to_ordinal_num(number) Translate from number input to ordinal numering text representation Parameters number (int)– Returns result – ordinal numering representation

9.6. API 45 malaya Documentation

Return type str malaya.num2word.to_currency(value) Translate from number input to cardinal currency text representation Parameters number (int)– Returns result – cardinal currency representation Return type str malaya.num2word.to_year(value) Translate from number input to cardinal year text representation Parameters number (int)– Returns result – cardinal year representation Return type str

9.6.17 malaya.paraphrase malaya.paraphrase.available_transformer() List available transformer models. malaya.paraphrase.transformer(model: str = 'small-t5', quantized: bool = False, **kwargs) Load Malaya transformer encoder-decoder model to generate a paraphrase given a string. Parameters • model (str, optional (default='small-t5')) – Model architecture sup- ported. Allowed values: – 't5' - T5 BASE parameters. – 'small-t5' - T5 SMALL parameters. – 'tiny-t5' - T5 TINY parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if t5 in model, will return malaya.model.t5.Paraphrase. Return type model

9.6.18 malaya.pos malaya.pos.describe() Describe Part-Of-Speech supported. malaya.pos.available_transformer() List available transformer Part-Of-Speech Tagging models. malaya.pos.naive(string: str) Recognize POS in a string using Regex. Parameters string (str)– Returns string

46 Chapter 9. Contents: malaya Documentation

Return type List[Tuple[str, str]] malaya.pos.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer POS Tagging model, transfer learning Transformer + CRF. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.TaggingBERT. • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET. Return type model

9.6.19 malaya.preprocessing malaya.preprocessing.unpack_english_contractions(text) Replace English contractions in text str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. Important Note: The function is taken from textacy (https://github.com/chartbeat-labs/textacy). malaya.preprocessing.preprocessing(normalize: List[str] = ['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate: List[str] = ['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase: bool = True, fix_unidecode: bool = True, expand_english_contractions: bool = True, trans- late_english_to_bm: bool = True, speller=None, seg- menter=None, stemmer=None, **kwargs) Load Preprocessing class. Parameters • normalize (list) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize(). • annotate (list) – annonate tokens , only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’]. • lowercase (bool)– • fix_unidecode (bool)– • expand_english_contractions (bool) – expand english contractions

9.6. API 47 malaya Documentation

• translate_english_to_bm (bool) – translate english words to bahasa malaysia words • speller (object) – spelling correction object, need to have a method correct • segmenter (object) – segmentation object, need to have a method segment. If provide, it will expand hashtags, #mondayblues == monday blues • stemmer (object) – stemmer object, need to have a method stem. If provide, it will stem or lemmatize the string. Returns result Return type malaya.preprocessing.Preprocessing class class malaya.preprocessing.Tokenizer

tokenize(text) Tokenize string. Parameters text (str)– Returns result Return type List[str] class malaya.preprocessing.Preprocessing

9.6.20 malaya.qa malaya.qa.available_transformer_squad() List available Transformer Span models. malaya.qa.transformer_squad(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer Span model trained on SQUAD V2 dataset. Parameters • model (str, optional (default='xlnet')) – Model architecture supported. Allowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.tf.SQUAD class

48 Chapter 9. Contents: malaya Documentation

9.6.21 malaya.relevancy

malaya.relevancy.available_transformer() List available transformer relevancy analysis models. malaya.relevancy.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer relevancy model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. – 'bigbird' - Google BigBird BASE parameters. – 'tiny-bigbird' - Malaya BigBird BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.MulticlassBERT. • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET. • if bigbird in model, will return malaya.model.xlnet.MulticlassBigBird. Return type model

9.6.22 malaya.segmentation malaya.segmentation.viterbi(max_split_length: int = 20, **kwargs) Load Segmenter class using viterbi algorithm. Parameters • max_split_length (int, (default=20)) – max length of words in a sentence to segment • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available. Returns result Return type malaya.segmentation.Segmenter class malaya.segmentation.available_transformer() List available transformer models.

9.6. API 49 malaya Documentation

malaya.segmentation.transformer(model: str = 'small', quantized: bool = False, **kwargs) Load transformer encoder-decoder model to Segmentize. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'small' - Transformer SMALL parameters. – 'base' - Transformer BASE parameters. – 'super-tiny-t5' - T5 SUPER TINY parameters. – 'super-super-tiny-t5' - T5 SUPER SUPER TINY parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.tf.Segmentation class class malaya.segmentation.Segmenter

segment(strings: List[str]) Segment strings. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya” Parameters strings (List[str])– Returns result Return type List[str]

9.6.23 malaya.sentiment malaya.sentiment.available_transformer() List available transformer sentiment analysis models. malaya.sentiment.multinomial(**kwargs) Load multinomial sentiment model. Returns result Return type malaya.model.ml.Bayes class malaya.sentiment.transformer(model: str = 'bert', quantized: bool = False, **kwargs) Load Transformer sentiment model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters.

50 Chapter 9. Contents: malaya Documentation

• quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.BinaryBERT. • if xlnet in model, will return malaya.model.xlnet.BinaryXLNET. Return type model

9.6.24 malaya.spell class malaya.spell.Probability(corpus, sp_tokenizer=None) The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/ spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_ online_reviews Added custom vowels augmentation P(word) Probability of word. correct(word: str, **kwargs) Most probable spelling correction for word. Parameters word (str)– Returns result Return type str class malaya.spell.Symspell(model, verbosity, corpus, k=10) The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www. researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation edit_step(word) Generate candidates given a word. Parameters word (str)– Returns result Return type {candidate1, candidate2} edit_candidates(word) Generate candidates given a word. Parameters word (str)– Returns result Return type List[str] correct(word: str, **kwargs) Most probable spelling correction for word. Parameters word (str)– Returns result Return type str

9.6. API 51 malaya Documentation

correct_text(text: str) Correct all the words within a text, returning the corrected text. Parameters text (str)– Returns result Return type str correct_match(match) Spell-correct word in match, and preserve proper upper/lower/title case. malaya.spell.probability(sentence_piece: bool = False, **kwargs) Train a Probability Spell Corrector. Parameters sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece. Returns result Return type malaya.spell.Probability class malaya.spell.symspell(max_edit_distance_dictionary: int = 2, prefix_length: int = 7, term_index: int = 0, count_index: int = 1, top_k: int = 10, **kwargs) Load a symspell Spell Corrector for Malay. Returns result Return type malaya.spell.Symspell class malaya.spell.jamspell(model: str = 'wiki', **kwargs) Load a jamspell Spell Corrector for Malay. Parameters model (str, optional (default='wiki+news')) – Supported models. Allowed values: • 'wiki+news' - Wikipedia + News, 337MB. • 'wiki' - Wikipedia, 148MB. • 'news' - local news, 215MB. Returns result Return type malaya.spell.JamSpell class malaya.spell.spylls(model: str = 'libreoffice-pejam', **kwargs) Load a spylls Spell Corrector for Malay. Parameters model (str, optional (default='libreoffice-pejam')) – Model spelling correction supported. Allowed values: • 'libreoffice-pejam' - from LibreOffice pEJAm, https://extensions.libreoffice.org/ en/extensions/show/3868 Returns result Return type malaya.spell.Spylls class malaya.spell.available_transformer() List available transformer models. malaya.spell.transformer(model: str = 'small-t5', quantized: bool = False, **kwargs) Load a Transformer Spell Corrector. Parameters

52 Chapter 9. Contents: malaya Documentation

• model (str, optional (default='small-t5')) – Model architecture sup- ported. Allowed values: – 'small-t5' - T5 SMALL parameters. – 'tiny-t5' - T5 TINY parameters. – 'super-tiny-t5' - T5 SUPER TINY parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.t5.Spell class malaya.spell.transformer_encoder(model, sentence_piece: bool = False, **kwargs) Load a Transformer Encoder Spell Corrector. Right now only supported BERT and ALBERT. Parameters sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece. Returns result Return type malaya.spell.Transformer class class malaya.spell.Transformer

correct(word: str, string: str, index: int = - 1, batch_size: int = 20) Correct a word within a text, returning the corrected word. Parameters • word (str)– • string (str) – Entire string, word must a word inside string. • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word). • batch_size (int, optional(default=20)) – batch size to insert into model. Returns result Return type str correct_text(text: str, batch_size: int = 20) Correct all the words within a text, returning the corrected text. Parameters • text (str)– • batch_size (int, optional(default=20)) – batch size to insert into model. Returns result Return type str correct_word(word: str, string: str, batch_size: int = 20) Spell-correct word in match, and preserve proper upper/lower/title case. Parameters • word (str)– • string (str) – Entire string, word must a word inside string.

9.6. API 53 malaya Documentation

• batch_size (int, optional(default=20)) – batch size to insert into model. Returns result Return type str class malaya.spell.Probability The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/ spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_ online_reviews Added custom vowels augmentation P(word) Probability of word. correct(word: str, **kwargs) Most probable spelling correction for word. Parameters word (str)– Returns result Return type str class malaya.spell.Symspell The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www. researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation edit_step(word) Generate candidates given a word. Parameters word (str)– Returns result Return type {candidate1, candidate2} edit_candidates(word) Generate candidates given a word. Parameters word (str)– Returns result Return type List[str] correct(word: str, **kwargs) Most probable spelling correction for word. Parameters word (str)– Returns result Return type str correct_text(text: str) Correct all the words within a text, returning the corrected text. Parameters text (str)– Returns result Return type str

54 Chapter 9. Contents: malaya Documentation

correct_match(match) Spell-correct word in match, and preserve proper upper/lower/title case.

9.6.25 malaya.stack

malaya.stack.voting_stack(models, text: str) Stacking for POS, Entities and Dependency models. Parameters • models (list) – list of models. • text (str) – string to predict. Returns result Return type list malaya.stack.predict_stack(models, strings: List[str], aggregate: Callable = , **kwargs) Stacking for predictive models. Parameters • models (List[Callable]) – list of models. • strings (List[str])– • aggregate (Callable, optional (default=scipy.stats.mstats. gmean)) – Aggregate function. Returns result Return type dict

9.6.26 malaya.stem malaya.stem.naive() Load stemming model using startswith and endswith naively using regex patterns. Returns result Return type malaya.stem.Naive class malaya.stem.sastrawi() Load stemming model using Sastrawi, this also include lemmatization. Returns result Return type malaya.stem.Sastrawi class malaya.stem.deep_model(quantized: bool = False, **kwargs) Load LSTM + Bahdanau Attention stemming model, this also include lemmatization. Original size 41.6MB, quantized size 10.6MB . Parameters quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.stem.DeepStemmer class

9.6. API 55 malaya Documentation

class malaya.stem.DeepStemmer

stem(string: str, beam_search: bool = False) Stem a string, this also include lemmatization. Parameters • string (str)– • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder. Returns result Return type str class malaya.stem.Sastrawi class malaya.stem.Naive

9.6.27 malaya.subjectivity malaya.subjectivity.available_transformer() List available transformer subjective analysis models. malaya.subjectivity.multinomial(**kwargs) Load multinomial subjectivity model. Parameters validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available. Returns result Return type malaya.model.ml.Bayes class malaya.subjectivity.transformer(model: str = 'bert', quantized: bool = False, **kwargs) Load Transformer subjectivity model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.BinaryBERT. • if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.

56 Chapter 9. Contents: malaya Documentation

Return type model

9.6.28 malaya.tatabahasa

malaya.tatabahasa.describe() Describe kesalahan tatabahasa supported. Full description at https://tatabahasabm.tripod.com/tata/salahtata.htm malaya.tatabahasa.available_transformer() List available transformer models. malaya.tatabahasa.transformer(model: str = 'base', quantized: bool = False, **kwargs) Load Malaya transformer encoder-decoder + tagging model to correct a kesalahan tatabahasa text. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'small' - Malaya Transformer Tag SMALL parameters. – 'base' - Malaya Transformer Tag BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.tf.Tatabahasa class

9.6.29 malaya.summarization.abstractive malaya.summarization.abstractive.available_transformer() List available transformer models. malaya.summarization.abstractive.transformer(model: str = 'small-t5', quantized: bool = False, **kwargs) Load Malaya transformer encoder-decoder model to generate a summary given a string. Parameters • model (str, optional (default='small-t5')) – Model architecture sup- ported. Allowed values: – 't5' - T5 BASE parameters. – 'small-t5' - T5 SMALL parameters. – 'tiny-t5' - T5 TINY parameters. – 'pegasus' - Pegasus BASE parameters. – 'small-pegasus' - Pegasus SMALL parameters. – 'bigbird' - BigBird + Pegasus BASE parameters. – 'small-bigbird' - BigBird + Pegasus SMALL parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes:

9.6. API 57 malaya Documentation

• if t5 in model, will return malaya.model.t5.Summarization. • if bigbird in model, will return malaya.model.bigbird.Summarization. • if pegasus in model, will return malaya.model.pegasus.Summarization. Return type model

9.6.30 malaya.summarization.extractive malaya.summarization.extractive.encoder(vectorizer) Encoder interface for summarization. Parameters vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method. Returns result Return type malaya.model.extractive_summarization.Encoder malaya.summarization.extractive.doc2vec(wordvector) Doc2Vec interface for summarization. Parameters wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method. Returns result Return type malaya.model.extractive_summarization.Doc2Vec malaya.summarization.extractive.sklearn(model, vectorizer) sklearn interface for summarization. Parameters • model (object) – Should have fit_transform method. Commonly: – sklearn.decomposition.TruncatedSVD - LSA algorithm. – sklearn.decomposition.LatentDirichletAllocation - LDA algorithm. • vectorizer (object) – Should have fit_transform method. Commonly: – sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algo- rithm. – sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word al- gorithm. – malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag- of-Word algorithm. – malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm. Returns result Return type malaya.model.extractive_summarization.SKLearn

58 Chapter 9. Contents: malaya Documentation

9.6.31 malaya.similarity

malaya.similarity.doc2vec_wordvector(wordvector) Doc2vec interface for text similarity using Word Vector. Parameters wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method. Returns result Return type malaya.similarity.Doc2VecSimilarity malaya.similarity.doc2vec_vectorizer(vectorizer) Doc2vec interface for text similarity using Encoder model. Parameters vectorizer (object) – encoder interface object, BERT, XLNET. should have vec- torize method. Returns result Return type malaya.similarity.VectorizerSimilarity malaya.similarity.available_transformer() List available transformer similarity models. malaya.similarity.transformer(model: str = 'bert', quantized: bool = False, **kwargs) Load Transformer similarity model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.SiameseBERT. • if xlnet in model, will return malaya.model.xlnet.SiameseXLNET. Return type model class malaya.similarity.VectorizerSimilarity

predict_proba(left_strings: List[str], right_strings: List[str], similarity: str = 'cosine') calculate similarity for two different batch of texts. Parameters • left_strings (list of str)–

9.6. API 59 malaya Documentation

• right_strings (list of str)– • similarity (str, optional (default='mean')) – similarity supported. Al- lowed values: – 'cosine' - cosine similarity. – 'euclidean' - euclidean similarity. – 'manhattan' - manhattan similarity. Returns result Return type List[float] heatmap(strings: List[str], similarity: str = 'cosine', visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7)) plot a heatmap based on output from bert similarity. Parameters • strings (list of str) – list of strings. • similarity (str, optional (default='mean')) – similarity supported. Al- lowed values: – 'cosine' - cosine similarity. – 'euclidean' - euclidean similarity. – 'manhattan' - manhattan similarity. • visualize (bool) – if True, it will render plt.show, else return data. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns result – list of results Return type list class malaya.similarity.Doc2VecSimilarity

predict_proba(left_strings: List[str], right_strings: List[str], aggregation: Callable = , similarity: str = 'cosine', soft: bool = False) calculate similarity for two different batch of texts. Parameters • left_strings (list of str)– • right_strings (list of str)– • aggregation (Callable, optional (default=numpy.mean))– • similarity (str, optional (default='mean')) – similarity supported. Al- lowed values: – 'cosine' - cosine similarity. – 'euclidean' - euclidean similarity. – 'manhattan' - manhattan similarity. • soft (bool, optional (default=False)) – word not inside word vector will replace with nearest word if True, else, will skip. Returns result

60 Chapter 9. Contents: malaya Documentation

Return type List[float] heatmap(strings: List[str], aggregation: Callable = , similarity: str = 'cosine', soft: bool = False, visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7)) plot a heatmap based on output from bert similarity. Parameters • strings (list of str) – list of strings • aggregation (Callable, optional (default=numpy.mean))– • similarity (str, optional (default='mean')) – similarity supported. Al- lowed values: – 'cosine' - cosine similarity. – 'euclidean' - euclidean similarity. – 'manhattan' - manhattan similarity. • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip. • visualize (bool) – if True, it will render plt.show, else return data. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns result – list of results. Return type list

9.6.32 malaya.topic_model malaya.topic_model.available_vectorizer() List available vectorizer topic modeling. malaya.topic_model.sklearn(corpus: List[str], model, vectorizer, n_topics: int, cleaning=, stopwords=, **kwargs) Train a SKlearn model to do topic modelling based on corpus / list of strings given. Parameters • corpus (list)– • model (object) – Should have fit_transform method. Commonly: – sklearn.decomposition.TruncatedSVD - LSA algorithm. – sklearn.decomposition.LatentDirichletAllocation - LDA algorithm. – sklearn.decomposition.NMF - NMF algorithm. • vectorizer (object) – Should have fit_transform method. Commonly: – sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algo- rithm. – sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word al- gorithm. – malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag- of-Word algorithm.

9.6. API 61 malaya Documentation

– malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm. • n_topics (int, (default=10)) – size of decomposition column. • cleaning (function, (default=malaya.text.function. simple_textcleaning)) – function to clean the corpus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] Returns result Return type malaya.topic_modelling.Topic class malaya.topic_model.lda2vec(corpus: List[str], vectorizer, n_topics: int = 10, cleaning=, stopwords=, win- dow_size: int = 2, embedding_size: int = 128, epoch: int = 10, switch_loss: int = 1000, **kwargs) Train a LDA2Vec model to do topic modelling based on corpus / list of strings given. Parameters • corpus (list)– • vectorizer (object) – Should have fit_transform method. Commonly: – sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algo- rithm. – sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word al- gorithm. – malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag- of-Word algorithm. – malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm. • n_topics (int, (default=10)) – size of decomposition column. • cleaning (function, (default=malaya.text.function. simple_textcleaning)) – function to clean the corpus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] • embedding_size (int, (default=128)) – embedding size of lda2vec tensors. • epoch (int, (default=10)) – training iteration, how many loop need to train. • switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss. Returns result Return type malaya.topic_modelling.DeepTopic class malaya.topic_model.attention(corpus: List[str], n_topics: int, vectorizer, cleaning=, stopwords=, ngram: Tuple[int, int] = (1, 3), batch_size: int = 10) Use attention from transformer model to do topic modelling based on corpus / list of strings given. Parameters • corpus (list)–

62 Chapter 9. Contents: malaya Documentation

• n_topics (int, (default=10)) – size of decomposition column. • vectorizer (object)– • cleaning (function, (default=malaya.text.function. simple_textcleaning)) – function to clean the corpus. • stopwords (List[str], (default=malaya.texts.function. get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus. • batch_size (int, (default=10)) – size of strings for each vectorization and atten- tion. Returns result Return type malaya.topic_modelling.AttentionTopic class class malaya.topic_model.AttentionTopic

top_topics(len_topic: int, top_n: int = 10, return_df: bool = True) Print important topics based on decomposition. Parameters • len_topic (int) – size of topics. • top_n (int, optional (default=10)) – top n of each topic. • return_df (bool, optional (default=True)) – return as pan- das.DataFrame, else JSON. get_topics(len_topic: int) Return important topics based on decomposition. Parameters len_topic (int) – size of topics. Returns result Return type List[str] class malaya.topic_model.DeepTopic

visualize_topics(notebook_mode: int = False, mds: str = 'pcoa') Print important topics based on decomposition. Parameters mds (str, optional (default='pcoa')) – 2D Decomposition. Allowed values: • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling) • 'mmds' - Dimension reduction via Multidimensional scaling • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding top_topics(len_topic: int, top_n: int = 10, return_df: bool = True) Print important topics based on decomposition. Parameters • len_topic (int) – size of topics. • top_n (int, optional (default=10)) – top n of each topic.

9.6. API 63 malaya Documentation

• return_df (bool, optional (default=True)) – return as pan- das.DataFrame, else JSON. get_topics(len_topic: int) Return important topics based on decomposition. Parameters len_topic (int) – size of topics. Returns result Return type List[str] get_sentences(len_sentence: int, k: int = 0) Return important sentences related to selected column based on decomposition. Parameters • len_sentence (int)– • k (int, (default=0)) – index of decomposition matrix. Returns result Return type List[str] class malaya.topic_model.Topic

visualize_topics(notebook_mode: bool = False, mds: str = 'pcoa') Print important topics based on decomposition. Parameters mds (str, optional (default='pcoa')) – 2D Decomposition. Allowed values: • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling) • 'mmds' - Dimension reduction via Multidimensional scaling • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding top_topics(len_topic: int, top_n: int = 10, return_df: bool = True) Print important topics based on decomposition. Parameters • len_topic (int) – size of topics. • top_n (int, optional (default=10)) – top n of each topic. • return_df (bool, optional (default=True)) – return as pan- das.DataFrame, else JSON. get_topics(len_topic: int) Return important topics based on decomposition. Parameters len_topic (int)– Returns result Return type List[str] get_sentences(len_sentence: int, k: int = 0) Return important sentences related to selected column based on decomposition. Parameters • len_sentence (int)–

64 Chapter 9. Contents: malaya Documentation

• k (int, (default=0)) – index of decomposition matrix. Returns result Return type List[str]

9.6.33 malaya.toxicity malaya.toxicity.available_transformer() List available transformer toxicity analysis models. malaya.toxicity.multinomial(**kwargs) Load multinomial toxicity model. Returns result Return type malaya.model.ml.MultilabelBayes class malaya.toxicity.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs) Load Transformer toxicity model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bert in model, will return malaya.model.bert.SigmoidBERT. • if xlnet in model, will return malaya.model.xlnet.SigmoidXLNET. Return type model

9.6.34 malaya.transformer malaya.transformer.available_transformer() List available transformer models. malaya.transformer.load(model: str = 'electra', pool_mode: str = 'last', **kwargs) Load transformer model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values:

9.6. API 65 malaya Documentation

– 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. – 'electra' - Google ELECTRA BASE parameters. – 'small-electra' - Google ELECTRA SMALL parameters. • pool_mode (str, optional (default='last')) – Model logits architecture supported. Only usable if model in [‘xlnet’, ‘alxlnet’]. Allowed values: – 'last' - last of the sequence. – 'first' - first of the sequence. – 'mean' - mean of the sequence. – 'attn' - attention of the sequence. Returns result – List of model classes: • if bert in model, will return malaya.transformers.bert.Model. • if xlnet in model, will return malaya.transformers.xlnet.Model. • if albert in model, will return malaya.transformers.albert.Model. • if electra in model, will return malaya.transformers.electra.Model. Return type model

9.6.35 malaya.translation.en_ms malaya.translation.en_ms.available_transformer() List available transformer models. malaya.translation.en_ms.transformer(model: str = 'base', quantized: bool = False, **kwargs) Load transformer encoder-decoder model to translate EN-to-MS. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'small' - Transformer SMALL parameters. – 'base' - Transformer BASE parameters. – 'large' - Transformer LARGE parameters. – 'bigbird' - BigBird BASE parameters. – 'small-bigbird' - BigBird SMALL parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine.

66 Chapter 9. Contents: malaya Documentation

Returns result – List of model classes: • if bigbird in model, return malaya.model.bigbird.Translation. • else, return malaya.model.tf.Translation. Return type model

9.6.36 malaya.translation.ms_en

malaya.translation.ms_en.available_transformer() List available transformer models. malaya.translation.ms_en.transformer(model: str = 'base', quantized: bool = False, **kwargs) Load Transformer encoder-decoder model to translate MS-to-EN. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'small' - Transformer SMALL parameters. – 'base' - Transformer BASE parameters. – 'large' - Transformer LARGE parameters. – 'bigbird' - BigBird BASE parameters. – 'small-bigbird' - BigBird SMALL parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes: • if bigbird in model, return malaya.model.bigbird.Translation. • else, return malaya.model.tf.Translation. Return type model

9.6.37 malaya.true_case malaya.true_case.available_transformer() List available transformer models. malaya.true_case.transformer(model: str = 'base', quantized: bool = False, **kwargs) Load transformer encoder-decoder model to True Case. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'small' - Transformer SMALL parameters. – 'base' - Transformer BASE parameters.

9.6. API 67 malaya Documentation

• quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result Return type malaya.model.tf.TrueCase class

9.6.38 malaya.word2num malaya.word2num.word2num(string) Translate from string to number, eg ‘kesepuluh’ -> 10. Parameters string (str)– Returns result Return type int / float

9.6.39 malaya.wordvector malaya.wordvector.available_wordvector() List available transformer models. malaya.wordvector.load(model: str = 'wikipedia', **kwargs) Return malaya.wordvector.WordVector object. Parameters model (str, optional (default='wikipedia')) – Model architecture supported. Allowed values: • 'wikipedia' - pretrained on Malay wikipedia word2vec size 256. • 'socialmedia' - pretrained on cleaned Malay twitter and Malay instagram size 256. • 'news' - pretrained on cleaned Malay news size 256. • 'combine' - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256. Returns • vocabulary (indices dictionary for vector.) • vector (np.array, 2D.) class malaya.wordvector.WordVector

get_vector_by_name(word: str, soft: bool = False, topn_soft: int = 5) get vector based on string. Parameters • word (str)– • soft (bool, (default=True)) – if True, a word not in the dictionary will be re- placed with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary. • topn_soft (int, (default=5)) – if word not found in dictionary, will returned topn_soft size of similar size using jarowinkler. Returns vector

68 Chapter 9. Contents: malaya Documentation

Return type np.array, 1D tree_plot(labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True) plot a tree plot based on output from calculator / n_closest / analogy. Parameters • labels (list) – output from calculator / n_closest / analogy. • visualize (bool) – if True, it will render plt.show, else return data. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns • embed (np.array, 2D.) • labelled (labels for X / Y axis.) scatter_plot(labels, centre: str = None, figsize: Tuple[int, int] = (7, 7), plus_minus: int = 25, handoff: float = 5e-05) plot a scatter plot based on output from calculator / n_closest / analogy. Parameters • labels (list) – output from calculator / n_closest / analogy • centre (str, (default=None)) – centre label, if a str, it will annotate in a red color. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns tsne Return type np.array, 2D. batch_calculator(equations: List[str], num_closest: int = 5, return_similarity: bool = False) batch calculator parser for word2vec using tensorflow. Parameters • equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’ • num_closest (int, (default=5)) – number of words closest to the result. Returns word_list Return type list of nearest words calculator(equation: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True) calculator parser for word2vec. Parameters • equation (str) – Eg, ‘(mahathir + najib) - rosmah’ • num_closest (int, (default=5)) – number of words closest to the result. • metric (str, (default='cosine')) – vector distance algorithm. • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance. Returns word_list Return type list of nearest words

9.6. API 69 malaya Documentation

batch_n_closest(words: List[str], num_closest: int = 5, return_similarity: bool = False, soft: bool = True) find nearest words based on a batch of words using Tensorflow. Parameters • words (list) – Eg, [‘najib’,’anwar’] • num_closest (int, (default=5)) – number of words closest to the result. • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance. • soft (bool, (default=True)) – if True, a word not in the dictionary will be re- placed with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary. Returns word_list Return type list of nearest words n_closest(word: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True) find nearest words based on a word. Parameters • word (str) – Eg, ‘najib’ • num_closest (int, (default=5)) – number of words closest to the result. • metric (str, (default='cosine')) – vector distance algorithm. • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance. Returns word_list Return type list of nearest words analogy(a: str, b: str, c: str, num: int = 1, metric: str = 'cosine') analogy calculation, vb - va + vc. Parameters • a (str)– • b (str)– • c (str)– • num (int, (default=1))– • metric (str, (default='cosine')) – vector distance algorithm. Returns word_list Return type list of nearest words. project_2d(start: int, end: int) project word2vec into 2d dimension. Parameters • start (int)– • end (int)– Returns

70 Chapter 9. Contents: malaya Documentation

• embed_2d (TSNE decomposition) • word_list (words in between start and end.) network(word: str, num_closest: int = 8, depth: int = 4, min_distance: float = 0.5, iteration: int = 300, figsize: Tuple[int, int] = (15, 15), node_color: str = '#72bbd0', node_factor: int = 50) plot a social network based on word given Parameters • word (str) – centre of social network. • num_closest (int, (default=8)) – number of words closest to the node. • depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth). • min_distance (float, (default=0.5)) – minimum distance among nodes. In- crease the value to increase the distance among nodes. • iteration (int, (default=300)) – number of loops to train the social network to fit min_distace. • figsize (tuple, (default=(15, 15))) – figure size for plot. • node_color (str, (default='#72bbd0')) – color for nodes. • node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth. ReturnsG Return type networkx graph object

9.6.40 malaya.zero_shot.classification malaya.zero_shot.classification.available_transformer() List available transformer zero-shot models. malaya.zero_shot.classification.transformer(model: str = 'bert', quantized: bool = False, **kwargs) Load Transformer zero-shot model. Parameters • model (str, optional (default='bert')) – Model architecture supported. Al- lowed values: – 'bert' - Google BERT BASE parameters. – 'tiny-bert' - Google BERT TINY parameters. – 'albert' - Google ALBERT BASE parameters. – 'tiny-albert' - Google ALBERT TINY parameters. – 'xlnet' - Google XLNET BASE parameters. – 'alxlnet' - Malaya ALXLNET BASE parameters. • quantized (bool, optional (default=False)) – if True, will load 8-bit quan- tized model. Quantized model not necessary faster, totally depends on the machine. Returns result – List of model classes:

9.6. API 71 malaya Documentation

• if bert in model, will return malaya.model.bert.ZeroshotBERT. • if xlnet in model, will return malaya.model.xlnet.ZeroshotXLNET. Return type model

9.6.41 malaya.model.bert class malaya.model.bert.BinaryBERT

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str], add_neutral: bool = True) classify list of strings. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[str] predict_proba(strings: List[str], add_neutral: bool = True) classify list of strings and return probability. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)–

72 Chapter 9. Contents: malaya Documentation

• method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard. Returns result Return type dict class malaya.model.bert.MulticlassBERT

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[str] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values:

9.6. API 73 malaya Documentation

– 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard. Returns result Return type dict class malaya.model.bert.SigmoidBERT

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[List[str]] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer.

74 Chapter 9. Contents: malaya Documentation

– 'mean' - average attentions from all layers. • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard. Returns dictionary Return type results class malaya.model.bert.SiameseBERT

vectorize(strings: List[str]) Vectorize list of strings. Parameters strings (List[str])– Returns result Return type np.array predict_proba(strings_left: List[str], strings_right: List[str]) calculate similarity for two different batch of texts. Parameters • strings_left (List[str])– • strings_right (List[str])– Returns list Return type list of float heatmap(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7)) plot a heatmap based on output from similarity Parameters • strings (list of str) – list of strings. • visualize (bool) – if True, it will render plt.show, else return data. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns result – list of results Return type list class malaya.model.bert.TaggingBERT

vectorize(string: str) vectorize a string. Parameters string (List[str])– Returns result Return type np.array analyze(string: str) Analyze a string. Parameters string (str)– Returns result

9.6. API 75 malaya Documentation

Return type {‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘begi- nOffset’: 0, ‘endOffset’: 1}]} predict(string: str) Tag a string. Parameters string (str)– Returns result Return type Tuple[str, str] class malaya.model.bert.DependencyBERT

vectorize(string: str) vectorize a string. Parameters string (List[str])– Returns result Return type np.array predict(string: str) Tag a string. Parameters string (str)– Returns result Return type Tuple class malaya.model.bert.ZeroshotBERT

vectorize(strings: List[str], labels: List[str], method: str = 'first') vectorize a string. Parameters • strings (List[str])– • labels (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict_proba(strings: List[str], labels: List[str]) classify list of strings and return probability. Parameters • strings (List[str])– • labels (List[str])–

76 Chapter 9. Contents: malaya Documentation

Returns list Return type list of float

9.6.42 malaya.model.bigbird class malaya.model.bigbird.MulticlassBigBird

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[str] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] class malaya.model.bigbird.Translation

greedy_decoder(strings: List[str]) translate list of strings. Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.bigbird.Summarization

greedy_decoder(strings: List[str], temperature: float = 0.0, postprocess: bool = False, **kwargs) Summarize strings using greedy decoder.

9.6. API 77 malaya Documentation

Parameters • strings (List[str])– • temperature (float, (default=0.0)) – logits * -log(random.uniform) * tem- perature. • postprocess (bool, optional (default=False)) – If True, will filter sen- tence generated using ROUGE score and removed international news publisher. Returns result Return type List[str] nucleus_decoder(strings: List[str], top_p: float = 0.7, temperature: float = 0.1, postprocess: bool = False, **kwargs) Summarize strings using nucleus decoder. Parameters • strings (List[str])– • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p. • temperature (float, (default=0.3)) – logits * -log(random.uniform) * tem- perature. • postprocess (bool, optional (default=False)) – If True, will filter sen- tence generated using ROUGE score and removed international news publisher. Returns result Return type List[str]

9.6.43 malaya.model.extractive_summarization class malaya.model.extractive_summarization.SKLearn

word_level(corpus, isi_penting: str = None, window_size: int = 10, important_words: int = 10, **kwargs) Summarize list of strings / string on word level. Parameters • corpus (str / List[str])– • isi_penting (str, optional (default=None)) – if not None, will put prior- ity based on isi_penting. • window_size (int, (default=10)) – window size for each word. • important_words (int, (default=10)) – number of important words. Returns dict Return type {‘top-words’, ‘cluster-top-words’, ‘score’} sentence_level(corpus, isi_penting: str = None, top_k: int = 3, important_words: int = 10, **kwargs) Summarize list of strings / string on sentence level. Parameters • corpus (str / List[str])–

78 Chapter 9. Contents: malaya Documentation

• isi_penting (str, optional (default=None)) – if not None, will put prior- ity based on isi_penting. • top_k (int, (default=3)) – number of summarized strings. • important_words (int, (default=10)) – number of important words. Returns dict Return type {‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’} class malaya.model.extractive_summarization.Doc2Vec

word_level(corpus, isi_penting: str = None, window_size: int = 10, aggregation=, soft: bool = False, **kwargs) Summarize list of strings / string on sentence level. Parameters • corpus (str / List[str])– • isi_penting (str, optional (default=None)) – if not None, will put prior- ity based on isi_penting. • window_size (int, (default=10)) – window size for each word. • aggregation (Callable, optional (default=numpy.mean)) – Aggrega- tion method for Doc2Vec. • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros. Returns dict Return type {‘score’} sentence_level(corpus, isi_penting: str = None, top_k: int = 3, aggregation=, soft: bool = False, **kwargs) Summarize list of strings / string on sentence level. Parameters • corpus (str / List[str])– • isi_penting (str, optional (default=None)) – if not None, will put prior- ity based on isi_penting. • top_k (int, (default=3)) – number of summarized strings. • aggregation (Callable, optional (default=numpy.mean)) – Aggrega- tion method for Doc2Vec. • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros. Returns dict Return type {‘summary’, ‘score’} class malaya.model.extractive_summarization.Encoder

9.6. API 79 malaya Documentation

9.6.44 malaya.model.ml

class malaya.model.ml.MulticlassBayes

predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[str] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] class malaya.model.ml.BinaryBayes

predict(strings: List[str], add_neutral: bool = True) classify list of strings. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[str] predict_proba(strings: List[str], add_neutral: bool = True) classify list of strings and return probability. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[dict[str, float]] class malaya.model.ml.MultilabelBayes

predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[List[str]]

80 Chapter 9. Contents: malaya Documentation

predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (list)– Returns result Return type List[dict[str, float]]

9.6.45 malaya.model.pegasus class malaya.model.pegasus.Summarization

greedy_decoder(strings: List[str], temperature: float = 0.0, postprocess: bool = False, **kwargs) Summarize strings using greedy decoder. Parameters • strings (List[str])– • temperature (float, (default=0.3)) – logits * -log(random.uniform) * tem- perature. • postprocess (bool, optional (default=False)) – If True, will filter sen- tence generated using ROUGE score and removed international news publisher. Returns result Return type List[str] nucleus_decoder(strings: List[str], top_p: float = 0.7, temperature: float = 0.2, postprocess: bool = False, **kwargs) Summarize strings using nucleus decoder. Parameters • strings (List[str])– • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p. • temperature (float, (default=0.3)) – logits * -log(random.uniform) * tem- perature. • postprocess (bool, optional (default=False)) – If True, will filter sen- tence generated using ROUGE score and removed international news publisher. Returns result Return type List[str]

9.6. API 81 malaya Documentation

9.6.46 malaya.model.t5 class malaya.model.t5.Summarization

greedy_decoder(strings: List[str], postprocess: bool = False, **kwargs) Summarize strings. Parameters • strings (List[str])– • postprocess (bool, optional (default=False)) – If True, will filter sen- tence generated using ROUGE score and removed international news publisher. Returns result Return type List[str] class malaya.model.t5.Generator

greedy_decoder(strings: List[str]) generate a long text given a isi penting. Decoder is greedy decoder with beam width size 1, alpha 0.5 . Parameters strings (List[str])– Returns result Return type str class malaya.model.t5.Paraphrase

greedy_decoder(strings: List[str]) paraphrase strings. Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.t5.KnowledgeGraph

greedy_decoder(strings: List[str], get_networkx: bool = True) Generate triples knowledge graph using greedy decoder. Example, “Joseph Enanga juga bermain untuk Union Douala.” -> “Joseph Enanga member of sports team Union Douala” Parameters • strings (List[str])– • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph. Returns result Return type List[Dict] class malaya.model.t5.Spell

greedy_decoder(strings: List[str]) spelling correction for strings.

82 Chapter 9. Contents: malaya Documentation

Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.t5.Segmentation

greedy_decoder(strings: List[str]) . Parameters strings (List[str])– Returns result Return type List[str]

9.6.47 malaya.model.tf class malaya.model.tf.DeepLang

predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[str] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] class malaya.model.tf.Translation

greedy_decoder(strings: List[str]) translate list of strings. Parameters strings (List[str])– Returns result Return type List[str] beam_decoder(strings: List[str]) translate list of strings using beam decoder, beam width size 3, alpha 0.5 . Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.tf.Constituency

9.6. API 83 malaya Documentation

vectorize(string: str) vectorize a string. Parameters string (List[str])– Returns result Return type np.array parse_nltk_tree(string: str) Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker. Parameters string (str)– Returns result Return type nltk.Tree object parse_tree(string) Parse a string into string format. Parameters string (str)– Returns result Return type malaya.text.trees.InternalTreebankNode class class malaya.model.tf.TrueCase

greedy_decoder(strings: List[str]) True case strings using greedy decoder. Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.” Parameters strings (List[str])– Returns result Return type List[str] beam_decoder(strings: List[str]) True case strings using beam decoder, beam width size 3, alpha 0.5 . Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.” Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.tf.Segmentation

greedy_decoder(strings: List[str]) Segment strings using greedy decoder. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya” Parameters strings (List[str])– Returns result Return type List[str] beam_decoder(strings: List[str]) Segment strings using beam decoder, beam width size 3, alpha 0.5 . Example, “sayasygkan negarasaya” -> “saya sygkan negara saya” Parameters strings (List[str])–

84 Chapter 9. Contents: malaya Documentation

Returns result Return type List[str] class malaya.model.tf.Paraphrase

greedy_decoder(strings: List[str], **kwargs) Paraphrase strings using greedy decoder. Parameters strings (List[str])– Returns result Return type List[str] beam_decoder(strings: List[str], **kwargs) Paraphrase strings using beam decoder, beam width size 3, alpha 0.5 . Parameters strings (List[str])– Returns result Return type List[str] nucleus_decoder(strings: List[str], top_p: float = 0.7, **kwargs) Paraphrase strings using nucleus sampling. Parameters • strings (List[str])– • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p. Returns result Return type List[str] class malaya.model.tf.Tatabahasa

greedy_decoder(strings: List[str]) Fix kesalahan tatatabahasa. Parameters strings (List[str])– Returns result Return type List[str] class malaya.model.tf.SQUAD

predict(paragraph_text: str, question_texts: List[str], doc_stride: int = 128, max_query_length: int = 64, max_answer_length: int = 64, n_best_size: int = 20) Predict Span from questions given a paragraph. Parameters • paragraph_text (str)– • question_texts (List[str]) – List of questions, results really depends on case sensitive questions. • doc_stride (int, optional (default=128)) – striding size to split a para- graph into multiple texts.

9.6. API 85 malaya Documentation

• max_query_length (int, optional (default=64)) – Maximum length if question tokens. • max_answer_length (int, optional (default=30)) – Maximum length if answer tokens. Returns result Return type List[{‘text’: ‘text’, ‘start’: 0, ‘end’: 1}] vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array class malaya.model.tf.KnowledgeGraph

greedy_decoder(strings: List[str], get_networkx: bool = True) Generate triples knowledge graph using greedy decoder. Example, “Joseph Enanga juga bermain untuk Union Douala.” -> “Joseph Enanga member of sports team Union Douala” Parameters • strings (List[str])– • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph. Returns result Return type List[Dict] beam_decoder(strings: List[str], get_networkx: bool = True) Generate triples knowledge graph using beam decoder. Example, “Joseph Enanga juga bermain untuk Union Douala.” -> “Joseph Enanga member of sports team Union Douala” Parameters • strings (List[str])– • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph. Returns result Return type List[Dict]

86 Chapter 9. Contents: malaya Documentation

9.6.48 malaya.model.xlnet class malaya.model.xlnet.BinaryXLNET

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str], add_neutral: bool = True) classify list of strings. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[str] predict_proba(strings: List[str], add_neutral: bool = True) classify list of strings and return probability. Parameters • strings (List[str])– • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability. Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer.

9.6. API 87 malaya Documentation

– 'mean' - average attentions from all layers. • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard. Returns result Return type dict class malaya.model.xlnet.MulticlassXLNET

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[str] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers.

88 Chapter 9. Contents: malaya Documentation

• visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard. Returns result Return type dict class malaya.model.xlnet.SigmoidXLNET

vectorize(strings: List[str], method: str = 'first') vectorize list of strings. Parameters • strings (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict(strings: List[str]) classify list of strings. Parameters strings (List[str])– Returns result Return type List[List[str]] predict_proba(strings: List[str]) classify list of strings and return probability. Parameters strings (List[str])– Returns result Return type List[dict[str, float]] predict_words(string: str, method: str = 'last', visualization: bool = True) classify words. Parameters • string (str)– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

9.6. API 89 malaya Documentation

Returns dictionary Return type results class malaya.model.xlnet.SiameseXLNET

vectorize(strings: List[str]) Vectorize list of strings. Parameters strings (List[str])– Returns result Return type np.array predict_proba(strings_left: List[str], strings_right: List[str]) calculate similarity for two different batch of texts. Parameters • string_left (List[str])– • string_right (List[str])– Returns result Return type List[float] heatmap(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7)) plot a heatmap based on output from similarity Parameters • strings (list of str) – list of strings. • visualize (bool) – if True, it will render plt.show, else return data. • figsize (tuple, (default=(7, 7))) – figure size for plot. Returns result – list of results Return type list class malaya.model.xlnet.TaggingXLNET

vectorize(string: str) vectorize a string. Parameters string (List[str])– Returns result Return type np.array analyze(string: str) Analyze a string. Parameters string (str)– Returns result Return type {‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘begi- nOffset’: 0, ‘endOffset’: 1}]} predict(string: str) Tag a string.

90 Chapter 9. Contents: malaya Documentation

Parameters string (str)– Returns result Return type Tuple[str, str] class malaya.model.xlnet.DependencyXLNET

vectorize(string: str) vectorize a string. Parameters string (List[str])– Returns result Return type np.array predict(string: str) Tag a string. Parameters string (str)– Returns result Return type Tuple class malaya.model.xlnet.ZeroshotXLNET

vectorize(strings: List[str], labels: List[str], method: str = 'first') vectorize a string. Parameters • strings (List[str])– • labels (List[str])– • method (str, optional (default='first')) – Vectorization layer sup- ported. Allowed values: – 'last' - vector from last sequence. – 'first' - vector from first sequence. – 'mean' - average vectors from all sequences. – 'word' - average vectors based on tokens. Returns result Return type np.array predict_proba(strings: List[str], labels: List[str]) classify list of strings and return probability. Parameters • strings (List[str])– • labels (List[str])– Returns list Return type list of float

9.6. API 91 malaya Documentation

9.6.49 malaya.transformers.albert

malaya.transformers.albert.load(model: str = 'albert', **kwargs) Load albert model. Parameters model (str, optional (default='base')) – Model architecture sup- ported. Allowed values: • 'albert' - base albert-bahasa released by Malaya. • 'tiny-albert' - tiny bert-bahasa released by Malaya. Returns result Return type malaya.transformers.albert.Model class class malaya.transformers.albert.Model

vectorize(strings: List[str]) Vectorize string inputs. Parameters strings (List[str])– Returns result Return type np.array attention(strings: List[str], method: str = 'last', **kwargs) Get attention string inputs. Parameters • strings (List[str])– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. Returns result Return type List[List[Tuple[str, float]]] visualize_attention(string: str) Visualize attention. Parameters string (str)–

9.6.50 malaya.transformers.alxlnet malaya.transformers.alxlnet.load(model: str = 'alxlnet', pool_mode: str = 'last', **kwargs) Load alxlnet model. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'alxlnet' - XLNET architecture from google + Malaya.

92 Chapter 9. Contents: malaya Documentation

• pool_mode (str, optional (default='last')) – Model logits architecture supported. Allowed values: – 'last' - last of the sequence. – 'first' - first of the sequence. – 'mean' - mean of the sequence. – 'attn' - attention of the sequence. Returns result Return type malaya.transformers.alxlnet.Model class class malaya.transformers.alxlnet.Model

vectorize(strings: List[str]) Vectorize string inputs. Parameters strings (List[str])– Returns result Return type np.array attention(strings: List[str], method: str = 'last', **kwargs) Get attention string inputs. Parameters • strings (List[str])– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. Returns result Return type List[List[Tuple[str, float]]] visualize_attention(string: str) Visualize attention. Parameters string (str)–

9.6.51 malaya.transformers.bert

malaya.transformers.bert.load(model: str = 'base', **kwargs) Load bert model. Parameters model (str, optional (default='base')) – Model architecture sup- ported. Allowed values: • 'bert' - base bert-bahasa released by Malaya. • 'tiny-bert' - tiny bert-bahasa released by Malaya. Returns result Return type malaya.transformers.bert.Model class

9.6. API 93 malaya Documentation

class malaya.transformers.bert.Model

vectorize(strings: List[str]) Vectorize string inputs. Parameters strings (List[str])– Returns result Return type np.array attention(strings: List[str], method: str = 'last', **kwargs) Get attention string inputs. Parameters • strings (List[str])– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. Returns result Return type List[List[Tuple[str, float]]] visualize_attention(string: str) Visualize attention. Parameters string (str)–

9.6.52 malaya.transformers.electra

malaya.transformers.electra.load(model: str = 'electra', **kwargs) Load electra model. Parameters model (str, optional (default='base')) – Model architecture sup- ported. Allowed values: • 'electra' - base electra-bahasa released by Malaya. • 'small-electra' - small electra-bahasa released by Malaya. Returns result Return type malaya.transformers.electra.Model class class malaya.transformers.electra.Model

vectorize(strings: List[str]) Vectorize string inputs. Parameters strings (List[str])– Returns result Return type np.array

94 Chapter 9. Contents: malaya Documentation

attention(strings: List[str], method: str = 'last', **kwargs) Get attention string inputs. Parameters • strings (List[str])– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. Returns result Return type List[List[Tuple[str, float]]] visualize_attention(string: str) Visualize attention. Parameters string (str)–

9.6.53 malaya.transformers.gpt2 malaya.transformers.gpt2.load(model='345M', generate_length=100, temperature=1.0, top_k=40, **kwargs) Load gpt2 model. Parameters • model (str, optional (default='345M')) – Model architecture supported. Al- lowed values: – '117M' - GPT2 117M parameters. – '345M' - GPT2 345M parameters. • generate_length (int, optional (default=256)) – length of sentence to generate. • temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1. • top_k (int, optional (default=40)) – top-k in nucleus sampling selection. Returns result Return type malaya.transformers.gpt2.Model class class malaya.transformers.gpt2.Model

generate(string: str) generate a text given an initial string. Parameters string (str)– Returns result Return type str

9.6. API 95 malaya Documentation

9.6.54 malaya.transformers.xlnet

malaya.transformers.xlnet.load(model: str = 'xlnet', pool_mode: str = 'last', **kwargs) Load xlnet model. Parameters • model (str, optional (default='base')) – Model architecture supported. Al- lowed values: – 'xlnet' - XLNET architecture from google. • pool_mode (str, optional (default='last')) – Model logits architecture supported. Allowed values: – 'last' - last of the sequence. – 'first' - first of the sequence. – 'mean' - mean of the sequence. – 'attn' - attention of the sequence. Returns result Return type malaya.transformers.xlnet.Model class class malaya.transformers.xlnet.Model

vectorize(strings: List[str]) Vectorize string inputs. Parameters strings (List[str])– Returns result Return type np.array attention(strings: List[str], method: str = 'last', **kwargs) Get attention string inputs. Parameters • strings (List[str])– • method (str, optional (default='last')) – Attention layer supported. Al- lowed values: – 'last' - attention from last layer. – 'first' - attention from first layer. – 'mean' - average attentions from all layers. Returns result Return type List[List[Tuple[str, float]]] visualize_attention(string: str) Visualize attention. Parameters string (str)–

96 Chapter 9. Contents: malaya Documentation

9.7 Contributing

Contributions are welcome and are greatly appreciated! Every little bit helps, and credit will always be given.

9.7.1 Code Formatting

We use AutoPEP8 for code formatting and standard. Checkout pyproject.toml at root directory.

9.7.2 Report Bugs

Report bugs through Github issue. Please report relevant information and preferably code that exhibits the problem. Do not try to email us about the issues, we will not respond to the emails, submit a proper Github issue.

9.7.3 Fix Bugs

Look through the Github issue for bugs. Anything is open to whoever wants to implement it.

9.7.4 Implement Features

Look through the Github issue or Malaya-project for features. Any unassigned improvement issue is open to whoever wants to implement it. We use frozen graph Tensorflow, should able to freeze any Tensorflow (version 1.15 and above) or Keras models.

9.7.5 Dataset

Create a new issue in Github issue related to your data including the data link or attached it there. If you want to improve current dataset we have, you can check at Malaya-Dataset. Or, you can simply email your data if you do not want to expose the data to public. Malaya will not exposed your data, but we will exposed our trained models based on your data. Thanks to, 1. Fake news, contributed by syazanihussin 2. Speech voice, contributed by Khalil Nooh 3. Speech voice, contributed by Mas Aisyah Ahmad 4. Singlish text dump, contributed by brytjy 5. Singapore news, contributed by brytjy

9.7. Contributing 97 malaya Documentation

9.7.6 Improve Documentation

Malaya could always use better documentation, might have some typos or uncorrect object names.

9.7.7 Submit Feedback

The best way to send feedback is to open an issue on Github issue.

9.7.8 Unit test

Test every possible program flow! You can check unit tests here. Feel free to help Malaya to write unit-tests, fork it!

9.8 GPU Environment

This tutorial is available as an IPython notebook at Malaya/example/gpu-environment.

[1]: %%time

import malaya import logging logging.basicConfig(level= logging.INFO) CPU times: user 6.33 s, sys: 2.5 s, total: 8.83 s Wall time: 4.65 s

9.8.1 List available GPU

[2]: malaya.utils.available_gpu() [2]: [('GPU:0', '29.16 GB'), ('GPU:1', '29.163 GB'), ('GPU:2', '29.163 GB'), ('GPU:3', '29.163 GB')]

9.8.2 Limit GPU memory

By default Malaya will not set max cap for GPU memory, to put a cap, override gpu_limit parameter in any load model API. gpu_limit should 0 < gpu_limit < 1. If gpu_limit = 0.3, it means the model will not use more than 30% of GPU memory.

malaya.sentiment.transformer(gpu_limit= 0.3)

98 Chapter 9. Contents: malaya Documentation

9.8.3 Not all operations supported by GPU

Yes, some models might faster in CPU due to head cost transitioning from CPU to GPU too frequently, for example, transformer model from T2T library.

9.8.4 N Models to N gpus

To allocate a model to another GPU, set device to different GPU, eg, GPU:1, default is GPU:0.

model_sentiment= malaya.sentiment.transformer(model='bert', gpu_limit= 0.5,

˓→device='GPU:0') model_subjectivity= malaya.subjectivity.transformer(model='bert', gpu_limit= 0.5,

˓→device='GPU:1') model_emotion= malaya.emotion.transformer(model='bert', gpu_limit= 0.5, device=

˓→'GPU:2') model_translation= malaya.translation.ms_en.transformer(gpu_limit= 0.5, device=

˓→'GPU:3')

9.8.5 GPU Rules

1. Malaya will not consumed all available GPU memory, but slowly grow based on batch size. This growth only towards positive (use more GPU memory) dynamically, but will not reduce GPU memory if feed small batch size. 2. Use malaya.utils.close_session to clear session from unused models but this will not free GPU memory.

[5]: anger_text='babi la company ni, aku dah la penat datang dari jauh' fear_text='takut doh tengok cerita hantu tadi' happy_text='bestnya dapat tidur harini, tak payah pergi kerja' love_text='aku sayang sgt dia dah doh' sadness_text='kecewa tengok kerajaan baru ni, janji ape pun tak dapat' surprise_text='sakit jantung aku, terkejut dengan cerita hantu tadi'

[6]: model_sentiment= malaya.sentiment.transformer(model='bert', gpu_limit= 0.5,

˓→device='GPU:0') model_subjectivity= malaya.subjectivity.transformer(model='bert', gpu_limit= 0.5,

˓→device='GPU:1') model_emotion= malaya.emotion.transformer(model='bert', gpu_limit= 0.5, device=

˓→'GPU:2') model_translation= malaya.translation.ms_en.transformer(gpu_limit= 0.5, device=

˓→'GPU:3') WARNING:tensorflow:From /home/husein/malaya/Malaya/malaya/function/__init__.py:73:

˓→The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /home/husein/malaya/Malaya/malaya/function/__init__.py:75:

˓→The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /home/husein/malaya/Malaya/malaya/function/__init__.py:50:

˓→The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

(continues on next page)

9.8. GPU Environment 99 malaya Documentation

(continued from previous page) WARNING:tensorflow:From /home/husein/malaya/Malaya/malaya/function/__init__.py:65:

˓→The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

[7]: %%time

model_sentiment.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) model_subjectivity.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) model_emotion.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) model_translation.translate(['Mahathir buat keputusan terburu-buru']) CPU times: user 8.61 s, sys: 2.71 s, total: 11.3 s Wall time: 10.8 s [7]: ['Mahathir made a hasty decision']

[8]:!nvidia-smi Sun Jul 12 19:26:18 2020 +------+ | NVIDIA-SMI 410.129 Driver Version: 410.129 CUDA Version: 10.0 | |------+------+------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |======+======+======| | 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 | | N/A 45C P0 54W / 300W | 1101MiB / 32475MiB | 0% Default | +------+------+------+ | 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 | | N/A 46C P0 52W / 300W | 1100MiB / 32478MiB | 0% Default | +------+------+------+ | 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 | | N/A 45C P0 52W / 300W | 1100MiB / 32478MiB | 0% Default | +------+------+------+ | 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 | | N/A 45C P0 53W / 300W | 1100MiB / 32478MiB | 0% Default | +------+------+------+

+------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |======| | 0 12786 C /usr/bin/python3 1089MiB | | 1 12786 C /usr/bin/python3 1089MiB | | 2 12786 C /usr/bin/python3 1089MiB | | 3 12786 C /usr/bin/python3 1089MiB | +------+

100 Chapter 9. Contents: malaya Documentation

9.9 Devices

This tutorial is available as an IPython notebook at Malaya/example/devices.

9.9.1 List available devices supported to run Malaya model

[1]: import malaya import logging logging.basicConfig(level= logging.INFO)

[2]: malaya.utils.available_device() [2]: [('CPU:0', '0.268 GB'), ('XLA_CPU:0', '17.18 GB'), ('XLA_GPU:0', '17.18 GB'), ('XLA_GPU:1', '17.18 GB'), ('XLA_GPU:2', '17.18 GB'), ('XLA_GPU:3', '17.18 GB'), ('GPU:0', '29.466 GB'), ('GPU:1', '29.469 GB'), ('GPU:2', '29.469 GB'), ('GPU:3', '29.469 GB')]

9.9.2 Use specific device for specific model

To do that, pass device parameter to any load model function in Malaya, default is CPU:0.

malaya.sentiment.transformer(model='alxlnet', device='CPU:0')

Or if you want to use XLA,

malaya.sentiment.transformer(model='alxlnet', device='XLA_CPU:0')

By default, device will automatically set to a gpu with the most empty memory if you have GPUs detected.

[3]: alxlnet_cpu= malaya.sentiment.transformer(model='alxlnet', device='CPU:0') INFO:root:running sentiment/alxlnet using device /device:GPU:1

[4]: alxlnet_cpu= malaya.sentiment.transformer(model='alxlnet', device='GPU:1') INFO:root:running sentiment/alxlnet using device /device:GPU:1

9.9. Devices 101 malaya Documentation

9.9.3 Disable auto GPU

Let say you do not want to use auto allocate to gpu, simply set auto_gpu to False, or set,

export CUDA_VISIBLE_DEVICES=''

[5]: alxlnet_cpu= malaya.sentiment.transformer(model='alxlnet', device='CPU:0', auto_

˓→gpu= False) INFO:root:running sentiment/alxlnet using device /device:CPU:0

[6]: alxlnet_xla_cpu= malaya.sentiment.transformer(model='alxlnet', device='XLA_CPU:0

˓→', auto_gpu= False) INFO:root:running sentiment/alxlnet using device /device:XLA_CPU:0

[7]: string='saya kentut busuk tapi muka comel'

[8]: %%time

alxlnet_cpu.predict_proba([string]) CPU times: user 4.95 s, sys: 636 ms, total: 5.59 s Wall time: 5.14 s [8]: [{'negative': 0.99993134, 'positive': 6.920824e-07, 'neutral': 6.7949295e-05}]

[9]: %%time

alxlnet_xla_cpu.predict_proba([string]) CPU times: user 49.9 s, sys: 818 ms, total: 50.7 s Wall time: 50.3 s [9]: [{'negative': 0.99997425, 'positive': 2.5436142e-07, 'neutral': 2.5510788e-05}]

Again, not all Tensorflow operation support XLA.

9.10 Precision Mode

This tutorial is available as an IPython notebook at Malaya/example/precision-mode.

Let say you want to run the model in FP16, or FP64.

[1]: import malaya import logging logging.basicConfig(level= logging.INFO)

102 Chapter 9. Contents: malaya Documentation

9.10.1 Use specific precision for specific model

To do that, pass precision_mode parameter to any load model function in Malaya,

malaya.sentiment.transformer(model='albert', precision_mode='FP16')

Supported precision mode is {'BFLOAT16', 'FP16', 'FP32', 'FP64'}, default is FP32, check code at https://github.com/huseinzol05/malaya-boilerplate/blob/main/malaya_boilerplate/frozen_graph.py

[2]: albert= malaya.sentiment.transformer(model='albert') albert_fp16= malaya.sentiment.transformer(model='albert', precision_mode='FP16') INFO:root:running sentiment/albert using device /device:CPU:0 INFO:root:running sentiment/albert using device /device:CPU:0 Converting sentiment/albert to FP16.

[3]: string='ketiak saya masam tapi saya comel'

[5]: %%time

albert.predict_proba([string]) CPU times: user 166 ms, sys: 15.9 ms, total: 182 ms Wall time: 47.1 ms [5]: [{'negative': 0.8387252, 'positive': 0.0016127465, 'neutral': 0.15966207}]

[7]: %%time

albert_fp16.predict_proba([string]) CPU times: user 14.6 s, sys: 53.3 ms, total: 14.6 s Wall time: 2.21 s [7]: [{'negative': 0.839, 'positive': 0.001611, 'neutral': 0.1597}]

Running on FP16 is not necessary faster, most CPUs are not optimized for FP16, might want to look into GPU RTX and above.

9.11 Quantization

This tutorial is available as an IPython notebook at Malaya/example/quantization.

We provided Quantized model for all Malaya models, example, sentiment transformer models,

[1]: import malaya

malaya.sentiment.available_transformer() INFO:root:tested on 20% test set. [1]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.6 111.00 0.99330 0.99330 tiny-bert 57.4 15.40 0.98774 0.98774 albert 48.6 12.80 0.99227 0.99226 (continues on next page)

9.11. Quantization 103 malaya Documentation

(continued from previous page) tiny-albert 22.4 5.98 0.98554 0.98550 xlnet 446.6 118.00 0.99353 0.99353 alxlnet 46.8 13.30 0.99188 0.99188

macro f1-score bert 0.99329 tiny-bert 0.98774 albert 0.99226 tiny-albert 0.98551 xlnet 0.99353 alxlnet 0.99188

Usually quantized model able to compress 4x of original size. This quantized model will convert all possible float- ing constants to quantized constants, and only stored mean, standard deviation of floating constants and quantized constants. Again, quantized model is not necessary faster, because tensorflow will cast back to FP32 during feed-forward for certain operations.

9.11.1 Use quantized model

Simply pass quantized parameter become True, default is False.

[2]: albert_quantized= malaya.sentiment.transformer(model='albert', quantized= True) albert= malaya.sentiment.transformer(model='albert') WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running sentiment/albert-quantized using device /device:CPU:0 INFO:root:running sentiment/albert using device /device:CPU:0

[3]: string='saya masam awak pun masam'

[5]: %%time

albert.predict([string]) CPU times: user 171 ms, sys: 15.9 ms, total: 187 ms Wall time: 47.2 ms [5]: ['negative']

[9]: %%time

albert_quantized.predict([string]) CPU times: user 181 ms, sys: 41.1 ms, total: 223 ms Wall time: 53.8 ms [9]: ['negative']

104 Chapter 9. Contents: malaya Documentation

9.12 Deployment

This tutorial is available as an IPython notebook at Malaya/example/deployment.

[1]: import malaya

9.12.1 Disable file validation

If you deployed some of Malaya models on persist, short-life (auto-restart to reduce memory consumption) and async / multiprocess workers, you might get errors related to file checking. You can skip this error as long you able to persist malaya models.

download model

So, first you need to download the model into your local machine / environment, run this on different script,

[5]: model= malaya.zero_shot.classification.transformer(model='tiny-albert') INFO:tensorflow:loading sentence piece model

load model

Load model without need to check model, run this on top of fastapi / flask / gunicorn.

[6]: model= malaya.zero_shot.classification.transformer(model='tiny-albert', validate=

˓→False) INFO:tensorflow:loading sentence piece model

This loaded model able to share among multi-workers / multi-threads.

9.12.2 disable type checking

Make sure you already install latest version herpetologist,

pip install herpetologist -U

If you check Malaya source code, you can see we check parameters on function / method definition, https://github. com/huseinzol05/Malaya/blob/master/malaya/model/bert.py#L232 We use herpetologist to check passed variables, https://github.com/huseinzol05/herpetologist

@check_type def predict(self, strings: List[str], add_neutral: bool= True): """ classify a string. Parameters ------strings: List[str] add_neutral: bool, optional (default=True) (continues on next page)

9.12. Deployment 105 malaya Documentation

(continued from previous page) if True, it will add neutral probability. Returns ------result: List[str] """

@check_type will check strings is a List[str] or not, if not, it will throw an error. But this @check_type will become expensive if you have massive list of strings. So you can disable to this type checking by simply set bash environment. Some of our environments we want to enable it, some of it also we want to disable, and we do not want herpetologist to keep check the variables. So to disable it, simply set bash environment, export ENABLE_HERPETOLOGIST=false

Or, using python, import os os.environ['ENABLE_HERPETOLOGIST']='false'

You can see impact of time execution in this example.

9.12.3 Use smaller model

Stacking multiple smaller models much faster than a single big model. But this cannot ensure the accuracy will be same as the big model.

9.12.4 docker example

You can check some docker examples and benchmarks at here, https://github.com/huseinzol05/Malaya/tree/master/ misc/deployment. The purpose of these benchmarks, how fast and how much requests for a model able to serve on perfect minibatch realtime, let say live streaming data from social media to detect sentiment, whether a text is a negative or a positive. Tested on ALBERT-BASE sentiment model. These are my machine specifications, 1. Intel(R) Core(TM) i7-8557U CPU @ 1.70GHz 2. 16 GB 2133 MHz LPDDR3 And I use same wrk command, wrk -t15 -c600 -d1m --timeout=15s http://localhost:8080/?string=husein%20sangat

˓→%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi

Some constraints, 1. ALBERT BASE is around 43MB. 2. Limit memory is 2GB, set by Docker itself. 3. batch size of 50 strings, duplicate 50 times of husein sangat comel dan handsome tambahan lagi ketiak wangi, can check every deployment in app.py or main.py. 4. No limit on CPU usage.

106 Chapter 9. Contents: malaya Documentation

5. no caching. fast-api workers automatically calculated by fast-api, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/ fast-api

Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan

˓→%20handsome%20tambahan%20lagi%20ketiak%20wangi 15 threads and 600 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.00us 0.00us 0.00us nan% Req/Sec 0.24 1.16 9.00 95.52% 68 requests in 1.00m, 8.96KB read Socket errors: connect 364, read 293, write 0, timeout 68 Requests/sec: 1.13 Transfer/sec: 152.75B

Gunicorn Flask

5 sync workers, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/gunicorn-flask

Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan

˓→%20handsome%20tambahan%20lagi%20ketiak%20wangi 15 threads and 600 connections Thread Stats Avg Stdev Max +/- Stdev Latency 7.98s 3.25s 12.71s 41.67% Req/Sec 0.49 1.51 9.00 90.91% 59 requests in 1.00m, 9.10KB read Socket errors: connect 364, read 39, write 0, timeout 47 Requests/sec: 0.98 Transfer/sec: 155.12B

UWSGI Flask + Auto scaling

Min 2 worker, Max 10 workers, spare2 algorithm, https://github.com/huseinzol05/Malaya/tree/master/misc/ deployment/uwsgi-flask-cheaper

Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan

˓→%20handsome%20tambahan%20lagi%20ketiak%20wangi 15 threads and 600 connections Thread Stats Avg Stdev Max +/- Stdev Latency 8.80s 4.16s 14.73s 62.50% Req/Sec 0.75 2.60 9.00 91.67% 12 requests in 1.00m, 0.90KB read Socket errors: connect 364, read 105, write 0, timeout 4 Requests/sec: 0.20 Transfer/sec: 15.37B

9.12. Deployment 107 malaya Documentation

UWSGI Flask

4 Workers, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/uwsgi-flask-fork

Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan

˓→%20handsome%20tambahan%20lagi%20ketiak%20wangi 15 threads and 600 connections Thread Stats Avg Stdev Max +/- Stdev Latency 8.79s 4.13s 14.87s 53.33% Req/Sec 1.06 3.16 20.00 92.59% 56 requests in 1.00m, 4.21KB read Socket errors: connect 364, read 345, write 0, timeout 41 Requests/sec: 0.93 Transfer/sec: 71.74B

9.12.5 Learn different deployment techniques

Eg, Change concurrent requests into mini-batch realtime processing to speed up text classification, repository This can reduce time taken up to 95%!

[ ]:

9.13 Transformer

This tutorial is available as an IPython notebook at Malaya/example/transformer.

Malaya provided basic interface for Pretrained Transformer encoder models, specific to Malay, local social media slang and Manglish language, we called it Transformer-Bahasa. Below are the list of dataset we pretrained, Standard Bahasa dataset, 1. Malay-dataset/dumping. 2. Malay-dataset/pure-text. Bahasa social media, 1. Malay-dataset/dumping/instagram. 2. Malay-dataset/dumping/twitter. Singlish / Manglish, 1. Malay-dataset/dumping/singlish. 2. Malay-dataset/dumping/singapore-news. This interface not able us to use it to do custom training. If you want to download pretrained model for Transformer-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/, some notebooks to help you get started. Or you can simply use hugging-face transformers to try transformer models from Malaya, simply check available models from here, https://huggingface.co/models?filter=ms

108 Chapter 9. Contents: malaya Documentation

[5]: from IPython.core.display import Image, display

display(Image('huggingface.png', width=500))

[1]: %%time import malaya CPU times: user 4.88 s, sys: 641 ms, total: 5.52 s Wall time: 4.5 s

9.13.1 list Transformer-Bahasa available

[2]: malaya.transformer.available_transformer() [2]: Size (MB) Description bert 425.6 Google BERT BASE parameters tiny-bert 57.4 Google BERT TINY parameters albert 48.6 Google ALBERT BASE parameters tiny-albert 22.4 Google ALBERT TINY parameters xlnet 446.6 Google XLNET BASE parameters alxlnet 46.8 Malaya ALXLNET BASE parameters electra 443 Google ELECTRA BASE parameters small-electra 55 Google ELECTRA SMALL parameters

[3]: strings=['Kerajaan galakkan rakyat naik public transport tapi parking kat lrt ada

˓→15. Reserved utk staff rapid je dah berpuluh. Park kereta tepi jalan kang kene ˓→saman dgn majlis perbandaran. Kereta pulak senang kene curi. Cctv pun tak(continues ada. on next Naik page) ˓→grab dah 5-10 ringgit tiap hari. Gampang juga',

9.13. Transformer 109 malaya Documentation

(continued from previous page) 'Alaa Tun lek ahhh npe muka masam cmni kn agong kata usaha kerajaan

˓→terdahulu sejak selepas merdeka', "Orang ramai cakap nurse kerajaan garang. So i tell u this. Most of our

˓→local ppl will treat us as hamba abdi and they don't respect us as a nurse"]

9.13.2 Load XLNET-Bahasa

[4]: xlnet= malaya.transformer.load(model='xlnet') WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/xlnet.py:70: The name tf.gfile.Open is

˓→deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:81: The name tf.placeholder

˓→is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/xlnet.py:253: The name tf.variable_scope

˓→is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/xlnet.py:253: The name tf.AUTO_REUSE is

˓→deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/modeling.py:686: The name tf.logging.info

˓→is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:memory input None INFO:tensorflow:Use float type WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/modeling.py:693: The name tf.get_variable

˓→is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/modeling.py:797: dropout (from tensorflow.

˓→python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/tensorflow_core/python/layers/core.py:271: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

(continues on next page)

110 Chapter 9. Contents: malaya Documentation

(continued from previous page) WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/modeling.py:99: dense (from tensorflow.

˓→python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:94: The name tf.

˓→InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession

˓→instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:95: The name tf.global_

˓→variables_initializer is deprecated. Please use tf.compat.v1.global_variables_

˓→initializer instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:96: The name tf.trainable_

˓→variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:100: The name tf.train.Saver

˓→is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/malaya/transformers/xlnet/__init__.py:103: The name tf.get_default_

˓→graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/xlnet-model/

˓→base/xlnet-base/model.ckpt

I have random sentences copied from Twitter, searched using kerajaan keyword.

Vectorization

Change a string or batch of strings to latent space / vectors representation.

def vectorize(self, strings: List[str]):

""" Vectorize string inputs.

Parameters ------strings : List[str]

Returns ------result: np.array """

[5]:v= xlnet.vectorize(strings) v.shape [5]: (3, 768)

9.13. Transformer 111 malaya Documentation

Attention

def attention(self, strings: List[str], method: str='last', **kwargs): """ Get attention string inputs from bert attention.

Parameters ------strings : List[str] method : str, optional (default='last') Attention layer supported. Allowed values:

* ``'last'`` - attention from last layer. * ``'first'`` - attention from first layer. * ``'mean'`` - average attentions from all layers.

Returns ------result : List[List[Tuple[str, float]]] """

You can give list of strings or a string to get the attention, in this documentation, I just want to use a string.

[6]: xlnet.attention([strings[1]], method='last') [6]: [[('Alaa', 0.062061824), ('Tun', 0.051056776), ('lek', 0.13115405), ('ahhh', 0.08195943), ('npe', 0.06210695), ('muka', 0.04706182), ('masam', 0.058289353), ('cmni', 0.026094284), ('kn', 0.056146827), ('agong', 0.033949938), ('kata', 0.052644122), ('usaha', 0.07063393), ('kerajaan', 0.046773836), ('terdahulu', 0.057166394), ('sejak', 0.045712817), ('selepas', 0.047048207), ('merdeka', 0.07013944)]]

[7]: xlnet.attention([strings[1]], method='first') [7]: [[('Alaa', 0.045956098), ('Tun', 0.040094823), ('lek', 0.0611072), ('ahhh', 0.07029096), ('npe', 0.048513662), ('muka', 0.056670234), ('masam', 0.04088071), ('cmni', 0.08728454), ('kn', 0.047778472), ('agong', 0.081243224), ('kata', 0.03866041), ('usaha', 0.058326427), ('kerajaan', 0.055446573), (continues on next page)

112 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('terdahulu', 0.077162124), ('sejak', 0.05951431), ('selepas', 0.05385498), ('merdeka', 0.07721528)]]

[8]: xlnet.attention([strings[1]], method='mean') [8]: [[('Alaa', 0.06978634), ('Tun', 0.0517442), ('lek', 0.059642658), ('ahhh', 0.055883657), ('npe', 0.05339206), ('muka', 0.06806306), ('masam', 0.0489921), ('cmni', 0.0698193), ('kn', 0.057752036), ('agong', 0.065566674), ('kata', 0.059152905), ('usaha', 0.063305095), ('kerajaan', 0.050608452), ('terdahulu', 0.05888331), ('sejak', 0.057429556), ('selepas', 0.042058233), ('merdeka', 0.067920305)]]

Visualize Attention

Before using attention visualization, we need to load D3 into our jupyter notebook first. This visualization borrow from https://github.com/jessevig/bertviz.

def visualize_attention(self, string: str):

""" Visualize attention.

Parameters ------string : str """

[9]:%%javascript require.config({ paths:{ d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min', jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min', } });

[10]: xlnet.visualize_attention('nak makan ayam dgn husein')

9.13. Transformer 113 malaya Documentation

I attached a printscreen, readthedocs cannot visualize the javascript.

[13]: from IPython.core.display import Image, display

display(Image('xlnet-attention.png', width=300))

All attention models able to use these interfaces.

9.13.3 Load ELECTRA-Bahasa

Feel free to use another models.

[4]: electra= malaya.transformer.load(model='electra') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.

˓→v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow_core/python/

˓→layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is

˓→deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: (continues on next page)

114 Chapter 9. Contents: malaya Documentation

(continued from previous page) Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:114: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:117: The name tf.InteractiveSession is deprecated. Please use

˓→tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:118: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:120: The name tf.get_collection is deprecated. Please use tf.

˓→compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:121: The name tf.GraphKeys is deprecated. Please use tf.compat.

˓→v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:127: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:129: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→base/electra-base/model.ckpt

[14]: electra.attention([strings[1]], method='last') [14]: [[('Alaa', 0.059817147), ('Tun', 0.075028375), ('lek', 0.057848394), ('ahhh', 0.046973262), ('npe', 0.05160833), ('muka', 0.06221234), ('masam', 0.058585588), ('cmni', 0.054711323), ('kn', 0.06741887), ('agong', 0.056326747), ('kata', 0.054182768), ('usaha', 0.07986903), ('kerajaan', 0.05559596), ('terdahulu', 0.052879248), ('sejak', 0.049992196), ('selepas', 0.053916205), ('merdeka', 0.06303418)]]

9.13. Transformer 115 malaya Documentation

9.14 Word Vector

This tutorial is available as an IPython notebook at Malaya/example/wordvector.

9.14.1 Pretrained word2vec

You can download Malaya pretrained without need to import malaya.

word2vec from local news

size-256

word2vec from wikipedia

size-256

word2vec from local social media

size-256 But If you don’t know what to do with malaya word2vec, Malaya provided some useful functions for you!

[1]: %%time import malaya %matplotlib inline CPU times: user 5.15 s, sys: 907 ms, total: 6.05 s Wall time: 6.25 s

9.14.2 List available pretrained word2vec

[2]: malaya.wordvector.available_wordvector() [2]: Size (MB) Vocab size lowercase \ wikipedia 781.7 763350 True socialmedia 1300 1294638 True news 200.2 195466 True combine 1900 1903143 True

Description wikipedia pretrained on Malay wikipedia word2vec size 256 socialmedia pretrained on cleaned Malay twitter and Malay ... news pretrained on cleaned Malay news size 256 combine pretrained on cleaned Malay news + Malay socia...

116 Chapter 9. Contents: malaya Documentation

9.14.3 Load pretrained word2vec

def load(model: str='wikipedia', **kwargs):

""" Return malaya.wordvector.WordVector object.

Parameters ------model : str, optional (default='wikipedia') Model architecture supported. Allowed values:

* ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256. * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram ˓→size 256. * ``'news'`` - pretrained on cleaned Malay news size 256. * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + ˓→Malay wikipedia size 256.

Returns ------vocabulary: indices dictionary for `vector`. vector: np.array, 2D. """

[3]: vocab_news, embedded_news= malaya.wordvector.load(model='news') vocab_wiki, embedded_wiki= malaya.wordvector.load(model='wikipedia')

9.14.4 Load word vector interface

class WordVector: @check_type def __init__(self, embed_matrix, dictionary: dict, **kwargs):

""" Parameters ------embed_matrix: numpy array dictionary: dictionary """

1. embed_matrix must be a 2d,

array([[ 0.25,-0.10816103,-0.19881412,..., 0.40432587, 0.19388093,-0.07062137], [ 0.3231817,-0.01318745,-0.17950962,..., 0.25, 0.08444146,-0.11705721], [ 0.29103908,-0.16274083,-0.20255531,..., 0.25, 0.06253044,-0.16404966], ..., [ 0.21346697, 0.12686132,-0.4029543,..., 0.43466234, 0.20910986,-0.32219803], [ 0.2372157, 0.32420087,-0.28036436,..., 0.2894639, 0.20745888,-0.30600077], (continues on next page)

9.14. Word Vector 117 malaya Documentation

(continued from previous page) [ 0.27907744, 0.35755727,-0.34932107,..., 0.37472805, 0.42045262,-0.21725406]], dtype=float32)

2. dictionary, a dictionary mapped {'word': 0},

{'mengembanfkan': 394623, 'dipujanya': 234554, 'comicolor': 182282, 'immaz': 538660, 'qabar': 585119, 'phidippus': 180802, }

Load custom word vector

Like fast-text, example, I download from here, https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ms.vec We need to parse the data to get embed_matrix and dictionary.

[ ]: import io import numpy as np

fin= io.open('wiki.ms.vec','r', encoding='utf-8', newline=' \n', errors='ignore') n, d= map(int, fin.readline().split())

data, vectors= {}, [] for no, line in enumerate(fin): tokens= line.rstrip().split('') data[tokens[0]]= no vectors.append(list(map(float, tokens[1:])))

vectors= np.array(vectors) fast_text= malaya.wordvector.WordVector(vectors, data)

[5]: word_vector_news= malaya.wordvector.WordVector(embedded_news, vocab_news) word_vector_wiki= malaya.wordvector.WordVector(embedded_wiki, vocab_wiki)

9.14.5 Check top-k similar semantics based on a word

def n_closest( self, word: str, num_closest: int=5, metric: str='cosine', return_similarity: bool= True, ): """ find nearest words based on a word.

Parameters ------word: str Eg, 'najib' (continues on next page)

118 Chapter 9. Contents: malaya Documentation

(continued from previous page) num_closest: int, (default=5) number of words closest to the result. metric: str, (default='cosine') vector distance algorithm. return_similarity: bool, (default=True) if True, will return between 0-1 represents the distance.

Returns ------word_list: list of nearest words """

[6]: word='anwar' print("Embedding layer: 8 closest words to:' %s' using malaya news word2vec"%(word)) print(word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')) Embedding layer: 8 closest words to: 'anwar' using malaya news word2vec [['najib', 0.6967672109603882], ['mukhriz', 0.675892174243927], ['azmin', 0.

˓→6686884164810181], ['rafizi', 0.6465028524398804], ['muhyiddin', 0.

˓→6413404941558838], ['daim', 0.6334482431411743], ['khairuddin', 0.6300410032272339],

˓→ ['shahidan', 0.6269811391830444]]

[12]: word='anwar' print("Embedding layer: 8 closest words to:' %s' using malaya wiki word2vec"%(word)) print(word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')) Embedding layer: 8 closest words to: 'anwar' using malaya wiki word2vec [['rasulullah', 0.6918460130691528], ['jamal', 0.6604709029197693], ['noraniza', 0.

˓→65153968334198], ['khalid', 0.6450133323669434], ['mahathir', 0.6447468400001526], [

˓→'sukarno', 0.641593337059021], ['wahid', 0.6359774470329285], ['pekin', 0.

˓→6262176036834717]]

9.14.6 Check batch top-k similar semantics based on a word

def batch_n_closest( self, words: List[str], num_closest: int=5, return_similarity: bool= False, soft: bool= True, ): """ find nearest words based on a batch of words using Tensorflow.

Parameters ------words: list Eg, ['najib','anwar'] num_closest: int, (default=5) number of words closest to the result. return_similarity: bool, (default=True) if True, will return between 0-1 represents the distance. soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest

˓→JaroWinkler ratio. (continues on next page)

9.14. Word Vector 119 malaya Documentation

(continued from previous page) if False, it will throw an exception if a word not in the dictionary.

Returns ------word_list: list of nearest words """

[13]: words=['anwar','mahathir'] word_vector_news.batch_n_closest(words, num_closest=8, return_similarity=False) [13]: [['anwar', 'najib', 'mukhriz', 'azmin', 'rafizi', 'muhyiddin', 'daim', 'khairuddin'], ['mahathir', 'daim', 'sahruddin', 'streram', 'morsi', 'anifah', 'jokowi', 'ramasamy']]

What happen if a word not in the dictionary? You can set parameter soft to True or False. Default is True. if True, a word not in the dictionary will be replaced with nearest JaroWrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

[14]: words=['anwar','mahathir','husein-comel'] word_vector_wiki.batch_n_closest(words, num_closest=8, return_similarity=False,soft=False) ------Exception Traceback (most recent call last) in 1 words=['anwar', 'mahathir','husein-comel'] 2 word_vector_wiki.batch_n_closest(words, num_closest=8, ----> 3 return_similarity=False,soft=False)

~/Documents/Malaya/malaya/wordvector.py in batch_n_closest(self, words, num_closest,

˓→return_similarity, soft) 484 raise Exception( 485 '%s not in dictionary, please use another word or set

˓→`soft` = True' --> 486%(words[i]) 487 ) 488 batches= np.array([self.get_vector_by_name(w) forw in words])

Exception: husein-comel not in dictionary, please use another word or set `soft` =

˓→True

120 Chapter 9. Contents: malaya Documentation

[15]: words=['anwar','mahathir','husein-comel'] word_vector_wiki.batch_n_closest(words, num_closest=8, return_similarity=False,soft=True) [15]: [['anwar', 'rasulullah', 'jamal', 'noraniza', 'khalid', 'mahathir', 'sukarno', 'wahid'], ['mahathir', 'anwar', 'wahid', 'najib', 'khalid', 'sukarno', 'suharto', 'salahuddin'], ['husein', 'khairi', 'gccsa', 'jkrte', 'montagny', 'pejudo', 'badriyyin', 'naginatajutsu']]

9.14.7 Word2vec calculator

You can put any equation you wanted.

def calculator( self, equation: str, num_closest: int=5, metric: str='cosine', return_similarity: bool= True, ): """ calculator parser for word2vec.

Parameters ------equation: str Eg, '(mahathir + najib) - rosmah' num_closest: int, (default=5) number of words closest to the result. metric: str, (default='cosine') vector distance algorithm. return_similarity: bool, (default=True) if True, will return between 0-1 represents the distance.

Returns ------(continues on next page)

9.14. Word Vector 121 malaya Documentation

(continued from previous page) word_list: list of nearest words """

[18]: word_vector_news.calculator('anwar + amerika + mahathir', num_closest=8, metric=

˓→'cosine', return_similarity=False) [18]: ['mahathir', 'anwar', 'trump', 'duterte', 'netanyahu', 'jokowi', 'rusia', 'kj', 'obama']

[19]: word_vector_wiki.calculator('anwar + amerika + mahathir', num_closest=8, metric=

˓→'cosine', return_similarity=False) [19]: ['mahathir', 'anwar', 'sukarno', 'suharto', 'hamas', 'sparta', 'amerika', 'iraq', 'lubnan']

9.14.8 Visualize scatter-plot

def scatter_plot( self, labels, centre: str= None, figsize: Tuple[int, int]=(7,7), plus_minus: int= 25, handoff: float= 5e-5, ): """ plot a scatter plot based on output from calculator / n_closest / analogy.

Parameters ------labels : list output from calculator / n_closest / analogy centre : str, (default=None) centre label, if a str, it will annotate in a red color. figsize : tuple, (default=(7, 7)) figure size for plot.

Returns ------(continues on next page)

122 Chapter 9. Contents: malaya Documentation

(continued from previous page) tsne: np.array, 2D. """

[20]: word='anwar' result= word_vector_news.n_closest(word=word, num_closest=8, metric='cosine') data= word_vector_news.scatter_plot(result, centre= word)

[21]: word='anwar' result= word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine') data= word_vector_wiki.scatter_plot(result, centre= word)

9.14. Word Vector 123 malaya Documentation

9.14.9 Visualize tree-plot

def tree_plot( self, labels, figsize: Tuple[int, int]=(7,7), annotate: bool= True ): """ plot a tree plot based on output from calculator / n_closest / analogy.

Parameters ------labels : list output from calculator / n_closest / analogy. visualize : bool if True, it will render plt.show, else return data. figsize : tuple, (default=(7, 7)) figure size for plot.

Returns ------embed: np.array, 2D. labelled: labels for X / Y axis. """

[22]: word='anwar' result= word_vector_news.n_closest(word=word, num_closest=8, metric='cosine') data= word_vector_news.tree_plot(result)

124 Chapter 9. Contents: malaya Documentation

[23]: word='anwar' result= word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine') data= word_vector_wiki.tree_plot(result)

9.14. Word Vector 125 malaya Documentation

9.14.10 Visualize social-network def network( self, word, num_closest=8, depth=4, min_distance= 0.5, iteration= 300, figsize=(15, 15), node_color='#72bbd0', node_factor= 50, (continues on next page)

126 Chapter 9. Contents: malaya Documentation

(continued from previous page) ):

""" plot a social network based on word given

Parameters ------word : str centre of social network. num_closest: int, (default=8) number of words closest to the node. depth: int, (default=4) depth of social network. More deeper more expensive to calculate, big^O(num_ ˓→closest ** depth). min_distance: float, (default=0.5) minimum distance among nodes. Increase the value to increase the distance

˓→among nodes. iteration: int, (default=300) number of loops to train the social network to fit min_distace. figsize: tuple, (default=(15, 15)) figure size for plot. node_color: str, (default='#72bbd0') color for nodes. node_factor: int, (default=10) size factor for depth nodes. Increase this value will increase nodes sizes

˓→based on depth.

[24]:g= word_vector_news.network('mahathir', figsize=(10, 10), node_factor= 50, depth

˓→=3)

9.14. Word Vector 127 malaya Documentation

[25]:g= word_vector_wiki.network('mahathir', figsize=(10, 10), node_factor= 50, depth

˓→=3)

128 Chapter 9. Contents: malaya Documentation

9.14.11 Get embedding from a word def get_vector_by_name( self, word: str, soft: bool= False, topn_soft: int=5 ): """ get vector based on string.

Parameters ------word: str soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest

˓→JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary. topn_soft: int, (default=5) (continues on next page)

9.14. Word Vector 129 malaya Documentation

(continued from previous page) if word not found in dictionary, will returned `topn_soft` size of similar

˓→size using jarowinkler.

Returns ------vector: np.array, 1D """

[28]: word_vector_wiki.get_vector_by_name('najib').shape [28]: (256,)

If a word not found in the vocabulary, it will throw an exception with top-5 nearest words

[26]: word_vector_wiki.get_vector_by_name('husein-comel') ------Exception Traceback (most recent call last) in ----> 1 word_vector_wiki.get_vector_by_name('husein-comel')

~/Documents/Malaya/malaya/wordvector.py in get_vector_by_name(self, word) 127 raise Exception( 128 'input not found in dictionary, here top-5 nearest words [%s]' --> 129%(strings) 130 ) 131 return self._embed_matrix[self._dictionary[word]]

Exception: input not found in dictionary, here top-5 nearest words [husein, husei,

˓→husenil, husen, secomel]

9.15 Word and sentence tokenizer

This tutorial is available as an IPython notebook at Malaya/example/tokenizer.

[1]: %%time import malaya CPU times: user 5.91 s, sys: 1.12 s, total: 7.03 s Wall time: 7.62 s /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

[2]: string1='xjdi ke, y u xsuke makan HUSEIN kt situ tmpt, i hate it. pelikle, pada' string2='i mmg2 xske mknn HUSEIN kampng tmpat, i love them. pelikle saye' string3='perdana menteri ke11 sgt suka makn ayam, harganya cuma rm15.50' string4='pada 10/4, kementerian mengumumkan, 1/100' string5='Husein Zolkepli dapat tempat ke-12 lumba lari hari ni' string6='Husein Zolkepli (2011 - 2019) adalah ketua kampng di sekolah King

˓→Edward ke-IV' string7='2jam 30 minit aku tunggu kau, 60.1 kg kau ni, suhu harini 31.2c, aku

˓→dahaga minum 600ml'

130 Chapter 9. Contents: malaya Documentation

9.15.1 Load word tokenizer class Tokenizer: def __init__(self, lowercase= False, **kwargs): """ Load Tokenizer object. Check supported regex pattern at https://github.com/huseinzol05/Malaya/blob/

˓→master/malaya/text/regex.py#L85

Parameters ------lowercase: bool, optional (default=False) lowercase tokens. emojis: bool, optional (default=True) True to keep emojis. urls: bool, optional (default=True) True to keep urls. tags: bool, optional (default=True) True to keep tags: . emails: bool, optional (default=True) True to keep emails. users: bool, optional (default=True) True to keep users handles: @cbaziotis. hashtags: bool, optional (default=True) True to keep hashtags. phones: bool, optional (default=True) True to keep phones. percents: bool, optional (default=True) True to keep percents. money: bool, optional (default=True) True to keep money expressions. date: bool, optional (default=True) True to keep date expressions. time: bool, optional (default=True) True to keep time expressions. acronyms: bool, optional (default=True) True to keep acronyms. emoticons: bool, optional (default=True) True to keep emoticons. censored: bool, optional (default=True) True to keep censored words: f**k. emphasis: bool, optional (default=True) True to keep words with emphasis: *very* good. numbers: bool, optional (default=True) True to keep numbers. temperature: bool, optional (default=True) True to keep temperatures distance: bool, optional (default=True) True to keep distances. volume: bool, optional (default=True) True to keep volumes. duration: bool, optional (default=True) True to keep durations. weight: bool, optional (default=True) True to keep weights. hypen: bool, optional (default=True) True to keep hypens. """

9.15. Word and sentence tokenizer 131 malaya Documentation

[3]: tokenizer= malaya.preprocessing.Tokenizer()

[4]: tokenizer.tokenize(string1) [4]: ['xjdi', 'ke', ',', 'y', 'u', 'xsuke', 'makan', 'HUSEIN', 'kt', 'situ', 'tmpt', ',', 'i', 'hate', 'it', '.', 'pelikle', ',', 'pada']

[5]: tokenizer.tokenize(string2) [5]: ['i', 'mmg2', 'xske', 'mknn', 'HUSEIN', 'kampng', 'tmpat', ',', 'i', 'love', 'them', '.', 'pelikle', 'saye']

[6]: tokenizer.tokenize(string3) [6]: ['perdana', 'menteri', 'ke11', 'sgt', 'suka', 'makn', 'ayam', ',', 'harganya', 'cuma', 'rm15.50']

[7]: tokenizer.tokenize(string4)

132 Chapter 9. Contents: malaya Documentation

[7]: ['pada', '10', '/', '4', ',', 'kementerian', 'mengumumkan', ',', '1', '/', '100']

[8]: tokenizer.tokenize(string6) [8]: ['Husein', 'Zolkepli', '(', '2011', '-', '2019', ')', 'adalah', 'ketua', 'kampng', 'di', 'kedah', 'sekolah', 'King', 'Edward', 'ke-IV']

[9]: tokenizer.tokenize(string7) [9]: ['2jam', '30 minit', 'aku', 'tunggu', 'kau', ',', '60.1 kg', 'kau', 'ni', ',', 'suhu', 'harini', '31.2c', ',', 'aku', 'dahaga', 'minum', '600ml']

9.15. Word and sentence tokenizer 133 malaya Documentation

url

[10]: tokenizer.tokenize('website saya http://huseinhouse.com') [10]: ['website', 'saya', 'http://huseinhouse.com']

tags

[12]: tokenizer.tokenize('panggil saya ') [12]: ['panggil', 'saya', '']

[13]: tokenizer.tokenize('panggil saya ') [13]: ['panggil', 'saya', '<', 'husein', '>']

emails

[14]: tokenizer.tokenize('email saya [email protected]') [14]: ['email', 'saya', '[email protected]']

[15]: tokenizer.tokenize('email saya [email protected]') [15]: ['email', 'saya', '[email protected]']

users

[16]: tokenizer.tokenize('twitter saya @husein123zolkepli') [16]: ['twitter', 'saya', '@husein123zolkepli']

[17]: tokenizer.tokenize('twitter saya @ husein123zolkepli') [17]: ['twitter', 'saya', '@', 'husein123zolkepli']

hashtags

[18]: tokenizer.tokenize('panggil saya #huseincomel') [18]: ['panggil', 'saya', '#huseincomel']

[19]: tokenizer.tokenize('panggil saya # huseincomel') [19]: ['panggil', 'saya', '#', 'huseincomel']

134 Chapter 9. Contents: malaya Documentation

phones

[20]: tokenizer.tokenize('call sye di 013-1234567') [20]: ['call', 'sye', 'di', '013-1234567']

[27]: tokenizer.tokenize('call sye di 013- 1234567') [27]: ['call', 'sye', 'di', '013', '-', '1234567']

percents

[28]: tokenizer.tokenize('saya sokong 100%') [28]: ['saya', 'sokong', '100%']

[29]: tokenizer.tokenize('saya sokong 100%') [29]: ['saya', 'sokong', '100', '%']

money

[30]: tokenizer.tokenize('saya tinggal rm100') [30]: ['saya', 'tinggal', 'rm100']

[31]: tokenizer.tokenize('saya tinggal rm100k') [31]: ['saya', 'tinggal', 'rm100k']

[32]: tokenizer.tokenize('saya tinggal rm100M') [32]: ['saya', 'tinggal', 'rm100M']

[33]: tokenizer.tokenize('saya tinggal rm100.123M') [33]: ['saya', 'tinggal', 'rm100.123M']

[34]: tokenizer.tokenize('saya tinggal 40 sen') [34]: ['saya', 'tinggal', '40 sen']

[35]: tokenizer.tokenize('saya tinggal 21 ringgit 50 sen') [35]: ['saya', 'tinggal', '21 ringgit', '50 sen']

9.15. Word and sentence tokenizer 135 malaya Documentation

date

[36]: tokenizer.tokenize('tarikh perjumpaan 10/11/2011') [36]: ['tarikh', 'perjumpaan', '10/11/2011']

[37]: tokenizer.tokenize('tarikh perjumpaan 10-11-2011') [37]: ['tarikh', 'perjumpaan', '10-11-2011']

[38]: tokenizer.tokenize('tarikh perjumpaan 12 mei 2011') [38]: ['tarikh', 'perjumpaan', '12 mei 2011']

[39]: tokenizer.tokenize('tarikh perjumpaan mei 12 2011') [39]: ['tarikh', 'perjumpaan', 'mei 12 2011']

time

[40]: tokenizer.tokenize('jumpa 3 am') [40]: ['jumpa', '3 am']

[41]: tokenizer.tokenize('jumpa 22:00') [41]: ['jumpa', '22:00']

censored

[42]: tokenizer.tokenize('f**k lah') [42]: ['f**k', 'lah']

emphasis

[43]: tokenizer.tokenize('*damn* good weih') [43]:[' *damn*', 'good', 'weih']

numbers

[44]: tokenizer.tokenize('no saya 123') [44]: ['no', 'saya', '123']

136 Chapter 9. Contents: malaya Documentation

temperature

[45]: tokenizer.tokenize('sejuk harini, 31.1c') [45]: ['sejuk', 'harini', ',', '31.1c']

[46]: tokenizer.tokenize('sejuk harini, 31.1C') [46]: ['sejuk', 'harini', ',', '31.1C']

distance

[47]: tokenizer.tokenize('nak sampai lagi 31km') [47]: ['nak', 'sampai', 'lagi', '31km']

[48]: tokenizer.tokenize('nak sampai lagi 31 km') [48]: ['nak', 'sampai', 'lagi', '31 km']

volume

[49]: tokenizer.tokenize('botol ni 400ml') [49]: ['botol', 'ni', '400ml']

[50]: tokenizer.tokenize('botol ni 400 l') [50]: ['botol', 'ni', '400 l']

duration

[51]: tokenizer.tokenize('aku dah tunggu kau 2jam kut') [51]: ['aku', 'dah', 'tunggu', 'kau', '2jam', 'kut']

[52]: tokenizer.tokenize('aku dah tunggu kau 2 jam kut') [52]: ['aku', 'dah', 'tunggu', 'kau', '2 jam', 'kut']

[53]: tokenizer.tokenize('lagi 10 minit 3 jam') [53]: ['lagi', '10 minit', '3 jam']

9.15. Word and sentence tokenizer 137 malaya Documentation

weight

[54]: tokenizer.tokenize('berat kau 60 kg') [54]: ['berat', 'kau', '60 kg']

[55]: tokenizer.tokenize('berat kau 60kg') [55]: ['berat', 'kau', '60kg']

hypen

[56]: tokenizer.tokenize('sememang-memangnya kau sakai') [56]: ['sememang-memangnya', 'kau', 'sakai']

[57]: tokenizer.tokenize('sememang- memangnya kau sakai') [57]: ['sememang', '-', 'memangnya', 'kau', 'sakai']

9.15.2 Sentence tokenizer

We considered prefixes, suffixes, starters, acronyms, websites, emails, digits, before digits, time and month to split a sentence into multiple sentences.

def split_into_sentences(text, minimum_length=5): """ Sentence tokenizer.

Parameters ------text: str minimum_length: int, optional (default=5) minimum length to assume a string is a string, default 5 characters.

Returns ------result: List[str] """

[58]:s= """ no.1 polis bertemu dengan suspek di ladang getah. polis tembak pui pui pui bertubi

˓→tubi """

[59]: malaya.text.function.split_into_sentences(s) [59]: ['no.1 polis bertemu dengan suspek di ladang getah.', 'polis tembak pui pui pui bertubi tubi.']

[60]:s= """ email saya di [email protected], nanti jom berkopi """

138 Chapter 9. Contents: malaya Documentation

[61]: malaya.text.function.split_into_sentences(s) [61]: ['email saya di [email protected], nanti jom berkopi.']

[62]:s= """ ke. 2 cerita nya begini. saya berjalan jalan ditepi muara jumpa anak dara. """

[63]: malaya.text.function.split_into_sentences(s) [63]: ['ke.2 cerita nya begini.', 'saya berjalan jalan ditepi muara jumpa anak dara.']

[ ]:

9.16 Spelling Correction

This tutorial is available as an IPython notebook at Malaya/example/spell-correction.

[1]: %%time import malaya CPU times: user 5.37 s, sys: 1.03 s, total: 6.4 s Wall time: 7.18 s

[2]: # some text examples copied from Twitter

string1='krajaan patut bagi pencen awal skt kpd warga emas supaya emosi' string2='Husein ska mkn aym dkat kampng Jawa' string3='Melayu malas ni narration dia sama je macam men are trash. True to some,

˓→false to some.' string4='Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring

˓→discrimination despite your good qualifications because of your race tau pulak

˓→marah. Your kids will be victims of that too.' string5='DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now

˓→as i am edging towards retirement in 4-5 years time after a career of being an

˓→Engineer, Project Manager, General Manager' string6='blh bntg dlm kls nlp sy, nnti intch'

9.16.1 Load probability speller

The probability speller extends the functionality of the Peter Norvig’s, http://norvig.com/spell-correct.html. And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews. Also added custom vowels and consonant augmentation to adapt with our local shortform / typos.

def probability(sentence_piece: bool= False, **kwargs): """ Train a Probability Spell Corrector. (continues on next page)

9.16. Spelling Correction 139 malaya Documentation

(continued from previous page)

Parameters ------sentence_piece: bool, optional (default=False) if True, reduce possible augmentation states using sentence piece.

Returns ------result: malaya.spell.Probability class """

[3]: prob_corrector= malaya.spell.probability()

To correct a word

def correct(self, word: str, **kwargs): """ Most probable spelling correction for word.

Parameters ------word: str

Returns ------result: str """

[4]: prob_corrector.correct('sy') [4]: 'saya'

[5]: prob_corrector.correct('mhthir') [5]: 'mahathir'

[6]: prob_corrector.correct('mknn') [6]: 'makanan'

List possible generated pool of words

def edit_candidates(self, word): """ Generate candidates given a word.

Parameters ------word: str

Returns ------(continues on next page)

140 Chapter 9. Contents: malaya Documentation

(continued from previous page) result: List[str] """

[7]: prob_corrector.edit_candidates('mhthir') [7]: ['mahathir']

[8]: prob_corrector.edit_candidates('smbng') [8]: ['sembang', 'smbg', 'sambung', 'simbang', 'sembung', 'sumbang', 'sambong', 'sambang', 'sumbing', 'sombong', 'sembong']

Now you can see, edit_candidates suggested quite a lot candidates and some of candidates not an actual word like sambang, to reduce that, we can use sentencepiece to check a candidate a legit word for malaysia context or not.

[10]: prob_corrector_sp= malaya.spell.probability(sentence_piece= True) prob_corrector_sp.edit_candidates('smbng') [10]: ['sumbing', 'sambung', 'smbg', 'sembung', 'sombong', 'sembong', 'sembang', 'sumbang', 'sambong']

So how does the model knows which words need to pick? highest counts from the corpus!

To correct a sentence

def correct_text(self, text: str): """ Correct all the words within a text, returning the corrected text.

Parameters ------text: str

Returns ------result: str """

[9]: prob_corrector.correct_text(string1)

9.16. Spelling Correction 141 malaya Documentation

[9]: 'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

[10]: prob_corrector.correct_text(string2) [10]: 'Husein suka makan ayam dekat kampung Jawa'

[11]: prob_corrector.correct_text(string3) [11]: 'Melayu malas ini narration dia sama sahaja macam men are trash. True to some, false

˓→to some.'

[12]: prob_corrector.correct_text(string4) [12]: 'Tapi tak fikir ke bahaya perpetuate myths macam itu. Nanti kalau ada hiring

˓→discrimination despite your good qualifications because of your race tahu pula

˓→marah. Your kids will be victims of that too.'

[13]: prob_corrector.correct_text(string5) [13]: 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya

˓→am edging towards retirement in 4-5 years time after a career of being an Engineer,

˓→Project Manager, General Manager'

[14]: prob_corrector.correct_text(string6) [14]: 'boleh bintang dalam kelas nlp saya, nanti intch'

9.16.2 Load JamSpell speller

JamSpell use Norvig + Ngram words. Before you able to use this spelling correction, you need to install jamspell, For mac,

wget http://prdownloads.sourceforge.net/swig/swig-3.0.12.tar.gz tar -zxf swig-3.0.12.tar.gz ./swig-3.0.12/configure&& make&& make install pip3 install jamspell

For debian / ubuntu,

apt install swig3 pip3 install jamspell

def jamspell(model: str='wiki+news', **kwargs): """ Load a jamspell Spell Corrector for Malay.

Parameters ------model: str, optional (default='wiki+news') Supported models. Allowed values:

* ``'wiki+news'`` - Wikipedia + News, 337MB. * ``'wiki'`` - Wikipedia, 148MB. (continues on next page)

142 Chapter 9. Contents: malaya Documentation

(continued from previous page) * ``'news'`` - Wikipedia, 215MB.

Returns ------result: malaya.spell.JamSpell class """

[4]: model= malaya.spell.jamspell(model='wiki')

To correct a word

def correct(self, word: str, string: str, index: int=-1): """ Correct a word within a text, returning the corrected word.

Parameters ------word: str string: str Entire string, `word` must a word inside `string`. index: int, optional(default=-1) index of word in the string, if -1, will try to use `string.index(word)`.

Returns ------result: str """

[5]: model.correct('suke','saya suke makan iyom') [5]: 'suka'

List possible generated pool of words

def edit_candidates(self, word: str, string: str, index: int=-1): """ Generate candidates given a word.

Parameters ------word: str string: str Entire string, `word` must a word inside `string`. index: int, optional(default=-1) index of word in the string, if -1, will try to use `string.index(word)`.

Returns ------result: List[str] """

[15]: model.edit_candidates('ayem','saya suke makan ayem')

9.16. Spelling Correction 143 malaya Documentation

[15]: ('ayem', 'ayam', 'ayer', 'aye', 'asem', 'yem', 'adem', 'alem', 'aem', 'ayim', 'oyem', 'ayew', 'azem', 'ajem', 'ayiem')

To correct a sentence

def correct_text(self, text: str): """ Correct all the words within a text, returning the corrected text.

Parameters ------text: str

Returns ------result: str """

[17]: model.correct_text('saya suke makan ayom') [17]: 'saya suka makan ayam'

9.16.3 Load Spylls speller

Spylls is ported to Python. Before you able to use this spelling correction, you need to install spylls,

pip3 install Spylls

def spylls(model: str='libreoffice-pejam', **kwargs): """ Load a spylls Spell Corrector for Malay.

Parameters ------model : str, optional (default='libreoffice-pejam') Model spelling correction supported. Allowed values:

* ``'libreoffice-pejam'`` - from LibreOffice pEJAm, https://extensions. ˓→libreoffice.org/en/extensions/show/3868 (continues on next page)

144 Chapter 9. Contents: malaya Documentation

(continued from previous page)

Returns ------result: malaya.spell.Spylls class """

[17]: model= malaya.spell.spylls()

To correct a word

def correct(self, word: str): """ Correct a word within a text, returning the corrected word.

Parameters ------word: str

Returns ------result: str """

[18]: model.correct('sy') [18]: 'st'

[19]: model.correct('mhthir') [19]: 'Mahathir'

[20]: model.correct('mknn') [20]: 'knn'

List possible generated pool of words

def edit_candidates(self, word: str): """ Generate candidates given a word.

Parameters ------word: str

Returns ------result: List[str] """ return list(self._dictionary.suggest(word))

[21]: model.edit_candidates('mhthir')

9.16. Spelling Correction 145 malaya Documentation

[21]: ['Mahathir']

[22]: model.edit_candidates('smbng') [22]: ['sbng', 'smbang', 'jmbng', 'cmbng']

To correct a sentence

def correct_text(self, text: str): """ Correct all the words within a text, returning the corrected text.

Parameters ------text: str

Returns ------result: str """

[23]: model.correct_text(string1) [23]: 'kerajaan putat baji pencen awal tks dpk warga enas supaya emisi'

[24]: model.correct_text(string2) [24]: 'Husein sak mkn aum tkad kampang Jawa'

[25]: model.correct_text(string3) [25]: 'Melayu malas in paration ida asma ja macam man ara tras. True tu som, falsafah tu

˓→som.'

[26]: model.correct_text(string4) [26]: 'Tapi kat fikir ka bahaya terperbuat smythea catu. Nanti kalau ada giring

˓→diskriminatif desiliter our food identification becus of our reca tua pukal ramah.

˓→Your kias wila ba victoria of rhat oto.'

9.16.4 Load Encoder transformer speller

This spelling correction is a transformer based, improvement version of malaya.spell.probability. Problem with malaya.spell.probability, it naively picked highest probability of word based on public sentences (wiki, news and social media) without understand actual context, example,

string='krajaan patut bagi pencen awal skt kpd warga emas supaya emosi' prob_corrector= malaya.spell.probability() prob_corrector.correct_text(string) ->'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

It supposely replaced skt with sikit, a common word people use in social media to give a little bit of attention to pencen. So, to fix that, we can use Transformer model! Right now transformer speller supported ``BERT``, ``ALBERT`` and ``ELECTRA`` only.

146 Chapter 9. Contents: malaya Documentation

def transformer_encoder(model, sentence_piece: bool= False, **kwargs): """ Load a Transformer Encoder Spell Corrector. Right now only supported BERT, ALBERT

˓→and ELECTRA.

Parameters ------sentence_piece: bool, optional (default=False) if True, reduce possible augmentation states using sentence piece.

Returns ------result: malaya.spell.Transformer class """

[3]: model= malaya.transformer.load(model='electra') transformer_corrector= malaya.spell.transformer_encoder(model, sentence_piece= True) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.

˓→v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow_core/python/

˓→layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is

˓→deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use

˓→tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:119: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead. (continues on next page)

9.16. Spelling Correction 147 malaya Documentation

(continued from previous page)

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.

˓→compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.

˓→v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→base/electra-base/model.ckpt

To correct a sentence

def correct_text(self, text: str, batch_size: int= 20): """ Correct all the words within a text, returning the corrected text.

Parameters ------text: str batch_size: int, optional(default=20) batch size to insert into model.

Returns ------result: str """

[4]: transformer_corrector.correct_text(string1) [4]: 'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

perfect! But again, transformer model is very expensive! You can compare the time wall with probability based.

[6]: %%time transformer_corrector.correct_text(string1) CPU times: user 21.8 s, sys: 1.19 s, total: 23 s Wall time: 5.15 s [6]: 'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

[7]: %%time prob_corrector.correct_text(string1) CPU times: user 108 ms, sys: 3.34 ms, total: 112 ms Wall time: 112 ms

148 Chapter 9. Contents: malaya Documentation

[7]: 'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

9.16.5 Load symspeller speller

This spelling correction is an improvement version for symspeller to adapt with our local shortform / typos. Before you able to use this spelling correction, you need to install symspeller,

pip install symspellpy

def symspell( max_edit_distance_dictionary: int=2, prefix_length: int=7, term_index: int=0, count_index: int=1, top_k: int= 10, **kwargs ): """ Train a symspell Spell Corrector.

Returns ------result: malaya.spell.Symspell class """

[11]: symspell_corrector= malaya.spell.symspell()

To correct a word

def correct(self, word: str, **kwargs): """ Most probable spelling correction for word.

Parameters ------word: str

Returns ------result: str """

[12]: symspell_corrector.correct('bntng') [12]: 'bintang'

[13]: symspell_corrector.correct('kerajaan') [13]: 'kerajaan'

[14]: symspell_corrector.correct('mknn')

9.16. Spelling Correction 149 malaya Documentation

[14]: 'makanan'

List possible generated words

def edit_step(self, word): """ Generate candidates given a word.

Parameters ------word: str

Returns ------result: List[str] """

[15]: symspell_corrector.edit_step('mrh') [15]: {'marah': 12684.0, 'merah': 21448.5, 'arah': 15066.5, 'darah': 10003.0, 'mara': 7504.5, 'malah': 7450.0, 'zarah': 3753.5, 'murah': 3575.5, 'barah': 2707.5, 'march': 2540.5, 'martha': 390.0, 'marsha': 389.0, 'maratha': 88.5, 'marcha': 22.5, 'karaha': 13.5, 'maraba': 13.5, 'varaha': 11.5, 'marana': 4.5, 'marama': 4.5}

To correct a sentence

def correct_text(self, text: str): """ Correct all the words within a text, returning the corrected text.

Parameters ------text: str

Returns ------result: str """

150 Chapter 9. Contents: malaya Documentation

[16]: symspell_corrector.correct_text(string1) [16]: 'kerajaan patut bagi pencen awal saat kepada warga emas supaya emosi'

[17]: symspell_corrector.correct_text(string2) [17]: 'Husein suka makan ayam dapat kampung Jawa'

[18]: symspell_corrector.correct_text(string3) [18]: 'Melayu malas ni narration dia sama sahaja macam men are trash. True to some, false

˓→to some.'

[19]: symspell_corrector.correct_text(string4) [19]: 'Tapi tak fikir ke bahaya perpetuate maathai macam itu. Nanti kalau ada hiring

˓→discrimination despite your good qualifications because of your race tahu pula

˓→marah. Your kids will be victims of that too.'

[20]: symspell_corrector.correct_text(string5) [20]: 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya

˓→am edging towards retirement in 4-5 aras time after a career of being an Engineer,

˓→Project Manager, General Manager'

[21]: symspell_corrector.correct_text(string6) [21]: 'boleh bintang dalam kelas malaya saya, nanti mintalah'

9.16.6 List available Transformer models

We use custom spelling augmentation, 1. replace_similar_consonants • mereka -> nereka 2. replace_similar_vowels • suka -> sika 3. socialmedia_form • suka -> ska 4. vowel_alternate • singapore -> sngpore • kampung -> kmpng

[4]: malaya.spell.available_transformer() INFO:root:tested on 10k test set. [4]: Size (MB) Quantized Size (MB) WER Suggested length small-t5 355.6 195.0 0.015625 256.0 tiny-t5 208.0 103.0 0.023712 256.0 super-tiny-t5 81.8 27.1 0.038001 256.0

9.16. Spelling Correction 151 malaya Documentation

9.16.7 Load Transformer model

def transformer(model: str='small-t5', quantized: bool= False, **kwargs): """ Load a Transformer Spell Corrector.

Parameters ------model : str, optional (default='small-t5') Model architecture supported. Allowed values:

* ``'small-t5'`` - T5 SMALL parameters. * ``'tiny-t5'`` - T5 TINY parameters. * ``'super-tiny-t5'`` - T5 SUPER TINY parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.t5.Spell class """

[3]: t5= malaya.spell.transformer(model='tiny-t5')

Predict using greedy decoder

def greedy_decoder(self, strings: List[str]): """ spelling correction for strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[6]: t5.greedy_decoder([string1]) [6]: ['kerajaan patut bagi pencen awal skt kpd warga emas supaya emosi']

[8]: t5.greedy_decoder([string2]) [8]: ['Husein suka makan ayam dekat kampung Jawa']

[7]: t5.greedy_decoder([string3]) [7]: ['Melayu malas ni narration dia sama je macam men are trash . True to some , false to

˓→some .']

152 Chapter 9. Contents: malaya Documentation

9.17 Coreference Resolution

This tutorial is available as an IPython notebook at Malaya/example/coref.

This module only trained on standard language structure, so it is not save to use it for local language structure.

9.17.1 What is Coreference Resolution?

[1]: from IPython.core.display import Image, display

display(Image('https://nlp.stanford.edu/projects/corefexample.png', width=500))

Kakak mempunyai kucing. Dia menyayanginya. Dia -> Kakak, nya -> kucing Husein Zolkepli suka makan ayam. Dia pun suka makan daging. Dia -> Husein Zolkepli

[2]: %%time import malaya CPU times: user 5.25 s, sys: 936 ms, total: 6.19 s Wall time: 6.79 s

9.17.2 Load dependency models

[3]: model= malaya.dependency.transformer(model='albert') alxlnet= malaya.dependency.transformer(model='tiny-albert')

9.17.3 Resolve coreference clusters using dependency parsing

def parse_from_dependency(models, string: str, references: List[str]=['dia','itu','ini','saya','awak

˓→','kamu','kita','kami','mereka'], rejected_references: List[str]=['saya','awak','kamu',

˓→'kita','kami','mereka'], acceptable_subjects: List[str]=['flat','subj','nsubj',

˓→'csubj','obl','obj'], acceptable_nested_subjects: List[str]=['compound','flat

˓→'], split_nya: bool= True, (continues on next page)

9.17. Coreference Resolution 153 malaya Documentation

(continued from previous page) aggregate: Callable= np.mean, top_k: int= 20): """ Apply Coreference Resolution using stacks of dependency models.

Parameters ------models: list list of dependency models, must has `vectorize` method. string: str references: List[str], optional (default=['dia', 'itu', 'ini', 'saya', 'awak',

˓→'kamu', 'kita', 'kami', 'mereka']) list of references. rejected_references: List[str], optional (default=['saya', 'awak', 'kamu', 'kita',

˓→ 'kami', 'mereka']) list of rejected references during populating subjects. acceptable_subjects:List[str], optional List of dependency labels for subjects. acceptable_nested_subjects: List[str], optional List of dependency labels for nested subjects, eg, syarikat (obl) facebook

˓→(compound). split_nya: bool, optional (default=True) split `nya`, eg, `disifatkannya` -> `disifatkan`, `nya`. aggregate: Callable, optional (default=numpy.mean) Aggregate function to aggregate list of vectors from `model.vectorize`. top_k: int, optional (default=20) only accept near top_k to assume a coherence.

Returns ------result: Dict[text, coref] {'text': ['Husein','Zolkepli','suka','makan','ayam','.','Dia','pun','suka',

˓→'makan','daging','.'], 'coref': {6: {'index': [0, 1], 'text': ['Husein', 'Zolkepli']}}} """

[15]: string='Husein Zolkepli suka makan ayam. Dia pun suka makan daging.' string1='Kakak mempunyai kucing. Dia menyayanginya.'

# https://www.malaysiakini.com/news/580044 string2='Pengerusi PKR Azan Ismail menyelar pemimpin PAS yang

˓→disifatkannya sebagai membisu mengenai gesaan mengadakan sidang Dewan Undangan

˓→Negeri (DUN) di negeri yang dipimpin parti mereka.'

# https://www.sinarharian.com.my/article/146270/EDISI/Tiada-isu-penjualan-vaksin-

˓→Covid-19-di- string3=' - Polis Kelantan mengesahkan masih belum menerima sebarang

˓→laporan berkaitan isu penjualan vaksin tidak sah berlaku di negeri ini. Timbalan

˓→Ketua Polis Kelantan, Senior Asisten Komisioner Abdullah Mohammad Piah berkata,

˓→bagaimanapun pihaknya sedia menjalankan siasatan lanjut jika menerima laporan

˓→berkaitan perkara itu.'

[8]: %%time

malaya.coref.parse_from_dependency([model], string)

154 Chapter 9. Contents: malaya Documentation

CPU times: user 434 ms, sys: 87.3 ms, total: 521 ms Wall time: 126 ms [8]: {'text': ['Husein', 'Zolkepli', 'suka', 'makan', 'ayam', '.', 'Dia', 'pun', 'suka', 'makan', 'daging', '.'], 'coref': {6: {'index': [0, 1], 'text': ['Husein', 'Zolkepli']}}}

[9]: %%time

malaya.coref.parse_from_dependency([model], string1) CPU times: user 373 ms, sys: 76 ms, total: 449 ms Wall time: 108 ms [9]: {'text': ['Kakak', 'mempunyai', 'kucing', '.', 'Dia', 'menyayangi', 'nya', '.'], 'coref': {4: {'index': [0], 'text': ['Kakak']}, 6: {'index': [2], 'text': ['kucing']}}}

[10]: %%time

malaya.coref.parse_from_dependency([model], string2) CPU times: user 673 ms, sys: 196 ms, total: 869 ms Wall time: 197 ms [10]: {'text': ['Pengerusi', 'PKR', 'Terengganu', 'Azan', 'Ismail', 'menyelar', 'pemimpin', 'PAS', 'yang', 'disifatkan', 'nya', 'sebagai', 'membisu', 'mengenai', 'gesaan', 'mengadakan', 'sidang', (continues on next page)

9.17. Coreference Resolution 155 malaya Documentation

(continued from previous page) 'Dewan', 'Undangan', 'Negeri', '(', 'DUN', ')', 'di', 'negeri', 'yang', 'dipimpin', 'parti', 'mereka', '.'], 'coref': {10: {'index': [6, 7], 'text': ['pemimpin', 'PAS']}, 28: {'index': [16, 17, 18, 19], 'text': ['sidang', 'Dewan', 'Undangan', 'Negeri']}}}

[16]: %%time

malaya.coref.parse_from_dependency([model], string3) CPU times: user 738 ms, sys: 183 ms, total: 922 ms Wall time: 207 ms [16]: {'text': ['Kota', 'Bharu', '-', 'Polis', 'Kelantan', 'mengesahkan', 'masih', 'belum', 'menerima', 'sebarang', 'laporan', 'berkaitan', 'isu', 'penjualan', 'vaksin', 'tidak', 'sah', 'berlaku', 'di', 'negeri', 'ini', '.', 'Timbalan', 'Ketua', 'Polis', 'Kelantan', ',', 'Senior', 'Asisten', 'Komisioner', 'Abdullah', 'Mohammad', 'Piah', (continues on next page)

156 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'berkata', ',', 'bagaimanapun', 'pihak', 'nya', 'sedia', 'menjalankan', 'siasatan', 'lanjut', 'jika', 'menerima', 'laporan', 'berkaitan', 'perkara', 'itu', '.'], 'coref': {20: {'index': [12], 'text': ['isu']}, 37: {'index': [28, 29, 30, 31, 32], 'text': ['Asisten', 'Komisioner', 'Abdullah', 'Mohammad', 'Piah']}, 47: {'index': [40], 'text': ['siasatan']}}}

[ ]:

9.18 Normalizer

This tutorial is available as an IPython notebook at Malaya/example/normalizer.

[1]: %%time import malaya CPU times: user 4.85 s, sys: 667 ms, total: 5.51 s Wall time: 4.51 s

[2]: string1='xjdi ke, y u xsuke makan HUSEIN kt situ tmpt, i hate it. pelikle, pada' string2='i mmg2 xske mknn HUSEIN kampng tmpat, i love them. pelikle saye' string3='perdana menteri ke11 sgt suka makn ayam, harganya cuma rm15.50' string4='pada 10/4, kementerian mengumumkan, 1/100' string5='Husein Zolkepli dapat tempat ke-12 lumba lari hari ni' string6='Husein Zolkepli (2011 - 2019) adalah ketua kampng di kedah sekolah King

˓→Edward ke-IV' string7='2jam 30 minit aku tunggu kau, 60.1 kg kau ni, suhu harini 31.2c, aku

˓→dahaga minum 600ml'

9.18. Normalizer 157 malaya Documentation

9.18.1 Load normalizer

This normalizer can load any spelling correction model, eg, malaya.spell.probability, or malaya. spell.transformer.

def normalizer(speller= None, **kwargs): """ Load a Normalizer using any spelling correction model.

Parameters ------speller : spelling correction object, optional (default = None)

Returns ------result: malaya.normalize.Normalizer class """

[3]: corrector= malaya.spell.probability() normalizer= malaya.normalize.normalizer(corrector)

normalize

def normalize( self, string: str, check_english: bool= True, normalize_text: bool= True, normalize_entity: bool= True, normalize_url: bool= False, normalize_email: bool= False, normalize_year: bool= True, normalize_telephone: bool= True, ): """ Normalize a string.

Parameters ------string : str check_english: bool, (default=True) check a word in english dictionary. normalize_text: bool, (default=True) if True, will try to replace shortforms with internal corpus. normalize_entity: bool, (default=True) normalize entities, only effect `date`, `datetime`, `time` and `money`

˓→patterns string only. normalize_url: bool, (default=False) if True, replace `://` with empty and `.` with `dot`. `https://huseinhouse.com` -> `https huseinhouse dot com`. normalize_email: bool, (default=False) if True, replace `@` with `di`, `.` with `dot`. `[email protected]` -> `husein dot zol kosong lima di gmail dot com`. normalize_year: bool, (default=True) if True, `tahun 1987` -> `tahun sembilan belas lapan puluh tujuh`. (continues on next page)

158 Chapter 9. Contents: malaya Documentation

(continued from previous page) if True, `1970-an` -> `sembilan belas tujuh puluh an`. if False, `tahun 1987` -> `tahun seribu sembilan ratus lapan puluh tujuh`. normalize_telephone: bool, (default=True) if True, `no 012-1234567` -> `no kosong satu dua, satu dua tiga empat lima

˓→enam tujuh`

Returns ------string: normalized string """

[4]: string='boleh dtg 8pagi esok tak atau minggu depan? 2 oktober 2019 2pm, tlong bayar

˓→rm 3.2k sekali tau'

[5]: normalizer.normalize(string) [5]: {'normalize': 'boleh datang lapan pagi esok tidak atau minggu depan ? 02/10/2019 14:

˓→00:00 , tolong bayar tiga ribu dua ratus ringgit sekali tahu', 'date': {'8 AM esok': datetime.datetime(2021, 1, 1, 8, 0), '2 oktober 2019 2pm': datetime.datetime(2019, 10, 2, 14, 0), 'minggu depan': datetime.datetime(2021, 1, 7, 19, 33, 47, 65094)}, 'money': {'rm 3.2k': 'RM3200.0'}}

[6]: normalizer.normalize(string, normalize_entity= False) [6]: {'normalize': 'boleh datang lapan pagi esok tidak atau minggu depan ? 02/10/2019 14:

˓→00:00 , tolong bayar tiga ribu dua ratus ringgit sekali tahu', 'date': {}, 'money': {}}

Here you can see, Malaya normalizer will normalize minggu depan to datetime object, also 3.2k ringgit to RM3200

[7]: print(normalizer.normalize(string1)) print(normalizer.normalize(string2)) print(normalizer.normalize(string3)) print(normalizer.normalize(string4)) print(normalizer.normalize(string5)) print(normalizer.normalize(string6)) print(normalizer.normalize(string7)) {'normalize': 'tak jadi ke , kenapa awak tak suka makan HUSEIN kat situ tempat , saya

˓→hate it . pelik lah , pada', 'date': {}, 'money': {}} {'normalize': 'saya memang-memang tak suka makan HUSEIN kampung tempat , saya love

˓→them . pelik lah saya', 'date': {}, 'money': {}} {'normalize': 'perdana menteri kesebelas sangat suka makan ayam , harganya cuma lima

˓→belas ringgit lima puluh sen', 'date': {}, 'money': {'rm15.50': 'RM15.50'}} {'normalize': 'pada sepuluh hari bulan empat , kementerian mengumumkan , satu per

˓→seratus', 'date': {}, 'money': {}} {'normalize': 'Husein Zolkepli dapat tempat kedua belas lumba lari hari ini', 'date':

˓→{}, 'money': {}} {'normalize': 'Husein Zolkepli ( dua ribu sebelas hingga dua ribu sembilan belas )

˓→adalah ketua kampung di kedah sekolah King Edward keempat', 'date': {}, 'money': {}} {'normalize': 'dua jam tiga puluh minit aku tunggu kamu , enam puluh perpuluhan satu

˓→kilogram kamu ini , suhu hari ini tiga puluh satu perpuluhan dua celsius , aku

˓→dahaga minum enam ratus milliliter', 'date': {}, 'money': {}}

9.18. Normalizer 159 malaya Documentation

9.18.2 Skip spelling correction

Simply pass None to speller to normalizer = malaya.normalize.normalizer. By default it is None.

[8]: normalizer= malaya.normalize.normalizer(corrector) without_corrector_normalizer= malaya.normalize.normalizer( None)

[9]: normalizer.normalize(string2) [9]: {'normalize': 'saya memang-memang tak suka makan HUSEIN kampung tempat , saya love

˓→them . pelik lah saya', 'date': {}, 'money': {}}

[10]: without_corrector_normalizer.normalize(string2) [10]: {'normalize': 'saya memang-memang tak suka mknn HUSEIN kampng tmpat , saya love them .

˓→ pelik lah saya', 'date': {}, 'money': {}}

9.18.3 Pass kwargs preprocessing

Let say you want to skip to normalize date pattern, you can pass kwargs to normalizer, check original tokenizer implementation at https://github.com/huseinzol05/Malaya/blob/master/malaya/preprocessing.py#L103

[11]: normalizer= malaya.normalize.normalizer(corrector) skip_date_normalizer= malaya.normalize.normalizer(corrector, date= False)

[12]: normalizer.normalize('tarikh program tersebut 14 mei') [12]: {'normalize': 'tarikh program tersebut 14/05/2020', 'date': {'14 mei': datetime.datetime(2020, 5, 14, 0, 0)}, 'money': {}}

[13]: skip_date_normalizer.normalize('tarikh program tersebut 14 mei') [13]: {'normalize': 'tarikh program tersebut empat belas mei', 'date': {'14 mei': datetime.datetime(2020, 5, 14, 0, 0)}, 'money': {}}

9.18.4 Normalize url

Let say you have an url word, example, https://huseinhouse.com, this parameter going to, 1. replace :// with empty string. 2. replace . with dot. 3. replace digits with string representation. Simply normalizer.normalize(string, normalize_url = True), default is False.

[14]: normalizer= malaya.normalize.normalizer() normalizer.normalize('web saya ialah https://huseinhouse.com')

160 Chapter 9. Contents: malaya Documentation

[14]: {'normalize': 'web saya ialah https://huseinhouse.com', 'date': {}, 'money': {}}

[15]: normalizer.normalize('web saya ialah https://huseinhouse.com', normalize_url= True) [15]: {'normalize': 'web saya ialah https huseinhouse dot com', 'date': {}, 'money': {}}

[16]: normalizer.normalize('web saya ialah https://huseinhouse02934.com', normalize_url=

˓→True) [16]: {'normalize': 'web saya ialah https huseinhouse kosong dua sembilan tiga empat dot com

˓→', 'date': {}, 'money': {}}

9.18.5 Normalize email

Let say you have an email word, example, [email protected], this parameter going to, 1. replace :// with empty string. 2. replace . with dot. 3. replace @ with di. 4. replace digits with string representation. Simply normalizer.normalize(string, normalize_email = True), default is False.

[17]: normalizer= malaya.normalize.normalizer() normalizer.normalize('email saya ialah [email protected]') [17]: {'normalize': 'email saya ialah [email protected]', 'date': {}, 'money': {}}

[18]: normalizer= malaya.normalize.normalizer() normalizer.normalize('email saya ialah [email protected]', normalize_email=

˓→True) [18]: {'normalize': 'email saya ialah husein dot zol kosong lima di gmail dot com', 'date': {}, 'money': {}}

9.18.6 Normalize year

1. if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. 2. if True, 1970-an -> sembilan belas tujuh puluh an. 3. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh. Simply normalizer.normalize(string, normalize_year = True), default is True.

9.18. Normalizer 161 malaya Documentation

[19]: normalizer= malaya.normalize.normalizer()

[20]: normalizer.normalize('$400 pada tahun 1998 berbanding lebih $1000') [20]: {'normalize': 'empat ratus dollar pada tahun sembilan belas sembilan puluh lapan

˓→berbanding lebih seribu dollar', 'date': {}, 'money': {'$400': '$400', '$1000': '$1000'}}

[21]: normalizer.normalize('$400 pada 1970-an berbanding lebih $1000') [21]: {'normalize': 'empat ratus dollar pada sembilan belas tujuh puluhan berbanding lebih

˓→seribu dollar', 'date': {}, 'money': {'$400': '$400', '$1000': '$1000'}}

[22]: normalizer.normalize('$400 pada tahun 1970-an berbanding lebih $1000') [22]: {'normalize': 'empat ratus dollar pada tahun sembilan belas tujuh puluhan berbanding

˓→lebih seribu dollar', 'date': {}, 'money': {'$400': '$400', '$1000': '$1000'}}

[23]: normalizer.normalize('$400 pada tahun 1998 berbanding lebih $1000', normalize_year=

˓→False) [23]: {'normalize': 'empat ratus dollar pada tahun seribu sembilan ratus sembilan puluh

˓→lapan berbanding lebih seribu dollar', 'date': {}, 'money': {'$400': '$400', '$1000': '$1000'}}

9.18.7 Normalize telephone

1. if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh. Simply normalizer.normalize(string, normalize_telephone = True), default is True.

[24]: normalizer= malaya.normalize.normalizer()

[25]: normalizer.normalize('no saya 012-1234567') [25]: {'normalize': 'no saya kosong satu dua, satu dua tiga empat lima enam tujuh', 'date': {}, 'money': {}}

[26]: normalizer.normalize('no saya 012-1234567', normalize_telephone= False) [26]: {'normalize': 'no saya 012-1234567', 'date': {}, 'money': {}}

162 Chapter 9. Contents: malaya Documentation

9.18.8 Ignore normalize money

Let say I have a text contains RM 77 juta and I wish to maintain it like that.

[27]: text='Suatu ketika rakyat Malaysia dikejutkan dengan kontrak pelantikan sebanyak

˓→hampir RM 77 juta setahun yang hanya terdedah apabila diasak oleh Datuk Seri Anwar

˓→Ibrahim.'

[28]: normalizer= malaya.normalize.normalizer()

[29]: normalizer.normalize(text) [29]: {'normalize': 'Suatu ketika rakyat Malaysia dikejutkan dengan kontrak pelantikan

˓→sebanyak hampir tujuh puluh tujuh ringgit juta setahun yang hanya terdedah apabila

˓→diasak oleh Datuk Seri .', 'date': {}, 'money': {'rm 77': 'RM77'}}

[30]: normalizer= malaya.normalize.normalizer(money= False) normalizer.normalize(text, normalize_text= False, check_english= False) [30]: {'normalize': 'Suatu ketika rakyat Malaysia dikejutkan dengan kontrak pelantikan

˓→sebanyak hampir RM tujuh puluh tujuh juta setahun yang hanya terdedah apabila

˓→diasak oleh Datuk Seri Anwar Ibrahim .', 'date': {}, 'money': {}}

[31]: normalizer.normalize(text, normalize_text= False, check_english= False) [31]: {'normalize': 'Suatu ketika rakyat Malaysia dikejutkan dengan kontrak pelantikan

˓→sebanyak hampir RM tujuh puluh tujuh juta setahun yang hanya terdedah apabila

˓→diasak oleh Datuk Seri Anwar Ibrahim .', 'date': {}, 'money': {}}

9.18.9 Normalizing rules

All these rules will ignore if first letter is capital except for normalizing titles.

1. Normalize title,

{ 'dr':'Doktor', 'yb':'Yang Berhormat', 'hj':'Haji', 'ybm':'Yang Berhormat Mulia', 'tyt':'Tuan Yang Terutama', 'yab':'Yang Berhormat', 'ybm':'Yang Berhormat Mulia', 'yabhg':'Yang Amat Berbahagia', 'ybhg':'Yang Berbahagia', 'miss':'Cik', }

9.18. Normalizer 163 malaya Documentation

[32]: normalizer= malaya.normalize.normalizer()

[33]: normalizer.normalize('Dr yahaya') [33]: {'normalize': 'Doktor yahaya', 'date': {}, 'money': {}}

2. expand x

[34]: normalizer.normalize('xtahu') [34]: {'normalize': 'tak tahu', 'date': {}, 'money': {}}

3. normalize ke -

[35]: normalizer.normalize('ke-12') [35]: {'normalize': 'kedua belas', 'date': {}, 'money': {}}

[36]: normalizer.normalize('ke - 12') [36]: {'normalize': 'kedua belas', 'date': {}, 'money': {}}

4. normalize ke - roman

[37]: normalizer.normalize('ke-XXI') [37]: {'normalize': 'kedua puluh satu', 'date': {}, 'money': {}}

[38]: normalizer.normalize('ke - XXI') [38]: {'normalize': 'kedua puluh satu', 'date': {}, 'money': {}}

5. normalize NUM - NUM

[39]: normalizer.normalize('2011 - 2019') [39]: {'normalize': 'dua ribu sebelas hingga dua ribu sembilan belas', 'date': {}, 'money': {}}

[40]: normalizer.normalize('2011.01-2019') [40]: {'normalize': 'dua ribu sebelas perpuluhan kosong satu hingga dua ribu sembilan belas

˓→', 'date': {}, 'money': {}}

164 Chapter 9. Contents: malaya Documentation

6. normalize pada NUM (/ | -) NUM

[41]: normalizer.normalize('pada 10/4') [41]: {'normalize': 'pada sepuluh hari bulan empat', 'date': {}, 'money': {}}

[42]: normalizer.normalize('PADA 10 -4') [42]: {'normalize': 'pada sepuluh hari bulan empat', 'date': {}, 'money': {}}

7. normalize NUM / NUM

[43]: normalizer.normalize('10 /4') [43]: {'normalize': 'sepuluh per empat', 'date': {}, 'money': {}}

8. normalize rm NUM

[44]: normalizer.normalize('RM10.5') [44]: {'normalize': 'sepuluh ringgit lima puluh sen', 'date': {}, 'money': {'rm10.5': 'RM10.5'}}

9. normalize rm NUM sen

[45]: normalizer.normalize('rm 10.5 sen') [45]: {'normalize': 'sepuluh ringgit lima puluh sen', 'date': {}, 'money': {'rm 10.5': 'RM10.5'}}

10. normalize NUM sen

[46]: normalizer.normalize('1015 sen') [46]: {'normalize': 'sepuluh ringgit lima belas sen', 'date': {}, 'money': {'1015 sen': 'RM10.15'}}

11. normalize money

[47]: normalizer.normalize('rm10.4m') [47]: {'normalize': 'sepuluh juta empat ratus ribu ringgit', 'date': {}, 'money': {'rm10.4m': 'RM10400000.0'}}

[48]: normalizer.normalize('$10.4K')

9.18. Normalizer 165 malaya Documentation

[48]: {'normalize': 'sepuluh ribu empat ratus dollar', 'date': {}, 'money': {'$10.4k': '$10400.0'}}

12. normalize cardinal

[49]: normalizer.normalize('123') [49]: {'normalize': 'seratus dua puluh tiga', 'date': {}, 'money': {}}

13. normalize ordinal

[50]: normalizer.normalize('ke123') [50]: {'normalize': 'keseratus dua puluh tiga', 'date': {}, 'money': {}}

14. normalize date / time / datetime string to datetime.datetime

[51]: normalizer.normalize('2 hari lepas') [51]: {'normalize': 'dua hari lepas', 'date': {'2 hari lalu': datetime.datetime(2020, 12, 29, 19, 33, 47, 507552)}, 'money': {}}

[52]: normalizer.normalize('esok') [52]: {'normalize': 'esok', 'date': {'esok': datetime.datetime(2021, 1, 1, 19, 33, 47, 514754)}, 'money': {}}

[53]: normalizer.normalize('okt 2019') [53]: {'normalize': '31/10/2019', 'date': {'okt 2019': datetime.datetime(2019, 10, 31, 0, 0)}, 'money': {}}

[54]: normalizer.normalize('2pgi') [54]: {'normalize': 'dua pagi', 'date': {'2 AM': datetime.datetime(2020, 12, 31, 2, 0)}, 'money': {}}

[55]: normalizer.normalize('pukul 8 malam') [55]: {'normalize': 'pukul lapan malam', 'date': {'pukul 8': datetime.datetime(2020, 12, 8, 0, 0)}, 'money': {}}

[56]: normalizer.normalize('jan 2 2019 12:01pm') [56]: {'normalize': '02/01/2019 12:01:00', 'date': {'jan 2 2019 12:01pm': datetime.datetime(2019, 1, 2, 12, 1)}, 'money': {}}

166 Chapter 9. Contents: malaya Documentation

[57]: normalizer.normalize('2 ptg jan 2 2019') [57]: {'normalize': 'dua ptg 02/01/2019', 'date': {'2 PM jan 2 2019': datetime.datetime(2019, 1, 2, 14, 0)}, 'money': {}}

15. normalize money string to string number representation

[58]: normalizer.normalize('50 sen') [58]: {'normalize': 'lima puluh sen', 'date': {}, 'money': {'50 sen': 'RM0.5'}}

[59]: normalizer.normalize('20.5 ringgit') [59]: {'normalize': 'dua puluh ringgit lima puluh sen', 'date': {}, 'money': {'20.5 ringgit': 'RM20.5'}}

[60]: normalizer.normalize('20m ringgit') [60]: {'normalize': 'dua puluh juta ringgit', 'date': {}, 'money': {'20m ringgit': 'RM20000000.0'}}

[61]: normalizer.normalize('22.5123334k ringgit') [61]: {'normalize': 'dua puluh dua ribu lima ratus dua belas ringgit tiga ratus tiga puluh

˓→empat sen', 'date': {}, 'money': {'22.512334k ringgit': 'RM22512.334'}}

16. normalize date string to %d/%m/%y

[62]: normalizer.normalize('1 nov 2019') [62]: {'normalize': '01/11/2019', 'date': {'1 nov 2019': datetime.datetime(2019, 11, 1, 0, 0)}, 'money': {}}

[63]: normalizer.normalize('januari 1 1996') [63]: {'normalize': '01/01/1996', 'date': {'januari 1 1996': datetime.datetime(1996, 1, 1, 0, 0)}, 'money': {}}

[64]: normalizer.normalize('januari 2019') [64]: {'normalize': '31/01/2019', 'date': {'januari 2019': datetime.datetime(2019, 1, 31, 0, 0)}, 'money': {}}

9.18. Normalizer 167 malaya Documentation

17. normalize time string to %H:%M:%S

[65]: normalizer.normalize('2pm') [65]: {'normalize': '14:00:00', 'date': {'2pm': datetime.datetime(2020, 12, 31, 14, 0)}, 'money': {}}

[66]: normalizer.normalize('2:01pm') [66]: {'normalize': '14:01:00', 'date': {'2:01pm': datetime.datetime(2020, 12, 31, 14, 1)}, 'money': {}}

[67]: normalizer.normalize('2AM') [67]: {'normalize': '02:00:00', 'date': {'2am': datetime.datetime(2020, 12, 31, 2, 0)}, 'money': {}}

18. expand repetition shortform

[68]: normalizer.normalize('skit2') [68]: {'normalize': 'sakit-sakit', 'date': {}, 'money': {}}

[69]: normalizer.normalize('xskit2') [69]: {'normalize': 'tak sakit-sakit', 'date': {}, 'money': {}}

[70]: normalizer.normalize('xjdi2') [70]: {'normalize': 'tak jadi-jadi', 'date': {}, 'money': {}}

[71]: normalizer.normalize('xjdi4') [71]: {'normalize': 'tak jadi-jadi-jadi-jadi', 'date': {}, 'money': {}}

[72]: normalizer.normalize('xjdi0') [72]: {'normalize': 'tak jadi', 'date': {}, 'money': {}}

[73]: normalizer.normalize('xjdi') [73]: {'normalize': 'tak jadi', 'date': {}, 'money': {}}

19. normalize NUM SI-UNIT

[74]: normalizer.normalize('61.2 kg') [74]: {'normalize': 'enam puluh satu perpuluhan dua kilogram', 'date': {}, 'money': {}}

168 Chapter 9. Contents: malaya Documentation

[75]: normalizer.normalize('61.2kg') [75]: {'normalize': 'enam puluh satu perpuluhan dua kilogram', 'date': {}, 'money': {}}

[76]: normalizer.normalize('61kg') [76]: {'normalize': 'enam puluh satu kilogram', 'date': {}, 'money': {}}

[77]: normalizer.normalize('61ml') [77]: {'normalize': 'enam puluh satu milliliter', 'date': {}, 'money': {}}

[78]: normalizer.normalize('61m') [78]: {'normalize': 'enam puluh satu meter', 'date': {}, 'money': {}}

[79]: normalizer.normalize('61.3434km') [79]: {'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat kilometer', 'date': {}, 'money': {}}

[80]: normalizer.normalize('61.3434c') [80]: {'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat celsius', 'date': {}, 'money': {}}

[81]: normalizer.normalize('61.3434 c') [81]: {'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat celsius', 'date': {}, 'money': {}}

20. normalize laughing pattern

[82]: normalizer.normalize('dia sakai wkwkwkawkw') [82]: {'normalize': 'dia sakai haha', 'date': {}, 'money': {}}

[83]: normalizer.normalize('dia sakai hhihihu') [83]: {'normalize': 'dia sakai haha', 'date': {}, 'money': {}}

21. normalize mengeluh pattern

[84]: normalizer.normalize('Haih apa lah si yusuff ni . Mama cari rupanya celah ni') [84]: {'normalize': 'Aduh apa lah si yusuff ini . Mama cari rupanya celah ini', 'date': {}, 'money': {}}

9.18. Normalizer 169 malaya Documentation

[85]: normalizer.normalize('hais sorrylah syazzz') [85]: {'normalize': 'aduh maaf lah syazz', 'date': {}, 'money': {}}

9.19 Stemmer and Lemmatization

This tutorial is available as an IPython notebook at Malaya/example/stemmer.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya CPU times: user 4.81 s, sys: 652 ms, total: 5.47 s Wall time: 4.44 s

[2]: string='Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu.

˓→Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je.

˓→Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah' another_string='melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://

˓→twitter.com'

9.19.1 Use deep learning model

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization. If you are using Tensorflow 2, make sure Tensorflow Addons already installed,

pip install tensorflow-addons==0.12.0

def deep_model(quantized: bool= False, **kwargs): """ Load LSTM + Bahdanau Attention stemming model, this also include lemmatization. Original size 41.6MB, quantized size 10.6MB .

Parameters ------quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.stem.DEEP_STEMMER class """

[9]: model= malaya.stem.deep_model()

170 Chapter 9. Contents: malaya Documentation

9.19.2 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]: quantized_model= malaya.stem.deep_model(quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

Stem and lemmatization

def stem(self, string: str, beam_search: bool= True): """ Stem a string, this also include lemmatization.

Parameters ------string : str beam_search : bool, (optional=True) If True, use beam search decoder, else use greedy decoder.

Returns ------result: str """

If want to speed up the inference, set beam_search = False.

[4]: %%time

model.stem(string) CPU times: user 1.22 s, sys: 305 ms, total: 1.52 s Wall time: 540 ms [4]: 'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana

˓→kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk

˓→tunjuk kau open sangat nk tegur cara org lain dakwah'

[5]: %%time

model.stem(string, beam_search= False) CPU times: user 285 ms, sys: 102 ms, total: 387 ms Wall time: 289 ms [5]: 'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana

˓→kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk

˓→tunjuk kau open sangat nk tegur cara org lain dakwah'

[6]: %%time

quantized_model.stem(string) CPU times: user 1.29 s, sys: 230 ms, total: 1.52 s Wall time: 573 ms

9.19. Stemmer and Lemmatization 171 malaya Documentation

[6]: 'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana

˓→kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk

˓→tunjuk kau open sangat nk tegur cara org lain dakwah'

[7]: %%time

quantized_model.stem(string, beam_search= False) CPU times: user 331 ms, sys: 105 ms, total: 436 ms Wall time: 329 ms [7]: 'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana

˓→kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk

˓→tunjuk kau open sangat nk tegur cara org lain dakwah'

[8]: model.stem(another_string) [8]: 'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'

[9]: quantized_model.stem(another_string) [9]: 'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'

[11]: model.stem('saya menyerukanlah') [11]: 'saya seru'

[10]: quantized_model.stem('saya menyerukanlah') [10]: 'saya seru'

9.19.3 Use Sastrawi stemmer

Malaya also included interface for Sastrawi stemmer. We use it for internal purpose. To use it, simply,

def sastrawi(): """ Load stemming model using Sastrawi, this also include lemmatization.

Returns ------result: malaya.stem.SASTRAWI class """

[3]: sastrawi= malaya.stem.sastrawi()

[4]: sastrawi.stem('saya menyerukanlah') [4]: 'saya seru'

[5]: sastrawi.stem('menarik') [5]: 'tarik'

[6]: sastrawi.stem(another_string)

172 Chapter 9. Contents: malaya Documentation

[6]: 'melayu bodoh dah la gay sokong lgbt lagi memang tak guna http twitter com'

But it not able to maintain words like url, hashtag, money, datetime and user mention.

9.19.4 Use Naive stemmer

Simply use regex pattern to do stemming. This method not able to lemmatize.

def naive(): """ Load stemming model using startswith and endswith naively using regex patterns.

Returns ------result : malaya.stem.NAIVE class """

[7]: naive= malaya.stem.naive()

[8]: naive.stem('saya menyerukanlah') [8]: 'saya yerukan'

[9]: naive.stem('menarik') [9]: 'arik'

[10]: naive.stem(another_string) [10]: 'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com'

9.20 True Case

This tutorial is available as an IPython notebook at Malaya/example/true-case.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time

import malaya CPU times: user 5.52 s, sys: 1.09 s, total: 6.61 s Wall time: 7.26 s

9.20. True Case 173 malaya Documentation

9.20.1 Explanation

Common third party NLP services like Google Speech to Text or PDF to Text will returned unsensitive case and no punctuations or mistake punctuations and cases. So True Case can help you. 1. jom makan di us makanan di sana sedap -> jom makan di US, makanan di sana sedap. 2. menteri di jabatan perdana menteri datuk seri dr hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan -> KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga-tiga negara berkenaan. True case only, 1. Solve mistake / no punctuations. 2. Solve mistake / unsensitive case. 3. Not correcting any grammar.

9.20.2 List available Transformer model

[2]: malaya.true_case.available_transformer() [2]: Size (MB) Quantized Size (MB) Sequence Accuracy small 42.7 13.1 0.347 base 234.0 63.8 0.696

9.20.3 Load Transformer model

def transformer(model: str='base', quantized: bool= False, **kwargs): """ Load transformer encoder-decoder model to True Case.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'small'`` - Transformer SMALL parameters. * ``'base'`` - Transformer BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.tf.TrueCase class """

[7]: model= malaya.true_case.transformer()

174 Chapter 9. Contents: malaya Documentation

9.20.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[6]: quantized_model= malaya.true_case.transformer(quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

[8]: string1='jom makan di us makanan di sana sedap' string2='kuala lumpur menteri di jabatan perdana menteri datuk seri dr mujahid

˓→yusof rawa hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia

˓→herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga

˓→tiga negara berkenaan'

Predict using greedy decoder

def greedy_decoder(self, strings: List[str]): """ True case strings using greedy decoder. Example, "saya nak makan di us makanan di sana sedap" -> "Saya nak makan di US,

˓→makanan di sana sedap."

Parameters ------strings : List[str]

Returns ------result: List[str] """ return self._greedy_decoder(strings)

[9]: from pprint import pprint

[12]: pprint(model.greedy_decoder([string1, string2])) ['Jom makan di US makanan di sana sedap.', 'KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga-tiga negara berkenaan.']

[13]: pprint(quantized_model.greedy_decoder([string1, string2])) ['Jom makan di US makanan di sana sedap.', 'KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga-tiga negara berkenaan.']

[14]: import random

(continues on next page)

9.20. True Case 175 malaya Documentation

(continued from previous page) def random_uppercase(string): string= [c.upper() if random.randint(0,1) else c for c in string] return ''.join(string)

[15]:r= random_uppercase(string2) r [15]: 'KuAlA LUMPUR menTeri di jAbATaN PErDANA MentERI Datuk SERi Dr mujahid YUsof RaWA

˓→HARI Ini mEngakhirI LAwaTAN kERJA lApan hARi Ke JOrdaN tURkI dAN bOsnIa heRzEGoVInA

˓→lAwaTAN yAng BeRTUjuAN MeNGERAtKaN lAgI HUBUnGAN dua HaLA dENGAn kETiGa TIGA nEgara

˓→BerkenAaN'

[16]: pprint(model.greedy_decoder([r])) ['KUALA LUMPUR: Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga tiga negara berkenaan.']

[17]: pprint(quantized_model.greedy_decoder([r])) ['KUALA LUMPUR: Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga tiga negara berkenaan.']

Predict using beam decoder

def beam_decoder(self, strings: List[str]): """ True case strings using beam decoder, beam width size 3, alpha 0.5 . Example, "saya nak makan di us makanan di sana sedap" -> "Saya nak makan di US,

˓→makanan di sana sedap."

Parameters ------strings : List[str]

Returns ------result: List[str] """

[18]: pprint(model.beam_decoder([string1, string2])) ['Jom makan di US makanan di sana sedap.', 'KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga-tiga negara berkenaan.']

[19]: pprint(quantized_model.beam_decoder([string1, string2])) ['Jom makan di US makanan di sana sedap.', 'KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid ' (continues on next page)

176 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan ' 'Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua ' 'hala dengan ketiga-tiga negara berkenaan.']

[ ]:

9.21 Segmentation

This tutorial is available as an IPython notebook at Malaya/example/segmentation.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[4]: %%time

import malaya CPU times: user 1.95 s, sys: 241 ms, total: 2.19 s Wall time: 2.63 s

Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you, 1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan. 2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang. 3. ceritatunnajibrazak -> cerita tun . 4. TunM sukakan -> Tun M sukakan. Segmentation only, 1. Solve spacing error. 2. Not correcting any grammar.

[5]: string1='huseinsukamakan ayam,dia sgtrisaukan' string2='drmahathir sangat menekankan budaya budakzamansekarang' string3='ceritatunnajibrazak' string4='TunM sukakan' string_hard='IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai

˓→Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO

˓→membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan

˓→Rakyat(PKR) dalam satumesejringkaskepadaSinar Harian menjelaskan perkara

˓→itutidakbenarsama sekali.' string_socialmedia='aqxsukalah apeyg tejadidekat mamattu'

9.21. Segmentation 177 malaya Documentation

9.21.1 Viterbi algorithm

Commonly people use Viterbi algorithm to solve this problem, we also added viterbi using ngram from bahasa papers and wikipedia.

def viterbi(max_split_length: int= 20, **kwargs): """ Load Segmenter class using viterbi algorithm.

Parameters ------max_split_length: int, (default=20) max length of words in a sentence to segment validate: bool, optional (default=True) if True, malaya will check model availability and download if not available.

Returns ------result : malaya.segmentation.SEGMENTER class """

[4]: viterbi= malaya.segmentation.viterbi()

Segmentize

def segment(self, strings: List[str]): """ Segment strings. Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

Parameters ------strings : List[str]

Returns ------result: List[str] """

[5]: %%time

viterbi.segment([string1, string2, string3, string4]) CPU times: user 109 ms, sys: 1.04 ms, total: 110 ms Wall time: 110 ms [5]: ['husein suka makan ayam,dia sgt risau kan', 'dr mahathir sangat mene kan kan budaya budak zaman sekarang', 'cerita tu n najib razak', 'Tun M suka kan']

[6]: %%time

viterbi.segment([string_hard, string_socialmedia])

178 Chapter 9. Contents: malaya Documentation

CPU times: user 8.45 ms, sys: 157 µs, total: 8.6 ms Wall time: 8.69 ms [6]: ['IPOH - Ahli Dewan Undangan Negeri(ADUN) Hulu Kinta, Muhamad Arafat Varisai

˓→Mahamadmenafikanmesejtularmendakwa belia u akan me lompat part i me nyo ko ng UMNO

˓→mem bentuk kerajaannegeridi Perak. Beliauyangjuga Ketua Penerangan Parti Keadilan

˓→Rakyat(PKR) Perak dalam satumesejringkaskepada men jel ask an perkara

˓→it u tidak benar sama sekali.', 'aq x suka lah ape yg te jadi dekat mama ttu']

9.21.2 List available Transformer model

[6]: malaya.segmentation.available_transformer() [6]: Size (MB) Quantized Size (MB) WER \ small 42.7 13.1 0.208520 base 234.0 63.8 0.177624 super-tiny-t5 81.8 27.1 0.032980 super-super-tiny-t5 39.6 12.0 0.037882

Suggested length small 256.0 base 256.0 super-tiny-t5 256.0 super-super-tiny-t5 256.0

9.21.3 Load Transformer model

def transformer(model: str='small', quantized: bool= False, **kwargs): """ Load transformer encoder-decoder model to Segmentize.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'small'`` - Transformer SMALL parameters. * ``'base'`` - Transformer BASE parameters. * ``'super-tiny-t5'`` - T5 SUPER TINY parameters. * ``'super-super-tiny-t5'`` - T5 SUPER SUPER TINY parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.tf.Segmentation class """

[21]: model= malaya.segmentation.transformer(model='small') quantized_model= malaya.segmentation.transformer(model='small', quantized= True)

9.21. Segmentation 179 malaya Documentation

WARNING:root:Load quantized model will cause accuracy drop.

[22]: model_base= malaya.segmentation.transformer(model='base') quantized_model_base= malaya.segmentation.transformer(model='base', quantized=

˓→True) WARNING:root:Load quantized model will cause accuracy drop.

[11]: super_super_tiny= malaya.segmentation.transformer(model='super-super-tiny-t5')

Predict using greedy decoder

def greedy_decoder(self, strings: List[str]): """ Segment strings using greedy decoder. Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

Parameters ------strings : List[str]

Returns ------result: List[str] """

[10]: %%time

model.greedy_decoder([string1, string2, string3, string4]) CPU times: user 1.12 s, sys: 432 ms, total: 1.55 s Wall time: 959 ms [10]: ['husein suka makan ayam, dia sgt risaukan', 'dr mahathir sangat menekankan budaya budak zaman sekarang', 'cerita tun najib razak', 'Tun M sukakan']

[11]: %%time

quantized_model.greedy_decoder([string1, string2, string3, string4]) CPU times: user 1.12 s, sys: 464 ms, total: 1.58 s Wall time: 888 ms [11]: ['husein suka makan ayam, dia sgt risaukan', 'dr mahathir sangat menekankan budaya budak zaman sekarang', 'cerita tun najib razak', 'Tun M sukakan']

[12]: %%time

model_base.greedy_decoder([string1, string2, string3, string4]) CPU times: user 5.58 s, sys: 2.88 s, total: 8.46 s Wall time: 4.08 s

180 Chapter 9. Contents: malaya Documentation

[12]: ['husein suka makan ayam, dia sgt risaukan', 'dr mahathir sangat menekankan budaya budak zaman sekarang', 'cerita tun najib razak cerita', 'Tun M sukakan Tun M sukakan']

[13]: %%time

quantized_model_base.greedy_decoder([string1, string2, string3, string4]) CPU times: user 5.73 s, sys: 2.96 s, total: 8.69 s Wall time: 3.81 s [13]: ['husein suka makan ayam, dia sgt risaukan', 'dr mahathir sangat menekankan budaya budak zaman sekarang', 'cerita tun najib razak cerita tun', 'Tun M sukakan Tun M sukakan']

[13]: %%time

super_super_tiny.greedy_decoder([string1, string2, string3, string4]) CPU times: user 908 ms, sys: 433 ms, total: 1.34 s Wall time: 288 ms [13]: ['husein suka makan ayam, dia sgt risaukan', 'dr mahathir sangat menekankan budaya budak zaman sekarang', 'cerita tun najib razak', 'Tun M sukakan']

[14]: %%time

model.greedy_decoder([string_hard, string_socialmedia]) CPU times: user 2.52 s, sys: 499 ms, total: 3.02 s Wall time: 768 ms [14]: ['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad

˓→menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk

˓→kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat

˓→(PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu

˓→tidak benar sama sekali.', 'aq xsukalah ape yg tejadid dekat mamat tu']

[15]: %%time

quantized_model.greedy_decoder([string_hard, string_socialmedia]) CPU times: user 2.62 s, sys: 447 ms, total: 3.07 s Wall time: 756 ms [15]: ['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad

˓→menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk

˓→kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat

˓→(PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu

˓→tidak benar sama sekali.', 'aq xsukalah ape yg tejadid dekat mamat tu']

[16]: %%time

model_base.greedy_decoder([string_hard, string_socialmedia])

9.21. Segmentation 181 malaya Documentation

CPU times: user 17.8 s, sys: 10.2 s, total: 28 s Wall time: 5.84 s [16]: ['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad

˓→menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk

˓→kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat

˓→(PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu

˓→tidak benar sama sekali.', 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu

˓→']

[17]: %%time

quantized_model_base.greedy_decoder([string_hard, string_socialmedia]) CPU times: user 17.6 s, sys: 9.63 s, total: 27.3 s Wall time: 5.85 s [17]: ['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad

˓→menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk

˓→kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat

˓→(PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu

˓→tidak benar sama sekali.', 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu

˓→']

[14]: %%time

super_super_tiny.greedy_decoder([string_hard, string_socialmedia]) CPU times: user 1.34 s, sys: 527 ms, total: 1.87 s Wall time: 421 ms [14]: ['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad

˓→menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk

˓→kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat

˓→(PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu

˓→tidak benar sama sekali.', 'aq xsukalah ape yg tejadi dekat mamat tu']

Problem with batching string, short string might repeating itself, so to solve this, you need to give a single string only,

[18]: %%time

quantized_model_base.greedy_decoder([string_socialmedia]) CPU times: user 1.37 s, sys: 532 ms, total: 1.9 s Wall time: 652 ms [18]: ['aq xsukalah ape yg teja di dekat mamat tu']

[19]: %%time

quantized_model_base.greedy_decoder([string3]) CPU times: user 648 ms, sys: 228 ms, total: 876 ms Wall time: 289 ms

182 Chapter 9. Contents: malaya Documentation

[19]: ['cerita tun najib razak']

[20]: %%time

quantized_model_base.greedy_decoder([string4]) CPU times: user 495 ms, sys: 202 ms, total: 697 ms Wall time: 225 ms [20]: ['Tun M sukakan']

Predict using beam decoder

def beam_decoder(self, strings: List[str]): """ Segment strings using beam decoder, beam width size 3, alpha 0.5 . Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

Parameters ------strings : List[str]

Returns ------result: List[str] """

[11]: %%time

quantized_model.beam_decoder([string_socialmedia]) CPU times: user 1.38 s, sys: 1.87 s, total: 3.25 s Wall time: 654 ms [11]: ['aq xsukalah ape yg tejadid dekat mamat tu']

[12]: %%time

quantized_model_base.beam_decoder([string_socialmedia]) CPU times: user 6.77 s, sys: 3.71 s, total: 10.5 s Wall time: 2.43 s [12]: ['aq xsukalah ape yg teja di dekat mamat tu']

We can expect beam decoder is much more slower than greedy decoder.

9.21. Segmentation 183 malaya Documentation

9.22 Preprocessing

This tutorial is available as an IPython notebook at Malaya/example/preprocessing.

[1]: %%time import malaya CPU times: user 4.73 s, sys: 664 ms, total: 5.39 s Wall time: 4.38 s

9.22.1 Available rules

We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing, 1. Malaya can replace special words into tokens to reduce dimension curse. rm10k become . 2. Malaya can put tags for special words, #drmahathir become drmahathir . 3. Malaya can expand english contractions. 4. Malaya can translate english words to become bahasa malaysia words. Again, this translation is using dictionary, it will not understand semantically. Purpose of this translation just to standardize to become bahasa Malaysia. 5. Stemming and lemmatizing, required stemmer object. 6. Normalize elongated words, required speller object. 7. Expand hashtags, #drmahathir become dr mahathir, required segmentation object.

normalize

Supported normalize, 1. hashtag 2. cashtag 3. tag 4. user 5. emphasis 6. censored 7. acronym 8. eastern_emoticons 9. rest_emoticons 10. emoji 11. quotes 12. percent 13. repeat_puncts

184 Chapter 9. Contents: malaya Documentation

14. money 15. email 16. phone 17. number 18. allcaps 19. url 20. date 21. time You can check all supported list at malaya.preprocessing.get_normalize(). Example, if you set money and number, and input string is RM10k, the output is .

annotate

Supported annotate, 1. hashtag 2. allcaps 3. elongated 4. repeated 5. emphasis 6. censored Example, if you set hashtag, and input string is #drmahathir, the output is drmahathir .

[2]: string_1='CANT WAIT for the new season of #mahathirmohamad (^o^)!!! #davidlynch

˓→#tvseries :))), TAAAK SAAABAAR!!!' string_2='kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10

˓→#badmovies :/' string_3="@husein: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY !!! :-D

˓→http://sentimentsymposium.com/." string_4='aahhh, malasnye nak pegi keje harini #mondayblues' string_5='#drmahathir #najibrazak #1malaysia #mahathirnajib'

9.22.2 Preprocessing Interface

def preprocessing( normalize: List[str]=[ 'url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', (continues on next page)

9.22. Preprocessing 185 malaya Documentation

(continued from previous page) 'number', ], annotate: List[str]=[ 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag', ], lowercase: bool= True, fix_unidecode: bool= True, expand_english_contractions: bool= True, translate_english_to_bm: bool= True, speller= None, segmenter= None, stemmer= None, **kwargs, ): """ Load Preprocessing class.

Parameters ------normalize: list normalizing tokens, can check all supported normalizing at `malaya.

˓→preprocessing.get_normalize()`. annotate: list annonate tokens , only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis',

˓→'censored']. lowercase: bool fix_unidecode: bool expand_english_contractions: bool expand english contractions translate_english_to_bm: bool translate english words to bahasa malaysia words speller: object spelling correction object, need to have a method `correct` segmenter: object segmentation object, need to have a method `segment`. If provide, it will expand hashtags, #mondayblues == monday blues stemmer: object stemmer object, need to have a method `stem`. If provide, it will stem or lemmatize the string.

Returns ------result : malaya.preprocessing.PREPROCESSING class """

186 Chapter 9. Contents: malaya Documentation

9.22.3 Load default paramaters

default parameters able to translate most of english to bahasa malaysia.

[3]: %%time preprocessing= malaya.preprocessing.preprocessing() CPU times: user 115 ms, sys: 18.4 ms, total: 134 ms Wall time: 134 ms

[4]: %%time ''.join(preprocessing.process(string_1)) CPU times: user 2.1 ms, sys: 19 µs, total: 2.12 ms Wall time: 2.12 ms [4]: ' tak boleh tunggu untuk yang baru musim daripada

˓→mahathirmohamad \\(^o^)/ ! davidlynch

˓→ tvseries , taak saabaar

˓→ ! '

[5]: %%time ''.join(preprocessing.process(string_2)) CPU times: user 426 µs, sys: 3 µs, total: 429 µs Wall time: 432 µs [5]: 'kecewanya johndoe filem dan ia suucks !

˓→ dibazirkan . badmovies

˓→hashtag> '

[6]: %%time ''.join(preprocessing.process(string_3)) CPU times: user 413 µs, sys: 0 ns, total: 413 µs Wall time: 416 µs [6]: ' : boleh tidak tunggu untuk yang sentimen talks !

˓→ yaay ! :-d '

[7]: %%time ''.join(preprocessing.process(string_4)) CPU times: user 391 µs, sys: 0 ns, total: 391 µs Wall time: 398 µs [7]: 'aahh , malasnye nak pergi kerja hari ini mondayblues

˓→'

[8]: %%time ''.join(preprocessing.process(string_5)) CPU times: user 459 µs, sys: 12 µs, total: 471 µs Wall time: 474 µs [8]: ' drmahathir najibrazak 1 malaysia

˓→ mahathirnajib '

9.22. Preprocessing 187 malaya Documentation

9.22.4 Load default paramaters with spelling correction to normalize elongated words.

We saw taak, saabaar and another elongated words are not the original words, so we can use spelling correction to normalize it.

[9]: corrector= malaya.spell.probability()

[10]: %%time preprocessing= malaya.preprocessing.preprocessing(speller= corrector) CPU times: user 85.5 ms, sys: 16.3 ms, total: 102 ms Wall time: 101 ms

[11]: %%time ''.join(preprocessing.process(string_1)) CPU times: user 630 µs, sys: 7 µs, total: 637 µs Wall time: 640 µs [11]: ' tak boleh tunggu untuk yang baru musim daripada

˓→mahathirmohamad \\(^o^)/ ! davidlynch

˓→ tvseries , tidak sabar

˓→ ! '

[12]: %%time ''.join(preprocessing.process(string_2)) CPU times: user 445 µs, sys: 3 µs, total: 448 µs Wall time: 451 µs [12]: 'kecewanya johndoe filem dan ia sucks !

˓→ dibazirkan . badmovies

˓→hashtag> '

[13]: %%time ''.join(preprocessing.process(string_3)) CPU times: user 640 µs, sys: 12 µs, total: 652 µs Wall time: 665 µs [13]: ' : boleh tidak tunggu untuk yang sentimen talks !

˓→ yay ! :-d '

[14]: %%time ''.join(preprocessing.process(string_4)) CPU times: user 495 µs, sys: 12 µs, total: 507 µs Wall time: 530 µs [14]: 'ah , malasnye nak pergi kerja hari ini mondayblues '

[15]: %%time ''.join(preprocessing.process(string_5)) CPU times: user 327 µs, sys: 6 µs, total: 333 µs Wall time: 346 µs [15]: ' drmahathir najibrazak 1 malaysia

˓→ mahathirnajib '

188 Chapter 9. Contents: malaya Documentation

9.22.5 Load default paramaters with segmenter to expand hashtags.

We saw drmahathir najibrazak , we want to expand to become dr mahathir and najib razak.

[16]: segmenter= malaya.segmentation.transformer(model='small', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

[17]: segmenter= malaya.segmentation.transformer(model='small', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

[18]: %%time preprocessing= malaya.preprocessing.preprocessing(segmenter= segmenter) CPU times: user 88.3 ms, sys: 18.9 ms, total: 107 ms Wall time: 107 ms

[19]: %%time ''.join(preprocessing.process(string_1)) CPU times: user 1.61 s, sys: 1.83 s, total: 3.43 s Wall time: 1.06 s [19]: ' tak boleh tunggu untuk yang baru musim daripada

˓→mahathir mohamad \\(^o^)/ ! davidlynch

˓→ tv series , taak saabaar

˓→ ! '

9.22. Preprocessing 189 malaya Documentation

[20]: %%time ''.join(preprocessing.process(string_2)) CPU times: user 726 ms, sys: 375 ms, total: 1.1 s Wall time: 293 ms [20]: 'kecewanya johndoe filem dan ia suucks !

˓→ dibazirkan . bad movies

˓→hashtag> '

[21]: %%time ''.join(preprocessing.process(string_3)) CPU times: user 332 ms, sys: 108 ms, total: 440 ms Wall time: 112 ms [21]: ' : boleh tidak tunggu untuk yang sentimen talks !

˓→ yaay ! :-d '

[22]: %%time ''.join(preprocessing.process(string_4)) CPU times: user 525 ms, sys: 592 ms, total: 1.12 s Wall time: 237 ms [22]: 'aahh , malasnye nak pergi kerja hari ini mondayblues

˓→'

[23]: %%time ''.join(preprocessing.process(string_5)) CPU times: user 1.5 s, sys: 575 ms, total: 2.07 s Wall time: 516 ms [23]: ' dr mahathir najib razak 1

˓→malaysia mahathir najib '

9.22.6 Load default paramaters with stemming and lemmatization

[44]: sastrawi= malaya.stem.sastrawi()

[45]: %%time preprocessing= malaya.preprocessing.preprocessing(stemmer= sastrawi) CPU times: user 112 ms, sys: 18.4 ms, total: 130 ms Wall time: 129 ms

[26]: %%time ''.join(preprocessing.process(string_1)) CPU times: user 11.6 ms, sys: 846 µs, total: 12.5 ms Wall time: 12.2 ms [26]: ' tak boleh tunggu untuk yang baru musim daripada

˓→mahathirmohamad o davidlynch

˓→tvseries taak saabaar

˓→allcaps> '

190 Chapter 9. Contents: malaya Documentation

[47]: %%time ''.join(preprocessing.process(string_2)) CPU times: user 5.61 ms, sys: 503 µs, total: 6.11 ms Wall time: 5.71 ms [47]: 'kecewa johndoe filem dan ia suucks

˓→ dibazirkan badmovies

˓→hashtag> '

[28]: %%time ''.join(preprocessing.process(string_3)) CPU times: user 2.13 ms, sys: 57 µs, total: 2.19 ms Wall time: 2.25 ms [28]: ' boleh tidak tunggu untuk yang sentimen talks

˓→ yaay -d '

[29]: %%time ''.join(preprocessing.process(string_4)) CPU times: user 1.81 ms, sys: 20 µs, total: 1.83 ms Wall time: 1.91 ms [29]: 'aahh malasnye nak pergi kerja hari ini mondayblues '

[30]: %%time ''.join(preprocessing.process(string_5)) CPU times: user 1.91 ms, sys: 13 µs, total: 1.92 ms Wall time: 1.95 ms [30]: ' drmahathir najibrazak 1 malaysia

˓→ mahathirnajib '

[46]: %%time ''.join(preprocessing.process('saya disini berjalan pergi ke , #masjidbesi

˓→')) CPU times: user 3.45 ms, sys: 30 µs, total: 3.48 ms Wall time: 3.49 ms [46]: 'saya sini jalan pergi ke putrajaya masjidbesi '

9.22.7 disable english translation

But there are basic normalizations that cannot override, like, for automatically become untuk. You can check default entire normalizations at from malaya.texts._tatabahasa import rules_normalizer

[31]: %%time preprocessing= malaya.preprocessing.preprocessing(translate_english_to_bm= False) CPU times: user 96 µs, sys: 1 µs, total: 97 µs Wall time: 101 µs

[32]: %%time ''.join(preprocessing.process(string_1))

9.22. Preprocessing 191 malaya Documentation

CPU times: user 867 µs, sys: 7 µs, total: 874 µs Wall time: 891 µs [32]: ' tak boleh wait untuk the new season of

˓→mahathirmohamad \\(^o^)/ ! davidlynch

˓→ tvseries , taak saabaar

˓→ ! '

[33]: %%time ''.join(preprocessing.process(string_2)) CPU times: user 509 µs, sys: 9 µs, total: 518 µs Wall time: 538 µs [33]: 'kecewanya johndoe movie and it suucks !

˓→ wasted . badmovies

˓→hashtag> '

[34]: %%time ''.join(preprocessing.process(string_3)) CPU times: user 477 µs, sys: 6 µs, total: 483 µs Wall time: 519 µs [34]: ' : can not wait untuk the sentiment talks !

˓→ yaay ! :-d '

9.22.8 Tokenizer

It able to tokenize multiple regex pipelines, you can check the list from malaya.preprocessing. get_normalize()

[35]: tokenizer= malaya.preprocessing.TOKENIZER().tokenize

[36]: tokenizer(string_1) [36]: ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#mahathirmohamad', '(^o^)', '!', '!', '!', '#davidlynch', '#tvseries', ':)))', ',', 'TAAAK', 'SAAABAAR', '!', '!', '!']

192 Chapter 9. Contents: malaya Documentation

[37]: tokenizer(string_2) [37]: ['kecewanya', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WASTED', 'RM10', '.', '.', '.', 'rm10', '#badmovies', ':/']

[38]: tokenizer(string_3) [38]: ['@husein', ':', 'can', "'", 't', 'wait', 'for', 'the', 'Nov 9', '#Sentiment', 'talks', '!', 'YAAAAAAY', '!', '!', '!', ':-D', 'http://sentimentsymposium.com/.']

[39]: tokenizer('saya nak makan ayam harga rm10k') [39]: ['saya', 'nak', 'makan', 'ayam', 'harga', 'rm10k']

9.23 Kesalahan Tatabahasa

This tutorial is available as an IPython notebook at Malaya/example/tatabahasa.

This module only trained on standard language structure, so it is not save to use it for local language structure.

9.23. Kesalahan Tatabahasa 193 malaya Documentation

[1]: %%time

import malaya from pprint import pprint CPU times: user 4.87 s, sys: 865 ms, total: 5.73 s Wall time: 6.34 s

9.23.1 Model

Common Seq2Seq model, P(yt | X, yt-1), one step decoder will generate yt, and this required encoder output and output from last step decoder, yt-1. So we improve the model to general tags also, P(yt, zt | X, yt-1, zt-1), one step decoder will generate yt and zt, and this required encoder output and outputs from last step decoder, yt-1 and zt-1. We named this model as TransformerTag. There is no paper produced for this model, feel free to write a paper about it, check out our implementation at https: //github.com/huseinzol05/malaya/tree/master/session/tatabahasa

9.23.2 List available Transformer Tag models

[2]: malaya.tatabahasa.available_transformer() INFO:root:tested on 10k kesalahan tatabahasa texts. [2]: Size (MB) Quantized Size (MB) Sequence Accuracy \ small 397.0 100.0 0.860198 base 875.0 220.0 0.938972

Sequence Tagging Accuracy small 0.963267 base 0.977407

9.23.3 Supported kesalahan tatabahasa

For full description, check out https://tatabahasabm.tripod.com/tata/salahtata.htm

[3]: malaya.tatabahasa.describe() [3]: class Description \ 0 0 PAD 1 1 kesambungan subwords 2 2 tiada kesalahan 3 3 kesalahan frasa nama, Perkara yang diterangkan... 4 4 kesalahan kata jamak 5 5 kesalahan kata penguat 6 6 kata adjektif dan imbuhan "ter" tanpa penguat. 7 7 kesalahan kata hubung 8 8 kesalahan kata bilangan 9 9 kesalahan kata sendi 10 10 kesalahan penjodoh bilangan 11 11 kesalahan kata ganti diri 12 12 kesalahan ayat pasif (continues on next page)

194 Chapter 9. Contents: malaya Documentation

(continued from previous page) 13 13 kesalahan kata tanya 14 14 kesalahan tanda baca 15 15 kesalahan kata kerja tak transitif 16 16 kesalahan kata kerja transitif 17 17 penggunaan kata yang tidak tepat

salah \ 0 1 2 3 Cili sos 4 mereka-mereka 5 sangat tinggi sekali 6 Sani mendapat markah yang tertinggi sekali. 7 Sally sedang membaca bila saya tiba di rumahnya. 8 Beribu peniaga tidak membayar cukai pendapatan. 9 Umar telah berpindah daripada sekolah ini bula... 10 Setiap orang pelajar 11 Pencuri itu telah ditangkap. Beliau dibawa ke ... 12 Cerpen itu telah dikarang oleh saya. 13 Kamu berasal dari manakah ? 14 Kamu berasal dari manakah . 15 Dia kata kepada saya 16 Dia suka baca buku 17 Tembuk Besar negeri Cina dibina oleh Shih Huan...

betul 0 1 2 3 sos cili 4 mereka 5 sangat tinggi 6 Sani mendapat markah yang tertinggi. 7 Sally sedang membaca apabila saya tiba di ruma... 8 Beribu-ribu peniaga tidak membayar cukai penda... 9 Umar telah berpindah dari sekolah ini bulan lalu. 10 Setiap pelajar. 11 Pencuri itu telah ditangkap. Dia dibawa ke bal... 12 Cerpen itu telah saya karang. 13 Kamu berasal dari mana ? 14 Kamu berasal dari mana ? 15 Dia berkata kepada saya 16 Dia suka membaca buku 17 Tembok Besar negeri Cina dibina oleh Shih Huan...

Right now we only able to predict up to 15 different kesalahan tatabahasa, hopefully in the future we can scale this up.

9.23. Kesalahan Tatabahasa 195 malaya Documentation

9.23.4 Load Transformer Tag model

[4]: model= malaya.tatabahasa.transformer(model='base')

9.23.5 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[5]: quantized_model= malaya.tatabahasa.transformer(model='base', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

9.23.6 Predict using greedy decoder

def greedy_decoder(self, strings: List[str]): """ Fix kesalahan tatatabahasa.

Parameters ------strings : List[str]

Returns ------result: List[str] """

For TransformerTag model, right now only supported greedy_decoder method. Randomly picked string in bahasa melayu wikipedia.

[6]: # https://ms.wikipedia.org/wiki/Bola_sepak string='Pada amnya, hanya penjaga gol sahaja yang dibenarkan menyentuh bola dengan

˓→tangan di dalam kawasan golnya'

[7]: model.greedy_decoder([string]) [7]: [[('Pada', 2), ('amnya,', 2), ('hanya', 2), ('penjaga', 2), ('gol', 2), ('sahaja', 2), ('yang', 2), ('dibenarkan', 2), ('menyentuh', 2), ('bola', 2), ('dengan', 2), ('tangan', 2), ('di', 2), ('dalam', 2), (continues on next page)

196 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('kawasan', 2), ('golnya', 2)]]

Now assumed we have kesalahan frasa nama, from penjaga gol become gol penjaga.

[8]: # https://ms.wikipedia.org/wiki/Bola_sepak string='Pada amnya, hanya gol penjaga sahaja yang dibenarkan menyentuh bola dengan

˓→tangan di dalam kawasan golnya'

[9]: model.greedy_decoder([string]) [9]: [[('Pada', 2), ('amnya,', 2), ('hanya', 2), ('penjaga', 3), ('gol', 3), ('sahaja', 2), ('yang', 2), ('dibenarkan', 2), ('menyentuh', 2), ('bola', 2), ('dengan', 2), ('tangan', 2), ('di', 2), ('dalam', 2), ('kawasan', 2), ('golnya', 2)]]

[10]: string='Sani mendapat markah yang tertinggi sekali.' string1='Hassan ialah peserta yang termuda sekali dalam pertandingan itu.' model.greedy_decoder([string, string1]) [10]: [[('Sani', 2), ('mendapat', 2), ('markah', 2), ('yang', 2), ('tertinggi.', 6)], [('Hassan', 2), ('ialah', 2), ('peserta', 2), ('yang', 2), ('termuda', 6), ('dalam', 2), ('pertandingan', 2), ('itu.', 2)]]

[11]: string='Dia kata kepada saya.' model.greedy_decoder([string]) [11]: [[('Dia', 2), ('berkata', 15), ('kepada', 2), ('saya.', 2)]]

[12]: import pickle

with open('tests/dataset-tatabahasa.pkl','rb') as fopen: test_set= pickle.load(fopen)

len(test_set) [12]: 100

9.23. Kesalahan Tatabahasa 197 malaya Documentation

[13]: def get_xy(row): x, y, tag= [], [], []

for i in range(len(row[0])): t= [row[0][i][0]] y.extend(t) t= [row[1][i][0]] x.extend(t) tag.extend([row[1][i][1]] * len(t))

return ''.join(x),''.join(y), tag

[14]: x, y, t= get_xy(test_set[0]) x, y, t [14]: ('Dirk Jan Klaas " Klaas-Jan " Huntelaar ( lahir 12 Ogos 1983 ) merupakan pemain bola

˓→sepak Belanda yang bermain seperti posisi penyerang !', 'Dirk Jan Klaas " Klaas-Jan " Huntelaar ( lahir 12 Ogos 1983 ) merupakan pemain bola

˓→sepak Belanda yang bermain di posisi penyerang .', [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 14])

[15]: model.greedy_decoder([x]) [15]: [[('Dirk', 2), ('Jan', 2), ('Klaas', 2), ('"', 2), ('Klaas-Jan', 2), ('"', 2), ('Huntelaar', 2), ('(', 2), ('lahir', 2), ('12', 2), ('Ogos', 2), ('1983', 2), (')', 2), ('merupakan', 2), ('pemain', 2), ('bola', 2), ('sepak', 2), ('Belanda', 2), ('yang', 2), ('bermain', 2), ('di', 9), ('posisi', 2), ('penyerang', 2), ('.', 14)]]

[16]: quantized_model.greedy_decoder([x]) [16]: [[('Dirk', 2), ('Jan', 2), ('Klaas', 2), ('"', 2), ('Klaas-Jan', 2), ('"', 2), ('Huntelaar', 2), ('(', 2), (continues on next page)

198 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('lahir', 2), ('12', 2), ('Ogos', 2), ('1983', 2), (')', 2), ('merupakan', 2), ('pemain', 2), ('bola', 2), ('sepak', 2), ('Belanda', 2), ('yang', 2), ('bermain', 2), ('di', 9), ('posisi', 2), ('penyerang', 2), ('.', 14)]]

[17]: x, y, t= get_xy(test_set[-1]) x, y, t [17]: ('Pada tahun 2002 , kedua-dua gol beliau menduduki tempat ke-6 dalam 100 Greatest

˓→Sporting Moments oleh saluran Channel 4 UK .', 'Pada tahun 2002 , kedua-dua gol ini menduduki tempat ke-6 dalam 100 Greatest

˓→Sporting Moments oleh saluran Channel 4 UK .', [2, 2, 2, 2, 2, 2, 11, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

[18]: model.greedy_decoder([x]) [18]: [[('Pada', 2), ('tahun', 2), ('2002', 2), (',', 2), ('kedua-dua', 2), ('gol', 2), ('beliau', 2), ('menduduki', 2), ('tempat', 2), ('ke-6', 2), ('dalam', 2), ('100', 2), ('Greatest', 2), ('Sporting', 2), ('Moments', 2), ('oleh', 2), ('saluran', 2), ('Channel', 2), ('4', 2), ('UK', 2), ('.', 2)]]

[19]: x, y, t= get_xy(test_set[-2]) x, y, t [19]: ('Gol inilah yang bergelar Goal of the Century dengan undian Internet 2000 sejak FIFA

˓→.', 'Gol inilah yang bergelar Goal of the Century di undian Internet 2000 oleh FIFA .', [2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 9, 2, 2])

9.23. Kesalahan Tatabahasa 199 malaya Documentation

[20]: model.greedy_decoder([x]) [20]: [[('Gol', 2), ('inilah', 2), ('yang', 2), ('bergelar', 2), ('Goal', 2), ('of', 2), ('the', 2), ('Century', 2), ('dengan', 2), ('undian', 2), ('Internet', 2), ('2000', 2), ('sejak', 2), ('FIFA', 2), ('.', 2)]]

[21]: x, y, t= get_xy(test_set[-3]) x, y, t [21]: ('Beliau mengambil bola dalam kawasan kepul diri lalu pusing dan luru lebih separuh

˓→padang sambil menyentuh bola 11 kali , memintas lima pemain England : ( Glenn

˓→Hoddle , Peter Reid , Kenny Sansom , Terry Butcher , dan Terry Fenwick ) serta

˓→penjaga gawang Peter Shilton .', 'Beliau mengambil bola di kawasan pasukan diri lalu berpusing-pusing dan meluru

˓→lebih separuh padang sambil menyentuh bola 11 kali , memintas lima pemain England :

˓→( Glenn Hoddle , Peter Reid , Kenny Sansom , Terry Butcher , dan Terry Fenwick )

˓→serta penjaga gawang Peter Shilton .', [2, 2, 2, 9, 2, 10, 2, 2, 15, 2, 15, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, (continues on next page)

200 Chapter 9. Contents: malaya Documentation

(continued from previous page) 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

[22]: model.greedy_decoder([x]) [22]: [[('Beliau', 2), ('mengambil', 2), ('bola', 2), ('dari', 9), ('kawasan', 2), ('kaki', 10), ('diri', 2), ('lalu', 2), ('berpusing', 15), ('dan', 2), ('meluru', 15), ('lebih', 2), ('separuh', 2), ('padang', 2), ('sambil', 2), ('menyentuh', 2), ('bola', 2), ('11', 2), ('kali', 2), (',', 2), ('memintas', 2), ('lima', 2), ('pemain', 2), ('England', 2), (':', 2), ('(', 2), ('Glenn', 2), ('Hoddle', 2), (',', 2), ('Peter', 2), ('Reid', 2), (',', 2), ('Kenny', 2), ('Sansom', 2), (',', 2), (continues on next page)

9.23. Kesalahan Tatabahasa 201 malaya Documentation

(continued from previous page) ('Terry', 2), ('Butcher', 2), (',', 2), ('dan', 2), ('Terry', 2), ('Fenwick', 2), (')', 2), ('serta', 2), ('penjaga', 2), ('gawang', 2), ('Peter', 2), ('Shilton', 2), ('.', 2)]]

[24]: quantized_model.greedy_decoder([x]) [24]: [[('Beliau', 2), ('mengambil', 2), ('bola', 2), ('dari', 9), ('kawasan', 2), ('kaki', 10), ('diri', 2), ('lalu', 2), ('berpusing', 15), ('dan', 2), ('meluru', 15), ('lebih', 2), ('separuh', 2), ('padang', 2), ('sambil', 2), ('menyentuh', 2), ('bola', 2), ('11', 2), ('kali', 2), (',', 2), ('memintas', 2), ('lima', 2), ('pemain', 2), ('England', 2), (':', 2), ('(', 2), ('Glenn', 2), ('Hoddle', 2), (',', 2), ('Peter', 2), ('Reid', 2), (',', 2), ('Kenny', 2), ('Sansom', 2), (',', 2), ('Terry', 2), ('Butcher', 2), (',', 2), ('dan', 2), ('Terry', 2), ('Fenwick', 2), (continues on next page)

202 Chapter 9. Contents: malaya Documentation

(continued from previous page) (')', 2), ('serta', 2), ('penjaga', 2), ('gawang', 2), ('Peter', 2), ('Shilton', 2), ('.', 2)]]

9.24 Num2Word

This tutorial is available as an IPython notebook at Malaya/example/num2word.

[3]: import malaya /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

9.24.1 To cardinal

def to_cardinal(number): """ Translate from number input to cardinal text representation

Parameters ------number: real number

Returns ------result: str cardinal representation """

[4]: malaya.num2word.to_cardinal(123456789) [4]: 'seratus dua puluh tiga juta empat ratus lima puluh enam ribu tujuh ratus lapan puluh

˓→sembilan'

[5]: malaya.num2word.to_cardinal(10) [5]: 'sepuluh'

[6]: malaya.num2word.to_cardinal(12) [6]: 'dua belas'

[7]: malaya.num2word.to_cardinal(-1234567.89) [7]: 'negatif satu juta dua ratus tiga puluh empat ribu lima ratus enam puluh tujuh

˓→perpuluhan lapan sembilan'

9.24. Num2Word 203 malaya Documentation

9.24.2 To ordinal

def to_ordinal(number): """ Translate from number input to ordinal text representation

Parameters ------number: real number

Returns ------result: str ordinal representation """

[8]: malaya.num2word.to_ordinal(1) [8]: 'pertama'

[9]: malaya.num2word.to_cardinal(1) [9]: 'satu'

[10]: malaya.num2word.to_ordinal(10) [10]: 'kesepuluh'

[11]: malaya.num2word.to_ordinal(12) [11]: 'kedua belas'

[12]: malaya.num2word.to_cardinal(-123456789) [12]: 'negatif seratus dua puluh tiga juta empat ratus lima puluh enam ribu tujuh ratus

˓→lapan puluh sembilan'

9.25 Word2Num

This tutorial is available as an IPython notebook at Malaya/example/word2num.

[1]: import malaya

def word2num(string): """ Translate from string to number, eg 'kesepuluh' -> 10.

Parameters ------string: str

Returns (continues on next page)

204 Chapter 9. Contents: malaya Documentation

(continued from previous page) ------result: int / float """

[3]: malaya.word2num.word2num('dua belas') [3]: 12

[4]: malaya.word2num.word2num('kesebelas') [4]: 11

[6]: malaya.word2num.word2num('kesebelas') [6]: 11

[5]: malaya.word2num.word2num('negatif kesebelas') [5]: -11

[7]: malaya.word2num.word2num('seratus dua puluh tiga juta empat ratus lima puluh enam

˓→ribu tujuh ratus lapan puluh sembilan') [7]: 123456789

[8]: malaya.word2num.word2num('negatif seratus dua puluh tiga juta empat ratus lima puluh

˓→enam ribu tujuh ratus lapan puluh sembilan') [8]: -123456789

[9]: malaya.word2num.word2num('negatif satu juta dua ratus tiga puluh empat ribu lima

˓→ratus enam puluh tujuh perpuluhan lapan sembilan') [9]: -1234567.89

[ ]:

9.26 Knowledge Graph Triples

Generate MS text -> EN Knowledge Graph Triples format.

This tutorial is available as an IPython notebook at Malaya/example/knowledge-graph-triples.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time

import malaya CPU times: user 5.07 s, sys: 867 ms, total: 5.94 s Wall time: 5.26 s

9.26. Knowledge Graph Triples 205 malaya Documentation

9.26.1 List available Transformer model

[2]: malaya.knowledge_graph.available_transformer() INFO:root:tested on KELM test set. [2]: Size (MB) Quantized Size (MB) BLEU Suggested length t5 1250.0 481.0 0.919301 256.0 small-t5 355.6 195.0 0.910234 512.0 tiny-t5 208.0 103.0 0.933374 256.0

9.26.2 Load Transformer model

def transformer(model: str='small-t5', quantized: bool= False, **kwargs): """ Load transformer to generate knowledge graphs in triples format from texts, MS text -> EN triples format.

Parameters ------model : str, optional (default='small-t5') Model architecture supported. Allowed values:

* ``'t5'`` - T5 BASE parameters. * ``'small-t5'`` - T5 SMALL parameters. * ``'tiny-t5'`` - T5 TINY parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.t5.KnowledgeGraph class """

[3]: model= malaya.knowledge_graph.transformer(model='small-t5') INFO:root:running knowledge-graph-triplet/small-t5 using device /device:CPU:0

9.26.3 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_model= malaya.knowledge_graph.transformer(model='small-t5', quantized=

˓→True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running knowledge-graph-triplet/small-t5-quantized using device /device:CPU:

˓→0

206 Chapter 9. Contents: malaya Documentation

[5]: string1="Yang Berhormat Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak ialah

˓→ahli politik Malaysia dan merupakan bekas Perdana Menteri Malaysia ke-6 yang mana

˓→beliau menjawat jawatan dari 3 April 2009 hingga 9 Mei 2018. Beliau juga pernah

˓→berkhidmat sebagai bekas Menteri Kewangan dan merupakan Ahli Parlimen Pekan " string2="Pahang ialah negeri yang ketiga terbesar di Malaysia Terletak di lembangan

˓→Sungai Pahang yang amat luas negeri Pahang bersempadan dengan Kelantan di utara

˓→Perak serta di barat di selatan dan Terengganu dan

˓→Laut China Selatan di timur."

Predict using greedy decoder

def greedy_decoder( self, strings: List[str], get_networkx: bool= True, ): """ Generate triples knowledge graph using greedy decoder. Example, "Joseph Enanga juga bermain untuk Union Douala." -> "Joseph Enanga

˓→member of sports team Union Douala"

Parameters ------strings: List[str] get_networkx: bool, optional (default=True) If True, will generate networkx.MultiDiGraph.

Returns ------result: List[Dict] """

[6]:r= model.greedy_decoder([string1, string2])

[7]: r[0] [7]: {'result': [{'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'occupation', 'object': 'Politician'}, {'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'country of citizenship', 'object': 'Malaysia'}, {'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'occupation', 'object': 'Finance minister'}, {'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'position held', 'object': 'Member of the Pahang Town Parliament'}, {'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'occupation', 'object': 'Prime Minister of Malaysia'}, {'subject': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', 'relation': 'work period (start)', 'object': '03 April 2009'}], 'main_object': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak', (continues on next page)

9.26. Knowledge Graph Triples 207 malaya Documentation

(continued from previous page) 'triple': 'Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak occupation

˓→Politician, country of citizenship Malaysia, occupation Finance minister, position

˓→held Member of the Pahang Town Parliament, occupation Prime Minister of Malaysia,

˓→work period (start) 03 April 2009.', 'G': }

[10]: import matplotlib.pyplot as plt import networkx as nx

g= r[0]['G'] plt.figure(figsize=(6,6)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

[11]:g= r[1]['G'] plt.figure(figsize=(6,6)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

208 Chapter 9. Contents: malaya Documentation

[12]: # https://ms.wikipedia.org/wiki/Malaysia

string= """ Malaysia secara rasminya Persekutuan Malaysia ialah sebuah negara raja

˓→berperlembagaan persekutuan di Asia Tenggara yang terdiri daripada tiga belas

˓→negeri dan tiga wilayah persekutuan, yang menduduki bumi berkeluasan 330,803

˓→kilometer persegi (127,720 bt2). Malaysia terbahagi kepada dua kawasan yang

˓→mengapit Laut China Selatan, iaitu Semenanjung Malaysia dan Borneo Malaysia (juga

˓→Malaysia Barat dan Timur). Malaysia berkongsi sempadan darat dengan Thailand,

˓→Indonesia, dan Brunei dan juga sempadan laut dengan Singapura dan Filipina. Ibu

˓→negara Malaysia ialah Kuala Lumpur, manakala Putrajaya merupakan pusat kerajaan

˓→persekutuan. Pada tahun 2009, Malaysia diduduki oleh 28 juta penduduk dan pada

˓→tahun 2017 dianggarkan telah mencecah lebih 30 juta orang yang menduduki di

˓→Malaysia.

Malaysia berakar-umbikan Kerajaan-kerajaan Melayu yang wujud di wilayahnya dan

˓→menjadi taklukan Empayar British sejak abad ke-18. Wilayah British pertama di sini

˓→dikenali sebagai Negeri-Negeri Selat. Semenanjung Malaysia yang ketika itu dikenali

˓→sebagai Tanah Melayu atau Malaya, mula-mula disatukan di bawah komanwel pada tahun

˓→1946, sebelum menjadi Persekutuan Tanah Melayu pada tahun 1948. Pada tahun 1957

˓→Semenanjung Malaysia mencapai Kemerdekaan dan bebas daripada penjajah dan sekali

˓→gus menjadi catatan sejarah terpenting bagi Malaysia. Pada tahun 1963, Tanah Melayu

˓→bersatu bersama dengan negara , , dan Singapura bagi membentuk Malaysia.

˓→ Pada tahun 1965, Singapura keluar dari persekutuan untuk menjadi negara kota yang

˓→bebas. Semenjak itu, Malaysia menikmati antara ekonomi yang terbaik di Asia, dengan ˓→purata pertumbuhan keluaran dalam negara kasarnya (KDNK) kira-kira 6.5%(continues selama on next 50 page) ˓→tahun pertama kemerdekaannya.

9.26. Knowledge Graph Triples 209 malaya Documentation

(continued from previous page)

Ekonomi negara yang selama ini dijana oleh sumber alamnya kini juga berkembang dalam

˓→sektor-sektor ukur tanah, sains, kejuruteraan, pendidikan, pelancongan, perkapalan,

˓→perdagangan dan perubatan.

Ketua negara Malaysia ialah Yang di-Pertuan Agong, iaitu raja elektif yang terpilih

˓→dan diundi dari kalangan sembilan raja negeri Melayu. Ketua kerajaannya pula ialah

˓→Perdana Menteri. Sistem kerajaan Malaysia banyak berdasarkan sistem parlimen

˓→Westminster, dan sistem perundangannya juga berasaskan undang-undang am Inggeris.

Malaysia terletak berdekatan dengan khatulistiwa dan beriklim tropika, serta

˓→mempunyai kepelbagaian flora dan fauna, sehingga diiktiraf menjadi salah satu

˓→daripada 17 negara megadiversiti. Di Malaysia terletaknya Tanjung Piai, titik

˓→paling selatan di seluruh tanah besar Eurasia. Malaysia ialah sebuah negara

˓→perintis Persatuan Negara-Negara Asia Tenggara dan Pertubuhan Persidangan Islam,

˓→dan juga anggota Kerjasama Ekonomi Asia-Pasifik, Negara-Negara Komanwel, dan

˓→Pergerakan Negara-Negara Berkecuali. """

[13]: def simple_cleaning(string): return ''.join([s for s in string if s not in ',.\'";'])

string= malaya.text.function.split_into_sentences(string) string= [simple_cleaning(s) for s in string if len(s)> 50] string [13]: ['Malaysia secara rasminya Persekutuan Malaysia ialah sebuah negara raja

˓→berperlembagaan persekutuan di Asia Tenggara yang terdiri daripada tiga belas

˓→negeri dan tiga wilayah persekutuan yang menduduki bumi berkeluasan 330803

˓→kilometer persegi (127720 bt2)', 'Malaysia terbahagi kepada dua kawasan yang mengapit Laut China Selatan iaitu

˓→Semenanjung Malaysia dan Borneo Malaysia (juga Malaysia Barat dan Timur)', 'Malaysia berkongsi sempadan darat dengan Thailand Indonesia dan Brunei dan juga

˓→sempadan laut dengan Singapura dan Filipina', 'Ibu negara Malaysia ialah Kuala Lumpur manakala Putrajaya merupakan pusat kerajaan

˓→persekutuan Pada tahun 2009 Malaysia diduduki oleh 28 juta penduduk dan pada tahun

˓→2017 dianggarkan telah mencecah lebih 30 juta orang yang menduduki di Malaysia', 'Malaysia berakar-umbikan Kerajaan-kerajaan Melayu yang wujud di wilayahnya dan

˓→menjadi taklukan Empayar British sejak abad ke-18', 'Wilayah British pertama di sini dikenali sebagai Negeri-Negeri Selat', 'Semenanjung Malaysia yang ketika itu dikenali sebagai Tanah Melayu atau Malaya mula-

˓→mula disatukan di bawah komanwel pada tahun 1946 sebelum menjadi Persekutuan Tanah

˓→Melayu pada tahun 1948', 'Pada tahun 1957 Semenanjung Malaysia mencapai Kemerdekaan dan bebas daripada

˓→penjajah dan sekali gus menjadi catatan sejarah terpenting bagi Malaysia', 'Pada tahun 1963 Tanah Melayu bersatu bersama dengan negara Sabah Sarawak dan

˓→Singapura bagi membentuk Malaysia', 'Pada tahun 1965 Singapura keluar dari persekutuan untuk menjadi negara kota yang

˓→bebas', 'Semenjak itu Malaysia menikmati antara ekonomi yang terbaik di Asia dengan purata

˓→pertumbuhan keluaran dalam negara kasarnya (KDNK) kira-kira 65% selama 50 tahun

˓→pertama kemerdekaannya', 'Ekonomi negara yang selama ini dijana oleh sumber alamnya kini juga berkembang

˓→dalam sektor-sektor ukur tanah sains kejuruteraan pendidikan pelancongan perkapalan

˓→perdagangan dan perubatan', 'Ketua negara Malaysia ialah Yang di-Pertuan Agong iaitu raja elektif yang terpilih

˓→dan diundi dari kalangan sembilan raja negeri Melayu', (continues on next page)

210 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'Sistem kerajaan Malaysia banyak berdasarkan sistem parlimen Westminster dan sistem

˓→perundangannya juga berasaskan undang-undang am Inggeris', 'Malaysia terletak berdekatan dengan khatulistiwa dan beriklim tropika serta

˓→mempunyai kepelbagaian flora dan fauna sehingga diiktiraf menjadi salah satu

˓→daripada 17 negara megadiversiti', 'Di Malaysia terletaknya Tanjung Piai titik paling selatan di seluruh tanah besar

˓→Eurasia', 'Malaysia ialah sebuah negara perintis Persatuan Negara-Negara Asia Tenggara dan

˓→Pertubuhan Persidangan Islam dan juga anggota Kerjasama Ekonomi Asia-Pasifik Negara-

˓→Negara Komanwel dan Pergerakan Negara-Negara Berkecuali']

[14]:r= model.greedy_decoder(string)

[15]:g= r[0]['G']

for i in range(1, len(r),1): g.update(r[i]['G'])

[16]: plt.figure(figsize=(17, 17)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

9.26. Knowledge Graph Triples 211 malaya Documentation

[17]: # https://www.utusan.com.my/terkini/2021/07/agong-dukacita-ketua-pembangkang-tuntut-

˓→pm-letak-jawatan/ # https://www.hmetro.com.my/mutakhir/2021/07/736206/kediaman-pm-jadi-tumpuan-media

string= """ KUALA LUMPUR: Ketua Pembangkang, Datuk Seri Anwar Ibrahim menggesa Perdana Menteri,

˓→Tan Sri meletak jawatan susulan kenyataan dikeluarkan Istana

˓→Negara berhubung isu Proklamasi Darurat.

“Ini menunjukkan Kabinet yang diketuai Tan Sri Muhyiddin melanggar Perlembagaan

˓→menghina instusi raja Perlembagaan termasuk menteri di Jabatan Perdana Menteri

˓→mengelirukan Dewan.

(continues on next page)

212 Chapter 9. Contents: malaya Documentation

(continued from previous page) “Oleh yang demikian, kita menuntut Perdana Menteri meletak jawatan,” ujarnya hari ini.

Terdahulu, Yang di-Pertuan Agong, Al-Sultan Abdullah Ri’ayatuddin Al-Mustafa Billah

˓→Shah menzahirkan rasa dukacita dengan pengumuman pembatalan darurat di Parlimen.

Perkara itu dimaklumkan Datuk Pengelola Bijaya Diraja Istana Negara, Datuk Indera

˓→Ahmad Fadil Shamsudddin dalam satu kenyataan hari ini.

“Sehubungan dengan itu, Seri Paduka Baginda menzahirkan rasa amat dukacita dengan

˓→kenyataan yang telah dibuat pada 26 Julai, 2021 lalu bahawa kerajaan telah

˓→membatalkan semua Ordinan Darurat yang telah dimasyhurkan oleh Baginda sepanjang

˓→tempoh darurat walhal belum lagi diperkenan baginda,” katanya.

Kuala Lumpur: Kediaman Perdana Menteri Tan Sri Muhyiddin Yassin menjadi tumpuan

˓→petugas media susulan kenyataan yang dikeluarkan Istana Negara berhubung isu

˓→pembatalan Ordinan Darurat hari ini.

Petugas media dilihat mula'berkampung' di rumah Perdana Menteri yang terletak di

˓→Bukit Damansara di sini, sejak 1 tengah hari ini.

Pemerhatian mendapati beberapa kenderaan dipercayai membawa menteri dan

˓→Peguam Negara memasuki pekarangan kediaman Perdana Menteri pada 1.30 tengah hari.

Dalam kenyataan Istana Negara itu, Yang di-Pertuan Agong Al-Sultan Abdullah Ri

˓→'ayatuddin Al-Mustafa Billah Shah menzahirkan rasa amat dukacita dengan kenyataan

˓→di Parlimen pada Isnin bahawa kerajaan membatalkan semua Ordinan Darurat walhal ia

˓→belum lagi diperkenan Seri Paduka.

Yang di-Pertuan Agong juga amat dukacita kerana apa yang diperkenan dan dititahkan

˓→kepada Menteri di Jabatan Perdana Menteri (Parlimen dan Undang-Undang) Datuk Seri

˓→Takiyuddin Hassan serta Peguam Negara Tan Sri Idrus Harun bahawa cadangan

˓→pembatalan semua Ordinan Darurat dibentang dan dibahaskan di Parlimen bagi tujuan

˓→diungkaikan tidak dilaksanakan. """

[18]: def simple_cleaning(string): return ''.join([s for s in string if s not in ',.\'";'])

string= malaya.text.function.split_into_sentences(string) string= [simple_cleaning(s) for s in string if len(s)> 50] string [18]: ['KUALA LUMPUR: Ketua Pembangkang Datuk Seri Anwar Ibrahim menggesa Perdana Menteri

˓→Tan Sri Muhyiddin Yassin meletak jawatan susulan kenyataan dikeluarkan Istana

˓→Negara berhubung isu Proklamasi Darurat', 'Ini menunjukkan Kabinet yang diketuai Tan Sri Muhyiddin melanggar Perlembagaan

˓→menghina instusi raja Perlembagaan termasuk menteri di Jabatan Perdana Menteri

˓→mengelirukan Dewan', 'Oleh yang demikian kita menuntut Perdana Menteri meletak jawatan ujarnya hari ini', 'Terdahulu Yang di-Pertuan Agong Al-Sultan Abdullah Riayatuddin Al-Mustafa Billah

˓→Shah menzahirkan rasa dukacita dengan pengumuman pembatalan darurat di Parlimen', 'Perkara itu dimaklumkan Datuk Pengelola Bijaya Diraja Istana Negara Datuk Indera

˓→Ahmad Fadil Shamsudddin dalam satu kenyataan hari ini', 'Sehubungan dengan itu Seri Paduka Baginda menzahirkan rasa amat dukacita dengan

˓→kenyataan yang telah dibuat pada 26 Julai 2021 lalu bahawa kerajaan telah

˓→membatalkan semua Ordinan Darurat yang telah dimasyhurkan oleh Baginda sepanjang

˓→tempoh darurat walhal belum lagi diperkenan baginda katanya', (continues on next page)

9.26. Knowledge Graph Triples 213 malaya Documentation

(continued from previous page) 'Kuala Lumpur: Kediaman Perdana Menteri Tan Sri Muhyiddin Yassin menjadi tumpuan

˓→petugas media susulan kenyataan yang dikeluarkan Istana Negara berhubung isu

˓→pembatalan Ordinan Darurat hari ini', 'Petugas media dilihat mula berkampung di rumah Perdana Menteri yang terletak di

˓→Bukit Damansara di sini sejak 1 tengah hari ini', 'Pemerhatian Bernama mendapati beberapa kenderaan dipercayai membawa menteri dan

˓→Peguam Negara memasuki pekarangan kediaman Perdana Menteri pada 130 tengah hari', 'Dalam kenyataan Istana Negara itu Yang di-Pertuan Agong Al-Sultan Abdullah

˓→Riayatuddin Al-Mustafa Billah Shah menzahirkan rasa amat dukacita dengan kenyataan

˓→di Parlimen pada Isnin bahawa kerajaan membatalkan semua Ordinan Darurat walhal ia

˓→belum lagi diperkenan Seri Paduka', 'Yang di-Pertuan Agong juga amat dukacita kerana apa yang diperkenan dan dititahkan

˓→kepada Menteri di Jabatan Perdana Menteri (Parlimen dan Undang-Undang) Datuk Seri

˓→Takiyuddin Hassan serta Peguam Negara Tan Sri Idrus Harun bahawa cadangan

˓→pembatalan semua Ordinan Darurat dibentang dan dibahaskan di Parlimen bagi tujuan

˓→diungkaikan tidak dilaksanakan']

[19]:r= model.greedy_decoder(string) WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 (continues on next page)

214 Chapter 9. Contents: malaya Documentation

(continued from previous page) WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 (continues on next page)

9.26. Knowledge Graph Triples 215 malaya Documentation

(continued from previous page) WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1 WARNING:root:1

[20]:g= r[0]['G']

for i in range(1, len(r),1): g.update(r[i]['G'])

[21]: plt.figure(figsize=(17, 17)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

216 Chapter 9. Contents: malaya Documentation

9.27 Knowledge Graph from Dependency

Parse knowledge graph from dependency parsing.

This tutorial is available as an IPython notebook at Malaya/example/knowledge-graph-from-dependency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

9.27. Knowledge Graph from Dependency 217 malaya Documentation

[1]: %%time

import malaya CPU times: user 5.14 s, sys: 883 ms, total: 6.03 s Wall time: 6.61 s

9.27.1 Load dependency parsing models

Read more about dependency parsing at https://malaya.readthedocs.io/en/latest/load-dependency.html In this example, I am going to load stacks of dependency parsing models.

[2]: quantized_model= malaya.dependency.transformer(version='v1', model='xlnet',

˓→quantized= True) alxlnet= malaya.dependency.transformer(version='v1', model='alxlnet') WARNING:root:Load quantized model will cause accuracy drop.

9.27.2 Predict dependency parsing

[3]:s='Najib yang juga Ahli Parlimen Pekan memuji sikap Ahli Parlimen Langkawi itu yang

˓→mengaku bersalah selepas melanggar SOP kerana tidak mengambil suhu badan ketika

˓→masuk ke sebuah surau di Langkawi pada Sabtu lalu' tagging, indexing= malaya.stack.voting_stack([quantized_model, alxlnet, quantized_

˓→model], s)

[4]: d_object= malaya.dependency.dependency_graph(tagging, indexing) d_object.to_graphvis() [4]:

9.27.3 Parse knowledge graph from dependency

def parse_from_dependency(tagging: List[Tuple[str, str]], indexing: List[Tuple[str, str]], subjects: List[List[str]]=[['flat','subj','nsubj',

˓→'csubj']], relations: List[List[str]]=[['acl','xcomp','ccomp','obj

˓→','conj','advcl'], ['obj']], objects: List[List[str]]=[['obj','compound','flat',

˓→'nmod','obl']], get_networkx: bool= True): """ Generate knowledge graphs from dependency parsing, we suggest use dependency

˓→parsing v1.

Parameters ------tagging: List[Tuple(str, str)] `tagging` result from dependency model. indexing: List[Tuple(str, str)] `indexing` result from dependency model. subjects: List[List[str]], optional (continues on next page)

218 Chapter 9. Contents: malaya Documentation

(continued from previous page) List of dependency labels for subjects. relations: List[List[str]], optional List of dependency labels for relations. objects: List[List[str]], optional List of dependency labels for objects. get_networkx: bool, optional (default=True) If True, will generate networkx.MultiDiGraph.

Returns ------result: Dict[result, G] """

[5]:r= malaya.knowledge_graph.parse_from_dependency(tagging, indexing)

[6]:r [6]: {'result': [{'subject': 'Najib', 'relation': 'memuji sikap mengaku melanggar SOP', 'object': 'badan'}, {'subject': 'Najib', 'relation': 'memuji sikap mengaku melanggar mengambil masuk', 'object': 'suhu'}, {'subject': 'Najib', 'relation': 'memuji sikap', 'object': 'Ahli Parlimen Langkawi'}], 'G': }

[7]: import matplotlib.pyplot as plt import networkx as nx

g= r['G'] plt.figure(figsize=(6,6)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

9.27. Knowledge Graph from Dependency 219 malaya Documentation

9.27.4 Larger knowledge graph

Below I copy pasted from different news from google search, isu israel.

[8]: string= """ Kerajaan gabungan baharu Israel memeterai perjanjian termasuk berkaitan had tempoh

˓→jawatan Perdana Menteri, semalam.

Ia sekali gus akan menamatkan tempoh pemerintahan 12 tahun Benjamin Netanyahu sebagai

˓→Perdana Menteri, antara pemimpin negara itu yang lama memegang jawatan berkenaan.

Gabungan parti yang akan memerintah Israel dijangka memberi fokus kepada isu ekonomi

˓→dan sosial, berbanding risiko mendedahkan keretakan dalaman dengan cuba menangani

˓→isu diplomatik besar seperti konflik Israel-Palestin.

Netanyahu, 71, pemimpin Israel paling lama berkhidmat, akan digantikan esok oleh

˓→pemimpin gabungan yang buat kali pertama turut disertai sebuah parti minoriti Arab

˓→Israel.

Di bawah perjanjian perkongsian kuasa itu, Naftali Bennett daripada parti ultra

˓→nasionalis, Yamina dijangka akan dilantik sebagai Perdana Menteri selama dua tahun.

Bennett semalam berkata, kerajaan gabungan itu mahu'mengakhiri krisis politik dua

˓→setengah tahun' walaupun tidak jelas berapa lama elemen berbeza dalam gabungan itu

˓→akan terus bersatu, lapor agensi berita ArabNews. (continues on next page)

220 Chapter 9. Contents: malaya Documentation

(continued from previous page)

Beliau kemudian akan menyerahkan jawatan itu kepada bekas penyiar TV, Yair Lapid

˓→daripada parti Yesh Atid.

Kandungan perjanjian yang digariskan parti, yang disifatkan Lapid sebagai'kerajaan

˓→perpaduan' antara lain mengehadkan tempoh jawatan Perdana Menteri kepada dua

˓→penggal atau lapan tahun.

Selain itu, pembinaan infrastruktur yang turut merangkumi hospital, universiti dan

˓→lapangan terbang baharu serta meluluskan belanjawan dua tahun untuk menstabilkan

˓→kewangan negara, yang mana kebuntuan politik yang berpanjangan menyebabkan Israel

˓→masih menggunakan versi pro-anggaran daripada belanjawan dasar 2019 yang disahkan

˓→pada pertengahan 2018.

Kandungan lain termasuk mempertahankan status-quo dalam isu agama dan negara, dengan

˓→parti Yamina memiliki hak veto dan rancangan menyeluruh pengangkutan di Tebing

˓→Barat yang diduduki Israel.

Sebagai pemimpin pembangkang dan ketua parti terbesar di Parlimen, Netanyahu dijangka

˓→terus melakukan apa saja dengan kuasanya untuk menjatuhkan kerajaan

Dakwaan beberapa pihak yang mengatakan bahawa Israel mempunyai hak untuk

˓→mempertahankan diri adalah tidak boleh diterima sama sekali. Menurut bekas Perdana Menteri, Tun Dr Mahathir Mohamad, dakwaan itu sebaliknya

˓→merujuk kepada warga Palestin yang terpaksa mempertahankan hak mereka daripada

˓→diserang oleh rejim zionis Israel sejak seminggu lalu. “Saya tidak boleh diterima dengan jawapan Israel yang mendakwa mereka ada hak

˓→mempertahan diri. Itu bukan mempertahankan diri, kerana sepanjang tempoh pergolakan

˓→ini, mereka menimbulkan konfrontasi dengan rakyat Palestin.

“Ini kerana, mereka ingin menceroboh lebih banyak tanah di Palestin. Saya yakin

˓→bahawa selepas ini, mereka akan mengambil alih Sheikh Jarrah misalnya dan mungkin

˓→akan mengambil alih tanah di kawasan terletaknya Masjid Al-Aqsa. “Ini adalah taktik Israel, mereka tidak mempertahankan diri mereka, tetapi mereka

˓→menyerang pihak lain. Jika mereka mempertahankan diri, maka mereka harus berada di

˓→negara mereka sendiri. “Tetapi mereka berperang di tanah Palestin, jadi alasan orang Israel bahawa mereka

˓→berhak untuk mempertahankan diri adalah tidak dapat diterima. “Sememangnya, apa yang berlaku ketika ini ialah rakyat Palestin yang sedang berusaha

˓→mempertahankan diri dan mereka sangat lemah,” ujarnya. Beliau berkata demikian semasa berucap menerusi satu sesi penstriman bertajuk

˓→‘Palestine: Malaysia With Love’ di laman Facebook beliau pada malam Ahad. Terdahulu, Presiden Amerika Syarikat (AS), Joe Biden menyuarakan sokongan tidak

˓→berbelah bahagi kepada hak Israel mempertahankan diri daripada serangan roket Hamas

˓→dan kumpulan pejuang lain. Beliau menyuarakan pendirian itu dalam satu panggilan telefon bersama Presiden

˓→Palestin, Mahmoud Abbas. Pada masa sama, Setiausaha Pertahanan, Lloyd Austin pula

˓→mengulangi hak Israel untuk mempertahankan diri.

Arab Saudi menyelaras delegasi negara Arab ke Pertubuhan Bangsa-Bangsa Bersatu (PBB)

˓→bagi membincangkan usaha menangani kemelut Israel-Palestin.

Wakil tetap Arab Saudi ke PBB, Abdallah Al-Mouallami, bertemu dengan wakil tetap

˓→China ke PBB, yang juga Presiden Majlis Keselamatan PBB (UNSC) bulan ini, bagi

˓→membincangkan perkara itu.

Ia antara usaha Arab Saudi memimpin negara Arab lain memberi maklumat kepada UNSC

˓→mengenai serangan Israel terhadap rakyat Palestin, supaya masyarakat antarabangsa(continues on next page)

˓→dapat menunaikan tanggungjawab melindungi orang awam.

9.27. Knowledge Graph from Dependency 221 malaya Documentation

(continued from previous page)

Mesyuarat turut dihadiri Faisal Al-Haqbani, iaitu pegawai jawatankuasa politik khas

˓→delegasi tetap Arab Saudi ke PBB.

Al-Mouallami turut bertemu dengan Presiden Perhimpunan Agung PBB, Volkan Bozkır,

˓→dalam usaha Arab Saudi memimpin gerakan negara Islam untuk Palestin.

Pertemuan Kumpulan Islam dengan Presiden Perhimpunan Agung PBB itu bertujuan untuk

˓→menjelaskan mengenai serangan Israel baru-baru ini termasuk menggesa masyarakat

˓→antarabangsa melindungi orang awam.

Arab Saudi selalu mendahului sokongan kepada perjuangan Palestin di PBB, sebelum

˓→masyarakat antarabangsa berbuat demikian.

Kelmarin, Menteri Luar Arab Saudi, Putera Faisal bin Farhan, bercakap dengan Menteri

˓→Luar Palestin, Riyad Al-Maliki, menerusi panggilan telefon.

Dalam perbualan itu, beliau menegaskan pihaknya mengutuk amalan haram yang dilakukan

˓→pihak berkuasa Israel dan perlu tindakan segera untuk menghentikan pencabulan

˓→undang-undang dan nilai kemanusiaan antarabangsa.

Susulan permintaan Arab Saudi, Pertubuhan Kerjasama Islam (OIC) turut mengadakan

˓→mesyuarat tergempar hari ini bagi membincang keadaan di Baitulmaqdis dan Gaza

Mesyuarat antara menteri luar negara anggota OIC itu juga bermatlamat menangani

˓→serangan berterusan oleh Israel di wilayah Palestin

Rakyat Malaysia yang bertungkus-lumus bertindak sebagai ‘tentera siber’ di media

˓→sosial demi membantah tindakan kekerasan zionis Israel terhadap Palestin wajar

˓→diberikan penghargaan.

Hal ini kerana mereka tanpa mengira waktu terus memberikan pencerahan kepada semua

˓→pihak sehingga membuka mata penduduk seluruh dunia kekejaman militan Israel dalam

˓→siri serangan dan keganasan terhadap penduduk Gaza.

Presiden Barisan Jemaah Islamiah Se-Malaysia (Berjasa), Zamani Ibrahim antara lain

˓→berkata, parti itu mendukung tindakan kerajaan Hamas yang penuh komited berjuang

˓→bagi mempertahankan tanah air umat Islam di negara tersebut.

“Isu Palestin perlu dilihat dalam kerangka pertembungan antara haq dan batil yang

˓→mana umat Islam keseluruhannya perlu cakna dan berusaha memahami perkara ini dengan

˓→tuntas.

“Tanah Palestin merupakan milik rakyat Palestin secara sejarah dan undang-undang.

˓→Kehadiran militan Israel mengaku Palestin milik mereka pada tahun 1948 dan

˓→kemudiannya secara strategik menghalau penduduk asal adalah satu tindakan ‘Settler

˓→Colonialism’ yang terkutuk.

“Isu ini juga bukanlah semata-mata isu kemanusiaan tetapi ia juga mengenai soal

˓→penjarahan asing yang merampas tanah milik penduduk asal. Bukan sekadar merampas,

˓→malah lebih keji lagi penduduk asal dipenjara, diseksa, dihalau dan dibunuh dengan

˓→kejam sekali,” katanya dalam satu kenyataan di sini hari ini.

Kenyataan itu hadir susulan laporan media antarabangsa berhubung konflik antara

˓→Israel yang terus melancarkan serangan udara terhadap Gaza, Palestin sehingga hari

˓→ini.

(continues on next page)

222 Chapter 9. Contents: malaya Documentation

(continued from previous page) Ketegangan antara kedua-dua negara tersebut yang bermula sejak Isnin minggu lalu,

˓→semakin meningkat sehingga Persatuan Bangsa-Bangsa Bersatu (PBB) memberi amaran ia

˓→bakal mencetuskan peperangan secara besar-besaran.

Siri keganasan itu meletus selepas konflik Israel-Palestin meningkat di

˓→Baitulmuqaddis Timur berikutan pasukan keselamatan Israel menyerbu Masjid Al-Asqa

˓→dan menyerang penduduk Palestin.

Dalam pada itu, beliau menggesa kerajaan untuk bertegas dan memainkan peranan selain

˓→menjadi negara pendesak kepada badan-badan antarabangsa bagi mengambil tindakan

˓→undang-undang menyelamatkan umat Islam sekali gus mempertahankan negara Palestin.

Malaysia, Indonesia dan Brunei menggesa agar satu sidang tergempar Perhimpunan Agung

˓→Pertubuhan Bangsa-Bangsa Bersatu (PBB) diadakan segera bagi menangani isu keganasan

˓→melampau rejim Israel ke atas rakyat Palestin.

Gesaan dibuat menerusi satu kenyataan bersama yang dikeluarkan oleh Perdana Menteri,

˓→Tan Sri Muhyiddin Yassin; Presiden Indonesia, Joko Widodo serta Sultan dan Yang Di-

˓→Pertuan Brunei Darussalam, Sultan Hassanal Bolkiah hari ini. Ketiga-tiga pemimpin negara itu mahu sebuah resolusi keamanan dicapai bagi memastikan

˓→kekejaman yang menimpa rakyat Palestin ketika ini dapat dihentikan.

Mereka turut menuntut pihak berkepentingan untuk melaksanakan gencatan senjata serta

˓→menerima penglibatan pemantau antarabangsa di bandar Al-Quds, Palestin untuk

˓→memastikan gencatan itu dipatuhi. "Kami menggesa komuniti antarabangsa untuk kekal dengan komitmen supaya penyelesaian

˓→dua negara (two-state solution) dilaksana dalam mencapai negara Palestin yang

˓→merdeka, berdasarkan sempadan sebelum 1967, dengan Baitulmaqdis sebagai ibu negara

˓→Palestin. "Kami mengulangi solidariti dan komitmen kepada rakyat Palestin, termasuk hak mereka

˓→untuk menentukan nasib sendiri dan membentuk negara Palestin yang merdeka dan

˓→berdaulat," memetik kenyataan tersebut. Ketiga-tiga pemimpin turut menyokong segala usaha antarabangsa yang terarah kepada

˓→perdamaian yang panjang di Asia Barat berdasarkan Resolusi PBB dan undang-undang

˓→antarabangsa serta undang-undang antarabangsa. Pada masa sama, para pemimpin tiga negara itu turut membidas tindakan kejam Israel

˓→yang secara jelas melanggar hak asasi manusia, undang-undang antarabangsa dan

˓→kemanusiaan. Mereka menyifatkan tindakan Israel itu adalah tidak berperikemanusiaan, bersifat

˓→penjajah dan ibarat pemerintahan aparteid. "Maka, pentingnya tindakan segera dan kolektif dilakukan bagi memastikan tindakan

˓→sewajarnya terhadap pelaku," memetik kenyataan itu lagi. """

string= malaya.text.function.split_into_sentences(string) len(string) [8]: 58

[9]: results=[] for s in string: try: tagging, indexing= malaya.stack.voting_stack([quantized_model, alxlnet,

˓→quantized_model], s) r= malaya.knowledge_graph.parse_from_dependency(tagging, indexing) results.append(r) except: (continues on next page)

9.27. Knowledge Graph from Dependency 223 malaya Documentation

(continued from previous page) pass

len(results), len(string) [9]: (34, 58)

[10]:g= results[0]['G']

for i in range(1, len(results),1): g.update(results[i]['G'])

[11]: import matplotlib.pyplot as plt import networkx as nx

plt.figure(figsize=(17, 17)) pos= nx.spring_layout(g) nx.draw(g, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos= pos) nx.draw_networkx_edge_labels(g, pos=pos) plt.show()

224 Chapter 9. Contents: malaya Documentation

[ ]:

9.27. Knowledge Graph from Dependency 225 malaya Documentation

9.28 Text Augmentation

This tutorial is available as an IPython notebook at Malaya/example/augmentation.

[1]: %%time

import malaya CPU times: user 4.36 s, sys: 840 ms, total: 5.2 s Wall time: 4.77 s

9.28.1 Why augmentation

Let say you have a very limited labelled corpus, and you want to add more, but labelling is very costly. So, text augmentation! We provided few augmentation interfaces in Malaya.

9.28.2 Load Synonym

Use dictionary of synonym to replace words with it synonyms. Synonym data from Malaya-Dataset/90k-synonym.

def synonym( string: str, threshold: float= 0.5, top_n=5, cleaning_function: Callable= augmentation_textcleaning, **kwargs ): """ augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset

˓→#90k-synonym

Parameters ------string: str threshold: float, optional (default=0.5) random selection for a word. top_n: int, (default=5) number of nearest neighbors returned. Length of returned result should as top_

˓→n. cleaning_function: function, (default=malaya.text.function.augmentation_

˓→textcleaning) function to clean text.

Returns ------result: List[str] """

[2]: string='saya suka makan ayam dan ikan' text='Perdana Menteri berkata, beliau perlu memperoleh maklumat terperinci

˓→berhubung isu berkenaan sebelum kerajaan dapat mengambil sebarang tindakan lanjut.

˓→Bagaimanapun, beliau yakin masalah itu dapat diselesaikan dan pentadbiran kerajaan

˓→boleh berfungsi dengan baik.' (continues on next page)

226 Chapter 9. Contents: malaya Documentation

(continued from previous page)

[3]: malaya.augmentation.synonym(string) [3]: ['saya suka makan ayam dan ikan', 'saya mencinta makan ayam jantan dan ikan', 'saya mencinta makan makan ayam jantan dan ikan', 'saya suka makan makan ayam dan ikan', 'saya suka makan ayam dan ikan']

[4]: malaya.augmentation.synonym(text) [4]: ['Perdana menteri menunjukkan beliau perlu memperoleh maklumat Terperinci menghayati

˓→isu berkenaan sebelum kerajaan dapat mengawali sebarang nombor lanjut bagaimanapun

˓→beliau beramanah sedih itu dapat diselesaikan dan pengurusannya kerajaan berupaya

˓→berfungsi dengan baik', 'Perdana menteri menunjukkan beliau wajib mengusahakan data Terperinci menghayati

˓→penerbitan berkenaan di hadapan jajahan menggunakannya mendapatkan sebarang digit

˓→tertua bagaimanapun beliau beramanah suram itu dapatkan diselesaikan dan

˓→pengurusannya kabinet boleh mengangkut dengan baik', 'Ulung uskup merupakan beliau wajib mengusahakan data Terperinci mempunyai

˓→penerbitan berkenaan di hadapan kerajaan menggunakannya berkumpul sebarang nombor

˓→gelap bagaimanapun beliau beramanah daif itu memperoleh diselesaikan dan

˓→pengurusannya kerajaan boleh mencari dengan baik', 'Ulung menteri merupakan beliau wajib memupuk dokumen Terperinci mempunyai

˓→pengeluaran berkenaan sebelum kerajaan menangani berkumpul sebarang nombor gelap

˓→masih beliau yakin daif itu tiba diselesaikan dan pengurusannya pemerintah boleh

˓→mengesani dengan baik', 'Perdana uskup menunjukkan beliau wajib pelihara dokumen Terperinci mempunyai

˓→keluaran berkenaan sebelumnya kerajaan menangani berkumpul sebarang nombor jahat

˓→Bagaimana pun beliau yakin daif itu maju diselesaikan dan pengurusannya komandan

˓→boleh mengesani dengan baik']

9.28.3 Load Wordvector

dictionary of synonym is quite hard to populate, required some domain experts to help us. So we can use wordvector to find nearest words.

def wordvector( string: str, wordvector, threshold: float= 0.5, top_n: int=5, soft: bool= False, cleaning_function: Callable= augmentation_textcleaning, ): """ augmenting a string using wordvector.

Parameters ------string: str wordvector: object wordvector interface object. threshold: float, optional (default=0.5) (continues on next page)

9.28. Text Augmentation 227 malaya Documentation

(continued from previous page) random selection for a word. soft: bool, optional (default=False) if True, a word not in the dictionary will be replaced with nearest

˓→jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. top_n: int, (default=5) number of nearest neighbors returned. Length of returned result should as top_

˓→n. cleaning_function: function, (default=malaya.text.function.augmentation_

˓→textcleaning) function to clean text.

Returns ------result: List[str] """

[5]: vocab_wiki, embedded_wiki= malaya.wordvector.load(model='wikipedia') word_vector_wiki= malaya.wordvector.WordVector(embedded_wiki, vocab_wiki) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/wordvector.py:

˓→114: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/wordvector.py:

˓→125: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

[6]: malaya.augmentation.wordvector( string, word_vector_wiki, soft= True ) [6]: ['saya suka makan ayam dan ikan', 'kamu gemar minum ayam serta ayam', 'anda pandai tidur ayam atau ular', 'kami senang mandi ayam mahupun keju', 'aku ingin berehat ayam tetapi lembu']

[7]: malaya.augmentation.wordvector( text, word_vector_wiki, soft= True ) [7]: ['perdana menteri berkata beliau perlu memperoleh maklumat terperinci berhubung isu

˓→berkenaan sebelum kerajaan dapat mengambil sebarang tindakan lanjut bagaimanapun

˓→beliau yakin masalah itu dapat diselesaikan dan pentadbiran kerajaan boleh

˓→berfungsi dengan baik', 'perdana kementerian menyatakan beliau perlu memperoleh maklumat terperinci

˓→berkaitan persoalan berkaitan selepas kerajaan dapat mendapat sebarang tindakan

˓→terperinci walaupun dia sedar gangguan itu boleh dibuktikan serta pentadbiran

˓→kerajaan dapat dikelaskan dengan baik', 'perdana setiausaha mengatakan beliau perlu memperoleh maklumat terperinci

˓→berhadapan prosedur tertentu setelah kerajaan dapat menghabiskan sebarang tindakan

˓→lanjutan namun baginda bimbang kelemahan itu harus dilaksanakan atau pentadbiran

˓→kerajaan harus bertindak dengan baik', 'perdana jabatan mendapati beliau perlu memperoleh maklumat terperinci sejajar

˓→artikel tersebut ketika kerajaan dapat mengubah sebarang tindakan ringkas maka ˓→mereka menyangka gejala itu perlu dikesan mahupun pentadbiran kerajaan(continues perlu on next page) ˓→dirujuk dengan baik',

228 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'perdana duta mencadangkan beliau perlu memperoleh maklumat terperinci bertentangan

˓→kontroversi berlainan sejak kerajaan dapat memakan sebarang tindakan positif tetapi

˓→saya takut risiko itu mampu diperhatikan tetapi pentadbiran kerajaan akan dikira

˓→dengan baik']

9.28.4 Load Transformer

Problem with wordvector, it just replaced a word for near synonym without understood the whole sentence context, so, Transformer comes to the rescue!

def transformer( string: str, model, threshold: float= 0.5, top_p: float= 0.9, top_k: int= 100, temperature: float= 1.0, top_n: int=5, cleaning_function: Callable= None, ):

""" augmenting a string using transformer + nucleus sampling / top-k sampling.

Parameters ------string: str model: object transformer interface object. Right now only supported BERT, ALBERT and

˓→ELECTRA. threshold: float, optional (default=0.5) random selection for a word. top_p: float, optional (default=0.8) cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k

˓→sampling. top_k: int, optional (default=100) k for top-k sampling. temperature: float, optional (default=0.8) logits * temperature. top_n: int, (default=5) number of nearest neighbors returned. Length of returned result should as top_

˓→n. cleaning_function: function, (default=None) function to clean text.

Returns ------result: List[str] """

[8]: electra= malaya.transformer.load(model='electra') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated ˓→and will be removed in a future version. (continues on next page)

9.28. Text Augmentation 229 malaya Documentation

(continued from previous page) Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow_core/python/

˓→layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is

˓→deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:114: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:118: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:120: The name tf.get_collection is deprecated. Please use tf.

˓→compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:121: The name tf.GraphKeys is deprecated. Please use tf.compat.

˓→v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:127: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:129: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→base/electra-base/model.ckpt

[10]: malaya.augmentation.transformer(text, electra) [10]: ['Perdana Menteri berkata , kerajaan sudah memperoleh maklumat terperinci berhubung

˓→masalah berkenaan supaya kerajaan dapat mengambil pelbagai tindakan sewajarnya .

˓→Bagaimanapun , beliau yakin masalah itu berjaya diselesaikan dan akhirnya terdahulu

˓→boleh diselesaikan dengan baik .', 'Perdana Menteri berkata , kerajaan perlu memperoleh maklumat terperinci berhubung

˓→isu berkenaan supaya kerajaan dapat mengambil serius tindakan segera . Bagaimanapun

˓→, beliau berharap masalah itu boleh diselesaikan dan akhirnya kementerian boleh

˓→diselesaikan dengan baik .', (continues on next page)

230 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'Perdana Menteri berkata , kerajaan telah memperoleh maklumat terperinci berhubung

˓→isu berkenaan supaya kerajaan dapat mengambil beberapa tindakan sewajarnya .

˓→Bagaimanapun , beliau berharap masalah itu perlu diselesaikan dan siasatan BN boleh

˓→diselesaikan dengan baik .', 'Perdana Menteri berkata , kerajaan akan memperoleh maklumat terperinci berhubung

˓→isu berkenaan supaya kerajaan dapat mengambil sebarang tindakan susulan .

˓→Bagaimanapun , beliau mengharapkan masalah itu dapat diselesaikan dan membolehkan

˓→tidak boleh ditangani dengan baik .', 'Perdana Menteri berkata , kerajaan sudah memperoleh maklumat terperinci berhubung

˓→isu berkenaan supaya kerajaan dapat mengambil sebarang tindakan lanjut .

˓→Bagaimanapun , beliau berharap masalah itu dapat diselesaikan dan hanya masih boleh

˓→diselesaikan dengan baik .']

[ ]:

9.29 Prefix Generator

Give initial sentence, then the models will continue to generate the text.

This tutorial is available as an IPython notebook at Malaya/example/prefix-generator.

[1]: %%time import malaya from pprint import pprint CPU times: user 4.74 s, sys: 901 ms, total: 5.64 s Wall time: 8.03 s

9.29.1 Load GPT2

Malaya provided Pretrained GPT2 model, specific to Malay, we called it GPT2-Bahasa. This interface not able us to use it to do custom training. GPT2-Bahasa was pretrained on ~0.9 billion words, and below is the list of dataset we trained, 1. dumping wikipedia (222MB). 2. local news (257MB). 3. local parliament text (45MB). 4. IIUM Confession (74MB). 5. Wattpad (74MB). 6. Academia PDF (42MB). 7. Common-Crawl (3GB). If you want to download pretrained model for GPT2-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2, some notebooks to help you get started. Here we hope these models are not use to finetune for spreading fake news.

9.29. Prefix Generator 231 malaya Documentation

load model

GPT2-Bahasa only available 117M and 345M models. 1. 117M size around 442MB. 2. 345M is around 1.2GB.

def gpt2( model: str='345M', generate_length: int= 256, temperature: float= 1.0, top_k: int= 40, **kwargs ):

""" Load GPT2 model to generate a string given a prefix string.

Parameters ------model : str, optional (default='345M') Model architecture supported. Allowed values:

* ``'117M'`` - GPT2 117M parameters. * ``'345M'`` - GPT2 345M parameters.

generate_length : int, optional (default=256) length of sentence to generate. temperature : float, optional (default=1.0) temperature value, value should between 0 and 1. top_k : int, optional (default=40) top-k in nucleus sampling selection.

Returns ------result: malaya.transformers.gpt2.Model class """

[3]: model= malaya.generator.gpt2(model='117M') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→gpt2/__init__.py:19: where (from tensorflow.python.ops.array_ops) is deprecated and

˓→will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→gpt2/__init__.py:140: The name tf.InteractiveSession is deprecated. Please use tf.

˓→compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→gpt2/__init__.py:141: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→gpt2/__init__.py:142: The name tf.train.Saver is deprecated. Please use tf.compat.

˓→v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→gpt2/__init__.py:142: The name tf.trainable_variables is deprecated. Please(continues use on next tf. page)

˓→compat.v1.trainable_variables instead.

232 Chapter 9. Contents: malaya Documentation

(continued from previous page)

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/gpt2/117M/gpt2-

˓→bahasa-117M/model.ckpt

[5]: string='ceritanya sebegini, aku bangun pagi baca surat khabar , tetiba

˓→aku nampak cerita seram,'

generate

model.generate accepts a string.

def generate(self, string: str): """ generate a text given an initial string.

Parameters ------string : str

Returns ------result: str """

[4]: print(model.generate(string)) ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku

˓→nampak cerita seram, ara aku yang lain keluar, aku pandang cerita tapi tak ingat,

˓→aku takut dan bimbang aku terpaksa marah kerana hati aku yang berada di sekeliling

˓→aku tadi tak putus-putus. Dalam diam, aku juga merasa kagum dan terharu bila aku bangun pagi untuk bangun dan

˓→tengok kisah seram ni, masa tu aku terus pandang, bila aku berada dalam bilik yang

˓→indah, aku tahu tentang benda yang nak diperkatakan. “Tu sikit, dengan banyak masa aku nak keluar dan keluar aku dah mula bangun pagi, aku

˓→nak keluar lagi, lepas tu nanti terus masuk ke bilik sambil nampak benda yang tak

˓→ada yang nak diperkatakan. Tak tau cerita tu macam benda yang boleh aku buat kalau rasa macam cerita. Sampai di bilik, aku pun rasa macam, benda yang nak diperkatakan tu bukan benda yang

˓→perlu aku buat. Macam tak percaya apa yang aku buat ni? Mungkin benda yang nak diperkatakan itu boleh buat aku jugak, cuma benda yang boleh

˓→bagi aku kata tak logik atau memang betul. Cuma yang paling aku nak cakap ni adalah benda pelik yang aku fikir nak nampak yang

˓→tak boleh dan kalau tak logik pun tak patut. So, apa kata dorang mainkan benda yang aku cakap ni. Rasa pelik dan amat pelik kan? Macam nak buat orang lain jadi macam benda pelik dan susah sangat nak buat

[6]: model= malaya.generator.gpt2(model='345M') INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/gpt2/345M/gpt2-

˓→bahasa-345M/model.ckpt

9.29. Prefix Generator 233 malaya Documentation

[7]: string='ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba

˓→aku nampak cerita seram,' print(model.generate(string)) ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku

˓→nampak cerita seram, omputeh-uteh cerita lama-lama, seram tak boleh bayang Sebelum kejadian, dalam 2 jam aku buat panggilan polis , lepas tu kira la sendiri nak

˓→ke lokasi. Tengok cerita lama.. Sekarang ni, apa yang aku lalui, kita yang jaga diri, kita yang jaga kesihatan dan

˓→juga kita yang jaga minda dalam hidup. Maka, inilah jalan penyelesaian terbaiknya. Jangan lupakan manusia Orang yang paling ditakuti untuk berjaya dalam hidup, tidak akan jumpa yang tersayang! Jangan rosakkan masa depannya, ingatlah apa yang kita nak buat, walaupun pahit untuk

˓→ditelan. Jangan lupakan orang lain - masa depan mereka. Jangan lupakan orang - masa itulah kita yang lebih dicintai. Jangan lupakan orang - orang yang kita sayang, mereka bukan orang yang tersayang! Jangan lupakan orang - orang yang kita cinta, mereka cinta pada kita. Jangan lupakan diri - diri kita - yang kita punya, yang kita tinggal adalah masa lalu

˓→kita. Jangan lupakan orang lain - orang yang kita cinta, lebih indah dari masa lalu kita. Jangan lupakan semua orang - orang yang tinggal ataupun hidup. Jangan cuba lupakan diri kita - kerja keras dan selalu ada masa depan kita. Jangan pernah putus rasa - kecewa kerana kita telah banyak berubah. Jangan pernah putus putus asa kerana kita

9.29.2 Using Babble method

We also can generate a text like GPT2 using Transformer-Bahasa. Right now only supported BERT, ALBERT and ELECTRA.

def babble( string: str, model, generate_length: int= 30, leed_out_len: int=1, temperature: float= 1.0, top_k: int= 100, burnin: int= 15, batch_size: int=5, ): """ Use pretrained transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

Parameters ------string: str model: object transformer interface object. Right now only supported BERT, ALBERT. generate_length : int, optional (default=256) length of sentence to generate. leed_out_len : int, optional (default=1) length of extra masks for each iteration. (continues on next page)

234 Chapter 9. Contents: malaya Documentation

(continued from previous page) temperature: float, optional (default=1.0) logits * temperature. top_k: int, optional (default=100) k for top-k sampling. burnin: int, optional (default=15) for the first burnin steps, sample from the entire next word distribution,

˓→instead of top_k. batch_size: int, optional (default=5) generate sentences size of batch_size.

Returns ------result: List[str] """

Make sure you already installed tensorflow-probability,

pip3 install tensorflow-probability==0.7.0

[10]: # !pip3 install tensorflow-probability==0.7.0

[3]: electra= malaya.transformer.load(model='electra') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.

˓→v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:242: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/

˓→site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. (continues on next page)

9.29. Prefix Generator 235 malaya Documentation

(continued from previous page) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use

˓→tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:119: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.

˓→compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.

˓→v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→base/electra-base/model.ckpt

[11]: malaya.generator.babble(string, electra) [11]: ['ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku

˓→nampak cerita seram , terseksa juga hidup di sekeliling aku . Diorang tak tahu

˓→sebab diorang tahu titik hitam yang mana kita tengok dari mana kita sendiri nampak

˓→cerita ke . Haih .', 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku

˓→nampak cerita seram , tengah baca benda besar pasal bumbung bilik . Rasanya sejuk

˓→macam pulau harapan . So aku baca cerita seram pelik . Jadi sedih juga dengar

˓→cerita seram seram ni .', 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku

˓→nampak cerita seram , lalu ibu ambil pusing bagi buku sejarah . Dah baca marsh

˓→pastu aku dah buat thread seram , ada dalam masa terdekat baru bangun . Sedih ,

˓→hidup lagi', 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku

˓→nampak cerita seram , mesti seram sampai aku ikut takdir Allah bagi betul2 aib kita

˓→kembali menulis mengenai kisah cinta aku ini malam , aku tersedar selepas ada

˓→seorang lelaki tersedar .', 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku

˓→nampak cerita seram , sedangkan yang baca pasal negara berpagarism memang patut

˓→berterima kasih . Kata ayah , ingatkan boleh mandi atau bilik pun boleh , kena air

˓→dan bukannya ikut kemampuan']

236 Chapter 9. Contents: malaya Documentation

9.29.3 ngrams

You can generate ngrams pretty easy using this interface,

def ngrams( sequence, n: int, pad_left= False, pad_right= False, left_pad_symbol= None, right_pad_symbol= None, ): """ generate ngrams.

Parameters ------sequence : List[str] list of tokenize words. n : int ngram size

Returns ------ngram: list """

[6]: string='saya suka makan ayam'

list(malaya.generator.ngrams(string.split(), n=2)) [6]: [('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam')]

[7]: list(malaya.generator.ngrams(string.split(), n=2, pad_left= True, pad_right=

˓→True)) [7]: [(None, 'saya'), ('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam'), ('ayam', None)]

[8]: list(malaya.generator.ngrams(string.split(), n=2, pad_left= True, pad_right= True, left_pad_symbol='START')) [8]: [('START', 'saya'), ('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam'), ('ayam', None)]

[8]: list(malaya.generator.ngrams(string.split(), n=2, pad_left= True, pad_right= True, left_pad_symbol='START', right_pad_symbol='END')) [8]: [('START', 'saya'), ('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam'), ('ayam', 'END')]

9.29. Prefix Generator 237 malaya Documentation

9.30 Isi Penting Generator

Generate a long text given isi penting (important facts).

This tutorial is available as an IPython notebook at Malaya/example/isi-penting-generator.

[1]: %%time import malaya from pprint import pprint CPU times: user 5.64 s, sys: 1.19 s, total: 6.83 s Wall time: 7.82 s

9.30.1 List available Transformer

This module trained heavily on news structure.

[2]: malaya.generator.available_transformer() [2]: Size (MB) Quantized Size (MB) t5 1250.0 481.0 small-t5 355.6 195.0

9.30.2 Load Transformer

Transformer Generator in Malaya is quite unique, most of the text generative model we found on the internet like GPT2 or Markov, simply just continue prefix input from user, but not for Transformer Generator. We want to generate an article or karangan like high school when the users give ‘isi penting’.

def transformer(model: str='t5', quantized: bool= False, **kwargs):

""" Load Transformer model to generate a string given a isu penting.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'t5'`` - T5 BASE parameters. * ``'small-t5'`` - T5 SMALL parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.t5.Generator class """

238 Chapter 9. Contents: malaya Documentation

[3]: model= malaya.generator.transformer(model='t5', quantized= True) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/generator.py:

˓→510: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/generator.py:

˓→512: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will

˓→be removed in a future version. Instructions for updating: This function will only be available through the v1 compatibility library as tf.

˓→compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a

˓→new function for importing SavedModels in Tensorflow 2.0. INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/generator-

˓→sample/t5/base/model/variables/variables

[4]: isi_penting=['Dr M perlu dikekalkan sebagai perdana menteri', 'Muhyiddin perlulah menolong Dr M', 'rakyat perlu menolong Muhyiddin']

I just want to test the model given this isi penting, because we all know, Dr M and Muhyiddin are not supporting each others in the real world.

generate

def greedy_decoder(self, strings: List[str]): """ generate a long text given a isi penting. Decoder is greedy decoder with beam width size 1, alpha 0.5 .

Parameters ------strings: List[str]

Returns ------result: str """

[5]: pprint(model.greedy_decoder(isi_penting)) (': Presiden Bersatu, Tan Sri Muhyiddin Yassin perlu mengekalkan Tun Dr ' 'Mahathir Mohamad sebagai perdana menteri berbanding Datuk Seri Anwar Ibrahim ' 'yang hanya minta bantuan untuk menyelesaikan kemelut kedudukan ' 'negara.Muhyiddin berkata, ini kerana semua pihak tahu masalah yang dihadapi ' 'oleh Perdana Menteri adalah di luar bidang kuasa beliau sendiri.Katanya, ' 'Muhyiddin perlu membantu beliau kerana beliau percaya rakyat Malaysia tahu ' 'apa yang berlaku di luar bidang kuasa beliau."Apa yang berlaku di luar ' 'bidang kuasa Dr Mahathir... semua tahu bahawa ini berlaku di bawah ' 'kepimpinan Anwar."Muhyiddin dan seluruh rakyat yang tahu apa yang berlaku di ' 'Johor."Ini kerana di Johor ini, majoriti menteri-menteri dalam Pakatan ' 'Harapan banyak sangat ketua-ketua parti."Jadi Muhyiddin perlu bantu Dr ' 'Mahathir sebab rakyat tahu apa yang berlaku di Johor Bahru," katanya dalam ' 'satu kenyataan di sini, pada Jumaat.Dalam pada itu, Muhyiddin berkata, ' 'rakyat juga perlu menolong Muhyiddin untuk menyelesaikan masalah yang ' (continues on next page)

9.30. Isi Penting Generator 239 malaya Documentation

(continued from previous page) 'melanda negara ketika ini.Menurutnya, Muhyiddin perlu menggalas tugas dengan ' 'baik dan memastikan keadaan negara berada dalam keadaan baik.')

Pretty good!

[6]: isi_penting=['Neelofa tetap dengan keputusan untuk berkahwin akhir tahun ini', 'Long Tiger sanggup membantu Neelofa', 'Tiba-tiba Long Tiger bergaduh dengan Husein']

We also can give any isi penting even does not make any sense.

[7]: pprint(model.greedy_decoder(isi_penting)) ('Kuala Lumpur: Pelakon, Neelofa tetap dengan keputusan dibuat untuk berkahwin ' 'penutup tahun ini, selepas mengadakan pertemuan dengan Long Tiger. Neelofa ' 'atau nama sebenarnya, Mohd Neelofa Ahmad Noor berkata, dia tidak pernah ' 'merancang untuk berkahwin, namun menegaskan dirinya lebih mengutamakan masa ' 'depan. "Saya seronok bersama keluarga. Kalau kami berkahwin awal tahun ini, ' 'ia mengambil masa yang lama. Itu impian saya tetapi biarlah, selepas setahun ' 'saya berehat, saya akan mula bekerja. "Jadi, apabila sering sesi pertemuan ' 'dengan Long Tiger, saya kena tegas mengenai perkara ini. Bukan soal nak ' 'memalukan diri sendiri tetapi siapa yang boleh menghentam saya," katanya ' 'kepada Bh Online. Dalam sesi pertemuan itu, Neelofa yang juga pengacara ' 'acara Top 5, bergaduh dengan Husein, dalam pergaduhan yang berlaku di ' 'Kompleks Mahkamah Tinggi Syariah di sini, baru-baru ini. Ditanya mengenai ' 'hubungannya dengan wanita itu, Neelofa berkata, mereka masih belum ' 'menyelesaikan perkara itu dengan baik. "Saya tidak tahu pasal semua ini, ' 'tetapi ia akan diselesaikan menerusi cara baik. Tidak kiralah apa yang kami ' 'tidak cakap pun. "Pada mulanya kami hanya mahu membebaskan mereka daripada ' 'sebarang isu, namun selepas beberapa hari bergaduh, kami akhirnya mengambil ' 'keputusan untuk berkahwin dengan Hadiza Aziz. "Jika mereka mahu, kami akan ' 'membendung, namun pada masa yang sama, kami tidak mahu bergaduh dengan ' 'lelaki yang digelar Long Tiger," katanya.')

How about karangan like high school?

[8]: # http://mieadham86.blogspot.com/2016/09/isi-isi-penting-karangan-bahasa-melayu.html # KEBAIKAN AMALAN BERGOTONG-ROYONG

isi_penting=['Dapat memupuk semangat kerjasama', 'Dapat mengeratkan hubungan silaturahim.', 'Kebersihan kawasan persekitaran terpelihara.', 'Terhindar daripada wabak penyakit seperti Denggi', 'Mengisi masa lapang', 'Menerapkan nilai-nilai murni dalam kehidupan']

[10]: pprint(model.greedy_decoder(isi_penting)) ('Dewasa ini, kes-kes seumpama denggi semakin menular di kalangan masyarakat. ' 'Justeru, individu yang bertanggungjawab dan berkesan perlu memainkan peranan ' 'penting dalam memastikan persekitaran dalam komuniti terjamin. Persis kata ' 'peribahasa Melayu, melentur buluh biarlah dari rebungnya. Oleh itu, tindakan ' 'yang wajar perlu diambil terutamanya jika kita mengamalkan sikap-sikap di ' 'dalam komuniti supaya kehidupan kita tidak terjejas. Oleh itu, kita perlu ' 'mengamalkan sikap bekerjasama dengan masyarakat dalam memastikan ' 'persekitaran kita selamat. Jika kita sehati, sikap bekerjasama dapat dipupuk ' 'dan dibudayakan dalam masyarakat. Maka, amalan ini secara tidak langsung ' (continues on next page)

240 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'mampu membantu kita supaya tidak hidup lebih sejahtera. Pada masa yang sama, ' 'ia juga dapat mengelakkan berlakunya sebarang masalah kesihatan dan ' 'seterusnya membantu yang mungkin akan berlaku pada masa akan datang. ' 'Masyarakat yang prihatin perlu meluahkan perasaan dan menitik beratkan soal ' 'kebersihan kawasan persekitaran. Bak kata peribahasa Melayu, mencegah lebih ' 'baik daripada merawat. Tamsilnya, pihak kerajaan perlu menjalankan usaha ' 'yang bersungguh-sungguh sebagai tanggungjawab yang diamanahkan. Selain itu, ' 'sikap masyarakat yang mengambil berat tentang kebersihan kawasan ' 'persekitaran dapat membantu mengurangkan masalah kesihatan yang kian ' 'menular. Secara tidak langsung, masyarakat awam akan melahirkan masyarakat ' 'yang peka dan menghargai keberadaan anggota masyarakat di sekeliling mereka. ' 'Bagi memastikan kebersihan kawasan persekitaran terjamin, kita perlu ' 'memastikan komuniti yang berada ditaarapkan dalam keadaan bersih dan terurus ' 'agar keselamatan masyarakat terjamin. Para pekerja dan ahli peniaga perlu ' 'memastikan kebersihan kawasan mereka dijaga dengan baik. Hal ini kerana, ' 'kita akan berhadapan dengan pelbagai masalah kesihatan yang mengakibatkan ' 'Malaysia menjadi negara ketiga yang paling teruk terkena jangkitan demam ' 'denggi pada tahun lepas. Sekiranya kita mempraktikkan amalan berkenaan, kita ' 'akan berhadapan dengan bahaya. Sekiranya aktiviti ini diteruskan, kita akan ' 'terencat daripada jumlah kes penyakit yang menyerang. Secara tidak langsung, ' 'kita akan dapat membendung penularan wabak penyakit di kalangan masyarakat. ' 'Sebagai contoh, wabak denggi di Malaysia berkemungkinan boleh menularkan ' 'jangkitan kepada penduduk di negeri-negeri yang lain. Oleh itu, langkah ini ' 'wajar dan mempunyai sistem pengurusan kebersihan yang terbaik bagi ' 'membolehkan jumlah pesakit yang dirawat di hospital meningkat. Kesannya, ia ' 'dapat membantu kita untuk mengamalkan kaedah yang betul dan matang dalam ' 'kehidupan. Selain itu, sekiranya kita mengamalkan sikap kerja, kita akan ' 'sentiasa berusaha supaya kita terhindar daripada wabak penyakit yang ' 'menyerang penduduk di sekeliling kita. Bak kata peribahasa Melayu, mencegah ' 'lebih baik daripada merawat. Semua pihak perlu berganding bahu bagai aur ' 'dengan tebing untuk menjaga kesihatan dan keselamatan para pekerja dalam ' 'kawasan yang sangat rentan. Kebersihan kawasan persekitaran merupakan elemen ' 'yang penting dalam memastikan persekitaran kita selamat daripada jangkitan ' 'wabak seperti denggi. Kita tentunya tidak mahu ada tempat yang kotor dan ' 'busuk namun kita tidak boleh berbuat demikian kerana ia merupakan elemen ' 'yang tidak boleh dijual beli. Oleh itu, jika kita mengamalkan sikap kerja ' "yang 'membersihkan', kita akan menjadi lebih baik dan selamat daripada wabak " 'penyakit seperti denggi. Jika kita mengamalkan sikap ini, kita akan menjadi ' 'lebih baik dan selamat daripada ancaman penyakit-penyakit yang berbahaya. ' 'Tidak kira apabila kita sudah terbiasa dengan amalan ini, sudah pasti ' 'keselamatan kita akan terjamin. Selain itu, kita perlulah dirikan amalan ' 'seperti rajin mencuci tangan menggunakan sabun atau segala benda lain kerana ' 'kita juga mempunyai tempat yang sesuai untuk membasuh tangan dengan baik. ' 'Perkara ini boleh menjadi perubahan kepada amalan kita dalam kehidupan ' 'apabila kita berusaha untuk membersihkan kawasan yang telah dikenal pasti. ' 'Secara tidak langsung, kita dapat bertukar-tukar fikiran dan mengamalkan ' 'nilai-nilai murni dalam kehidupan. Hal ini demikian kerana, kita antara ' 'mereka yang merancang untuk melakukan sesuatu bagi mengelakkan berlakunya ' 'kemalangan. Hakikatnya, amalan membasuh tangan menggunakan sabun atau benda ' 'lain adalah berniat buruk kerana akan dapat mengganggu kelancaran proses ' 'pemanduan terutamanya apabila tidur. Kesannya, kita akan mewujudkan ' 'masyarakat yang bertimbang rasa dan bergantung kepada orang lain untuk ' 'melakukan kerja mereka walaupun di mana mereka berada. Selain itu, kita ' 'dapat mengamalkan cara yang betul dalam memastikan kebersihan kawasan ' 'persekitaran adalah terjamin. Kita tidak boleh menyembunyikan diri daripada ' 'pengetahuan umum seperti di tempat awam seperti tempat letak kereta yang ' 'sering digunakan oleh orang ramai. Jika kita menggunakan tandas awam dan ' (continues on next page)

9.30. Isi Penting Generator 241 malaya Documentation

(continued from previous page) 'menggunakan botol air untuk membersihkan kawasan berkenaan, kita akan mudah ' 'terdedah dengan wabak penyakit yang membahayakan kesihatan. Selain itu, kita ' 'juga perlu sentiasa berjaga-jaga dengan memakai penutup mulut dan hidung ' 'jika ada demam. Jika kita tidak mengamalkan kebersihan, besar kemungkinan ia ' 'boleh mengundang kepada penularan wabak penyakit. Bak kata peribahasa ' 'Melayu, mencegah lebih baik daripada merawat. Jika kita membuat keputusan ' 'untuk menutup mulut atau hidung dengan pakaian yang bersih dan bijak, kita ' 'akan menjadi lebih baik daripada menyelamatkan diri sendiri daripada ' 'jangkitan penyakit. Andai kata, pengamal media dapat menggunakan telefon ' 'pintar ketika membuat liputan di media massa, proses ini akan membuatkan ' 'kehidupan mereka lebih mudah dan sukar. Selain itu, proses nyah kuman juga ' 'dapat memastikan kebersihan di kawasan rumah kita terjamin. Contohnya, semua ' 'stesen minyak dan restoran makanan segera perlu memakai penutup mulut dan ' 'hidung secara betul agar penularan wabak penyakit dapat dihentikan. Penonton ' 'yang berada di dalam juga wajar digalakkan untuk menggunakan penutup mulut ' 'dan hidung agar mudah terkena jangkitan kuman. Selain itu, pengisian masa ' 'lapang yang terdapat di kawasan tempat awam dapat mendidik masyarakat untuk ' 'mengamalkan nilai-nilai murni seperti rajin mencuci tangan menggunakan sabun ' 'dan air supaya tidak terdedah kepada virus denggi. Walaupun kita mempunyai ' 'ramai kenalan yang ramai tetapi tidak dapat mengamalkannya kerana kita perlu ' 'adalah rakan yang sedar dan memahami tugas masing-masing. Pelbagai cara yang ' 'boleh kita lakukan bagi memastikan hospital atau klinik-klinik kerajaan ' 'menjadi')

[11]: # http://mieadham86.blogspot.com/2016/09/isi-isi-penting-karangan-bahasa-melayu.html # CARA MENJADI MURID CEMERLANG

isi_penting=['Rajin berusaha - tidak mudah putus asa', 'Menghormati orang yang lebih tua - mendapat keberkatan', 'Melibatkan diri secara aktif dalam bidang kokurikulum', 'Memberi tumpuan ketika guru mengajar.', 'Berdisiplin - menepati jadual yang disediakan.', 'Bercita-cita tinggi - mempunyai keazaman yang tinggi untuk berjaya']

[12]: pprint(model.greedy_decoder(isi_penting)) ('Sejak akhir-akhir ini, pelbagai isu yang hangat diperkatakan oleh masyarakat ' 'yang berkait dengan sambutan Hari Raya Aidilfitri. Pelbagai faktor yang ' 'melatari perkara yang berlaku dalam kalangan masyarakat hari ini, khususnya ' 'bagi golongan muda. Dikatakan bahawa kehidupan kita hari ini semakin ' 'mencabar terutamanya kesibukan dalam menjalankan tugas dan mengajar. ' 'Justeru, tidak dinafikan apabila semakin jauh kita, semakin ramai yang ' 'memilih untuk lalai atau tidak mematuhi arahan yang telah ditetapkan. ' 'Mendepani cabaran ini, golongan muda terpaksa menempuhi segala cabaran untuk ' 'menjadi lebih baik dan lebih baik. Minda yang perlu diterapkan, terutama di ' 'dalam kelas untuk mempelajari ilmu pengetahuan. Jika tidak, kita akan ' 'menjadi lebih mudah untuk menilai dan menyelesaikan masalah yang dihadapi. ' 'Oleh itu, kita perlu berfikir untuk menetapkan langkah yang patut atau perlu ' 'dilaksanakan bagi mengatasi masalah yang berlaku. Selain itu, guru-guru juga ' 'harus mendidik peserta-peserta dalam kelas supaya dapat menjalankan kegiatan ' 'dengan lebih serius dan berkesan. Guru-Guru juga seharusnya berusaha untuk ' 'meningkatkan kemahiran mereka dalam kalangan pelajar. Seperti peribahasa ' 'Melayu, melentur buluh biarlah dari rebungnya. Setiap insan mempunyai ' 'peranan masing-masing dan tanggungjawab yang masing-masing. Kesempatan untuk ' 'memberikan nasihat dan teguran adalah lebih penting dan membantu secara ' 'halus dan bijaksana dalam melakukan sesuatu. Selain itu, guru-guru hendaklah ' (continues on next page)

242 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'berani untuk melakukan sesuatu perkara yang memberi manfaat kepada para ' 'pelajar yang lain. Cara ini adalah dengan melakukan aktiviti-aktiviti yang ' 'boleh memberi manfaat kepada para pelajar. Selain itu, guru-guru juga ' 'perlulah menjaga disiplin mereka dengan sebaik-baiknya. Dalam menyampaikan ' 'nasihat dan teguran secara berterusan, pelajar juga boleh melakukan perkara ' 'yang boleh mendatangkan mudarat. Anak-Anak awal pelajar dan rakan-rakan ' 'mereka juga boleh melakukan tugas yang bermanfaat. Keadaan ini membolehkan ' 'mereka untuk lebih berusaha dan memberikan nasihat yang berguna kepada kaum ' 'lain. Oleh itu, mereka perlu sentiasa mengingati dan mendidik pelajar dengan ' 'nilai-nilai yang murni. Setiap orang mempunyai impian yang tinggi untuk ' 'berjaya. Sama ada kita berjaya atau tidak, pencapaian yang diperoleh setelah ' 'tamat belajar akan memberikan kita nilai yang baik dan perlu menjadi contoh ' 'yang baik untuk negara kita.')

9.31 Lexicon Generator

This tutorial is available as an IPython notebook at Malaya/example/lexicon.

[1]: %%time import malaya import numpy as np CPU times: user 4.47 s, sys: 1.01 s, total: 5.48 s Wall time: 5.37 s

9.31.1 Why lexicon

Lexicon is populated words related to certain domains, like, words for negative and positive sentiments. Example, word suka can represent as positive sentiment. If suka exists in a sentence, we can say that sentence is positive sentiment. Lexicon based is common way people use to classify a text and very fast. Again, it is pretty naive because a word can be semantically ambiguous.

9.31.2 sentiment lexicon

Malaya provided a small sample for sentiment lexicon, simply,

[6]: sentiment_lexicon= malaya.lexicon.sentiment sentiment_lexicon.keys() [6]: dict_keys(['negative', 'positive'])

9.31. Lexicon Generator 243 malaya Documentation

9.31.3 emotion lexicon

Malaya provided a small sample for emotion lexicon, simply,

[3]: emotion_lexicon= malaya.lexicon.emotion emotion_lexicon.keys() [3]: dict_keys(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'])

9.31.4 Lexicon generator

To build a lexicon is time consuming, because required expert domains to populate related words to the domains. With the help of word vector, we can induce sample words to specific domains given some annotated lexicon. Why we induced lexicon from word vector? Even for a word suka commonly represent positive sentiment, but if the word vector learnt the context of suka different polarity and based nearest words also represent different polarity, so suka got tendency to become negative sentiment. Malaya provided inducing lexicon interface, build on top of Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Let say you have a lexicon based on standard language or bahasa baku, then you want to find similar lexicon on social media context. So you can use this malaya.lexicon interface. To use this interface, we must initiate malaya.wordvector.load first. And, at least small lexicon sample like this,

{'label1':['word1','word2'],'label2':['word3','word4']}

label can be more than 2, example like malaya.lexicon.emotion, up to 6 different labels.

[5]: vocab, embedded= malaya.wordvector.load(model='socialmedia') wordvector= malaya.wordvector.WordVector(embedded, vocab)

9.31.5 random walk

Random walk technique is main technique use by the paper, can read more at 3.2 Propagating polarities from a seed set

def random_walk( lexicon, wordvector, pool_size= 10, top_n= 20, similarity_power= 100.0, beta= 0.9, arccos= True, normalization= True, soft= False, silent= False, ):

""" Induce lexicon by using random walk technique, use in paper, https://arxiv.org/

˓→pdf/1606.02820.pdf

(continues on next page)

244 Chapter 9. Contents: malaya Documentation

(continued from previous page) Parameters ------

lexicon: dict curated lexicon from expert domain, {'label1': [str], 'label2': [str]}. wordvector: object wordvector interface object. pool_size: int, optional (default=10) pick top-pool size from each lexicons. top_n: int, optional (default=20) top_n for each vectors will multiple with `similarity_power`. similarity_power: float, optional (default=100.0) extra score for `top_n`, less will generate less bias induced but high chance

˓→unbalanced outcome. beta: float, optional (default=0.9) penalty score, towards to 1.0 means less penalty. 0 < beta < 1. arccos: bool, optional (default=True) covariance distribution for embedded.dot(embedded.T). If false, covariance +

˓→1. normalization: bool, optional (default=True) normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. soft: bool, optional (default=False) if True, a word not in the dictionary will be replaced with nearest

˓→jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. silent: bool, optional (default=False) if True, will not print any logs.

Returns ------tuple: (labels[argmax(scores), axis = 1], scores, labels)

"""

[5]: %%time

results, scores, labels= malaya.lexicon.random_walk(sentiment_lexicon, wordvector,

˓→pool_size=5) populating nearest words from wordvector populating vectors from populated nearest words random walking from populated vectors

CPU times: user 1min 36s, sys: 16.1 s, total: 1min 52s Wall time: 28.1 s

[6]: np.unique(list(results.values()), return_counts= True) [6]: (array(['negative', 'positive'], dtype='

[7]: results [7]: {'serang': 'negative', 'cilegon': 'positive', 'culik': 'negative', 'tanjungpinang': 'positive', 'jenguk': 'negative', (continues on next page)

9.31. Lexicon Generator 245 malaya Documentation

(continued from previous page) 'luka': 'negative', 'jerawat': 'negative', 'infeksi': 'negative', 'migrain': 'negative', 'penyakit': 'negative', 'penaklukan': 'negative', '4ir': 'positive', 'renjer': 'positive', 'kezhaliman': 'positive', 'proklamator': 'positive', 'kelucahan': 'negative', 'pablisiti': 'positive', 'terjwp': 'positive', '33100': 'positive', 'impos': 'positive', 'kritikan': 'negative', 'mandat': 'negative', 'teguran': 'negative', 'persepsi': 'negative', 'pembelaan': 'negative', 'muflis': 'negative', 'mempelajarinya': 'negative', 'melarat': 'positive', 'dihabisi': 'positive', 'kooperatif': 'positive', 'kelemahan': 'negative', 'keyakinan': 'positive', 'kehendak': 'negative', 'keburukan': 'negative', 'gerombolan': 'negative', 'kelakuan': 'negative', 'antek': 'negative', 'politikus': 'negative', 'ulah': 'negative', 'debu': 'negative', 'kotoran': 'negative', 'polusi': 'negative', 'kuman': 'negative', 'keringat': 'negative', 'sinis': 'negative', 'misterius': 'positive', 'menggemaskan': 'positive', 'emosional': 'negative', 'progresif': 'positive', 'bocor': 'negative', 'pecah': 'negative', 'retak': 'negative', 'rosak': 'negative', 'terbalik': 'negative', 'kekacauan': 'negative', 'penindasan': 'negative', 'perdebatan': 'negative', 'kesombongan': 'negative', 'pengamatan': 'negative', 'permusuhan': 'negative', 'ketidakadilan': 'negative', 'empati': 'negative', (continues on next page)

246 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'perpecahan': 'negative', 'menghasut': 'negative', 'menghukum': 'negative', 'memfitnah': 'negative', 'memaki': 'negative', 'memprovokasi': 'negative', 'bersedih': 'negative', 'mengalah': 'negative', 'terlena': 'negative', 'cemburu': 'negative', 'dikenang': 'negative', 'jatuh': 'negative', 'terjatuh': 'negative', 'putus': 'negative', 'hilang': 'negative', 'hancur': 'negative', 'dipakai': 'negative', 'digunakan': 'negative', 'dikonsumsi': 'negative', 'dipake': 'negative', 'diminum': 'negative', 'harapan': 'negative', 'kebahagiaan': 'positive', 'impian': 'positive', 'cita2': 'negative', 'senyuman': 'positive', 'beban': 'negative', 'resiko': 'negative', 'kerugian': 'negative', 'tekanan': 'negative', 'risiko': 'negative', 'mencaci': 'negative', 'dicaci': 'negative', 'mengejek': 'negative', 'disia': 'negative', 'bengkak': 'negative', 'berair': 'negative', 'lebam': 'negative', 'lenguh': 'negative', 'toksik': 'negative', 'toksin': 'negative', 'pepejal': 'positive', 'kafein': 'negative', 'buih': 'negative', 'terperangkap': 'negative', 'dijumpai': 'negative', 'tersimpan': 'negative', 'tergabung': 'negative', 'bertarung': 'negative', 'rahsia': 'negative', 'cabaran': 'positive', 'petua': 'negative', 'persamaan': 'negative', 'punca': 'negative', 'fail': 'negative', 'failed': 'negative', 'approve': 'negative', (continues on next page)

9.31. Lexicon Generator 247 malaya Documentation

(continued from previous page) 'consider': 'negative', 'freehair': 'negative', 'munafik': 'negative', 'dungu': 'negative', 'liberal': 'negative', 'rasis': 'negative', 'konservatif': 'negative', 'parasit': 'negative', 'klorofil': 'negative', 'klorin': 'positive', 'fibroid': 'negative', 'antibakteri': 'negative', 'menyesal': 'negative', 'nyesal': 'negative', 'terkejut': 'positive', 'terliur': 'positive', 'sebak': 'positive', 'pemberontakan': 'negative', 'kudeta': 'negative', 'feminisme': 'negative', 'keragaman': 'negative', 'kesangsian': 'negative', 'nelponke': 'positive', 'datebook': 'negative', '4dalzk': 'negative', 'ketidakpentinganku': 'positive', 'fasis': 'negative', 'portugis': 'negative', 'ateisme': 'positive', 'illuminati': 'negative', 'malang': 'negative', 'depok': 'positive', 'kediri': 'positive', 'semarang': 'positive', 'cirebon': 'positive', 'mendatangkan': 'negative', 'menimbulkan': 'negative', 'memupuk': 'negative', 'mengundang': 'negative', 'menghianati': 'negative', 'kejatuhan': 'negative', 'pelemahan': 'negative', 'lonjakan': 'negative', 'ketiadaan': 'negative', 'pengubahan': 'negative', 'memusnahkan': 'negative', 'mengadopsi': 'negative', 'merampas': 'negative', 'mengangkut': 'negative', 'mengarahkan': 'negative', 'kemarahan': 'negative', 'keimanan': 'positive', 'penderitaan': 'negative', 'wabak': 'negative', 'letupan': 'negative', 'jangkitan': 'negative', 'serangan': 'negative', (continues on next page)

248 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'jenayah': 'negative', 'tragedi': 'negative', 'peristiwa': 'negative', 'insiden': 'negative', 'kejadian': 'negative', 'menganggur': 'negative', 'dioptimalkan': 'positive', 'menyakitimu': 'positive', 'bernafsu': 'positive', 'derhaka': 'negative', 'menakan': 'negative', 'sulung': 'positive', 'bongsu': 'negative', 'teruna': 'negative', 'merungut': 'negative', 'komplen': 'negative', 'giveup': 'negative', 'melalak': 'negative', 'melawa': 'negative', 'berdarah': 'negative', 'bengkok': 'negative', 'layu': 'negative', 'ngeri': 'negative', 'serem': 'negative', 'kocak': 'negative', 'mantep': 'positive', 'miris': 'negative', 'menghina': 'negative', 'menuduh': 'negative', 'membenci': 'negative', 'menyalahkan': 'negative', 'menyekat': 'negative', 'menggenjot': 'negative', 'mengevaluasi': 'negative', 'mengalirkan': 'negative', 'melemahkan': 'negative', 'keengganan': 'negative', 'vendon': 'positive', 'koturno': 'positive', 'spesialisasikan': 'positive', "'pembongkaran": 'positive', 'neraka': 'negative', 'surga': 'negative', 'syurga': 'positive', 'kubur': 'negative', 'mesjid': 'negative', 'gerun': 'negative', 'betui2': 'positive', 'bankrup': 'positive', 'gamak': 'positive', 'mendobi': 'negative', 'penghapusan': 'negative', 'proyeksi': 'negative', 'realisasi': 'negative', 'pengendalian': 'negative', 'maraknya': 'negative', 'strike': 'negative', (continues on next page)

9.31. Lexicon Generator 249 malaya Documentation

(continued from previous page) 'adop': 'positive', 'seats': 'positive', 'sponsored': 'positive', 'script': 'positive', 'pengangguran': 'negative', 'pns': 'negative', 'koruptor': 'negative', 'oposisi': 'negative', 'stunting': 'negative', 'mengamuk': 'negative', 'membebel': 'negative', 'menjerit': 'negative', 'meroyan': 'negative', 'bergaduh': 'negative', 'keruntuhan': 'negative', 'maxxie': 'positive', '081266267925': 'positive', 'evvadiki': 'positive', 'digibdulu': 'positive', 'kekuatan': 'negative', 'kepercayaan': 'positive', 'kesadaran': 'negative', 'hasrat': 'negative', 'radikal': 'negative', 'sekuler': 'negative', 'intoleran': 'negative', 'sosialis': 'negative', 'penagih': 'negative', 'penagihan': 'positive', 'professor': 'negative', 'keldai': 'negative', 'penebar': 'negative', 'menghentam': 'negative', 'jagungbakar': 'positive', 'pembakaram': 'positive', 'bajucoplemurah': 'positive', 'ma3i': 'positive', 'pembakar': 'negative', 'limpahan': 'positive', 'melarutkan': 'positive', 'pencegah': 'negative', 'merendam': 'positive', 'membakar': 'negative', 'mengikat': 'negative', 'membersihkan': 'positive', 'menghancurkan': 'negative', 'pembakaran': 'negative', 'penat': 'negative', 'letih': 'negative', 'stress': 'negative', 'bosan': 'negative', 'mengantuk': 'negative', 'binasa': 'negative', 'membengkak': 'positive', 'terpejam': 'positive', 'menggumpal': 'positive', 'bergoyang': 'negative', (continues on next page)

250 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'diasingkan': 'negative', 'difokuskan': 'negative', 'melindungimu': 'positive', 'terselamatkan': 'positive', 'tertid': 'positive', 'mengelak': 'negative', 'menyiasat': 'negative', 'menghindar': 'negative', 'mengelakkan': 'negative', 'dilepaskan': 'negative', 'tempur': 'negative', 'migas': 'negative', 'nuklir': 'negative', 'manufaktur': 'negative', 'ilegal': 'negative', 'discrimination': 'negative', 'dramaticnyer': 'positive', 'disuwek': 'positive', '6066030438': 'positive', 'fahdy': 'positive', 'merugikan': 'negative', 'meresahkan': 'negative', 'menimpa': 'negative', 'meyakinkan': 'positive', 'membanggakan': 'positive', 'membingungkan': 'negative', 'diperlihatkan': 'negative', 'dilakukannya': 'positive', 'disegani': 'positive', 'dititipkan': 'negative', 'fatal': 'positive', 'provokatif': 'positive', 'memprihatinkan': 'positive', 'ambisius': 'positive', 'mendasar': 'positive', 'peredaran': 'negative', 'sirkulasi': 'negative', 'pembuluh': 'negative', 'murka': 'negative', 'dilaknat': 'negative', 'diijabah': 'negative', 'berkehendak': 'negative', 'terusik': 'positive', 'virus': 'negative', 'hama': 'negative', 'stroke': 'negative', 'perkauman': 'negative', 'lgbt': 'negative', 'icerd': 'negative', 'rasuah': 'negative', 'politik': 'negative', 'kehancuran': 'negative', 'kedewasaan': 'negative', 'penjajahan': 'negative', 'menurun': 'negative', 'meningkat': 'negative', 'berkurang': 'negative', (continues on next page)

9.31. Lexicon Generator 251 malaya Documentation

(continued from previous page) 'membaik': 'negative', 'meroket': 'negative', 'mengetepikan': 'negative', 'kuimplankan': 'positive', 'mountaineer': 'positive', 'chapalein': 'positive', '40365036': 'positive', 'penjara': 'negative', 'lokap': 'negative', 'mengekori': 'negative', 'c4uf5s': 'positive', '085602974529': 'positive', 'kebiasqan': 'positive', 'teamgoals': 'positive', 'bimbang': 'negative', 'khawatir': 'negative', 'kesal': 'positive', 'sungkan': 'negative', 'pemabuk': 'negative', 'adibrunner': 'positive', 'eppii': 'positive', '3s3bju': 'positive', 'jakwir': 'positive', 'pemukul': 'negative', 'seminaronline7': 'positive', 'gemoksaya': 'positive', 'gabisabisa': 'positive', 'berocorak': 'positive', 'penentangan': 'negative', 'livescreen': 'positive', 'meliriktelegramdan': 'positive', '081334186600': 'positive', 'indox': 'positive', 'terdesak': 'negative', 'desperate': 'negative', 'bebal': 'negative', 'fobia': 'negative', 'nekad': 'positive', 'tahi': 'negative', 'taik': 'negative', 'bangkai': 'negative', 'seekor': 'negative', 'ulat': 'negative', 'kesusahan': 'negative', 'kesedihan': 'negative', 'keraguan': 'negative', 'berdepan': 'negative', 'dikaitkan': 'negative', 'dimulakan': 'negative', 'mengesan': 'negative', 'dikejutkan': 'negative', 'tamak': 'negative', 'biadap': 'negative', 'bongkak': 'negative', 'angkuh': 'negative', 'pendarahan': 'negative', 'alahan': 'negative', (continues on next page)

252 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'pembengkakan': 'negative', 'kegatalan': 'negative', 'komplikasi': 'negative', 'dirosakkan': 'negative', 'sajadahmasjid': 'positive', 'wisatalumajang': 'positive', 'dsmua': 'positive', 'otogod': 'positive', 'kekufuran': 'negative', 'auratnya': 'positive', 'kebhinekaan': 'positive', 'kekuatannya': 'negative', 'maksiat': 'negative', 'zina': 'negative', 'provokasi': 'negative', 'syirik': 'negative', 'dicemari': 'negative', 'bergandingan': 'negative', 'diperankan': 'positive', 'dihalang': 'negative', 'bpuasa': 'positive', 'merobohkan': 'negative', 'wediaraya': 'positive', 'pliharaku': 'positive', 'diinfor': 'positive', 'ivgfood': 'positive', 'mencuri': 'negative', 'pecahkan': 'negative', 'sumbang': 'negative', 'meminjam': 'negative', 'curi': 'negative', 'disembelih': 'negative', 'terobati': 'negative', 'diangetin': 'positive', 'berharta': 'positive', 'dituliskan': 'positive', 'pengepungan': 'negative', 'menyamoaikan': 'positive', 'kihoii': 'positive', 'sukasukanya': 'positive', '085740709892': 'positive', 'menyeleweng': 'negative', 'bukanyah': 'positive', 'terlangkap': 'positive', 'nurulady_sandwich': 'positive', 'spupet': 'positive', 'krisis': 'negative', 'konflik': 'negative', 'kekhawatiran': 'negative', 'keterbatasan': 'negative', 'ancaman': 'negative', 'dipadamkan': 'negative', 'diagungkan': 'positive', 'digunapakai': 'positive', 'dikenalpasti': 'negative', 'digariskan': 'positive', 'sumpahan': 'negative', (continues on next page)

9.31. Lexicon Generator 253 malaya Documentation

(continued from previous page) 'busuknya': 'negative', 'raklu': 'positive', 'adela': 'negative', 'sgguh': 'positive', 'merebut': 'negative', 'memindahkan': 'negative', 'menyelamatkan': 'negative', 'memperluas': 'negative', 'pembangkang': 'negative', 'ppbm': 'negative', 'bn': 'negative', 'tmj': 'negative', 'pkr': 'negative', 'bercanggah': 'negative', 'berkerjasama': 'negative', 'diberhentikan': 'negative', 'terpalit': 'negative', 'selari': 'negative', 'penalty': 'negative', 'lipliner': 'positive', 'glasses': 'positive', 'kdak': 'positive', 'logbook': 'positive', 'tergantung': 'negative', 'beda': 'negative', 'berbeda': 'positive', 'gatau': 'negative', 'berdasarkan': 'negative', 'longgar': 'negative', 'ketat': 'positive', 'sendat': 'positive', 'ramping': 'positive', 'dijahit': 'negative', 'kontroversi': 'negative', 'kezaliman': 'negative', 'penolakan': 'negative', 'menakutkan': 'negative', 'menyedihkan': 'negative', 'mengerikan': 'negative', 'mendebarkan': 'positive', 'dibenci': 'negative', 'mengusik': 'negative', 'memberkahi': 'positive', 'menyirami': 'negative', 'memantulkan': 'negative', 'menampar': 'negative', 'problem': 'negative', 'prob': 'positive', 'down': 'negative', 'error': 'negative', 'function': 'positive', 'pelarian': 'negative', 'pengemis': 'negative', 'jurnalis': 'negative', 'primadona': 'negative', 'buzzer': 'negative', 'lengkap': 'negative', (continues on next page)

254 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'lengkapnya': 'positive', 'komplit': 'positive', 'pengirim': 'negative', 'simpel': 'positive', 'bencana': 'negative', 'musibah': 'negative', 'tsunami': 'negative', 'kerusuhan': 'negative', 'rompakan': 'negative', 'samun': 'negative', 'lynas': 'negative', 'rusuhan': 'negative', 'penyelewengan': 'negative', 'meletup': 'negative', 'tercabut': 'negative', 'terkencing': 'negative', 'pitam': 'negative', 'letup': 'negative', 'membosankan': 'negative', 'menyebalkan': 'negative', 'rumit': 'negative', 'bantahan': 'negative', 'cenderahati': 'negative', 'instruksi': 'negative', 'ketertarikan': 'negative', 'penghasut': 'negative', 'hasanudin': 'positive', 'astuti': 'positive', 'kurva': 'positive', 'gerd': 'positive', 'ribut': 'negative', 'ngeluh': 'negative', 'rusuh': 'negative', 'berantem': 'negative', 'ngumpul': 'negative', 'bergelut': 'negative', 'disibukkan': 'negative', 'berkolaborasi': 'negative', 'berkutat': 'negative', 'khinzir': 'negative', 'cmnie': 'positive', 'kecikk': 'positive', 'instafemes': 'positive', 'siuk': 'positive', 'gangguan': 'negative', 'kerusakan': 'negative', 'permasalahan': 'negative', 'berisiko': 'negative', 'beresiko': 'positive', 'rentan': 'negative', 'berpotensi': 'negative', 'disyaki': 'negative', 'mengetuk': 'negative', 'membukakan': 'negative', 'bukain': 'negative', 'ngetok': 'negative', 'bukakan': 'negative', (continues on next page)

9.31. Lexicon Generator 255 malaya Documentation

(continued from previous page) 'memutuskan': 'negative', 'berkomitmen': 'positive', 'berencana': 'negative', 'berniat': 'negative', 'diminta': 'negative', 'penceroboh': 'negative', 'keperpercayaan': 'positive', 'coherence': 'positive', 'lgdnya': 'positive', "deto'x": 'positive', 'sindiran': 'negative', 'heroik': 'positive', 'ceramahnya': 'positive', 'petuah': 'negative', 'ketegasan': 'negative', 'hukuman': 'negative', 'pidana': 'negative', 'sanksi': 'negative', 'najis': 'negative', 'cicak': 'negative', 'iblis': 'negative', 'depresi': 'negative', 'mengharamkan': 'negative', 'memaknai': 'negative', 'meragukan': 'negative', 'mengedepankan': 'negative', 'kelaparan': 'negative', 'kesepian': 'negative', 'tenggelam': 'negative', 'gelisah': 'negative', 'terluka': 'negative', 'korupsi': 'negative', 'makar': 'negative', 'kriminal': 'negative', 'vandalisme': 'negative', 'penipuan': 'negative', 'kebencian': 'negative', 'kebohongan': 'negative', 'hoaks': 'negative', 'dusta': 'negative', 'inflasi': 'negative', 'apbn': 'negative', 'trauma': 'negative', 'mual': 'negative', 'stres': 'negative', 'badmood': 'negative', 'keradangan': 'negative', 'pigmentasi': 'negative', 'peradangan': 'negative', 'keletihan': 'negative', 'selulit': 'negative', 'kesilapan': 'negative', 'kesalahan': 'negative', 'kemusnahan': 'negative', 'perbendeharaan': 'positive', 'romanticist': 'positive', 'deseu2': 'positive', (continues on next page)

256 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'menyjilat': 'positive', 'benci': 'negative', 'menyampah': 'positive', 'jijik': 'negative', 'kagum': 'positive', 'geli': 'positive', 'mendesak': 'negative', 'mengkritik': 'negative', 'menggesa': 'negative', 'menghimbau': 'negative', 'diperintah': 'negative', 'tahap': 'negative', 'level': 'negative', 'fasa': 'negative', 'tingkat': 'negative', 'babak': 'negative', 'praktikal': 'negative', 'kaunseling': 'negative', 'stpm': 'negative', 'pt3': 'negative', 'practical': 'negative', 'dahsyat': 'negative', 'tragis': 'negative', 'dasyat': 'negative', 'kematian': 'negative', 'pembunuhan': 'negative', 'kekalahan': 'negative', 'kebodohan': 'negative', 'pembelotan': 'negative', 'bis2lo': 'negative', 'nepisnya': 'positive', 'stabizernya': 'negative', 'dziewczynka': 'negative', 'mengkhianati': 'negative', 'mengabaikan': 'negative', 'menyembah': 'negative', 'meremehkan': 'negative', 'perbuatannya': 'negative', 'protes': 'negative', 'kritik': 'negative', 'dibela': 'negative', 'rekonsiliasi': 'negative', 'diusir': 'negative', 'tuduhan': 'negative', 'dakwaan': 'negative', 'perbuatan': 'negative', 'tuntutan': 'negative', 'dadah': 'negative', 'hey': 'positive', 'astagfirullah': 'negative', 'heh': 'negative', 'fak': 'positive', 'ditakuti': 'negative', 'diharamkan': 'negative', 'dicintai': 'positive', 'nasionalis': 'negative', 'mengalir': 'negative', (continues on next page)

9.31. Lexicon Generator 257 malaya Documentation

(continued from previous page) 'tumpah': 'negative', 'merebak': 'negative', 'dimasukkan': 'negative', 'terjun': 'negative', 'mencederakan': 'negative', 'mummuy': 'positive', 'pkdnya': 'positive', 'dilepasi': 'positive', 'tolak': 'negative', 'keluarkan': 'negative', 'tuntut': 'negative', 'pegang': 'negative', 'kutip': 'negative', 'khianat': 'negative', 'bersaksi': 'negative', 'dipersalahkan': 'positive', 'menyeksa': 'negative', 'morah2': 'positive', 'hakimnegara': 'positive', 'princemmed': 'positive', 'bedaken': 'positive', 'kemelesetan': 'negative', 'raauww': 'positive', "'aiyok": 'positive', '15dan': 'positive', 'huina': 'positive', 'melumpuhkan': 'negative', 'dipercayakan': 'positive', 'direbut': 'negative', 'menyasar': 'positive', 'mengetuai': 'negative', 'kesengsaraan': 'negative', 'kebermanfaatan': 'positive', 'kegelisahan': 'negative', 'berkabung': 'negative', 'berbasikal': 'positive', 'berbisnes': 'negative', 'memuncak': 'positive', 'berbahas': 'negative', 'pengakuan': 'negative', 'kesaksian': 'negative', 'pernyataan': 'negative', 'perang': 'negative', 'neraca': 'negative', 'negosiasi': 'negative', 'kebangkitan': 'positive', 'menyerahkan': 'negative', 'menyalurkan': 'negative', 'membagikan': 'negative', 'serahkan': 'negative', 'mengajukan': 'negative', 'hutang': 'negative', 'utang': 'negative', 'pendapatan': 'negative', 'pajak': 'negative', 'cukai': 'negative', 'saingan': 'negative', (continues on next page)

258 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'trofi': 'positive', 'pertarungan': 'negative', 'kompetisi': 'negative', 'klasemen': 'negative', 'mengeruhkan': 'negative', 'zuaini': 'positive', 'sedip': 'positive', '7572687': 'positive', 'sesiapo': 'positive', 'mengemis': 'negative', 'tanyaa': 'negative', 'feeling2': 'positive', 'berdendam': 'negative', 'bermasalah': 'negative', 'sensitif': 'positive', 'terganggu': 'negative', 'berjerawat': 'positive', 'menghitam': 'positive', 'disaster': 'negative', 'ngisahin': 'positive', 'butoset': 'positive', 'stuffed': 'positive', 'kayk': 'positive', 'rapuh': 'negative', 'rebah': 'negative', 'mengering': 'positive', 'kaku': 'negative', 'hti': 'negative', 'syaitan': 'negative', 'pembohong': 'negative', 'opposition': 'negative', 'accord': 'positive', 'hone': 'positive', 'writternya': 'positive', 'memahat': 'positive', 'dikawal': 'negative', 'ditangani': 'negative', 'diselamatkan': 'negative', 'diselesaikan': 'negative', 'dilewati': 'negative', 'beracun': 'negative', 'lazim': 'positive', 'merbahaya': 'positive', 'mengkilap': 'positive', 'berbahaya': 'negative', 'gross': 'negative', 'paint': 'positive', 'bunny': 'positive', 'teriyaki': 'positive', 'panther': 'positive', 'menghantui': 'negative', 'menyiksa': 'negative', 'menuntun': 'negative', 'cintakan': 'negative', 'membohongi': 'negative', 'bodoh': 'negative', 'bangang': 'negative', (continues on next page)

9.31. Lexicon Generator 259 malaya Documentation

(continued from previous page) 'bodo': 'positive', 'noob': 'negative', 'merenggangkan': 'negative', 'nowel2': 'positive', 'memmpesonahh': 'positive', 'sotoguk': 'positive', 'promotinggal2harilagiburuuaann': 'positive', 'polemik': 'negative', 'penahanan': 'negative', 'usulan': 'negative', 'pertikaian': 'negative', 'sejarahnya': 'negative', 'kejanggalan': 'negative', 'petaka': 'negative', 'tamparan': 'negative', 'takut': 'negative', 'risau': 'negative', 'malu': 'negative', 'segan': 'negative', 'ketinggalan': 'negative', 'kehabisan': 'negative', 'kebagian': 'negative', 'lewatkan': 'negative', 'terlepas': 'negative', 'paksaan': 'negative', 'kejelasan': 'negative', 'batasnya': 'negative', 'halangan': 'negative', 'bingung': 'negative', 'penasaran': 'positive', 'mikir': 'negative', 'kepikiran': 'negative', 'males': 'negative', 'ditinggalkan': 'negative', 'dibunuh': 'negative', 'dihina': 'negative', 'dijalani': 'negative', 'dilanda': 'negative', 'mengidap': 'negative', 'picu': 'negative', 'memicu': 'negative', 'terjangkit': 'negative', 'penyerang': 'negative', 'gelandang': 'negative', 'pembalap': 'negative', 'manajer': 'negative', 'kiper': 'negative', 'mencurigai': 'negative', 'zemwah': 'positive', 'enenenenenene': 'positive', 'destroyers': 'positive', 'norsyida': 'positive', 'memarahi': 'negative', 'dereta': 'positive', 'pengambil': 'positive', 'menjudge': 'positive', 'disodorin': 'positive', (continues on next page)

260 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'disentuh': 'negative', 'memakainya': 'negative', 'membacanya': 'negative', 'dicerna': 'negative', 'dihilangkan': 'negative', 'membimbangkan': 'negative', 'dibaiat': 'positive', 'memenatkan': 'negative', 'diingati': 'positive', 'perosak': 'negative', 'penghianat': 'negative', 'pembela': 'negative', 'perusak': 'negative', 'minoriti': 'negative', 'kemudaratan': 'negative', 'kainavailable': 'positive', 'angesti': 'positive', 'konsta': 'positive', 'togor2': 'positive', 'menangkis': 'negative', 'gobindh': 'positive', "k'sasar": 'positive', 'mgnr': 'positive', 'kemesu': 'positive', 'rugi': 'negative', 'untung': 'negative', 'berdosa': 'negative', 'berbaloi': 'positive', 'terasa': 'negative', 'merasa': 'negative', 'berdebar': 'negative', 'terlihat': 'positive', 'berasa': 'negative', 'tebusan': 'negative', '082257468845': 'positive', 'penghakiman': 'positive', 'dihafal': 'positive', 'kecelaruan': 'negative', 'pakvwi': 'positive', 'mwamuna': 'positive', 'hapepend': 'positive', 'mengekuarkan': 'positive', 'kasar': 'negative', 'kotor': 'negative', 'halus': 'positive', 'kusam': 'positive', 'memaksa': 'negative', 'menyayangi': 'negative', 'menyuruh': 'negative', 'menyakiti': 'negative', 'fanatik': 'negative', 'toleran': 'positive', 'zalim': 'negative', 'atheis': 'negative', 'kemiskinan': 'negative', 'pelampau': 'negative', 'dicekal': 'positive', (continues on next page)

9.31. Lexicon Generator 261 malaya Documentation

(continued from previous page) 'ysfheartnezia': 'positive', 'photograther': 'positive', 'ntuh': 'positive', 'takot': 'negative', 'teror': 'negative', 'menyerang': 'negative', 'membunuh': 'negative', 'membela': 'negative', 'menolong': 'negative', 'menjatuhkan': 'negative', 'menyamakan': 'negative', 'meninggalkan': 'negative', 'menemui': 'negative', 'tinggalkan': 'negative', 'menemukan': 'negative', 'mengubah': 'negative', 'miskin': 'negative', 'goblok': 'negative', 'jelek': 'negative', 'jomblo': 'negative', 'bego': 'negative', 'siber': 'negative', 'undang2': 'negative', 'menangis': 'negative', 'nangis': 'negative', 'tertidur': 'negative', 'tertunggak': 'negative', 'langsai': 'positive', 'rm2k': 'positive', 'rm450': 'negative', 'xsilap': 'positive', 'lucah': 'negative', 'porno': 'negative', 'semburit': 'negative', 'seks': 'negative', '3gp': 'negative', 'mengalami': 'negative', 'menderita': 'negative', 'merasakan': 'negative', 'menyebabkan': 'negative', 'musnah': 'negative', 'lenyap': 'negative', 'sengsara': 'negative', 'stereotaip': 'negative', 'ahmbs': 'positive', 'radangmembaik': 'positive', 'escapepenang': 'positive', 'f7szfx': 'positive', 'ironinya': 'negative', 'moyez': 'positive', 'mauloee': 'positive', 'ndakanamirana': 'positive', 'skf3013': 'positive', 'pergolakan': 'negative', 'gelembung': 'negative', 'menghadkan': 'negative', 'wardrobenya': 'positive', (continues on next page)

262 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'anrara': 'positive', 'tukaanza': 'positive', 'tersebutnya': 'positive', 'hamba': 'negative', 'hambanya': 'negative', 'firman': 'negative', 'takdir': 'negative', 'rasul': 'negative', 'memburukkan': 'negative', 'tubuhkan': 'negative', 'menggulingkan': 'negative', 'meruntuhkan': 'negative', 'membantai': 'negative', 'haiwan': 'negative', 'dajjal': 'negative', 'penyamun': 'negative', 'sampah': 'negative', 'rumput': 'negative', 'racun': 'negative', 'rokok': 'negative', 'dengki': 'negative', 'jeles': 'positive', 'sombong': 'negative', 'hasutan': 'negative', 'palsu': 'negative', 'negatif': 'negative', ...}

[8]: %%time

results_emotion, scores_emotion, labels_emotion= malaya.lexicon.random_walk(emotion_

˓→lexicon,

˓→wordvector, pool_

˓→size= 10) populating nearest words from wordvector populating vectors from populated nearest words random walking from populated vectors

CPU times: user 5.9 s, sys: 3.13 s, total: 9.03 s Wall time: 1.5 s

[9]: np.unique(list(results_emotion.values()), return_counts= True) [9]: (array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='

[10]: results_emotion [10]: {'sebal': 'anger', 'gesture': 'anger', 'se7': 'anger', 'ziraa': 'love', 'mantepp': 'love', 'mesem': 'love', (continues on next page)

9.31. Lexicon Generator 263 malaya Documentation

(continued from previous page) 'nggapapa': 'love', 'maen2': 'love', 'gacocok': 'anger', 'jeongwoo': 'love', 'bergelora': 'anger', 'mereda': 'anger', 'skeptis': 'anger', 'gebus': 'love', 'tyrion': 'love', 'memuncak': 'anger', 'mewabah': 'love', 'mengenaskan': 'anger', 'kesasar': 'love', 'kepedean': 'love', 'annoying': 'anger', 'awkward': 'fear', 'scary': 'fear', 'handsome': 'fear', 'nervous': 'fear', 'cringe': 'fear', 'menyampah': 'fear', 'kelakar': 'fear', 'cute': 'fear', 'cuak': 'fear', 'bodoh': 'anger', 'bangang': 'anger', 'bebal': 'anger', 'bodo': 'fear', 'noob': 'fear', 'bengap': 'fear', 'celaka': 'fear', 'biadap': 'fear', 'pukimak': 'fear', 'berang': 'anger', 'buru': 'anger', 'nerus': 'anger', 'kangsar': 'anger', 'lipis': 'anger', 'pilah': 'anger', 'besut': 'anger', 'krai': 'anger', 'klawang': 'anger', 'ketil': 'anger', 'amuk': 'anger', 'mbatin': 'love', 'sebarin': 'love', 'sebarisan': 'love', 'ngalami': 'love', 'tikt': 'love', 'diharga': 'love', 'threesome': 'love', 'shizuka': 'love', 'bokondini': 'love', 'mendidih': 'anger', 'mengental': 'anger', 'sebati': 'anger', 'mengembang': 'anger', (continues on next page)

264 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'layu': 'anger', 'kecoklatan': 'anger', 'matang': 'anger', 'meresap': 'anger', 'mengering': 'anger', 'direbus': 'anger', 'pengecut': 'anger', 'bajingan': 'anger', 'pembohong': 'anger', 'pecundang': 'anger', 'dungu': 'anger', 'pemberani': 'anger', 'negarawan': 'anger', 'jahil': 'anger', 'biadab': 'anger', 'provokator': 'anger', 'bengang': 'anger', 'menyirap': 'fear', 'meluat': 'fear', 'frust': 'fear', 'rimas': 'fear', 'annoyed': 'fear', 'lonely': 'fear', 'berdukacita': 'anger', 'menyakitimu': 'anger', 'bersinggungan': 'love', 'bermesra': 'love', 'meridhoi': 'anger', 'menyelubungi': 'love', 'empukk': 'love', 'berserban': 'love', 'diracuni': 'love', 'dibayangi': 'love', 'jengkel': 'anger', 'gugup': 'anger', 'dibiasain': 'love', 'mubazir': 'anger', 'amnesia': 'anger', 'psikopat': 'anger', 'gumoh': 'love', 'diurusin': 'love', 'ngangenin': 'anger', 'purging': 'anger', 'babi': 'anger', 'sial': 'fear', 'kimak': 'fear', 'anjing': 'anger', 'pundek': 'fear', 'cibai': 'fear', 'setan': 'anger', 'lembu': 'anger', 'pedar': 'anger', 'sanwya': 'love', 'qabaya': 'love', '5pac': 'love', 'wa082336409906': 'love', 'mpibg': 'love', (continues on next page)

9.31. Lexicon Generator 265 malaya Documentation

(continued from previous page) 'honachahthu': 'anger', 'unieleven': 'love', 'mengepilkan': 'anger', 'ciknorzaidi': 'love', 'benci': 'anger', 'jijik': 'fear', 'kagum': 'surprise', 'geli': 'fear', 'insecure': 'fear', 'geram': 'fear', 'respect': 'fear', 'jealous': 'fear', 'marah': 'anger', 'maki': 'fear', 'merajuk': 'fear', 'marah2': 'surprise', 'perli': 'fear', 'jeles': 'fear', 'tegur': 'fear', 'kecam': 'fear', 'cemburu': 'surprise', 'bitter': 'fear', 'ngeri': 'fear', 'serem': 'fear', 'kocak': 'fear', 'mantep': 'fear', 'miris': 'fear', 'ngeselin': 'fear', 'nyesek': 'fear', 'kesel': 'fear', 'sebel': 'fear', 'lebay': 'fear', 'phobia': 'fear', 'mendem': 'love', 'berideologi': 'love', 'niru': 'love', 'nyicip': 'love', 'ngerawat': 'fear', 'riweuh': 'anger', 'nmun': 'love', 'ngancam': 'love', 'bencong': 'love', 'anxiety': 'fear', 'glasses': 'love', 'manners': 'fear', 'satan': 'love', 'popularity': 'love', 'curl': 'love', 'impossible': 'love', 'mayb': 'love', 'sperm': 'love', 'nyumpah': 'love', 'fitnah': 'fear', 'hoax': 'fear', 'provokasi': 'fear', 'kebencian': 'fear', 'dusta': 'fear', (continues on next page)

266 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'hoaks': 'fear', 'kebohongan': 'fear', 'bohong': 'fear', 'rasis': 'anger', 'ngibul': 'fear', 'horror': 'fear', 'horor': 'fear', 'romance': 'fear', 'day6': 'fear', 'dokumenter': 'fear', 'porno': 'fear', 'anime': 'fear', 'sinetron': 'fear', 'drakor': 'fear', 'dangdut': 'fear', 'takut': 'fear', 'risau': 'fear', 'malu': 'fear', 'khawatir': 'sadness', 'segan': 'fear', 'kecewa': 'sadness', 'takot': 'fear', 'bimbang': 'sadness', 'takutnya': 'fear', 'sedih': 'sadness', 'panic': 'fear', 'loud': 'love', 'impressed': 'love', 'expected': 'love', 'dying': 'love', 'rush': 'fear', 'shitty': 'love', 'smoke': 'fear', 'suck': 'fear', 'cheap': 'fear', 'emo': 'fear', 'boring': 'fear', 'gelabah': 'fear', 'ngantok': 'fear', 'syok': 'joy', 'seronok': 'joy', 'busy': 'fear', 'serabut': 'fear', 'syiok': 'fear', 'sendu': 'fear', 'riang': 'joy', 'ceria': 'sadness', 'takbir': 'joy', 'bersuka': 'anger', 'emma': 'love', 'barakah': 'anger', 'telemovie': 'anger', 'riuh': 'anger', 'ria': 'joy', 'khutbah': 'joy', 'sebak': 'fear', 'excited': 'fear', (continues on next page)

9.31. Lexicon Generator 267 malaya Documentation

(continued from previous page) 'terharu': 'surprise', 'terliur': 'fear', 'girang': 'joy', 'ditikung': 'love', 'ambis': 'anger', 'rafa': 'love', 'digangguin': 'love', 'nyiksa': 'anger', 'maruk': 'love', 'tamvan': 'love', 'pengap': 'love', 'iklas': 'love', 'puas': 'joy', 'muak': 'sadness', 'kenyang': 'fear', 'lega': 'fear', 'bosan': 'fear', 'berbaloi': 'fear', 'berpuas': 'sadness', 'lelah': 'sadness', 'bahagia': 'joy', 'menyenangkan': 'sadness', 'gelisah': 'sadness', 'nyaman': 'sadness', 'indah': 'sadness', 'sukses': 'sadness', 'sehat': 'sadness', 'damai': 'sadness', 'suka': 'joy', 'sukanya': 'fear', 'doyan': 'fear', 'demen': 'fear', 'suke': 'fear', 'gasuka': 'fear', 'gemar': 'fear', 'sukaa': 'fear', 'prefer': 'fear', 'happy': 'joy', 'hepi': 'love', 'wish': 'fear', 'nice': 'fear', 'cerita': 'joy', 'citer': 'fear', 'cite': 'fear', 'crita': 'fear', 'kisah': 'love', 'percakapan': 'joy', 'tweet': 'fear', 'drama': 'fear', 'lagu': 'fear', 'ceramah': 'joy', 'cinta': 'love', 'kebahagiaan': 'love', 'cintanya': 'sadness', 'cintaku': 'sadness', 'persahabatan': 'love', 'cintamu': 'sadness', (continues on next page)

268 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'kesabaran': 'love', 'dendam': 'sadness', 'kesedihan': 'sadness', 'asa': 'love', 'baby': 'love', 'daddy': 'love', 'mira': 'fear', 'princess': 'love', 'bella': 'love', 'farah': 'love', 'mommy': 'love', 'sister': 'love', 'mummy': 'love', 'lisa': 'love', 'love': 'love', 'luv': 'love', 'hate': 'love', 'thought': 'fear', 'mean': 'fear', 'want': 'fear', 'see': 'fear', 'need': 'fear', 'hope': 'fear', 'peace': 'fear', 'syang': 'love', 'noi': 'love', 'bilang2': 'love', 'syng': 'love', 'mut': 'love', 'ribbey': 'love', 'seneng2': 'love', 'butoset': 'love', 'manly': 'love', 'twet': 'love', 'syg': 'love', 'sayangg': 'love', 'sayang': 'love', 'bby': 'love', 'cntik': 'fear', 'knl': 'surprise', 'anon': 'fear', 'sistur': 'love', 'sayang2': 'love', 'bgus': 'fear', 'rindukn': 'love', 'ajeb2an': 'love', 'hshakjsjsbs': 'love', 'miliknyamencatat': 'love', 'p6a': 'love', 'ahsjahhaa': 'love', 'diwajibk': 'love', 'protese': 'love', 'botaqin': 'love', 'kruntel': 'love', 'rindu': 'love', 'sayangku': 'love', 'sayangkan': 'love', (continues on next page)

9.31. Lexicon Generator 269 malaya Documentation

(continued from previous page) 'sayangnya': 'love', 'disayang': 'anger', 'moody': 'fear', 'rindukan': 'love', 'merindui': 'love', 'takutkan': 'love', 'banggakan': 'love', 'cintakan': 'love', 'perbuat': 'surprise', 'merindukan': 'love', 'ceraikan': 'love', 'jumpai': 'love', 'rindunya': 'fear', 'teringat': 'fear', 'rinduu': 'fear', 'lapar': 'fear', 'kempunan': 'fear', 'teringin': 'fear', 'kangen': 'fear', 'confuse': 'fear', 'stress': 'fear', 'letih': 'fear', 'penat': 'fear', 'stres': 'sadness', 'mengantuk': 'fear', 'tertekan': 'sadness', 'terganggu': 'sadness', 'tertipu': 'surprise', 'keliru': 'surprise', 'mengeluh': 'sadness', 'merosot': 'sadness', 'susut': 'sadness', 'terjebak': 'surprise', 'terpengaruh': 'surprise', 'kesal': 'sadness', 'terkejut': 'surprise', 'bersalah': 'sadness', 'berdosa': 'fear', 'dihargai': 'sadness', 'janggal': 'anger', 'resah': 'sadness', 'kesepian': 'sadness', 'gundah': 'sadness', 'goyah': 'sadness', 'disakiti': 'sadness', 'takjub': 'sadness', 'sengsara': 'sadness', 'seram': 'fear', 'menyebalkan': 'sadness', 'merana': 'fear', 'melarat': 'anger', 'angkuh': 'sadness', 'rakus': 'sadness', 'terpuruk': 'sadness', 'pengsan': 'surprise', 'tertido': 'fear', 'pitam': 'surprise', (continues on next page)

270 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'terlelap': 'surprise', 'terberak': 'surprise', 'nanges': 'fear', 'mengamuk': 'fear', 'tdoq': 'fear', 'termuntah': 'surprise', 'tidor': 'surprise', 'bangga': 'surprise', 'surprise': 'surprise', 'suprise': 'surprise', 'makan2': 'surprise', 'attention': 'fear', 'kejutan': 'surprise', 'assignment': 'fear', 'comeback': 'surprise', 'chance': 'fear', 'homework': 'surprise', 'appointment': 'surprise', 'wtf': 'surprise', 'huh': 'fear', 'seriously': 'fear', 'omg': 'fear', 'aik': 'fear', 'wth': 'fear', 'shit': 'fear', 'apoo': 'fear', 'hah': 'fear', 'damn': 'fear', 'stun': 'surprise', 'pinafsueun': 'love', 'neelehh': 'love', 'rudgard': 'love', '016344981': 'love', 'pramaandika': 'love', 'hamidibahawa': 'love', 'spesialers': 'love', 'superpignan': 'love', '082187486748': 'love', 'tertanya2': 'surprise', 'terperanjat': 'surprise', 'cubaa': 'surprise', 'stuju': 'surprise', 'stayback': 'love', 'cakaplah': 'surprise', 'melebih': 'anger', 'tanyaa': 'surprise', 'ngandung': 'surprise'}

9.31. Lexicon Generator 271 malaya Documentation

9.31.6 propagate probabilistic

def propagate_probabilistic( lexicon, wordvector, pool_size= 10, top_n= 20, similarity_power= 10.0, arccos= True, normalization= True, soft= False, silent= False, ):

""" Learns polarity scores via standard label propagation from lexicon sets.

Parameters ------

lexicon: dict curated lexicon from expert domain, {'label1': [str], 'label2': [str]}. wordvector: object wordvector interface object. pool_size: int, optional (default=10) pick top-pool size from each lexicons. top_n: int, optional (default=20) top_n for each vectors will multiple with `similarity_power`. similarity_power: float, optional (default=10.0) extra score for `top_n`, less will generate less bias induced but high chance

˓→unbalanced outcome. arccos: bool, optional (default=True) covariance distribution for embedded.dot(embedded.T). If false, covariance +

˓→1. normalization: bool, optional (default=True) normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. soft: bool, optional (default=False) if True, a word not in the dictionary will be replaced with nearest

˓→jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. silent: bool, optional (default=False) if True, will not print any logs.

Returns ------tuple: (labels[argmax(scores), axis = 1], scores, labels) """

[11]: %%time

results_emotion, scores_emotion, labels_emotion= malaya.lexicon.propagate_

˓→probabilistic(emotion_lexicon,

˓→wordvector, pool_

˓→size= 10)

272 Chapter 9. Contents: malaya Documentation

populating nearest words from wordvector populating vectors from populated nearest words propagating probabilistic from populated vectors

CPU times: user 5.64 s, sys: 2.05 s, total: 7.68 s Wall time: 1.29 s

[12]: np.unique(list(results_emotion.values()), return_counts= True) [12]: (array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='

[13]: results_emotion [13]: {'sebal': 'anger', 'gesture': 'anger', 'se7': 'anger', 'ziraa': 'anger', 'mantepp': 'anger', 'mesem': 'anger', 'nggapapa': 'anger', 'maen2': 'anger', 'gacocok': 'anger', 'jeongwoo': 'anger', 'bergelora': 'anger', 'mereda': 'anger', 'skeptis': 'anger', 'gebus': 'anger', 'tyrion': 'anger', 'memuncak': 'anger', 'mewabah': 'anger', 'mengenaskan': 'anger', 'kesasar': 'anger', 'kepedean': 'anger', 'annoying': 'anger', 'awkward': 'fear', 'scary': 'fear', 'handsome': 'anger', 'nervous': 'fear', 'cringe': 'fear', 'menyampah': 'fear', 'kelakar': 'anger', 'cute': 'anger', 'cuak': 'fear', 'bodoh': 'anger', 'bangang': 'anger', 'bebal': 'anger', 'bodo': 'anger', 'noob': 'anger', 'bengap': 'anger', 'celaka': 'anger', 'biadap': 'anger', 'pukimak': 'anger', 'berang': 'anger', 'buru': 'anger', 'nerus': 'anger', 'kangsar': 'anger', 'lipis': 'anger', (continues on next page)

9.31. Lexicon Generator 273 malaya Documentation

(continued from previous page) 'pilah': 'anger', 'besut': 'anger', 'krai': 'anger', 'klawang': 'anger', 'ketil': 'anger', 'amuk': 'anger', 'mbatin': 'anger', 'sebarin': 'anger', 'sebarisan': 'anger', 'ngalami': 'anger', 'tikt': 'anger', 'diharga': 'anger', 'threesome': 'anger', 'shizuka': 'anger', 'bokondini': 'anger', 'mendidih': 'anger', 'mengental': 'anger', 'sebati': 'anger', 'mengembang': 'anger', 'layu': 'anger', 'kecoklatan': 'anger', 'matang': 'anger', 'meresap': 'anger', 'mengering': 'anger', 'direbus': 'anger', 'pengecut': 'anger', 'bajingan': 'anger', 'pembohong': 'anger', 'pecundang': 'anger', 'dungu': 'anger', 'pemberani': 'anger', 'negarawan': 'anger', 'jahil': 'anger', 'biadab': 'anger', 'provokator': 'anger', 'bengang': 'anger', 'menyirap': 'fear', 'meluat': 'anger', 'frust': 'fear', 'rimas': 'fear', 'annoyed': 'fear', 'lonely': 'fear', 'berdukacita': 'anger', 'menyakitimu': 'anger', 'bersinggungan': 'anger', 'bermesra': 'anger', 'meridhoi': 'anger', 'menyelubungi': 'anger', 'empukk': 'anger', 'berserban': 'anger', 'diracuni': 'anger', 'dibayangi': 'anger', 'jengkel': 'anger', 'gugup': 'anger', 'dibiasain': 'anger', 'mubazir': 'anger', 'amnesia': 'anger', (continues on next page)

274 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'psikopat': 'anger', 'gumoh': 'anger', 'diurusin': 'anger', 'ngangenin': 'anger', 'purging': 'anger', 'babi': 'anger', 'sial': 'anger', 'kimak': 'anger', 'anjing': 'anger', 'pundek': 'anger', 'cibai': 'anger', 'setan': 'anger', 'lembu': 'anger', 'pedar': 'anger', 'sanwya': 'anger', 'qabaya': 'anger', '5pac': 'anger', 'wa082336409906': 'anger', 'mpibg': 'anger', 'honachahthu': 'anger', 'unieleven': 'anger', 'mengepilkan': 'anger', 'ciknorzaidi': 'anger', 'benci': 'anger', 'jijik': 'fear', 'kagum': 'surprise', 'geli': 'anger', 'insecure': 'fear', 'geram': 'anger', 'respect': 'anger', 'jealous': 'fear', 'marah': 'anger', 'maki': 'anger', 'merajuk': 'anger', 'marah2': 'anger', 'perli': 'anger', 'jeles': 'fear', 'tegur': 'anger', 'kecam': 'anger', 'cemburu': 'surprise', 'bitter': 'anger', 'ngeri': 'fear', 'serem': 'anger', 'kocak': 'anger', 'mantep': 'anger', 'miris': 'fear', 'ngeselin': 'anger', 'nyesek': 'anger', 'kesel': 'fear', 'sebel': 'fear', 'lebay': 'anger', 'phobia': 'fear', 'mendem': 'anger', 'berideologi': 'anger', 'niru': 'anger', 'nyicip': 'anger', 'ngerawat': 'anger', (continues on next page)

9.31. Lexicon Generator 275 malaya Documentation

(continued from previous page) 'riweuh': 'anger', 'nmun': 'anger', 'ngancam': 'anger', 'bencong': 'anger', 'anxiety': 'fear', 'glasses': 'anger', 'manners': 'anger', 'satan': 'anger', 'popularity': 'anger', 'curl': 'anger', 'impossible': 'anger', 'mayb': 'anger', 'sperm': 'anger', 'nyumpah': 'anger', 'fitnah': 'fear', 'hoax': 'anger', 'provokasi': 'anger', 'kebencian': 'anger', 'dusta': 'anger', 'hoaks': 'anger', 'kebohongan': 'anger', 'bohong': 'anger', 'rasis': 'anger', 'ngibul': 'anger', 'horror': 'fear', 'horor': 'fear', 'romance': 'anger', 'day6': 'anger', 'dokumenter': 'anger', 'porno': 'anger', 'anime': 'anger', 'sinetron': 'anger', 'drakor': 'anger', 'dangdut': 'anger', 'takut': 'fear', 'risau': 'fear', 'malu': 'fear', 'khawatir': 'sadness', 'segan': 'fear', 'kecewa': 'sadness', 'takot': 'fear', 'bimbang': 'sadness', 'takutnya': 'anger', 'sedih': 'sadness', 'panic': 'fear', 'loud': 'anger', 'impressed': 'anger', 'expected': 'anger', 'dying': 'anger', 'rush': 'anger', 'shitty': 'anger', 'smoke': 'anger', 'suck': 'anger', 'cheap': 'anger', 'emo': 'fear', 'boring': 'fear', 'gelabah': 'fear', (continues on next page)

276 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'ngantok': 'fear', 'syok': 'joy', 'seronok': 'joy', 'busy': 'fear', 'serabut': 'fear', 'syiok': 'anger', 'sendu': 'fear', 'riang': 'joy', 'ceria': 'joy', 'takbir': 'anger', 'bersuka': 'anger', 'emma': 'love', 'barakah': 'anger', 'telemovie': 'anger', 'riuh': 'anger', 'ria': 'anger', 'khutbah': 'anger', 'sebak': 'fear', 'excited': 'fear', 'terharu': 'surprise', 'terliur': 'fear', 'girang': 'joy', 'ditikung': 'anger', 'ambis': 'anger', 'rafa': 'anger', 'digangguin': 'anger', 'nyiksa': 'anger', 'maruk': 'anger', 'tamvan': 'anger', 'pengap': 'anger', 'iklas': 'anger', 'puas': 'joy', 'muak': 'fear', 'kenyang': 'fear', 'lega': 'fear', 'bosan': 'fear', 'berbaloi': 'fear', 'berpuas': 'sadness', 'lelah': 'sadness', 'bahagia': 'joy', 'menyenangkan': 'sadness', 'gelisah': 'sadness', 'nyaman': 'sadness', 'indah': 'sadness', 'sukses': 'anger', 'sehat': 'sadness', 'damai': 'sadness', 'suka': 'joy', 'sukanya': 'anger', 'doyan': 'anger', 'demen': 'anger', 'suke': 'anger', 'gasuka': 'anger', 'gemar': 'anger', 'sukaa': 'anger', 'prefer': 'anger', 'happy': 'joy', (continues on next page)

9.31. Lexicon Generator 277 malaya Documentation

(continued from previous page) 'hepi': 'anger', 'wish': 'anger', 'nice': 'fear', 'cerita': 'joy', 'citer': 'fear', 'cite': 'fear', 'crita': 'anger', 'kisah': 'love', 'percakapan': 'anger', 'tweet': 'fear', 'drama': 'anger', 'lagu': 'anger', 'ceramah': 'anger', 'cinta': 'love', 'kebahagiaan': 'anger', 'cintanya': 'anger', 'cintaku': 'anger', 'persahabatan': 'anger', 'cintamu': 'anger', 'kesabaran': 'anger', 'dendam': 'sadness', 'kesedihan': 'anger', 'asa': 'sadness', 'baby': 'love', 'daddy': 'love', 'mira': 'love', 'princess': 'anger', 'bella': 'love', 'farah': 'love', 'mommy': 'love', 'sister': 'love', 'mummy': 'love', 'lisa': 'love', 'love': 'love', 'luv': 'love', 'hate': 'anger', 'thought': 'anger', 'mean': 'anger', 'want': 'anger', 'see': 'anger', 'need': 'anger', 'hope': 'anger', 'peace': 'anger', 'syang': 'love', 'noi': 'anger', 'bilang2': 'anger', 'syng': 'anger', 'mut': 'anger', 'ribbey': 'anger', 'seneng2': 'anger', 'butoset': 'anger', 'manly': 'anger', 'twet': 'anger', 'syg': 'love', 'sayangg': 'anger', 'sayang': 'love', 'bby': 'anger', (continues on next page)

278 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'cntik': 'anger', 'knl': 'anger', 'anon': 'anger', 'sistur': 'anger', 'sayang2': 'anger', 'bgus': 'anger', 'rindukn': 'love', 'ajeb2an': 'anger', 'hshakjsjsbs': 'anger', 'miliknyamencatat': 'anger', 'p6a': 'anger', 'ahsjahhaa': 'anger', 'diwajibk': 'anger', 'protese': 'anger', 'botaqin': 'anger', 'kruntel': 'anger', 'rindu': 'love', 'sayangku': 'anger', 'sayangkan': 'anger', 'sayangnya': 'love', 'disayang': 'anger', 'moody': 'fear', 'rindukan': 'love', 'merindui': 'anger', 'takutkan': 'anger', 'banggakan': 'anger', 'cintakan': 'anger', 'perbuat': 'anger', 'merindukan': 'anger', 'ceraikan': 'anger', 'jumpai': 'anger', 'rindunya': 'fear', 'teringat': 'fear', 'rinduu': 'fear', 'lapar': 'fear', 'kempunan': 'fear', 'teringin': 'fear', 'kangen': 'fear', 'confuse': 'fear', 'stress': 'sadness', 'letih': 'fear', 'penat': 'fear', 'stres': 'sadness', 'mengantuk': 'fear', 'tertekan': 'sadness', 'terganggu': 'sadness', 'tertipu': 'surprise', 'keliru': 'sadness', 'mengeluh': 'sadness', 'merosot': 'anger', 'susut': 'anger', 'terjebak': 'sadness', 'terpengaruh': 'surprise', 'kesal': 'sadness', 'terkejut': 'surprise', 'bersalah': 'sadness', 'berdosa': 'fear', (continues on next page)

9.31. Lexicon Generator 279 malaya Documentation

(continued from previous page) 'dihargai': 'sadness', 'janggal': 'anger', 'resah': 'sadness', 'kesepian': 'sadness', 'gundah': 'anger', 'goyah': 'anger', 'disakiti': 'anger', 'takjub': 'anger', 'sengsara': 'sadness', 'seram': 'fear', 'menyebalkan': 'anger', 'merana': 'sadness', 'melarat': 'anger', 'angkuh': 'anger', 'rakus': 'anger', 'terpuruk': 'anger', 'pengsan': 'surprise', 'tertido': 'anger', 'pitam': 'anger', 'terlelap': 'anger', 'terberak': 'anger', 'nanges': 'anger', 'mengamuk': 'anger', 'tdoq': 'anger', 'termuntah': 'anger', 'tidor': 'anger', 'bangga': 'surprise', 'surprise': 'surprise', 'suprise': 'anger', 'makan2': 'anger', 'attention': 'anger', 'kejutan': 'anger', 'assignment': 'fear', 'comeback': 'anger', 'chance': 'fear', 'homework': 'anger', 'appointment': 'anger', 'wtf': 'surprise', 'huh': 'anger', 'seriously': 'anger', 'omg': 'anger', 'aik': 'anger', 'wth': 'anger', 'shit': 'anger', 'apoo': 'fear', 'hah': 'anger', 'damn': 'anger', 'stun': 'surprise', 'pinafsueun': 'anger', 'neelehh': 'anger', 'rudgard': 'anger', '016344981': 'anger', 'pramaandika': 'anger', 'hamidibahawa': 'anger', 'spesialers': 'anger', 'superpignan': 'anger', '082187486748': 'anger', (continues on next page)

280 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'tertanya2': 'surprise', 'terperanjat': 'anger', 'cubaa': 'anger', 'stuju': 'anger', 'stayback': 'anger', 'cakaplah': 'anger', 'melebih': 'anger', 'tanyaa': 'anger', 'ngandung': 'anger'}

9.31.7 propagate graph def propagate_graph( lexicon, wordvector, pool_size= 10, top_n= 20, similarity_power= 10.0, normalization= True, soft= False, silent= False, ):

""" Graph propagation method dapted from Velikovich, Leonid, et al. "The viability of

˓→web-derived polarity lexicons." http://www.aclweb.org/anthology/N10-1119

Parameters ------

lexicon: dict curated lexicon from expert domain, {'label1': [str], 'label2': [str]}. wordvector: object wordvector interface object. pool_size: int, optional (default=10) pick top-pool size from each lexicons. top_n: int, optional (default=20) top_n for each vectors will multiple with `similarity_power`. similarity_power: float, optional (default=10.0) extra score for `top_n`, less will generate less bias induced but high chance

˓→unbalanced outcome. normalization: bool, optional (default=True) normalize word vectors using L2 norm. L2 is good to penalize skewed vectors. soft: bool, optional (default=False) if True, a word not in the dictionary will be replaced with nearest

˓→jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary. silent: bool, optional (default=False) if True, will not print any logs.

Returns ------tuple: (labels[argmax(scores), axis = 1], scores, labels) """

9.31. Lexicon Generator 281 malaya Documentation

[14]: %%time

results_emotion, scores_emotion, labels_emotion= malaya.lexicon.propagate_

˓→graph(emotion_lexicon,

˓→wordvector, pool_

˓→size= 10) populating nearest words from wordvector populating vectors from populated nearest words propagate graph from populated nearest words 100%|| 452/452 [00:00<00:00, 1830.24it/s] CPU times: user 16.5 s, sys: 2.2 s, total: 18.7 s Wall time: 11.8 s

[15]: np.unique(list(results_emotion.values()), return_counts= True) [15]: (array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='

[16]: results_emotion [16]: {'sebal': 'anger', 'gesture': 'fear', 'se7': 'anger', 'ziraa': 'anger', 'mantepp': 'anger', 'mesem': 'fear', 'nggapapa': 'anger', 'maen2': 'anger', 'gacocok': 'fear', 'jeongwoo': 'anger', 'bergelora': 'anger', 'mereda': 'anger', 'skeptis': 'anger', 'gebus': 'love', 'tyrion': 'fear', 'memuncak': 'anger', 'mewabah': 'anger', 'mengenaskan': 'anger', 'kesasar': 'love', 'kepedean': 'anger', 'annoying': 'anger', 'awkward': 'fear', 'scary': 'fear', 'handsome': 'love', 'nervous': 'fear', 'cringe': 'anger', 'menyampah': 'anger', 'kelakar': 'anger', 'cute': 'love', 'cuak': 'fear', 'bodoh': 'anger', 'bangang': 'anger', (continues on next page)

282 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'bebal': 'anger', 'bodo': 'anger', 'noob': 'anger', 'bengap': 'anger', 'celaka': 'anger', 'biadap': 'anger', 'pukimak': 'anger', 'berang': 'anger', 'buru': 'joy', 'nerus': 'anger', 'kangsar': 'fear', 'lipis': 'anger', 'pilah': 'fear', 'besut': 'anger', 'krai': 'anger', 'klawang': 'anger', 'ketil': 'anger', 'amuk': 'anger', 'mbatin': 'love', 'sebarin': 'anger', 'sebarisan': 'fear', 'ngalami': 'joy', 'tikt': 'anger', 'diharga': 'anger', 'threesome': 'anger', 'shizuka': 'anger', 'bokondini': 'anger', 'mendidih': 'anger', 'mengental': 'anger', 'sebati': 'surprise', 'mengembang': 'anger', 'layu': 'surprise', 'kecoklatan': 'anger', 'matang': 'sadness', 'meresap': 'surprise', 'mengering': 'anger', 'direbus': 'anger', 'pengecut': 'anger', 'bajingan': 'fear', 'pembohong': 'fear', 'pecundang': 'fear', 'dungu': 'fear', 'pemberani': 'anger', 'negarawan': 'anger', 'jahil': 'anger', 'biadab': 'fear', 'provokator': 'fear', 'bengang': 'anger', 'menyirap': 'joy', 'meluat': 'surprise', 'frust': 'surprise', 'rimas': 'sadness', 'annoyed': 'anger', 'lonely': 'love', 'berdukacita': 'anger', 'menyakitimu': 'surprise', 'bersinggungan': 'anger', (continues on next page)

9.31. Lexicon Generator 283 malaya Documentation

(continued from previous page) 'bermesra': 'anger', 'meridhoi': 'love', 'menyelubungi': 'anger', 'empukk': 'anger', 'berserban': 'anger', 'diracuni': 'anger', 'dibayangi': 'fear', 'jengkel': 'anger', 'gugup': 'anger', 'dibiasain': 'joy', 'mubazir': 'anger', 'amnesia': 'fear', 'psikopat': 'fear', 'gumoh': 'anger', 'diurusin': 'fear', 'ngangenin': 'joy', 'purging': 'joy', 'babi': 'anger', 'sial': 'surprise', 'kimak': 'surprise', 'anjing': 'fear', 'pundek': 'surprise', 'cibai': 'surprise', 'setan': 'fear', 'lembu': 'anger', 'pedar': 'anger', 'sanwya': 'love', 'qabaya': 'love', '5pac': 'love', 'wa082336409906': 'love', 'mpibg': 'love', 'honachahthu': 'love', 'unieleven': 'love', 'mengepilkan': 'surprise', 'ciknorzaidi': 'love', 'benci': 'anger', 'jijik': 'fear', 'kagum': 'surprise', 'geli': 'fear', 'insecure': 'sadness', 'geram': 'sadness', 'respect': 'love', 'jealous': 'anger', 'marah': 'anger', 'maki': 'surprise', 'merajuk': 'surprise', 'marah2': 'surprise', 'perli': 'joy', 'jeles': 'love', 'tegur': 'surprise', 'kecam': 'fear', 'cemburu': 'sadness', 'bitter': 'surprise', 'ngeri': 'fear', 'serem': 'anger', 'kocak': 'fear', 'mantep': 'fear', (continues on next page)

284 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'miris': 'anger', 'ngeselin': 'fear', 'nyesek': 'fear', 'kesel': 'sadness', 'sebel': 'anger', 'lebay': 'fear', 'phobia': 'fear', 'mendem': 'joy', 'berideologi': 'anger', 'niru': 'anger', 'nyicip': 'anger', 'ngerawat': 'love', 'riweuh': 'joy', 'nmun': 'anger', 'ngancam': 'surprise', 'bencong': 'fear', 'anxiety': 'fear', 'glasses': 'love', 'manners': 'fear', 'satan': 'fear', 'popularity': 'love', 'curl': 'surprise', 'impossible': 'fear', 'mayb': 'love', 'sperm': 'anger', 'nyumpah': 'fear', 'fitnah': 'fear', 'hoax': 'fear', 'provokasi': 'anger', 'kebencian': 'love', 'dusta': 'love', 'hoaks': 'anger', 'kebohongan': 'love', 'bohong': 'anger', 'rasis': 'sadness', 'ngibul': 'anger', 'horror': 'fear', 'horor': 'joy', 'romance': 'love', 'day6': 'anger', 'dokumenter': 'anger', 'porno': 'anger', 'anime': 'joy', 'sinetron': 'joy', 'drakor': 'joy', 'dangdut': 'joy', 'takut': 'fear', 'risau': 'surprise', 'malu': 'fear', 'khawatir': 'sadness', 'segan': 'anger', 'kecewa': 'sadness', 'takot': 'surprise', 'bimbang': 'sadness', 'takutnya': 'love', 'sedih': 'sadness', 'panic': 'fear', (continues on next page)

9.31. Lexicon Generator 285 malaya Documentation

(continued from previous page) 'loud': 'love', 'impressed': 'surprise', 'expected': 'surprise', 'dying': 'joy', 'rush': 'surprise', 'shitty': 'anger', 'smoke': 'surprise', 'suck': 'love', 'cheap': 'fear', 'emo': 'anger', 'boring': 'joy', 'gelabah': 'surprise', 'ngantok': 'surprise', 'syok': 'joy', 'seronok': 'joy', 'busy': 'joy', 'serabut': 'sadness', 'syiok': 'surprise', 'sendu': 'joy', 'riang': 'joy', 'ceria': 'joy', 'takbir': 'joy', 'bersuka': 'love', 'emma': 'love', 'barakah': 'anger', 'telemovie': 'joy', 'riuh': 'surprise', 'ria': 'joy', 'khutbah': 'joy', 'sebak': 'sadness', 'excited': 'joy', 'terharu': 'surprise', 'terliur': 'surprise', 'girang': 'joy', 'ditikung': 'anger', 'ambis': 'anger', 'rafa': 'anger', 'digangguin': 'anger', 'nyiksa': 'fear', 'maruk': 'love', 'tamvan': 'anger', 'pengap': 'anger', 'iklas': 'love', 'puas': 'joy', 'muak': 'sadness', 'kenyang': 'joy', 'lega': 'joy', 'bosan': 'love', 'berbaloi': 'sadness', 'berpuas': 'sadness', 'lelah': 'sadness', 'bahagia': 'joy', 'menyenangkan': 'sadness', 'gelisah': 'sadness', 'nyaman': 'sadness', 'indah': 'sadness', 'sukses': 'sadness', (continues on next page)

286 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'sehat': 'sadness', 'damai': 'sadness', 'suka': 'joy', 'sukanya': 'love', 'doyan': 'fear', 'demen': 'anger', 'suke': 'love', 'gasuka': 'love', 'gemar': 'love', 'sukaa': 'love', 'prefer': 'love', 'happy': 'joy', 'hepi': 'love', 'wish': 'love', 'nice': 'surprise', 'cerita': 'joy', 'citer': 'surprise', 'cite': 'surprise', 'crita': 'surprise', 'kisah': 'love', 'percakapan': 'fear', 'tweet': 'love', 'drama': 'joy', 'lagu': 'joy', 'ceramah': 'surprise', 'cinta': 'love', 'kebahagiaan': 'sadness', 'cintanya': 'anger', 'cintaku': 'sadness', 'persahabatan': 'joy', 'cintamu': 'anger', 'kesabaran': 'fear', 'dendam': 'sadness', 'kesedihan': 'sadness', 'asa': 'sadness', 'baby': 'love', 'daddy': 'love', 'mira': 'love', 'princess': 'love', 'bella': 'joy', 'farah': 'surprise', 'mommy': 'love', 'sister': 'surprise', 'mummy': 'love', 'lisa': 'joy', 'love': 'love', 'luv': 'love', 'hate': 'surprise', 'thought': 'surprise', 'mean': 'surprise', 'want': 'joy', 'see': 'surprise', 'need': 'joy', 'hope': 'surprise', 'peace': 'anger', 'syang': 'love', 'noi': 'fear', (continues on next page)

9.31. Lexicon Generator 287 malaya Documentation

(continued from previous page) 'bilang2': 'anger', 'syng': 'anger', 'mut': 'fear', 'ribbey': 'anger', 'seneng2': 'anger', 'butoset': 'anger', 'manly': 'anger', 'twet': 'anger', 'syg': 'love', 'sayangg': 'love', 'sayang': 'love', 'bby': 'surprise', 'cntik': 'anger', 'knl': 'surprise', 'anon': 'anger', 'sistur': 'surprise', 'sayang2': 'surprise', 'bgus': 'anger', 'rindukn': 'love', 'ajeb2an': 'surprise', 'hshakjsjsbs': 'anger', 'miliknyamencatat': 'anger', 'p6a': 'anger', 'ahsjahhaa': 'surprise', 'diwajibk': 'anger', 'protese': 'surprise', 'botaqin': 'surprise', 'kruntel': 'anger', 'rindu': 'love', 'sayangku': 'anger', 'sayangkan': 'love', 'sayangnya': 'fear', 'disayang': 'joy', 'moody': 'surprise', 'rindukan': 'love', 'merindui': 'surprise', 'takutkan': 'surprise', 'banggakan': 'surprise', 'cintakan': 'surprise', 'perbuat': 'surprise', 'merindukan': 'joy', 'ceraikan': 'surprise', 'jumpai': 'anger', 'rindunya': 'surprise', 'teringat': 'surprise', 'rinduu': 'surprise', 'lapar': 'sadness', 'kempunan': 'surprise', 'teringin': 'joy', 'kangen': 'joy', 'confuse': 'anger', 'stress': 'sadness', 'letih': 'joy', 'penat': 'joy', 'stres': 'sadness', 'mengantuk': 'joy', 'tertekan': 'sadness', (continues on next page)

288 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'terganggu': 'sadness', 'tertipu': 'sadness', 'keliru': 'sadness', 'mengeluh': 'sadness', 'merosot': 'sadness', 'susut': 'surprise', 'terjebak': 'sadness', 'terpengaruh': 'sadness', 'kesal': 'sadness', 'terkejut': 'surprise', 'bersalah': 'sadness', 'berdosa': 'anger', 'dihargai': 'sadness', 'janggal': 'surprise', 'resah': 'sadness', 'kesepian': 'sadness', 'gundah': 'surprise', 'goyah': 'surprise', 'disakiti': 'anger', 'takjub': 'anger', 'sengsara': 'sadness', 'seram': 'anger', 'menyebalkan': 'fear', 'merana': 'surprise', 'melarat': 'surprise', 'angkuh': 'anger', 'rakus': 'anger', 'terpuruk': 'anger', 'pengsan': 'surprise', 'tertido': 'anger', 'pitam': 'surprise', 'terlelap': 'anger', 'terberak': 'surprise', 'nanges': 'surprise', 'mengamuk': 'surprise', 'tdoq': 'anger', 'termuntah': 'surprise', 'tidor': 'anger', 'bangga': 'anger', 'surprise': 'surprise', 'suprise': 'love', 'makan2': 'fear', 'attention': 'fear', 'kejutan': 'fear', 'assignment': 'fear', 'comeback': 'fear', 'chance': 'love', 'homework': 'fear', 'appointment': 'fear', 'wtf': 'surprise', 'huh': 'love', 'seriously': 'love', 'omg': 'love', 'aik': 'love', 'wth': 'love', 'shit': 'anger', 'apoo': 'anger', (continues on next page)

9.31. Lexicon Generator 289 malaya Documentation

(continued from previous page) 'hah': 'anger', 'damn': 'love', 'stun': 'surprise', 'pinafsueun': 'anger', 'neelehh': 'anger', 'rudgard': 'anger', '016344981': 'anger', 'pramaandika': 'anger', 'hamidibahawa': 'love', 'spesialers': 'anger', 'superpignan': 'anger', '082187486748': 'anger', 'tertanya2': 'anger', 'terperanjat': 'anger', 'cubaa': 'anger', 'stuju': 'anger', 'stayback': 'anger', 'cakaplah': 'anger', 'melebih': 'anger', 'tanyaa': 'anger', 'ngandung': 'anger'}

9.32 Paraphrase

This tutorial is available as an IPython notebook at Malaya/example/paraphrase.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time

import malaya from pprint import pprint CPU times: user 4.83 s, sys: 714 ms, total: 5.54 s Wall time: 4.52 s

9.32.1 List available Transformer model

[2]: malaya.paraphrase.available_transformer() INFO:root:tested on ParaSCI test set. [2]: Size (MB) Quantized Size (MB) BLEU t5 1250.0 481.0 0.608904 small-t5 355.6 195.0 0.617456 tiny-t5 208.0 103.0 0.460321

290 Chapter 9. Contents: malaya Documentation

9.32.2 Load Transformer model

def transformer(model: str='small-t5', quantized: bool= False, **kwargs): """ Load Malaya transformer encoder-decoder model to generate a paraphrase given a

˓→string.

Parameters ------model : str, optional (default='small-t5') Model architecture supported. Allowed values:

* ``'t5'`` - T5 BASE parameters. * ``'small-t5'`` - T5 SMALL parameters. * ``'tiny-t5'`` - T5 TINY parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `t5` in model, will return `malaya.model.t5.Paraphrase`. """

[6]: t5= malaya.paraphrase.transformer(model='small-t5') INFO:root:running paraphrase-v2/small-t5 using device /device:CPU:0

9.32.3 Paraphrase simple string

We only provide greedy_decoder method for T5 models,

def greedy_decoder(self, strings: List[str]): """ paraphrase strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[7]: string="Beliau yang juga saksi pendakwaan kesembilan berkata, ia bagi mengelak

˓→daripada wujud isu digunakan terhadap Najib." pprint(string) ('Beliau yang juga saksi pendakwaan kesembilan berkata, ia bagi mengelak ' 'daripada wujud isu digunakan terhadap Najib.')

9.32. Paraphrase 291 malaya Documentation

[8]: pprint(t5.greedy_decoder([string])) ['Beliau yang juga saksi pendakwaan kesembilan berkata, ia bagi mengelak wujud ' 'isu yang digunakan terhadap Najib.']

[11]: string= """ PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu

˓→Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT)

˓→pada 24 Februari lalu.

Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun

˓→diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri

˓→Muhyiddin Yassin.

Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT

˓→sebulat suara menolak peletakan jawatan Dr Mahathir.

"Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham

˓→bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu

˓→sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang

˓→buat keputusan.

"Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di

˓→luar daripada keputusan mesyuarat, ini bukan keputusan parti.

"Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada

˓→JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti,"

˓→katanya kepada Harian Metro.

Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr

˓→Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di

˓→tengah-tengah pergolakan politik pada akhir Februari adalah sah.

Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah.

Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi

˓→ditolak oleh MPT.

"Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini

˓→mesyuarat sudah menolak," katanya.

Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari

˓→ini yang menyatakan sokongan kepada .

"Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali

˓→pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita

˓→catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya. """

[13]: import re

# minimum cleaning, just simply to remove newlines. def cleaning(string): string= string.replace(' \n','') string= re.sub(r'[ ]+','', string).strip() return string

(continues on next page)

292 Chapter 9. Contents: malaya Documentation

(continued from previous page) string= cleaning(string) splitted= malaya.text.function.split_into_sentences(string) splitted [13]: ['PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu

˓→Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT)

˓→pada 24 Februari lalu.', 'Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun

˓→diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri

˓→Muhyiddin Yassin.', 'Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT

˓→sebulat suara menolak peletakan jawatan Dr Mahathir.', '"Jadi ini agak berlawanan dengan keputusan yang kita sudah buat.', 'Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata

˓→peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat,

˓→bukan seorang dua yang buat keputusan.', '"Semua keputusan mesti dibuat melalui parti.', 'Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan

˓→keputusan parti.', '"Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada

˓→JPPM.', 'Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti,"

˓→katanya kepada Harian Metro.', 'Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr

˓→Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di

˓→tengah-tengah pergolakan politik pada akhir Februari adalah sah.', 'Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga

˓→sah.', 'Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi

˓→ditolak oleh MPT.', '"Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini

˓→mesyuarat sudah menolak," katanya.', 'Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu

˓→hari ini yang menyatakan sokongan kepada Perikatan Nasional.', '"Kenyataan media bukanlah keputusan rasmi.', 'Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang

˓→sudah dibuat di dalam mesyuarat.', 'Kita catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya.']

[20]: t5.greedy_decoder(splitted) [20]: ['PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu

˓→Malaysia (Bersatu) ditolak pada mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada

˓→24 Februari lalu.', 'Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun

˓→diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri

˓→Muhyiddin Yassin.', 'Bekas Setiausaha Agung Bersatu, Datuk Marzuki Yahya, berkata pada mesyuarat itu MPT

˓→sebulat suara menolak peletakan jawatan Dr Mahathir.', '"Jadi ini agak berlawanan dengan keputusan yang kita buat.', 'Saya tidak faham bagaimana Jabatan Pendaftaran Pertubuhan Malaysia (JPPM) kata

˓→peletakan jawatan itu sah sedangkan kita sudah membuat keputusan di dalam mesyuarat,

˓→ bukan seorang dua yang membuat keputusan.', '"Semua keputusan mesti dibuat melalui parti.', 'Walau apa pun perbincangan dibuat di luar keputusan mesyuarat, ini bukan keputusan

˓→parti.', '"Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada

˓→JPPM.', (continues on next page)

9.32. Paraphrase 293 malaya Documentation

(continued from previous page) 'Ia sepatutnya dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti,"

˓→katanya kepada Harian Metro.', 'Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr

˓→Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di

˓→tengah-tengah pergolakan politik pada akhir Februari adalah sah.', 'Laporan itu juga menyatakan kedudukan Muhyiddin Yassin memangku jawatan itu juga

˓→sah.', 'Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi

˓→ditolak oleh MPT.', '"Fasal yang disebut itu terpakai jika berhenti atau diberhentikan, tetapi ini

˓→mesyuarat sudah menolak," katanya.', 'Marzuki juga mempersoalkan kenyataan media yang dibuat oleh beberapa pemimpin parti

˓→itu hari ini yang menyatakan sokongan kepada Perikatan Nasional.', '"Kenyataan media bukanlah keputusan rasmi.', 'Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak mengubah keputusan

˓→yang sudah dibuat di dalam mesyuarat.', 'Kami mencatat dalam minit apa yang terjadi di dalam pertemuan," katanya.']

9.33 Emotion Analysis

This tutorial is available as an IPython notebook at Malaya/example/emotion.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.7 s, sys: 630 ms, total: 5.33 s Wall time: 4.3 s /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

[2]: anger_text='babi la company ni, aku dah la penat datang dari jauh' fear_text='takut doh tengok cerita hantu tadi' happy_text='bestnya dapat tidur harini, tak payah pergi kerja' love_text='aku sayang sgt dia dah doh' sadness_text='kecewa tengok kerajaan baru ni, janji ape pun tak dapat' surprise_text='sakit jantung aku, terkejut dengan cerita hantu tadi'

294 Chapter 9. Contents: malaya Documentation

9.33.1 Get label

[3]: malaya.emotion.label [3]: ['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']

All models follow same method as sklearn interface, predict to get batch of labels, predict_proba to get batch of probabilities.

9.33.2 Load multinomial model

def multinomial(**kwargs): """ Load multinomial emotion model.

Returns ------result : malaya.model.ml.Bayes class """

[4]: model= malaya.emotion.multinomial()

Predict batch of strings

def predict(self, strings: List[str]): """ classify list of strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[5]: model.predict([anger_text]) [5]: ['anger']

[6]: model.predict( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [6]: ['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']

9.33. Emotion Analysis 295 malaya Documentation

Predict batch of strings with probability

def predict_proba(self, strings: List[str]): """ classify list of strings and return probability.

Parameters ------strings: List[str]

Returns ------result: List[dict[str, float]] """

[7]: model.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [7]: [{'anger': 0.32948272681734814, 'fear': 0.13959708810717708, 'happy': 0.14671455153216045, 'love': 0.12489192355631354, 'sadness': 0.1285972541671178, 'surprise': 0.13071645581988448}, {'anger': 0.11379406005377896, 'fear': 0.4006934391283133, 'happy': 0.11389665647702245, 'love': 0.12481915233837086, 'sadness': 0.0991261507380643, 'surprise': 0.14767054126445014}, {'anger': 0.14667998117610198, 'fear': 0.1422732633232615, 'happy': 0.29984520430807293, 'love': 0.1409005078277281, 'sadness': 0.13374705318404811, 'surprise': 0.13655399018078768}, {'anger': 0.1590563839629243, 'fear': 0.14687344690114268, 'happy': 0.1419948160674701, 'love': 0.279550441361504, 'sadness': 0.1285927908584157, 'surprise': 0.14393212084854254}, {'anger': 0.13425914937312508, 'fear': 0.12053328146716755, 'happy': 0.14923350911233682, 'love': 0.10289492749919464, 'sadness': 0.36961334597699913, 'surprise': 0.12346578657117815}, {'anger': 0.06724850384395685, 'fear': 0.1283628050361525, 'happy': 0.05801958643852813, 'love': 0.06666524240157067, 'sadness': 0.06537667186293224, 'surprise': 0.6143271904168589}]

296 Chapter 9. Contents: malaya Documentation

9.33.3 List available Transformer models

[8]: malaya.emotion.available_transformer() INFO:root:tested on 20% test set. [8]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.6 111.00 0.99786 0.99773 tiny-bert 57.4 15.40 0.99692 0.99696 albert 48.6 12.80 0.99740 0.99773 tiny-albert 22.4 5.98 0.99325 0.99378 xlnet 446.5 118.00 0.99773 0.99775 alxlnet 46.8 13.30 0.99663 0.99697

macro f1-score bert 0.99779 tiny-bert 0.99694 albert 0.99757 tiny-albert 0.99351 xlnet 0.99774 alxlnet 0.99680

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#emotion-analysis You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

9.33.4 Load Transformer model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer emotion model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters. quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.supervised.softmax.transformer function """

[11]: model= malaya.emotion.transformer(model='albert') INFO:tensorflow:loading sentence piece model

9.33. Emotion Analysis 297 malaya Documentation

INFO:tensorflow:loading sentence piece model

9.33.5 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[12]: quantized_model= malaya.emotion.transformer(model='albert', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model

Predict batch of strings

def predict(self, strings: List[str]): """ classify list of strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[13]: model.predict( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [13]: ['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']

[14]: quantized_model.predict( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [14]: ['anger', 'fear', 'happy', 'love', 'sadness', 'surprise']

Predict batch of strings with probability

def predict_proba(self, strings: List[str]): """ classify list of strings and return probability.

Parameters ------strings : List[str] (continues on next page)

298 Chapter 9. Contents: malaya Documentation

(continued from previous page)

Returns ------result: List[dict[str, float]] """

[16]: model.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [16]: [{'anger': 0.9998901, 'fear': 3.2524113e-05, 'happy': 2.620931e-05, 'love': 2.2871463e-05, 'sadness': 9.782951e-06, 'surprise': 1.8502667e-05}, {'anger': 1.6941378e-05, 'fear': 0.9999205, 'happy': 9.070281e-06, 'love': 2.044179e-05, 'sadness': 6.7731107e-06, 'surprise': 2.6314676e-05}, {'anger': 0.15370166, 'fear': 0.0013852724, 'happy': 0.8268689, 'love': 0.011433229, 'sadness': 0.0011807577, 'surprise': 0.005430276}, {'anger': 1.2597201e-05, 'fear': 1.7600481e-05, 'happy': 9.667115e-06, 'love': 0.9999331, 'sadness': 1.3735416e-05, 'surprise': 1.3399296e-05}, {'anger': 1.9176923e-05, 'fear': 1.1163729e-05, 'happy': 6.353941e-06, 'love': 7.004002e-06, 'sadness': 0.99994576, 'surprise': 1.0511084e-05}, {'anger': 5.8739704e-05, 'fear': 1.9771342e-05, 'happy': 1.8316741e-05, 'love': 2.2319455e-05, 'sadness': 3.646786e-05, 'surprise': 0.9998443}]

[4]: quantized_model.predict_proba( [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] ) [4]: [{'anger': 0.99988353, 'fear': 3.5938003e-05, 'happy': 2.7778764e-05, 'love': 2.3541537e-05, 'sadness': 9.574292e-06, 'surprise': 1.9607493e-05}, (continues on next page)

9.33. Emotion Analysis 299 malaya Documentation

(continued from previous page) {'anger': 1.6855265e-05, 'fear': 0.9999219, 'happy': 9.185196e-06, 'love': 2.0216348e-05, 'sadness': 6.6679663e-06, 'surprise': 2.5186611e-05}, {'anger': 0.22842072, 'fear': 0.001628682, 'happy': 0.7477462, 'love': 0.014303649, 'sadness': 0.0013838055, 'surprise': 0.00651699}, {'anger': 1.28296715e-05, 'fear': 1.7833345e-05, 'happy': 9.577061e-06, 'love': 0.9999324, 'sadness': 1.3832815e-05, 'surprise': 1.34745715e-05}, {'anger': 1.9776813e-05, 'fear': 1.1116885e-05, 'happy': 6.3422367e-06, 'love': 6.905633e-06, 'sadness': 0.9999455, 'surprise': 1.0316757e-05}, {'anger': 5.8218586e-05, 'fear': 2.07504e-05, 'happy': 1.8061248e-05, 'love': 2.1852256e-05, 'sadness': 3.5944133e-05, 'surprise': 0.99984515}]

Open emotion visualization dashboard def predict_words( self, string: str, method: str='last', visualization: bool= True ): """ classify words.

Parameters ------string : str method : str, optional (default='last') Attention layer supported. Allowed values:

* ``'last'`` - attention from last layer. * ``'first'`` - attention from first layer. * ``'mean'`` - average attentions from all layers. visualization: bool, optional (default=True) If True, it will open the visualization dashboard.

Returns ------result: dict """

300 Chapter 9. Contents: malaya Documentation

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[ ]: model.predict_words(sadness_text)

[18]: from IPython.core.display import Image, display

display(Image('emotion-dashboard.png', width=800))

9.33.6 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

9.33. Emotion Analysis 301 malaya Documentation

Sentence level

[5]: texts= [anger_text, fear_text, happy_text, love_text, sadness_text, surprise_text] r= quantized_model.vectorize(texts, method='first')

[6]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [6]: (6, 2)

[8]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= texts for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

302 Chapter 9. Contents: malaya Documentation

Word level

[9]:r= quantized_model.vectorize(texts, method='word')

[10]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[11]: tsne= TSNE().fit_transform(y) tsne.shape [11]: (49, 2)

[12]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.33. Emotion Analysis 303 malaya Documentation

Pretty good, the model able to know cluster top right as surprise emotion.

9.33.7 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[4]: multinomial= malaya.emotion.multinomial()

[6]: malaya.stack.predict_stack([multinomial, model], [anger_text]) [6]: [{'anger': 0.5739743139312979, 'fear': 0.002130791264743306, 'happy': 0.0019609404077070573, 'love': 0.0016901068202818533, 'sadness': 0.001121633002361737, 'surprise': 0.0015551851123993595}]

[7]: malaya.stack.predict_stack([multinomial, model], [anger_text, sadness_text]) [7]: [{'anger': 0.5739743139312979, 'fear': 0.002130791264743306, 'happy': 0.0019609404077070573, 'love': 0.0016901068202818533, 'sadness': 0.001121633002361737, 'surprise': 0.0015551858768478731}, {'anger': 0.0016541454680912208, 'fear': 0.0011659984542562358, (continues on next page)

304 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'happy': 0.001014179551389293, 'love': 0.0008495638318424924, 'sadness': 0.5854571761989077, 'surprise': 0.001159149836587787}]

[ ]:

9.34 Language Detection

This tutorial is available as an IPython notebook at Malaya/example/language-detection.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya import fasttext /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons

˓→supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and

˓→strictly below 2.4.0 (nightly versions are not supported). The versions of TensorFlow you are currently using is 2.4.1 and is not supported. Some things might work, some things might not. If you were to encounter a bug, do not file an issue. If you want to make sure you're using a tested and supported configuration, either

˓→change the TensorFlow version or the TensorFlow Addons's version. You can find the compatibility matrix in TensorFlow Addon's readme: https://github.com/tensorflow/addons UserWarning, CPU times: user 5.17 s, sys: 990 ms, total: 6.16 s Wall time: 6.67 s

9.34.1 List available language detected

[2]: malaya.language_detection.label [2]: ['eng', 'ind', 'malay', 'manglish', 'other', 'rojak']

[4]: chinese_text='Muiriel' english_text='i totally love it man' indon_text='menjabat saleh perombakan menjabat periode komisi energi fraksi partai

˓→pengurus partai periode periode partai terpilih periode menjabat komisi perdagangan

˓→investasi persatuan periode' malay_text='beliau berkata program Inisitif Peduli Rakyat (IPR) yang diperkenalkan

˓→oleh kerajaan negeri Selangor lebih besar sumbangannya' socialmedia_malay_text='nti aku tengok dulu tiket dari kl pukul berapa ada nahh' (continues on next page)

9.34. Language Detection 305 malaya Documentation

(continued from previous page) socialmedia_indon_text='saking kangen papanya pas vc anakku nangis' rojak_text='jadi aku tadi bikin ini gengs dan dijual haha salad only k dan haha

˓→drinks only k' manglish_text='power lah even shopback come to edmw riao'

9.34.2 Load Fast-text model

Make sure fast-text already installed, if not, simply,

pip install fasttext

def fasttext(quantized: bool= True, **kwargs):

""" Load Fasttext language detection model. Original size is 353MB, Quantized size 31.1MB.

Parameters ------quantized: bool, optional (default=True) if True, load quantized fasttext model. Else, load original fasttext model.

Returns ------result : malaya.model.ml.LanguageDetection class """

In this example, I am going to compare with pretrained fasttext from Facebook. https://fasttext.cc/docs/en/ language-identification.html Simply download pretrained model,

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

[4]: model= fasttext.load_model('lid.176.ftz') fast_text= malaya.language_detection.fasttext()

[5]: model.predict(['']) [5]: ([['__label__km']], array([[0.99841499]]))

[6]: fast_text.predict(['']) [6]: ['other']

Language detection in Malaya is not trying to tackle possible languages in this world, just towards to hyperlocal language.

[7]: model.predict(['suka makan ayam dan daging']) [7]: ([['__label__id']], array([[0.6334154]]))

306 Chapter 9. Contents: malaya Documentation

[8]: fast_text.predict_proba(['suka makan ayam dan daging']) [8]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.8817721009254456, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}]

[9]: model.predict(malay_text) [9]: (('__label__ms',), array([0.57101035]))

[10]: fast_text.predict_proba([malay_text]) [10]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.9999504089355469, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}]

[11]: model.predict(socialmedia_malay_text) [11]: (('__label__id',), array([0.7870034]))

[12]: fast_text.predict_proba([socialmedia_malay_text]) [12]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.9996305704116821, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}]

[13]: model.predict(socialmedia_indon_text) [13]: (('__label__fr',), array([0.2912012]))

[14]: fast_text.predict_proba([socialmedia_indon_text]) [14]: [{'eng': 0.0, 'ind': 1.0000293254852295, 'malay': 0.0, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}]

[15]: model.predict(rojak_text) [15]: (('__label__id',), array([0.87948251]))

[16]: fast_text.predict_proba([rojak_text]) [16]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.0, 'manglish': 0.0, (continues on next page)

9.34. Language Detection 307 malaya Documentation

(continued from previous page) 'other': 0.0, 'rojak': 0.9994134306907654}]

[17]: model.predict(manglish_text) [17]: (('__label__en',), array([0.89707506]))

[18]: fast_text.predict_proba([manglish_text]) [18]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.0, 'manglish': 1.00004243850708, 'other': 0.0, 'rojak': 0.0}]

[19]: model.predict(chinese_text) [19]: (('__label__zh',), array([0.97311586]))

[20]: fast_text.predict_proba([chinese_text]) [20]: [{'eng': 0.0, 'ind': 0.0, 'malay': 0.0, 'manglish': 0.0, 'other': 0.9921814203262329, 'rojak': 0.0}]

[21]: fast_text.predict_proba([indon_text,malay_text]) [21]: [{'eng': 0.0, 'ind': 1.0000287294387817, 'malay': 0.0, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}, {'eng': 0.0, 'ind': 0.0, 'malay': 0.9999504089355469, 'manglish': 0.0, 'other': 0.0, 'rojak': 0.0}]

9.34.3 Load Deep learning model

Deep learning model is slightly more accurate then fast-text model, can check accuracy comparison at here, https: //malaya.readthedocs.io/en/latest/Accuracy.html#language-detection

def deep_model(quantized: bool= False, **kwargs): """ Load deep learning language detection model. Original size is 51.2MB, Quantized size 12.8MB.

quantized : bool, optional (default=False) (continues on next page)

308 Chapter 9. Contents: malaya Documentation

(continued from previous page) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.model.tf.DeepLang class """

[5]: deep= malaya.language_detection.deep_model() quantized_deep= malaya.language_detection.deep_model(quantized= True)

[6]: deep.predict_proba([indon_text]) [6]: [{'eng': 3.6145184e-06, 'ind': 0.9998913, 'malay': 5.4685424e-05, 'manglish': 5.768742e-09, 'other': 5.8103424e-06, 'rojak': 4.4987162e-05}]

[7]: quantized_deep.predict_proba([indon_text]) [7]: [{'eng': 3.6145184e-06, 'ind': 0.9998913, 'malay': 5.4685424e-05, 'manglish': 5.768742e-09, 'other': 5.8103424e-06, 'rojak': 4.4987162e-05}]

[24]: deep.predict_proba([malay_text]) [24]: [{'eng': 9.500837e-11, 'ind': 0.0004703698, 'malay': 0.9991295, 'manglish': 1.602048e-13, 'other': 1.9133091e-07, 'rojak': 0.0004000054}]

[8]: quantized_deep.predict_proba([malay_text]) [8]: [{'eng': 9.500829e-11, 'ind': 0.00047036994, 'malay': 0.99912965, 'manglish': 1.6020499e-13, 'other': 1.9133095e-07, 'rojak': 0.00040000546}]

[25]: deep.predict_proba([indon_text,malay_text]) [25]: [{'eng': 3.6145207e-06, 'ind': 0.9998909, 'malay': 5.468535e-05, 'manglish': 5.7687397e-09, 'other': 5.8103406e-06, 'rojak': 4.4987148e-05}, {'eng': 9.500837e-11, (continues on next page)

9.34. Language Detection 309 malaya Documentation

(continued from previous page) 'ind': 0.0004703698, 'malay': 0.9991295, 'manglish': 1.602048e-13, 'other': 1.9133091e-07, 'rojak': 0.0004000056}]

[9]: quantized_deep.predict_proba([indon_text,malay_text]) [9]: [{'eng': 3.614522e-06, 'ind': 0.9998913, 'malay': 5.4685373e-05, 'manglish': 5.768742e-09, 'other': 5.8103424e-06, 'rojak': 4.4987162e-05}, {'eng': 9.500829e-11, 'ind': 0.00047036994, 'malay': 0.99912965, 'manglish': 1.6020499e-13, 'other': 1.9133095e-07, 'rojak': 0.0004000057}]

[26]: deep.predict_proba([socialmedia_malay_text]) [26]: [{'eng': 1.4520887e-09, 'ind': 0.0064318455, 'malay': 0.9824693, 'manglish': 2.1923141e-13, 'other': 1.06363805e-05, 'rojak': 0.0110881105}]

[10]: quantized_deep.predict_proba([socialmedia_malay_text]) [10]: [{'eng': 1.4520903e-09, 'ind': 0.006431847, 'malay': 0.98246956, 'manglish': 2.1923168e-13, 'other': 1.0636383e-05, 'rojak': 0.011088113}]

[27]: deep.predict_proba([socialmedia_indon_text]) [27]: [{'eng': 4.0632068e-07, 'ind': 0.9999995, 'malay': 6.871639e-10, 'manglish': 7.4285925e-11, 'other': 1.5928721e-07, 'rojak': 4.892652e-10}]

[28]: deep.predict_proba([rojak_text, malay_text]) [28]: [{'eng': 0.0040922514, 'ind': 0.02200061, 'malay': 0.0027574676, 'manglish': 9.336553e-06, 'other': 0.00023811469, 'rojak': 0.97090226}, {'eng': 9.500837e-11, (continues on next page)

310 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'ind': 0.0004703698, 'malay': 0.9991295, 'manglish': 1.602048e-13, 'other': 1.9133091e-07, 'rojak': 0.0004000056}]

9.35 NSFW Detection

This tutorial is available as an IPython notebook at Malaya/example/nsfw.

Pretty simple and straightforward, just to detect whether a text is NSFW or not.

[1]: %%time import malaya CPU times: user 4.05 s, sys: 741 ms, total: 4.79 s Wall time: 4.59 s

9.35.1 Get label

[2]: malaya.nsfw.label [2]: ['sex', 'gambling', 'negative']

9.35.2 Load lexicon model

Pretty naive but really effective, lexicon gathered at Malay-Dataset/corpus/nsfw.

def lexicon(**kwargs): """ Load Lexicon NSFW model.

Returns ------result : malaya.text.lexicon.nsfw.Lexicon class """

[3]: lexicon_model= malaya.nsfw.lexicon()

[4]: string1='xxx sgt panas, best weh' string2='jmpa dekat kl sentral' string3='Rolet Dengan Wang Sebenar'

9.35. NSFW Detection 311 malaya Documentation

Predict batch of strings

[5]: lexicon_model.predict([string1, string2, string3]) [5]: ['sex', 'negative', 'gambling']

9.35.3 Load multinomial model

All model interface will follow sklearn interface started v3.4,

def multinomial(**kwargs): """ Load multinomial NSFW model.

Returns ------result : malaya.model.ml.BAYES class """

[7]: model= malaya.nsfw.multinomial()

Predict batch of strings

[8]: model.predict([string1, string2, string3]) [8]: ['sex', 'negative', 'gambling']

Predict batch of strings with probability

[9]: model.predict_proba([string1, string2, string3]) [9]: [{'sex': 0.9357058034930408, 'gambling': 0.02616353532998711, 'negative': 0.03813066117697173}, {'sex': 0.027541900360621846, 'gambling': 0.03522626245360637, 'negative': 0.9372318371857732}, {'sex': 0.01865380888750343, 'gambling': 0.9765340760395791, 'negative': 0.004812115072918792}]

9.36 Relevancy Analysis

This tutorial is available as an IPython notebook at Malaya/example/relevancy.

This module only trained on standard language structure, so it is not save to use it for local language structure.

312 Chapter 9. Contents: malaya Documentation

[1]: %%time import malaya CPU times: user 4.19 s, sys: 554 ms, total: 4.74 s Wall time: 3.84 s

9.36.1 Explanation

Positive relevancy: The article or piece of text is relevant, tendency is high to become not a fake news. Can be a positive or negative sentiment. Negative relevancy: The article or piece of text is not relevant, tendency is high to become a fake news. Can be a positive or negative sentiment. Right now relevancy module only support deep learning model.

[2]: negative_text='Roti Massimo Mengandungi DNA Babi. Roti produk Massimo keluaran

˓→Syarikat The Italian Baker mengandungi DNA babi. Para pengguna dinasihatkan supaya

˓→tidak memakan produk massimo. Terdapat pelbagai produk roti keluaran syarikat lain

˓→yang boleh dimakan dan halal. Mari kita sebarkan berita ini supaya semua rakyat

˓→Malaysia sedar dengan apa yang mereka makna setiap hari. Roti tidak halal ada DNA

˓→babi jangan makan ok.' positive_text='Jabatan Kemajuan Islam Malaysia memperjelaskan dakwaan sebuah mesej

˓→yang dikitar semula, yang mendakwa kononnya kod E dikaitkan dengan kandungan lemak

˓→babi sepertimana yang tular di media sosial. . Tular: November 2017 . Tular: Mei

˓→2014 JAKIM ingin memaklumkan kepada masyarakat berhubung maklumat yang telah

˓→disebarkan secara meluas khasnya melalui media sosial berhubung kod E yang

˓→dikaitkan mempunyai lemak babi. Untuk makluman, KOD E ialah kod untuk bahan tambah

˓→(aditif) dan ianya selalu digunakan pada label makanan di negara Kesatuan Eropah.

˓→Menurut JAKIM, tidak semua nombor E yang digunakan untuk membuat sesuatu produk

˓→makanan berasaskan dari sumber yang haram. Sehubungan itu, sekiranya sesuatu produk

˓→merupakan produk tempatan dan mendapat sijil Pengesahan Halal Malaysia, maka ia

˓→boleh digunakan tanpa was-was sekalipun mempunyai kod E-kod. Tetapi sekiranya

˓→produk tersebut bukan produk tempatan serta tidak mendapat sijil pengesahan halal

˓→Malaysia walaupun menggunakan e-kod yang sama, pengguna dinasihatkan agar berhati-

˓→hati dalam memilih produk tersebut.'

9.36.2 List available Transformer models

[3]: malaya.relevancy.available_transformer() INFO:root:tested on 20% test set. [3]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.6 111.00 0.89320 0.89195 tiny-bert 57.4 15.40 0.87179 0.86324 albert 48.6 12.80 0.89798 0.86008 tiny-albert 22.4 5.98 0.82157 0.83410 xlnet 446.6 118.00 0.92707 0.92103 alxlnet 46.8 13.30 0.91135 0.90446 bigbird 458.0 116.00 0.88093 0.86832 tiny-bigbird 65.0 16.90 0.86558 0.85871

macro f1-score max length bert 0.89256 512.0 (continues on next page)

9.36. Relevancy Analysis 313 malaya Documentation

(continued from previous page) tiny-bert 0.86695 512.0 albert 0.87209 512.0 tiny-albert 0.82416 512.0 xlnet 0.92381 512.0 alxlnet 0.90758 512.0 bigbird 0.87352 1024.0 tiny-bigbird 0.86176 1024.0

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#relevancy You might want to use Alxlnet, a very small size, 46.8MB, but the accuracy is still on the top notch.

9.36.3 Load Transformer model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer relevancy model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters. * ``'bigbird'`` - Google BigBird BASE parameters. * ``'tiny-bigbird'`` - Malaya BigBird BASE parameters. quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `bert` in model, will return `malaya.model.bert.MulticlassBERT`. * if `xlnet` in model, will return `malaya.model.xlnet.MulticlassXLNET`. * if `bigbird` in model, will return `malaya.model.xlnet.MulticlassBigBird`. """

[4]: model= malaya.relevancy.transformer(model='tiny-bigbird') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead. (continues on next page)

314 Chapter 9. Contents: malaya Documentation

(continued from previous page)

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

9.36.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[5]: quantized_model= malaya.relevancy.transformer(model='alxlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

Predict batch of strings

def predict(self, strings: List[str]): """ classify list of strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[7]: %%time

model.predict([negative_text, positive_text]) CPU times: user 2.04 s, sys: 520 ms, total: 2.56 s Wall time: 1.23 s

9.36. Relevancy Analysis 315 malaya Documentation

[7]: ['not relevant', 'relevant']

[8]: %%time

quantized_model.predict([negative_text, positive_text]) CPU times: user 5.08 s, sys: 823 ms, total: 5.91 s Wall time: 2.96 s [8]: ['not relevant', 'relevant']

Predict batch of strings with probability

def predict_proba(self, strings: List[str]): """ classify list of strings and return probability.

Parameters ------strings : List[str]

Returns ------result: List[dict[str, float]] """

[9]: %%time

model.predict_proba([negative_text, positive_text]) CPU times: user 1.46 s, sys: 403 ms, total: 1.86 s Wall time: 319 ms [9]: [{'not relevant': 0.9896912, 'relevant': 0.010308762}, {'not relevant': 0.007830339, 'relevant': 0.9921697}]

[10]: %%time

quantized_model.predict_proba([negative_text, positive_text]) CPU times: user 2.98 s, sys: 386 ms, total: 3.37 s Wall time: 583 ms [10]: [{'not relevant': 0.9999988, 'relevant': 1.2511766e-06}, {'not relevant': 9.157779e-06, 'relevant': 0.9999908}]

Open relevancy visualization dashboard

def predict_words( self, string: str, method: str='last', visualization: bool= True ): """ classify words.

Parameters (continues on next page)

316 Chapter 9. Contents: malaya Documentation

(continued from previous page) ------string : str method : str, optional (default='last') Attention layer supported. Allowed values:

* ``'last'`` - attention from last layer. * ``'first'`` - attention from first layer. * ``'mean'`` - average attentions from all layers. visualization: bool, optional (default=True) If True, it will open the visualization dashboard.

Returns ------result: dict """

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False. This method not available for BigBird models.

[11]: model.predict_words(negative_text) ------NotImplementedError Traceback (most recent call last) in ----> 1 model.predict_words(negative_text)

~/Documents/Malaya/malaya/model/abstract.py in predict_words(self, string, **kwargs) 21 22 def predict_words(self, string, **kwargs): ---> 23 raise NotImplementedError 24 25

NotImplementedError:

[9]: quantized_model.predict_words(negative_text)

[10]: from IPython.core.display import Image, display

display(Image('relevancy-dashboard.png', width=800))

9.36. Relevancy Analysis 317 malaya Documentation

9.36.5 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

Sentence level

[10]: texts= [negative_text, positive_text] r= model.vectorize(texts, method='first')

[11]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [11]: (2, 2)

318 Chapter 9. Contents: malaya Documentation

[12]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= texts for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Word level

[13]:r= quantized_model.vectorize(texts, method='word')

[14]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[15]: tsne= TSNE().fit_transform(y) tsne.shape [15]: (211, 2)

[16]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.36. Relevancy Analysis 319 malaya Documentation

Pretty good, the model able to know cluster bottom left as positive relevancy.

9.36.6 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[ ]: albert= malaya.relevancy.transformer(model='albert') WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model

[14]: malaya.stack.predict_stack([albert, model], [positive_text, negative_text]) [14]: [{'not relevant': 3.1056952e-06, 'relevant': 0.9999934}, {'not relevant': 0.99982065, 'relevant': 3.868528e-05}]

320 Chapter 9. Contents: malaya Documentation

9.37 Sentiment Analysis

This tutorial is available as an IPython notebook at Malaya/example/sentiment.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.37 s, sys: 709 ms, total: 5.08 s Wall time: 5.45 s

[2]: string1='Sis, students from overseas were brought back because they are not in

˓→their countries which is if something happens to them, its not the other countries’

˓→responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg

˓→tak faham?' string2='Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on

˓→economy related industries dulu' string3='Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah

˓→shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah? Are you serious? Siapa yg

˓→lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu' string4='Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program

˓→pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas

˓→penagih dadah diberi potongan cukai tambahan sehingga 2025.'

9.37.1 Load multinomial model

def multinomial(**kwargs): """ Load multinomial emotion model.

Returns ------result : malaya.model.ml.Bayes class """

[3]: model= malaya.sentiment.multinomial()

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool= True): """ classify list of strings.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) (continues on next page)

9.37. Sentiment Analysis 321 malaya Documentation

(continued from previous page) if True, it will add neutral probability.

Returns ------result: List[str] """

[4]: model.predict([string1, string2]) [4]: ['neutral', 'neutral']

Disable neutral probability,

[ ]: model.predict([string1, string2], add_neutral= False)

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool= True): """ classify list of strings and return probability.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------result: List[dict[str, float]] """

[5]: model.predict_proba([string1, string2]) [5]: [{'negative': 0.008213267932937583, 'positive': 0.17867320670623799, 'neutral': 0.8131135253608244}, {'negative': 0.010098264096992408, 'positive': 0.009901735903007554, 'neutral': 0.98}]

Disable neutral probability,

[6]: model.predict_proba([string1, string2], add_neutral= False) [6]: [{'negative': 0.4106633966468791, 'positive': 0.589336603353119}, {'negative': 0.5049132048496204, 'positive': 0.49508679515037773}]

322 Chapter 9. Contents: malaya Documentation

9.37.2 List available Transformer models

[8]: malaya.sentiment.available_transformer() INFO:root:tested on 20% test set. [8]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.6 111.00 0.99330 0.99330 tiny-bert 57.4 15.40 0.98774 0.98774 albert 48.6 12.80 0.99227 0.99226 tiny-albert 22.4 5.98 0.98554 0.98550 xlnet 446.6 118.00 0.99353 0.99353 alxlnet 46.8 13.30 0.99188 0.99188

macro f1-score bert 0.99329 tiny-bert 0.98774 albert 0.99226 tiny-albert 0.98551 xlnet 0.99353 alxlnet 0.99188

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#sentiment-analysis You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

9.37.3 Load Transformer model

def transformer(model: str='bert', quantized: bool= False, **kwargs): """ Load Transformer sentiment model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters. quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.supervised.softmax.transformer function """

[14]: model= malaya.sentiment.transformer(model='xlnet')

9.37. Sentiment Analysis 323 malaya Documentation

9.37.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[10]: quantized_model= malaya.sentiment.transformer(model='xlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool= True): """ classify list of strings.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------result: List[str] """

[12]: %%time

model.predict([string1, string2]) CPU times: user 4.08 s, sys: 1.44 s, total: 5.51 s Wall time: 4.67 s [12]: ['positive', 'negative']

[13]: %%time

quantized_model.predict([string1, string2]) CPU times: user 3.51 s, sys: 1.33 s, total: 4.84 s Wall time: 3.8 s [13]: ['positive', 'positive']

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool= True): """ classify list of strings and return probability.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) (continues on next page)

324 Chapter 9. Contents: malaya Documentation

(continued from previous page) if True, it will add neutral probability.

Returns ------result: List[dict[str, float]] """

[7]: %%time

model.predict_proba([string1, string2]) CPU times: user 5.05 s, sys: 2.6 s, total: 7.65 s Wall time: 10.1 s [7]: [{'negative': 0.00032528088, 'positive': 0.96747196, 'neutral': 0.03220278}, {'negative': 0.98301303, 'positive': 0.0001698712, 'neutral': 0.016817093}]

[5]: %%time

quantized_model.predict_proba([string1, string2]) CPU times: user 1.64 s, sys: 387 ms, total: 2.03 s Wall time: 1.43 s [5]: [{'negative': 0.0007685767, 'positive': 0.9231422, 'neutral': 0.0760892}, {'negative': 8.198959e-06, 'positive': 0.9991802, 'neutral': 0.00081157684}]

[13]: model.predict_proba([string1, string2], add_neutral= False) [13]: [{'negative': 0.029847767, 'positive': 0.97015226}, {'negative': 0.1034979, 'positive': 0.89650214}]

[5]: quantized_model.predict_proba([string1, string2], add_neutral= False) [5]: [{'negative': 0.004556194, 'positive': 0.9954438}, {'negative': 0.07760632, 'positive': 0.9223937}]

Open emotion visualization dashboard

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[15]: model.predict_words(string1)

[16]: from IPython.core.display import Image, display

display(Image('sentiment-dashboard.png', width=800))

9.37. Sentiment Analysis 325 malaya Documentation

9.37.5 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

Sentence level

[5]:r= quantized_model.vectorize([string1, string2, string3, string4], method='first')

[6]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [6]: (4, 2)

326 Chapter 9. Contents: malaya Documentation

[7]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= [string1, string2, string3, string4] for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Word level

[8]:r= quantized_model.vectorize([string1, string2, string3, string4], method='word')

[9]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[10]: tsne= TSNE().fit_transform(y) tsne.shape [10]: (129, 2)

[11]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( (continues on next page)

9.37. Sentiment Analysis 327 malaya Documentation

(continued from previous page) label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Pretty good, the model able to know cluster top left as positive sentiment, bottom right as negative sentiment.

9.37.6 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[5]: multinomial= malaya.sentiment.multinomial() alxlnet= malaya.sentiment.transformer(model='alxlnet')

[8]: malaya.stack.predict_stack([multinomial, alxlnet, model], [string1, string2]) [8]: [{'negative': 0.0005453552136673502, 'positive': 0.5603020846001405, 'neutral': 0.05399025419995675}, {'negative': 0.0002248290781177622, 'positive': 0.21361579430243546, 'neutral': 0.022142383292097452}]

If you do not want neutral in predict_stack, simply override the parameter,

328 Chapter 9. Contents: malaya Documentation

[9]: malaya.stack.predict_stack([multinomial, alxlnet, model], [string1, string2], add_

˓→neutral= False) [9]: [{'negative': 0.05828375571937787, 'positive': 0.8221586003437801}, {'negative': 0.014352668987571138, 'positive': 0.7835866999009022}]

9.38 Subjectivity Analysis

This tutorial is available as an IPython notebook at Malaya/example/subjectivity.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.22 s, sys: 535 ms, total: 4.75 s Wall time: 3.84 s

9.38.1 Explanation

Positive subjectivity: based on or influenced by personal feelings, tastes, or opinions. Can be a positive or negative sentiment. Negative subjectivity: based on a report or a fact. Can be a positive or negative sentiment.

[2]: negative_text='Kerajaan negeri Kelantan mempersoalkan motif kenyataan Menteri

˓→Kewangan yang hanya menyebut Kelantan penerima terbesar bantuan

˓→kewangan dari Kerajaan Persekutuan. Sedangkan menurut Timbalan Menteri Besarnya,

˓→Datuk Mohd Amar Nik Abdullah, negeri lain yang lebih maju dari Kelantan turut

˓→mendapat pembiayaan dan pinjaman.' positive_text='kerajaan sebenarnya sangat bencikan rakyatnya, minyak naik dan

˓→segalanya'

string1='Sis, students from overseas were brought back because they are not in

˓→their countries which is if something happens to them, its not the other countries’

˓→responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg

˓→tak faham?' string2='Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on

˓→economy related industries dulu' string3='Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah

˓→shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah? Are you serious? Siapa yg

˓→lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu' string4='Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program

˓→pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas

˓→penagih dadah diberi potongan cukai tambahan sehingga 2025.'

9.38. Subjectivity Analysis 329 malaya Documentation

9.38.2 Load multinomial model

def multinomial(**kwargs): """ Load multinomial emotion model.

Returns ------result : malaya.model.ml.Bayes class """

[3]: model= malaya.subjectivity.multinomial()

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool= True): """ classify list of strings.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------result: List[str] """

[4]: model.predict([positive_text,negative_text]) [4]: ['neutral', 'negative']

Disable neutral probability,

[5]: model.predict([positive_text,negative_text], add_neutral= False) [5]: ['positive', 'negative']

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool= True): """ classify list of strings and return probability.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------(continues on next page)

330 Chapter 9. Contents: malaya Documentation

(continued from previous page) result: List[dict[str, float]] """

[6]: model.predict_proba([positive_text,negative_text], add_neutral= False) [6]: [{'negative': 0.420659316666446, 'positive': 0.5793406833335559}, {'negative': 0.7906212884104161, 'positive': 0.2093787115895868}]

9.38.3 List available Transformer models

[7]: malaya.subjectivity.available_transformer() INFO:root:tested on 20% test set. [7]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.6 111.00 0.92004 0.91748 tiny-bert 57.4 15.40 0.91023 0.90228 albert 48.6 12.80 0.90544 0.90299 tiny-albert 22.4 5.98 0.89457 0.89469 xlnet 446.6 118.00 0.91916 0.91753 alxlnet 46.8 13.30 0.90862 0.90835

macro f1-score bert 0.91663 tiny-bert 0.90301 albert 0.90300 tiny-albert 0.89461 xlnet 0.91761 alxlnet 0.90817

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#subjectivity-analysis You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

9.38.4 Load Transformer model

All model interface will follow sklearn interface started v3.4,

def transformer(model: str='bert', quantized: bool= False, **kwargs): """ Load Transformer subjectivity model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters. (continues on next page)

9.38. Subjectivity Analysis 331 malaya Documentation

(continued from previous page)

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `bert` in model, will return `malaya.model.bert.BinaryBERT`. * if `xlnet` in model, will return `malaya.model.xlnet.BinaryXLNET`. """

[12]: model= malaya.subjectivity.transformer(model='albert') INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model

9.38.5 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[14]: quantized_model= malaya.subjectivity.transformer(model='albert', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool= True): """ classify list of strings.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------result: List[str] """

[15]: model.predict([negative_text, positive_text]) [15]: ['negative', 'negative']

332 Chapter 9. Contents: malaya Documentation

[17]: quantized_model.predict([negative_text, positive_text]) [17]: ['negative', 'negative']

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool= True): """ classify list of strings and return probability.

Parameters ------strings: List[str] add_neutral: bool, optional (default=True) if True, it will add neutral probability.

Returns ------result: List[dict[str, float]] """

[18]: model.predict_proba([negative_text, positive_text]) [18]: [{'negative': 0.9956738, 'positive': 4.326162e-05, 'neutral': 0.0042829514}, {'negative': 0.9615872, 'positive': 0.00038412912, 'neutral': 0.038028657}]

[16]: quantized_model.predict_proba([negative_text, positive_text]) [16]: [{'negative': 0.9954784, 'positive': 4.521673e-05, 'neutral': 0.0044763684}, {'negative': 0.9612684, 'positive': 0.00038731584, 'neutral': 0.038344264}]

Open subjectivity visualization dashboard

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[12]: model.predict_words(negative_text)

[13]: from IPython.core.display import Image, display

display(Image('subjective-dashboard.png', width=800))

9.38. Subjectivity Analysis 333 malaya Documentation

9.38.6 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

Sentence level

[8]: texts= [negative_text, positive_text, string1, string2] r= quantized_model.vectorize(texts, method='first')

[9]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape

334 Chapter 9. Contents: malaya Documentation

[9]: (4, 2)

[11]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= texts for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Word level

[12]:r= quantized_model.vectorize(texts, method='word')

[13]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[14]: tsne= TSNE().fit_transform(y) tsne.shape [14]: (109, 2)

[15]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) (continues on next page)

9.38. Subjectivity Analysis 335 malaya Documentation

(continued from previous page) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Pretty good, the model able to know cluster top side as positive subjectivity, bottom side as negative subjectivity.

9.38.7 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[6]: multinomial= malaya.subjectivity.multinomial() alxlnet= malaya.subjectivity.transformer(model='alxlnet')

[12]: malaya.stack.predict_stack([multinomial, model, alxlnet], [positive_text]) [12]: [{'negative': 0.19735892950073536, 'positive': 0.003119166818228667, 'neutral': 0.1160071232668102}]

[13]: malaya.stack.predict_stack([multinomial, model, alxlnet], [positive_text], add_

˓→neutral= False) [13]: [{'negative': 0.7424157666636825, 'positive': 0.04498033797670938}]

336 Chapter 9. Contents: malaya Documentation

9.39 Toxicity Analysis

This tutorial is available as an IPython notebook at Malaya/example/toxicity.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.18 s, sys: 558 ms, total: 4.74 s Wall time: 3.85 s

9.39.1 get labels

[2]: malaya.toxicity.label [2]: ['severe toxic', 'obscene', 'identity attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual', 'buddhist', 'christian', 'female', 'heterosexual', 'indian', 'homosexual, gay or lesbian', 'intellectual or learning disability', 'male', 'muslim', 'other disability', 'other gender', 'other race or ethnicity', 'other religion', 'other sexual orientation', 'physical disability', 'psychiatric or mental illness', 'transgender', 'malay', 'chinese']

[4]: string='Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu.

˓→Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je.

˓→Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah.' another_string='melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna' string1='Sis, students from overseas were brought back because they are not in

˓→their countries which is if something happens to them, its not the other countries’

˓→responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg ˓→tak faham?' (continues on next page)

9.39. Toxicity Analysis 337 malaya Documentation

(continued from previous page) string2='Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on

˓→economy related industries dulu'

9.39.2 Load multinomial model

def multinomial(**kwargs): """ Load multinomial toxicity model.

Returns ------result : malaya.model.ml.MultilabelBayes class """

[9]: model= malaya.toxicity.multinomial()

Predict batch of strings

def predict(self, strings: List[str]): """ classify list of strings.

Parameters ------strings: List[str]

Returns ------result: List[str] """

[6]: model.predict([string]) [6]: [['severe toxic', 'obscene', 'identity attack', 'insult', 'indian', 'malay', 'chinese']]

Predict batch of strings with probability

def predict_proba(self, strings: List[str]): """ classify list of strings and return probability.

Parameters ------strings: List[str] (continues on next page)

338 Chapter 9. Contents: malaya Documentation

(continued from previous page)

Returns ------result: List[dict[str, float]] """

[7]: model.predict_proba([string]) [7]: [{'severe toxic': 0.997487040981572, 'obscene': 0.9455379277616331, 'identity attack': 0.8274699625500679, 'insult': 0.5607594945618526, 'threat': 0.024772971511820983, 'asian': 0.0221240002096628, 'atheist': 0.013774558637508741, 'bisexual': 0.0024495807483865223, 'buddhist': 0.004640372956039871, 'christian': 0.052795457745171054, 'female': 0.05289744129561423, 'heterosexual': 0.008128507494633362, 'indian': 0.9023637357823499, 'homosexual, gay or lesbian': 0.04385664232535533, 'intellectual or learning disability': 0.0014981591337876019, 'male': 0.07976929455558882, 'muslim': 0.08806420077375651, 'other disability': 0.0, 'other gender': 0.0, 'other race or ethnicity': 0.0017014040578187566, 'other religion': 0.0017333144620482767, 'other sexual orientation': 0.00122606681013474, 'physical disability': 0.001489522998169223, 'psychiatric or mental illness': 0.027125947355667267, 'transgender': 0.012349564445375391, 'malay': 0.9991900346707605, 'chinese': 0.9886782229459774}]

9.39.3 List available Transformer models

[8]: malaya.toxicity.available_transformer() INFO:root:tested on 20% test set. [8]: Size (MB) Quantized Size (MB) micro precision micro recall \ bert 425.6 111.00 0.86098 0.77313 tiny-bert 57.4 15.40 0.83535 0.79611 albert 48.6 12.80 0.86054 0.76973 tiny-albert 22.4 5.98 0.83535 0.79611 xlnet 446.6 118.00 0.77904 0.83829 alxlnet 46.8 13.30 0.83376 0.80221

micro f1-score bert 0.81469 tiny-bert 0.81526 albert 0.81261 tiny-albert 0.81526 (continues on next page)

9.39. Toxicity Analysis 339 malaya Documentation

(continued from previous page) xlnet 0.80758 alxlnet 0.81768

9.39.4 Load ALXLNET model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer toxicity model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.model.bert.SigmoidBERT class """

[16]: model= malaya.toxicity.transformer(model='alxlnet')

9.39.5 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[ ]: quantized_model= malaya.toxicity.transformer(model='alxlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

340 Chapter 9. Contents: malaya Documentation

Predict batch of strings

def predict(self, strings: List[str]): """ classify list of strings.

Parameters ------strings: List[str]

Returns ------result: List[List[str]] """

[12]: model.predict([string,another_string]) [12]: [['obscene'], ['severe toxic', 'obscene', 'identity attack', 'insult', 'malay']]

Predict batch of strings with probability

def predict_proba(self, strings: List[str]): """ classify list of strings and return probability.

Parameters ------strings : List[str]

Returns ------result: List[dict[str, float]] """

[14]: model.predict_proba([string,another_string]) [14]: [{'severe toxic': 0.30419078, 'obscene': 0.07300964, 'identity attack': 0.02309686, 'insult': 0.14792377, 'threat': 0.0043829083, 'asian': 0.00018724799, 'atheist': 0.0013933778, 'bisexual': 0.0005682409, 'buddhist': 0.0006982982, 'christian': 0.00010216236, 'female': 0.0062876344, 'heterosexual': 3.6597252e-05, 'indian': 0.020283729, 'homosexual, gay or lesbian': 0.0008122027, 'intellectual or learning disability': 0.00015977025, 'male': 0.0007993579, 'muslim': 0.054483294, 'other disability': 0.00017657876, 'other gender': 0.00018069148, (continues on next page)

9.39. Toxicity Analysis 341 malaya Documentation

(continued from previous page) 'other race or ethnicity': 6.273389e-05, 'other religion': 0.0011053085, 'other sexual orientation': 0.0013027787, 'physical disability': 0.00010755658, 'psychiatric or mental illness': 0.00078335404, 'transgender': 0.00080055, 'malay': 0.0033579469, 'chinese': 0.20889702}, {'severe toxic': 0.99571323, 'obscene': 0.91805434, 'identity attack': 0.95676684, 'insult': 0.7667657, 'threat': 0.02582252, 'asian': 0.00074103475, 'atheist': 0.0012175143, 'bisexual': 0.07754475, 'buddhist': 0.004547477, 'christian': 0.0019699335, 'female': 0.03404945, 'heterosexual': 0.029964417, 'indian': 0.021356285, 'homosexual, gay or lesbian': 0.13626209, 'intellectual or learning disability': 0.021410972, 'male': 0.029543608, 'muslim': 0.06485465, 'other disability': 0.0006414652, 'other gender': 0.04015115, 'other race or ethnicity': 0.010606945, 'other religion': 0.001650244, 'other sexual orientation': 0.04054076, 'physical disability': 0.0025109593, 'psychiatric or mental illness': 0.0022883855, 'transgender': 0.01127643, 'malay': 0.9658916, 'chinese': 0.33373892}]

[15]: quantized_model.predict_proba([string,another_string]) [15]: [{'severe toxic': 0.28386846, 'obscene': 0.25873762, 'identity attack': 0.021321118, 'insult': 0.19023287, 'threat': 0.005617261, 'asian': 0.00022211671, 'atheist': 0.000109523535, 'bisexual': 0.0019034147, 'buddhist': 0.00038090348, 'christian': 0.0016773939, 'female': 0.007807076, 'heterosexual': 0.0001899302, 'indian': 0.049388766, 'homosexual, gay or lesbian': 0.00043603778, 'intellectual or learning disability': 0.0012571216, 'male': 0.0043218136, 'muslim': 0.018054605, 'other disability': 0.0011820793, 'other gender': 0.00044164062, (continues on next page)

342 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'other race or ethnicity': 0.00012764335, 'other religion': 0.0009614825, 'other sexual orientation': 0.0040558875, 'physical disability': 0.0005840957, 'psychiatric or mental illness': 0.0023525357, 'transgender': 0.003135711, 'malay': 0.0013717413, 'chinese': 0.0051787198}, {'severe toxic': 0.9966523, 'obscene': 0.82459927, 'identity attack': 0.97338796, 'insult': 0.49216133, 'threat': 0.010962069, 'asian': 0.0034621954, 'atheist': 0.0007635355, 'bisexual': 0.044597328, 'buddhist': 0.0061615705, 'christian': 0.0029616058, 'female': 0.023250878, 'heterosexual': 0.0038115382, 'indian': 0.0068957508, 'homosexual, gay or lesbian': 0.084989995, 'intellectual or learning disability': 0.006228268, 'male': 0.070231974, 'muslim': 0.055434316, 'other disability': 0.00017631054, 'other gender': 0.02043128, 'other race or ethnicity': 0.0032926202, 'other religion': 0.0035361946, 'other sexual orientation': 0.018447628, 'physical disability': 0.0007721717, 'psychiatric or mental illness': 0.004228982, 'transgender': 0.0046984255, 'malay': 0.7579823, 'chinese': 0.8585954}]

Open toxicity visualization dashboard

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[13]: model.predict_words(another_string)

[14]: from IPython.core.display import Image, display

display(Image('toxicity-dashboard.png', width=800))

9.39. Toxicity Analysis 343 malaya Documentation

9.39.6 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

Sentence level

[8]: texts= [string, another_string, string1, string2] r= quantized_model.vectorize(texts, method='first')

[9]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [9]: (4, 2)

344 Chapter 9. Contents: malaya Documentation

[11]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= texts for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Word level

[17]:r= quantized_model.vectorize(texts, method='word')

[18]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[19]: tsne= TSNE().fit_transform(y) tsne.shape [19]: (107, 2)

[20]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) (continues on next page)

9.39. Toxicity Analysis 345 malaya Documentation

(continued from previous page) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Pretty good, outliers are toxic words.

9.39.7 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[16]: albert= malaya.toxicity.transformer(model='albert') INFO:tensorflow:loading sentence piece model

[18]: malaya.stack.predict_stack([model, albert], [another_string]) [18]: [{'severe toxic': 0.9968317, 'obscene': 0.43022493, 'identity attack': 0.90531594, 'insult': 0.42289576, 'threat': 0.0058603976, 'asian': 0.000983668, 'atheist': 0.0005495089, 'bisexual': 0.0009623809, (continues on next page)

346 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'buddhist': 0.0003632398, 'christian': 0.0018632574, 'female': 0.006050684, 'heterosexual': 0.0025569045, 'indian': 0.0056869243, 'homosexual, gay or lesbian': 0.012232827, 'intellectual or learning disability': 0.00091394753, 'male': 0.011594971, 'muslim': 0.0042621437, 'other disability': 0.00027529505, 'other gender': 0.0010361207, 'other race or ethnicity': 0.0012320877, 'other religion': 0.00091365684, 'other sexual orientation': 0.0027996385, 'physical disability': 0.00010540871, 'psychiatric or mental illness': 0.000815311, 'transgender': 0.0016718076, 'malay': 0.96644485, 'chinese': 0.05199418}]

[ ]:

9.40 Doc2Vec

This tutorial is available as an IPython notebook at Malaya/example/doc2vec.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.19 s, sys: 598 ms, total: 4.79 s Wall time: 4.16 s

[2]: string1='Pemuda mogok lapar desak kerajaan prihatin isu iklim' string2='Perbincangan isu pembalakan perlu babit kerajaan negeri' string3='kerajaan perlu kisah isu iklim, pemuda mogok lapar' string4='Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'

[3]: news1='Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan

˓→pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19

˓→' tweet1='DrM sembang pilihan raya tak boleh buat sebab COVID 19'

9.40. Doc2Vec 347 malaya Documentation

9.40.1 Doc2Vec using Word Vector

def doc2vec_wordvector(wordvector): """ Doc2vec interface for text similarity using Word Vector.

Parameters ------wordvector : object malaya.wordvector.WordVector object. should have `get_vector_by_name` method.

Returns ------result: malaya.similarity.Doc2VecSimilarity """

Using Interface

I will use malaya.wordvector.load(model = 'news'), pretty accurate related to local issues.

[4]: %%time

vocab_news, embedded_news= malaya.wordvector.load(model='news') w2v= malaya.wordvector.WordVector(embedded_news, vocab_news) doc2vec= malaya.similarity.doc2vec_wordvector(w2v) CPU times: user 178 ms, sys: 118 ms, total: 296 ms Wall time: 301 ms

predict batch of strings with probability

def predict_proba( self, left_strings: List[str], right_strings: List[str], aggregation: Callable= np.mean, similarity: str='cosine', soft: bool= False, ): """ calculate similarity for two different batch of texts.

Parameters ------left_strings : list of str right_strings : list of str aggregation : Callable, optional (default=numpy.mean) similarity : str, optional (default='mean') similarity supported. Allowed values:

* ``'cosine'`` - cosine similarity. * ``'euclidean'`` - euclidean similarity. * ``'manhattan'`` - manhattan similarity. (continues on next page)

348 Chapter 9. Contents: malaya Documentation

(continued from previous page) soft: bool, optional (default=False) word not inside word vector will replace with nearest word if True, else,

˓→will skip.

Returns ------result: List[float] """

[5]: %%time

doc2vec.predict_proba([string1], [string2]) CPU times: user 1.53 ms, sys: 786 µs, total: 2.31 ms Wall time: 1.68 ms [5]: array([0.89971105])

[6]: %%time

doc2vec.predict_proba([string1, string2], [string3, string4]) CPU times: user 2.55 ms, sys: 1.44 ms, total: 3.99 ms Wall time: 2.73 ms [6]: array([0.91679387, 0.82348571])

[7]: %%time

doc2vec.predict_proba([string1, string2], [string3, tweet1]) CPU times: user 1.68 ms, sys: 381 µs, total: 2.06 ms Wall time: 1.75 ms [7]: array([0.91679387, 0.78542261])

visualize heatmap

def heatmap( self, strings: List[str], aggregation: Callable= np.mean, similarity: str='cosine', soft: bool= False, visualize: bool= True, annotate: bool= True, figsize: Tuple[int, int]=(7,7), ): """ plot a heatmap based on output from bert similarity.

Parameters ------strings : list of str list of strings aggregation : Callable, optional (default=numpy.mean) (continues on next page)

9.40. Doc2Vec 349 malaya Documentation

(continued from previous page) similarity : str, optional (default='mean') similarity supported. Allowed values:

* ``'cosine'`` - cosine similarity. * ``'euclidean'`` - euclidean similarity. * ``'manhattan'`` - manhattan similarity. soft: bool, optional (default=True) word not inside word vector will replace with nearest word if True, else,

˓→will skip. visualize : bool if True, it will render plt.show, else return data. figsize : tuple, (default=(7, 7)) figure size for plot.

Returns ------result: list list of results. """

[8]: doc2vec.heatmap([string1, string2, string3, string4])

350 Chapter 9. Contents: malaya Documentation

Different similarity function will return different percentage.

9.40.2 Doc2Vec using Vectorizer Model

We can use any Vectorizer models provided by Malaya to use encoder similarity interface, example, BERT, XLNET. Again, these encoder models not trained to do similarity classification, it just encode the strings into vector represen- tation. def doc2vec_vectorizer(vectorizer): """ Doc2vec interface for text similarity using Vectorizer model.

(continues on next page)

9.40. Doc2Vec 351 malaya Documentation

(continued from previous page) Parameters ------vectorizer : object vectorizer interface object, BERT, XLNET. should have `vectorize` method.

Returns ------result: malaya.similarity.VectorizerSimilarity """

using ALXLNET

[9]: alxlnet= malaya.transformer.load(model='alxlnet') doc2vec_vectorizer= malaya.similarity.doc2vec_vectorizer(alxlnet) INFO:tensorflow:memory input None INFO:tensorflow:Use float type WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:810: dropout (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/

˓→site-packages/tensorflow_core/python/layers/core.py:271: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:110: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/alxlnet-model/

˓→base/alxlnet-base/model.ckpt

predict for 2 strings with probability

def predict_proba( self, left_strings: List[str], right_strings: List[str], similarity: str='cosine', ): """ calculate similarity for two different batch of texts.

Parameters ------left_strings : list of str right_strings : list of str similarity : str, optional (default='mean') (continues on next page)

352 Chapter 9. Contents: malaya Documentation

(continued from previous page) similarity supported. Allowed values:

* ``'cosine'`` - cosine similarity. * ``'euclidean'`` - euclidean similarity. * ``'manhattan'`` - manhattan similarity.

Returns ------result: List[float] """

[11]: %%time

doc2vec_vectorizer.predict_proba([string1], [string2]) CPU times: user 1.49 s, sys: 103 ms, total: 1.59 s Wall time: 1.34 s [11]: array([0.89992255], dtype=float32)

[12]: %%time

doc2vec_vectorizer.predict_proba([string1, string2], [string3, string4]) CPU times: user 504 ms, sys: 118 ms, total: 621 ms Wall time: 139 ms [12]: array([0.64460504, 0.63204634], dtype=float32)

visualize heatmap

def heatmap( self, strings: List[str], similarity: str='cosine', visualize: bool= True, annotate: bool= True, figsize: Tuple[int, int]=(7,7), ): """ plot a heatmap based on output from bert similarity.

Parameters ------strings : list of str list of strings. similarity : str, optional (default='mean') similarity supported. Allowed values:

* ``'cosine'`` - cosine similarity. * ``'euclidean'`` - euclidean similarity. * ``'manhattan'`` - manhattan similarity. visualize : bool if True, it will render plt.show, else return data. figsize : tuple, (default=(7, 7)) figure size for plot. (continues on next page)

9.40. Doc2Vec 353 malaya Documentation

(continued from previous page)

Returns ------result: list list of results """

[13]: doc2vec_vectorizer.heatmap([string1, string2, string3, string4])

354 Chapter 9. Contents: malaya Documentation

9.41 Semantic Similarity

This tutorial is available as an IPython notebook at Malaya/example/semantic-similarity.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 5.18 s, sys: 1.07 s, total: 6.25 s Wall time: 7.5 s

[2]: string1='Pemuda mogok lapar desak kerajaan prihatin isu iklim' string2='Perbincangan isu pembalakan perlu babit kerajaan negeri' string3='kerajaan perlu kisah isu iklim, pemuda mogok lapar' string4='Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'

[3]: news1='Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan

˓→pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19

˓→' tweet1='DrM sembang pilihan raya tak boleh buat sebab COVID 19'

9.41.1 List available Transformer models

[4]: malaya.similarity.available_transformer() INFO:root:tested on 20% test set. [4]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 423.4 111.0 0.88315 0.88656 tiny-bert 56.6 15.0 0.87210 0.87546 albert 48.3 12.8 0.87164 0.87146 tiny-albert 21.9 6.0 0.82234 0.82383 xlnet 448.7 119.0 0.80866 0.76775 alxlnet 49.0 13.9 0.88756 0.88700

macro f1-score bert 0.88405 tiny-bert 0.87292 albert 0.87155 tiny-albert 0.82295 xlnet 0.77112 alxlnet 0.88727

We trained on Quora Question Pairs, translated SNLI and translated MNLI Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#similarity You might want to use ALXLNET, a very small size, 49MB, but the accuracy is still on the top notch.

9.41. Semantic Similarity 355 malaya Documentation

9.41.2 Load transformer model

def transformer(model: str='bert', quantized: bool= False, **kwargs): """ Load Transformer similarity model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `bert` in model, will return `malaya.model.bert.SiameseBERT`. * if `xlnet` in model, will return `malaya.model.xlnet.SiameseXLNET`. """

[5]: model= malaya.similarity.transformer(model='alxlnet')

9.41.3 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[6]: quantized_model= malaya.similarity.transformer(model='alxlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

predict batch of strings with probability

def predict_proba(self, strings_left: List[str], strings_right: List[str]): """ calculate similarity for two different batch of texts.

Parameters ------string_left : List[str] string_right : List[str] (continues on next page)

356 Chapter 9. Contents: malaya Documentation

(continued from previous page)

Returns ------result : List[float] """

you need to give list of left strings, and list of right strings. first left string will compare will first right string and so on. similarity model only supported predict_proba.

[7]: model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1,

˓→string1]) [7]: array([0.99828064, 0.01076903, 0.9603669 , 0.9075881 ], dtype=float32)

[8]: quantized_model.predict_proba([string1, string2, news1, news1], [string3, string4,

˓→tweet1, string1]) [8]: array([0.9987801 , 0.00554545, 0.8729592 , 0.49839294], dtype=float32)

visualize heatmap

def heatmap( self, strings: List[str], visualize: bool= True, annotate: bool= True, figsize: Tuple[int, int]=(7,7), ): """ plot a heatmap based on output from similarity

Parameters ------strings : list of str list of strings. visualize : bool if True, it will render plt.show, else return data. figsize : tuple, (default=(7, 7)) figure size for plot.

Returns ------result: list list of results """

[9]: model.heatmap([string1, string2, string3, string4])

9.41. Semantic Similarity 357 malaya Documentation

9.41.4 Vectorize

Let say you want to visualize sentences in lower dimension, you can use model.vectorize, def vectorize(self, strings: List[str]): """ Vectorize list of strings.

Parameters ------strings : List[str]

(continues on next page)

358 Chapter 9. Contents: malaya Documentation

(continued from previous page) Returns ------result: np.array """

[10]: texts= [string1, string2, string3, string4, news1, tweet1] r= quantized_model.vectorize(texts)

[11]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [11]: (6, 2)

[12]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= texts for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.41. Semantic Similarity 359 malaya Documentation

9.41.5 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html If you want to stack semantic similarity models, you need to pass labels using strings_right parameter,

malaya.stack.predict_stack([model1, model2], List[str], strings_right= List[str])

We will passed strings_right as **kwargs.

[13]: alxlnet= malaya.similarity.transformer(model='alxlnet') albert= malaya.similarity.transformer(model='albert') tiny_bert= malaya.similarity.transformer(model='tiny-bert')

[14]: malaya.stack.predict_stack([alxlnet, albert, tiny_bert], [string1, string2, news1,

˓→news1], strings_right= [string3, string4, tweet1, string1]) [14]: array([0.9968965 , 0.17514098, 0.11507297, 0.01998391], dtype=float32)

9.42 Unsupervised Keyword Extraction

We can use any Vectorizer Model to calculate Top N keywords.

This tutorial is available as an IPython notebook at Malaya/example/unsupervised-keyword-extraction.

[1]: import malaya

[2]: # https://www.bharian.com.my/berita/nasional/2020/06/698386/isu-bersatu-tun-m-6-yang-

˓→lain-saman-muhyiddin

string= """ Dalam saman itu, plaintif memohon perisytiharan, antaranya mereka adalah ahli BERSATU

˓→yang sah, masih lagi memegang jawatan dalam parti (bagi pemegang jawatan) dan layak

˓→untuk bertanding pada pemilihan parti.

Mereka memohon perisytiharan bahawa semua surat pemberhentian yang ditandatangani

˓→Muhammad Suhaimi bertarikh 28 Mei lalu dan pengesahan melalui mesyuarat Majlis

˓→Pimpinan Tertinggi (MPT) parti bertarikh 4 Jun lalu adalah tidak sah dan terbatal.

Plaintif juga memohon perisytiharan bahawa keahlian Muhyiddin, Hamzah dan Muhammad

˓→Suhaimi di dalam BERSATU adalah terlucut, berkuat kuasa pada 28 Februari 2020 dan/

˓→atau 29 Februari 2020, menurut Fasal 10.2.3 perlembagaan parti.

Yang turut dipohon, perisytiharan bahawa Seksyen 18C Akta Pertubuhan 1966 adalah

˓→tidak terpakai untuk menghalang pelupusan pertikaian berkenaan oleh mahkamah.

Perisytiharan lain ialah Fasal 10.2.6 Perlembagaan BERSATU tidak terpakai di atas hal

˓→melucutkan/ memberhentikan keahlian semua plaintif. """

[3]: import re

(continues on next page)

360 Chapter 9. Contents: malaya Documentation

(continued from previous page) # minimum cleaning, just simply to remove newlines. def cleaning(string): string= string.replace(' \n','') string= re.sub('[^A-Za-z\-() ]+','', string).strip() string= re.sub(r'[ ]+','', string).strip() return string string= cleaning(string)

9.42.1 Use RAKE algorithm

Original implementation from https://github.com/aneesha/RAKE. Malaya added attention mechanism into RAKE algorithm. def rake( string: str, model= None, vectorizer= None, top_k: int=5, atleast: int=1, stopwords= get_stopwords, **kwargs ): """ Extract keywords using Rake algorithm.

Parameters ------string: str model: Object, optional (default=None) Transformer model or any model has `attention` method. vectorizer: Object, optional (default=None) Prefer `sklearn.feature_extraction.text.CountVectorizer` or, `malaya.text.vectorizer.SkipGramCountVectorizer`. If None, will generate ngram automatically based on `stopwords`. top_k: int, optional (default=5) return top-k results. ngram: tuple, optional (default=(1,1)) n-grams size. atleast: int, optional (default=1) at least count appeared in the string to accept as candidate. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.

Returns ------result: Tuple[float, str] """

9.42. Unsupervised Keyword Extraction 361 malaya Documentation

auto-ngram

This will auto generated N-size ngram for keyword candidates.

[4]: malaya.keyword_extraction.rake(string) [4]: [(0.11666666666666665, 'ditandatangani Muhammad Suhaimi bertarikh Mei'), (0.08888888888888888, 'mesyuarat Majlis Pimpinan Tertinggi'), (0.08888888888888888, 'Seksyen C Akta Pertubuhan'), (0.05138888888888888, 'parti bertarikh Jun'), (0.04999999999999999, 'keahlian Muhyiddin Hamzah')]

auto-gram with Attention

This will use attention mechanism as the scores. I will use small-electra in this example.

[5]: electra= malaya.transformer.load(model='small-electra') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:242: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/

˓→site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:120: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→small/electra-small/model.ckpt

[6]: malaya.keyword_extraction.rake(string, model= electra) [6]: [(0.21135464299906287, 'ditandatangani Muhammad Suhaimi bertarikh Mei'), (0.1707678937548548, 'terlucut berkuat kuasa'), (0.1665075410114966, 'Muhammad Suhaimi'), (0.16204322474881924, 'mesyuarat Majlis Pimpinan Tertinggi'), (0.08333932270307894, 'Seksyen C Akta Pertubuhan')]

362 Chapter 9. Contents: malaya Documentation

using vectorizer

[7]: from malaya.text.vectorizer import SkipGramCountVectorizer

stopwords= malaya.text.function.get_stopwords() vectorizer= SkipGramCountVectorizer( token_pattern=r'[\S]+', ngram_range=(1,3), stop_words= stopwords, lowercase= False, skip=2 )

[8]: malaya.keyword_extraction.rake(string, vectorizer= vectorizer) [8]: [(0.0017052987393271276, 'parti memohon perisytiharan'), (0.0017036368782590756, 'memohon perisytiharan BERSATU'), (0.0017012023597074357, 'memohon perisytiharan sah'), (0.0017012023597074357, 'sah memohon perisytiharan'), (0.0016992809994779549, 'perisytiharan BERSATU sah')]

fixed-ngram with Attention

[9]: malaya.keyword_extraction.rake(string, model= electra, vectorizer= vectorizer) [9]: [(0.011575973734122788, 'Suhaimi terlucut kuasa'), (0.011181844743375292, 'Suhaimi terlucut berkuat'), (0.011115823052133569, 'Hamzah Suhaimi terlucut'), (0.011088263093292463, 'Muhammad Suhaimi terlucut'), (0.010932739982610082, 'Suhaimi BERSATU terlucut')]

9.42.2 Use Textrank algorithm

Malaya simply use textrank algorithm.

def textrank( string: str, model= None, vectorizer= None, top_k: int=5, atleast: int=1, stopwords= get_stopwords, **kwargs ): """ Extract keywords using Textrank algorithm.

Parameters ------string: str model: Object, optional (default='None') model has `fit_transform` or `vectorize` method. vectorizer: Object, optional (default=None) Prefer `sklearn.feature_extraction.text.CountVectorizer` or, (continues on next page)

9.42. Unsupervised Keyword Extraction 363 malaya Documentation

(continued from previous page) `malaya.text.vectorizer.SkipGramCountVectorizer`. If None, will generate ngram automatically based on `stopwords`. top_k: int, optional (default=5) return top-k results. atleast: int, optional (default=1) at least count appeared in the string to accept as candidate. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns ------result: Tuple[float, str] """

[10]: from sklearn.feature_extraction.text import TfidfVectorizer tfidf= TfidfVectorizer()

auto-ngram with TFIDF

This will auto generated N-size ngram for keyword candidates.

[11]: malaya.keyword_extraction.textrank(string, model= tfidf) [11]: [(0.00015733542072521276, 'plaintif memohon perisytiharan'), (0.00012558967703709954, 'Fasal perlembagaan parti'), (0.00011514137183023093, 'Fasal Perlembagaan BERSATU'), (0.00011505528232050447, 'parti'), (0.00010763519022276223, 'memohon perisytiharan')]

auto-ngram with Attention

This will auto generated N-size ngram for keyword candidates.

[12]: electra= malaya.transformer.load(model='small-electra') albert= malaya.transformer.load(model='albert') INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→small/electra-small/model.ckpt INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/albert-model/

˓→base/albert-base/model.ckpt

[13]: malaya.keyword_extraction.textrank(string, model= electra) [13]: [(6.318265869072403e-05, 'dipohon perisytiharan'), (6.316746537201306e-05, 'pemegang jawatan'), (6.316118885596658e-05, 'parti bertarikh Jun'), (6.316104343935219e-05, 'Februari'), (6.315818745707347e-05, 'plaintif')]

[14]: malaya.keyword_extraction.textrank(string, model= albert) [14]: [(7.964654245909712e-05, 'Fasal Perlembagaan BERSATU'), (7.746139567779304e-05, 'mesyuarat Majlis Pimpinan Tertinggi'), (7.522448275120805e-05, 'Muhammad Suhaimi'), (continues on next page)

364 Chapter 9. Contents: malaya Documentation

(continued from previous page) (7.520443949997106e-05, 'pengesahan'), (7.519602119292121e-05, 'terbatal Plaintif')]

Or you can use any classification model to find keywords sensitive towards to specific domain.

[15]: sentiment= malaya.sentiment.transformer(model='xlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

[16]: malaya.keyword_extraction.textrank(string, model= sentiment) [16]: [(6.592349998684001e-05, 'pengesahan'), (6.522374046273496e-05, 'parti'), (6.519787313586387e-05, 'ditandatangani Muhammad Suhaimi bertarikh Mei'), (6.50355056789609e-05, 'memegang jawatan'), (6.48614030622403e-05, 'pemilihan parti')]

fixed-ngram with Attention

[17]: stopwords= malaya.text.function.get_stopwords() vectorizer= SkipGramCountVectorizer( token_pattern=r'[\S]+', ngram_range=(1,3), stop_words= stopwords, lowercase= False, skip=2 )

[18]: malaya.keyword_extraction.textrank(string, model= electra, vectorizer= vectorizer) [18]: [(5.652169440330057e-09, 'plaintif perisytiharan'), (5.652075728462069e-09, 'perisytiharan ahli sah'), (5.651996176263403e-09, 'Plaintif perisytiharan keahlian'), (5.651931485635611e-09, 'Perisytiharan'), (5.651703407437562e-09, 'plaintif memohon perisytiharan')]

[19]: malaya.keyword_extraction.textrank(string, model= albert, vectorizer= vectorizer) [19]: [(7.237609487831676e-09, 'Perisytiharan Fasal Perlembagaan'), (7.237148398598793e-09, 'Fasal Perlembagaan melucutkan'), (7.234637484224076e-09, 'Pimpinan Tertinggi (MPT)'), (7.2318264874552195e-09, 'Majlis Pimpinan (MPT)'), (7.231510832160389e-09, 'Perisytiharan Fasal BERSATU')]

9.42.3 Use Attention mechanism

Use attention mechanism from transformer model to get important keywords.

def attention( string: str, model, vectorizer= None, top_k: int=5, (continues on next page)

9.42. Unsupervised Keyword Extraction 365 malaya Documentation

(continued from previous page) atleast: int=1, stopwords= get_stopwords, **kwargs ): """ Extract keywords using Attention mechanism.

Parameters ------string: str model: Object Transformer model or any model has `attention` method. vectorizer: Object, optional (default=None) Prefer `sklearn.feature_extraction.text.CountVectorizer` or, `malaya.text.vectorizer.SkipGramCountVectorizer`. If None, will generate ngram automatically based on `stopwords`. top_k: int, optional (default=5) return top-k results. atleast: int, optional (default=1) at least count appeared in the string to accept as candidate. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns ------result: Tuple[float, str] """

auto-ngram

This will auto generated N-size ngram for keyword candidates.

[20]: malaya.keyword_extraction.attention(string, model= electra) [20]: [(0.9452064615567696, 'menghalang pelupusan pertikaian'), (0.00748668920928296, 'Fasal Perlembagaan BERSATU'), (0.005130746086467051, 'ahli BERSATU'), (0.005036596770673816, 'melucutkan memberhentikan keahlian'), (0.004883705096775167, 'BERSATU')]

[21]: malaya.keyword_extraction.attention(string, model= albert) [21]: [(0.16196376947988833, 'plaintif memohon perisytiharan'), (0.09294069270557498, 'memohon perisytiharan'), (0.06902307677431335, 'plaintif'), (0.05584833292678144, 'ditandatangani Muhammad Suhaimi bertarikh Mei'), (0.05206227265177878, 'dipohon perisytiharan')]

366 Chapter 9. Contents: malaya Documentation

fixed-ngram

[22]: malaya.keyword_extraction.attention(string, model= electra, vectorizer= vectorizer) [22]: [(0.037611192232396125, 'pertikaian mahkamah Perlembagaan'), (0.03757121639209162, 'pertikaian mahkamah Fasal'), (0.037563414917813766, 'terpakai pertikaian mahkamah'), (0.03756289871618318, 'menghalang pertikaian mahkamah'), (0.037561437116523086, 'pelupusan pertikaian mahkamah')]

[23]: malaya.keyword_extraction.attention(string, model= albert, vectorizer= vectorizer) [23]: [(0.0073900373302097505, 'saman plaintif memohon'), (0.006895211361267655, 'Dalam plaintif memohon'), (0.006638399608830277, 'plaintif memohon BERSATU'), (0.0062231449129606375, 'Dalam saman memohon'), (0.006196574312595335, 'plaintif memohon perisytiharan')]

9.42.4 Use similarity mechanism

def similarity( string: str, model, vectorizer= None, top_k: int=5, atleast: int=1, stopwords= get_stopwords, **kwargs, ): """ Extract keywords using Sentence embedding VS keyword embedding similarity.

Parameters ------string: str model: Object Transformer model or any model has `vectorize` method. vectorizer: Object, optional (default=None) Prefer `sklearn.feature_extraction.text.CountVectorizer` or, `malaya.text.vectorizer.SkipGramCountVectorizer`. If None, will generate ngram automatically based on `stopwords`. top_k: int, optional (default=5) return top-k results. atleast: int, optional (default=1) at least count appeared in the string to accept as candidate. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns ------result: Tuple[float, str] """

It is best to use with malaya.similarity.transformer(model = 'alxlnet').

9.42. Unsupervised Keyword Extraction 367 malaya Documentation

[4]: alxlnet= malaya.similarity.transformer(model='alxlnet')

[5]: malaya.keyword_extraction.similarity(string, model= alxlnet) [5]: [(0.817958, 'terbatal Plaintif'), (0.79831344, 'memohon perisytiharan'), (0.7925713, 'melucutkan memberhentikan keahlian'), (0.7921115, 'plaintif memohon perisytiharan'), (0.76372087, 'Seksyen C Akta Pertubuhan')]

9.43 Keyphrase similarity

Finetuning transformers to calculate similarity between sentences and keyphrases.

This tutorial is available as an IPython notebook at Malaya/example/keyphrase-similarity.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: import malaya import numpy as np

9.43.1 List available Transformer models

[2]: malaya.keyword_extraction.available_transformer() INFO:root:tested on 20% test set. [2]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 443.0 112.0 0.99403 0.99568 tiny-bert 59.5 15.1 0.99494 0.99707 alxlnet 53.0 14.0 0.98170 0.99182 xlnet 472.0 120.0 0.99667 0.99819

macro f1-score bert 0.99485 tiny-bert 0.99600 alxlnet 0.98663 xlnet 0.99742

We trained on Twitter Keyphrase Bahasa and Malaysia Entities. Example training set,

[3]: # !wget https://raw.githubusercontent.com/huseinzol05/Malay-Dataset/master/keyphrase/

˓→twitter-bahasa/topics.json

import json

with open('topics.json') as fopen: (continues on next page)

368 Chapter 9. Contents: malaya Documentation

(continued from previous page) topics= set(json.load(fopen).keys())

list_topics= list(topics) len(list_topics) [3]: 949

[4]: import random

def get_data(data):

if len(set(data[1])& topics) and random.random()> 0.2: t= random.choice(data[1]) label=1 else: s=(set(data[1])| set()) t= random.choice(list(topics- s)) label=0

return data[0], t, label

[14]: data=('Peguam dikuarantin, kes 1MDB ditangguh',['najib razak'])

[23]: get_data(data) [23]: ('Peguam dikuarantin, kes 1MDB ditangguh', 'najib razak', 1)

Some time will returned random topics inside corpus and give label 0.

[22]: get_data(data) [22]: ('Peguam dikuarantin, kes 1MDB ditangguh', 'car camera', 0)

9.43.2 Load transformer model

def transformer(model: str='bert', quantized: bool= False, **kwargs): """ Load Transformer keyword similarity model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------(continues on next page)

9.43. Keyphrase similarity 369 malaya Documentation

(continued from previous page) result: model List of model classes:

* if `bert` in model, will return `malaya.model.bert.KeyphraseBERT`. * if `xlnet` in model, will return `malaya.model.xlnet.KeyphraseXLNET`. """

[5]: tiny_bert= malaya.keyword_extraction.transformer(model='tiny-bert') alxlnet= malaya.keyword_extraction.transformer(model='alxlnet')

[6]: # !wget https://raw.githubusercontent.com/huseinzol05/Malay-Dataset/master/keyphrase/

˓→twitter-bahasa/testset-keyphrase.json

with open('testset-keyphrase.json') as fopen: testset= json.load(fopen)

[7]: testset[:10] [7]: [['Takdak gambar raya ', 'myburgerlab restaurant', 0], ['Menyampah aku tngk cerita ayda jebat pukul 7 ni, mcm bodoh je', 'samsung smartphone', 0], ['Alhamdulillah ala kulli hal.', 'fifa', 0], ['@mubimalaysia @sharifahamani 018 9828689 no fon saya kak amani', 'jabatan parlimen malaysia', 0], ['Ya ga seru lah, seruan lawan Burnley sama Brighton ', 'asus smartphone', 0], ['Senyumlah. Allah tahu kau sedih tapi senyumlah. Dont give up now. Yelah life is

˓→not always rainbow and flowers kan. Kadang-kadang you stuck on cloudy day and step

˓→on thorn.', 'shop property', 0], ['Serabut fikir pasal kerja.. Pastu memalukan diri kat lrt! Kau ingat kau anak

˓→menteri ke nak naik free? Confident je https://t.co/Qe3XgyUllt', 'pengangkutan awam', 1], ['Melaka peeps !!! I jumpe satu restaurant yg best sangat kalau nk lepak2 or makan

˓→with whole family !! Tmpt selesa ! Makanan pun sedap ! Harga pun not bad !!! \n\

˓→nrestaurant markisar https://t.co/dnwkqqlt5z', 'gejala sosial', 1], ['@farzanamahmud7 haha eh air nira kan sedap minum sejuk time panas camnie haha', 'beverage', 1], ['Ayah...kenapa kita gak punya mobil? Pertanyaan Bintang siang ini saat hujan turun

˓→dgn derasnya.\nSemua orang yg punya mobil itu cuman titipan nak, nah kebetulan ayah

˓→gak kebagian dititipin, kebagian dititipan bawa spd motor. Enak naik motor, adem.

˓→Hehe.\nBintang ketawa https://t.co/ToYBTEHJoA', 'motorcycle type', 1]]

370 Chapter 9. Contents: malaya Documentation

predict batch of strings with probability

def predict_proba(self, strings_left: List[str], strings_right: List[str]): """ calculate similarity for two different batch of texts.

Parameters ------string_left : List[str] string_right : List[str]

Returns ------result : List[float] """

you need to give list of left strings, and list of right strings. first left string will compare will first right string and so on. similarity model only supported predict_proba.

[8]: texts, keyphrases, labels= [], [], [] for i in range(10): texts.append(testset[i][0]) keyphrases.append(testset[i][1]) labels.append(testset[i][2])

[9]: np.around(tiny_bert.predict_proba(texts, keyphrases)) [9]: array([0., 0., 0., 0., 0., 0., 1., 1., 1., 1.], dtype=float32)

[10]: np.around(alxlnet.predict_proba(texts, keyphrases)) [10]: array([1., 1., 1., 1., 0., 1., 1., 1., 1., 1.], dtype=float32)

[12]: np.around(tiny_bert.predict_proba(texts, keyphrases)) == np.array(labels) [12]: array([ True, True, True, True, True, True, True, True, True, True])

[13]: np.around(alxlnet.predict_proba(texts, keyphrases)) == np.array(labels) [13]: array([False, True, False, False, False, False, True, True, True, True])

9.43.3 Vectorize

Let say you want to visualize sentences in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str]): """ Vectorize list of strings.

Parameters ------(continues on next page)

9.43. Keyphrase similarity 371 malaya Documentation

(continued from previous page) strings : List[str]

Returns ------result: np.array """

[16]: v_texts= tiny_bert.vectorize(texts) v_keyphrases= tiny_bert.vectorize(keyphrases) v_texts.shape, v_keyphrases.shape [16]: ((10, 312), (10, 312))

[25]: from sklearn.metrics.pairwise import cosine_similarity

similarities= cosine_similarity(v_keyphrases, v_texts)

[22]: import matplotlib.pyplot as plt import seaborn as sns

sns.set()

[28]: plt.figure(figsize=(7,7)) g= sns.heatmap( similarities, cmap='Blues', xticklabels= keyphrases, yticklabels= texts, annot= True, ) plt.show()

[29]: v_texts= alxlnet.vectorize(texts) v_keyphrases= alxlnet.vectorize(keyphrases) v_texts.shape, v_keyphrases.shape

372 Chapter 9. Contents: malaya Documentation

[29]: ((10, 768), (10, 768))

[30]: similarities= cosine_similarity(v_keyphrases, v_texts)

[31]: plt.figure(figsize=(7,7)) g= sns.heatmap( similarities, cmap='Blues', xticklabels= keyphrases, yticklabels= texts, annot= True, ) plt.show()

[32]: text='Peguam dikuarantin, kes 1MDB ditangguh' label='najib razak'

[33]:v= tiny_bert.vectorize([text, label])

[34]: cosine_similarity(v) [34]: array([[0.99999994, 0.48644015], [0.48644015, 1.0000002 ]], dtype=float32)

[35]:v= alxlnet.vectorize([text, label])

[36]: cosine_similarity(v) [36]: array([[1.0000002, 0.3488139], [0.3488139, 1.0000001]], dtype=float32)

9.43. Keyphrase similarity 373 malaya Documentation

9.44 Entities Recognition

This tutorial is available as an IPython notebook at Malaya/example/entities.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya CPU times: user 4.79 s, sys: 950 ms, total: 5.74 s Wall time: 6.79 s

9.44.1 Describe supported entities

[2]: import pandas as pd pd.set_option('display.max_colwidth',-1) malaya.entity.describe() [2]: Tag \ 0 OTHER 1 law 2 location 3 organization 4 person 5 quantity 6 time 7 event

Description 0 other 1 law, regulation, related law documents, documents, etc 2 location, place 3 organization, company, government, facilities, etc 4 person, group of people, believes, unique arts (eg; food, drink), etc 5 numbers, quantity 6 date, day, time, etc 7 unique event happened, etc

9.44.2 Describe supported Ontonotes 5 entities

[3]: malaya.entity.describe_ontonotes5() [3]: Tag Description 0 OTHER other 1 ADDRESS Address of physical location. 2 PERSON People, including fictional. 3 NORP Nationalities or religious or political groups. 4 FAC Buildings, airports, highways, bridges, etc. 5 ORG Companies, agencies, institutions, etc. 6 GPE Countries, cities, states. (continues on next page)

374 Chapter 9. Contents: malaya Documentation

(continued from previous page) 7 LOC Non-GPE locations, mountain ranges, bodies of water. 8 PRODUCT Objects, vehicles, foods, etc. (Not services.) 9 EVENT Named hurricanes, battles, wars, sports events, etc. 10 WORK_OF_ART Titles of books, songs, etc. 11 LAW Named documents made into laws. 12 LANGUAGE Any named language. 13 DATE Absolute or relative dates or periods. 14 TIME Times smaller than a day. 15 PERCENT Percentage, including "%". 16 MONEY Monetary values, including unit. 17 QUANTITY Measurements, as of weight or distance. 18 ORDINAL "first", "second", etc. 19 CARDINAL Numerals that do not fall under another type.

9.44.3 List available Transformer NER models

[4]: malaya.entity.available_transformer() INFO:root:tested on 20% test set. [4]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.4 111.00 0.99291 0.97864 tiny-bert 57.7 15.40 0.98151 0.94754 albert 48.6 12.80 0.98026 0.95332 tiny-albert 22.4 5.98 0.96100 0.90363 xlnet 446.6 118.00 0.99344 0.98154 alxlnet 46.8 13.30 0.99215 0.97575

macro f1-score bert 0.98537 tiny-bert 0.96134 albert 0.96492 tiny-albert 0.92374 xlnet 0.98725 alxlnet 0.98337

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ models-accuracy.html#Entities-Recognition

9.44.4 List available Transformer NER Ontonotes 5 models

[5]: malaya.entity.available_transformer_ontonotes5() INFO:root:tested on 20% test set. [5]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 425.4 111.00 0.94460 0.93244 tiny-bert 57.7 15.40 0.91908 0.91635 albert 48.6 12.80 0.93010 0.92341 tiny-albert 22.4 5.98 0.90298 0.88251 xlnet 446.6 118.00 0.93814 0.95021 alxlnet 46.8 13.30 0.93244 0.92942

macro f1-score bert 0.93822 (continues on next page)

9.44. Entities Recognition 375 malaya Documentation

(continued from previous page) tiny-bert 0.91704 albert 0.92636 tiny-albert 0.89145 xlnet 0.94388 alxlnet 0.93047

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ models-accuracy.html#Entities-Recognition-Ontonotes5

[36]: string='KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun

˓→Dr Mahathir Mohamad dan Menteri Pengangkutan Siew Fook menitipkan

˓→pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing.

˓→Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir

˓→menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.' string1='memperkenalkan Husein, dia sangat comel, berumur 25 tahun, bangsa melayu,

˓→agama islam, tinggal di cyberjaya malaysia, bercakap bahasa melayu, semua membaca

˓→buku undang-undang kewangan, dengar laju Siti Nurhaliza - Seluruh Cinta sambil

˓→makan ayam goreng KFC'

9.44.5 Load Transformer model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer Entity Tagging model, transfer learning Transformer + CRF.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.supervised.tag.transformer function """

[7]: model= malaya.entity.transformer(model='alxlnet') INFO:root:running entity/alxlnet using device /device:CPU:0

376 Chapter 9. Contents: malaya Documentation

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]: quantized_model= malaya.entity.transformer(model='alxlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running entity/alxlnet-quantized using device /device:CPU:0

Predict

def predict(self, string: str): """ Tag a string.

Parameters ------string : str

Returns ------result: Tuple[str, str] """

[9]: model.predict(string) [9]: [('KUALA', 'location'), ('LUMPUR', 'location'), (':', 'OTHER'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'event'), ('minggu', 'time'), ('depan', 'time'), (',', 'OTHER'), ('Perdana', 'person'), ('Menteri', 'person'), ('Tun', 'person'), ('Dr', 'person'), ('Mahathir', 'person'), ('Mohamad', 'person'), ('dan', 'OTHER'), ('Menteri', 'organization'), ('Pengangkutan', 'organization'), ('Anthony', 'person'), ('Loke', 'person'), ('Siew', 'person'), ('Fook', 'person'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), (continues on next page)

9.44. Entities Recognition 377 malaya Documentation

(continued from previous page) ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'OTHER'), ('halaman', 'location'), ('masing-masing', 'OTHER'), ('.', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'organization'), ('Keselamatan', 'organization'), ('Jalan', 'organization'), ('Raya', 'organization'), ('(', 'organization'), ('JKJR', 'organization'), (')', 'organization'), ('itu', 'OTHER'), (',', 'OTHER'), ('Dr', 'person'), ('Mahathir', 'person'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER'), ('.', 'OTHER')]

[37]: model.predict(string1) [37]: [('memperkenalkan', 'OTHER'), ('Husein', 'person'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'OTHER'), ('25', 'OTHER'), ('tahun', 'OTHER'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'person'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'person'), (',', 'OTHER'), ('tinggal', 'OTHER'), (continues on next page)

378 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('di', 'OTHER'), ('cyberjaya', 'location'), ('malaysia', 'location'), (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'OTHER'), ('melayu', 'person'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'person'), ('Nurhaliza', 'person'), ('-', 'OTHER'), ('Seluruh', 'OTHER'), ('Cinta', 'OTHER'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'location')]

[11]: quantized_model.predict(string) [11]: [('KUALA', 'location'), ('LUMPUR', 'location'), (':', 'OTHER'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'event'), ('minggu', 'time'), ('depan', 'time'), (',', 'OTHER'), ('Perdana', 'person'), ('Menteri', 'person'), ('Tun', 'person'), ('Dr', 'person'), ('Mahathir', 'person'), ('Mohamad', 'person'), ('dan', 'OTHER'), ('Menteri', 'person'), ('Pengangkutan', 'person'), ('Anthony', 'person'), ('Loke', 'person'), ('Siew', 'person'), ('Fook', 'person'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), (continues on next page)

9.44. Entities Recognition 379 malaya Documentation

(continued from previous page) ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'OTHER'), ('halaman', 'OTHER'), ('masing-masing', 'OTHER'), ('.', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'organization'), ('Keselamatan', 'organization'), ('Jalan', 'organization'), ('Raya', 'organization'), ('(', 'organization'), ('JKJR', 'organization'), (')', 'organization'), ('itu', 'OTHER'), (',', 'OTHER'), ('Dr', 'person'), ('Mahathir', 'person'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER'), ('.', 'OTHER')]

[38]: quantized_model.predict(string1) [38]: [('memperkenalkan', 'OTHER'), ('Husein', 'person'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'OTHER'), ('25', 'OTHER'), ('tahun', 'OTHER'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'person'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'person'), (',', 'OTHER'), ('tinggal', 'OTHER'), (continues on next page)

380 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('di', 'OTHER'), ('cyberjaya', 'location'), ('malaysia', 'location'), (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'OTHER'), ('melayu', 'person'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'person'), ('Nurhaliza', 'person'), ('-', 'OTHER'), ('Seluruh', 'OTHER'), ('Cinta', 'OTHER'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'organization')]

Group similar tags

def analyze(self, string: str): """ Analyze a string.

Parameters ------string : str

Returns ------result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location',

˓→'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]} """

[13]: model.analyze(string) [13]: [{'text': ['KUALA', 'LUMPUR'], 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 2}, {'text': [':', 'Sempena', 'sambutan'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 2, 'endOffset': 5}, (continues on next page)

9.44. Entities Recognition 381 malaya Documentation

(continued from previous page) {'text': ['Aidilfitri'], 'type': 'event', 'score': 1.0, 'beginOffset': 5, 'endOffset': 6}, {'text': ['minggu'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 6, 'endOffset': 7}, {'text': ['depan'], 'type': 'time', 'score': 1.0, 'beginOffset': 7, 'endOffset': 8}, {'text': [','], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 8, 'endOffset': 9}, {'text': ['Perdana', 'Menteri', 'Tun', 'Dr', 'Mahathir', 'Mohamad'], 'type': 'person', 'score': 1.0, 'beginOffset': 9, 'endOffset': 15}, {'text': ['dan'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 15, 'endOffset': 16}, {'text': ['Menteri', 'Pengangkutan'], 'type': 'organization', 'score': 1.0, 'beginOffset': 16, 'endOffset': 18}, {'text': ['Anthony', 'Loke', 'Siew', 'Fook'], 'type': 'person', 'score': 1.0, 'beginOffset': 18, 'endOffset': 22}, {'text': ['menitipkan', 'pesanan', 'khas', 'kepada', 'orang', 'ramai', 'yang', 'mahu', 'pulang', 'ke', 'kampung', 'halaman', 'masing-masing', '.', 'Dalam', 'video', 'pendek', (continues on next page)

382 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'terbitan'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 22, 'endOffset': 40}, {'text': ['Jabatan', 'Keselamatan', 'Jalan', 'Raya', '(', 'JKJR', ')'], 'type': 'organization', 'score': 1.0, 'beginOffset': 40, 'endOffset': 47}, {'text': ['itu', ','], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 47, 'endOffset': 49}, {'text': ['Dr', 'Mahathir'], 'type': 'person', 'score': 1.0, 'beginOffset': 49, 'endOffset': 51}, {'text': ['menasihati', 'mereka', 'supaya', 'berhenti', 'berehat', 'dan', 'tidur', 'sebentar', 'sekiranya', 'mengantuk', 'ketika', 'memandu', '.'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 51, 'endOffset': 64}]

[39]: model.analyze(string1) [39]: [{'text': ['memperkenalkan'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}, {'text': ['Husein'], 'type': 'person', 'score': 1.0, 'beginOffset': 1, 'endOffset': 2}, {'text': [',', 'dia', 'sangat', 'comel', ',', 'berumur', '25', (continues on next page)

9.44. Entities Recognition 383 malaya Documentation

(continued from previous page) 'tahun', ',', 'bangsa'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 2, 'endOffset': 12}, {'text': ['melayu'], 'type': 'person', 'score': 1.0, 'beginOffset': 12, 'endOffset': 13}, {'text': [',', 'agama'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 13, 'endOffset': 15}, {'text': ['islam'], 'type': 'person', 'score': 1.0, 'beginOffset': 15, 'endOffset': 16}, {'text': [',', 'tinggal', 'di'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 16, 'endOffset': 19}, {'text': ['cyberjaya', 'malaysia'], 'type': 'location', 'score': 1.0, 'beginOffset': 19, 'endOffset': 21}, {'text': [',', 'bercakap', 'bahasa'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 21, 'endOffset': 24}, {'text': ['melayu'], 'type': 'person', 'score': 1.0, 'beginOffset': 24, 'endOffset': 25}, {'text': [',', 'semua', 'membaca', 'buku', 'undang-undang', 'kewangan', ',', 'dengar', 'laju'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 25, 'endOffset': 34}, {'text': ['Siti', 'Nurhaliza'], 'type': 'person', (continues on next page)

384 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'score': 1.0, 'beginOffset': 34, 'endOffset': 36}, {'text': ['-', 'Seluruh', 'Cinta', 'sambil', 'makan', 'ayam', 'goreng'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 36, 'endOffset': 43}, {'text': ['KFC'], 'type': 'organization', 'score': 1.0, 'beginOffset': 43, 'endOffset': 44}]

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str): """ vectorize a string.

Parameters ------string: List[str]

Returns ------result: np.array """

[15]: strings= [string, 'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai

˓→petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam

˓→goreng dan milo o ais', 'contact Husein at [email protected]', 'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik

˓→esok dekat Restoran Sebulek']

[16]:r= [quantized_model.vectorize(string) for string in strings]

[17]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[18]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(y) tsne.shape [18]: (124, 2)

9.44. Entities Recognition 385 malaya Documentation

[19]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Pretty good, the model able to know cluster similar entities.

386 Chapter 9. Contents: malaya Documentation

9.44.6 Load Transformer Ontonotes 5 model

def transformer_ontonotes5( model: str='xlnet', quantized: bool= False, **kwargs ): """ Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer

˓→learning Transformer + CRF.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.supervised.tag.transformer function """

[20]: albert= malaya.entity.transformer_ontonotes5(model='albert') INFO:root:running entity-ontonotes5/albert using device /device:CPU:0

[21]: alxlnet= malaya.entity.transformer_ontonotes5(model='alxlnet') INFO:root:running entity-ontonotes5/alxlnet using device /device:CPU:0

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[22]: quantized_albert= malaya.entity.transformer_ontonotes5(model='albert', quantized=

˓→True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running entity-ontonotes5/albert-quantized using device /device:CPU:0

[23]: quantized_alxlnet= malaya.entity.transformer_ontonotes5(model='alxlnet', quantized

˓→= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running entity-ontonotes5/alxlnet-quantized using device /device:CPU:0

9.44. Entities Recognition 387 malaya Documentation

Predict

def predict(self, string: str): """ Tag a string.

Parameters ------string : str

Returns ------result: Tuple[str, str] """

[24]: albert.predict(string) [24]: [('KUALA', 'GPE'), ('LUMPUR', 'GPE'), (':', 'OTHER'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'DATE'), ('minggu', 'OTHER'), ('depan', 'OTHER'), (',', 'OTHER'), ('Perdana', 'OTHER'), ('Menteri', 'OTHER'), ('Tun', 'PERSON'), ('Dr', 'PERSON'), ('Mahathir', 'PERSON'), ('Mohamad', 'PERSON'), ('dan', 'OTHER'), ('Menteri', 'OTHER'), ('Pengangkutan', 'OTHER'), ('Anthony', 'PERSON'), ('Loke', 'PERSON'), ('Siew', 'PERSON'), ('Fook', 'PERSON'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'OTHER'), ('halaman', 'OTHER'), ('masing-masing', 'OTHER'), ('.', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'ORG'), (continues on next page)

388 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('Keselamatan', 'ORG'), ('Jalan', 'ORG'), ('Raya', 'ORG'), ('(', 'ORG'), ('JKJR', 'ORG'), (')', 'ORG'), ('itu', 'OTHER'), (',', 'OTHER'), ('Dr', 'PERSON'), ('Mahathir', 'PERSON'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER'), ('.', 'OTHER')]

[25]: alxlnet.predict(string) [25]: [('KUALA', 'EVENT'), ('LUMPUR', 'EVENT'), (':', 'OTHER'), ('Sempena', 'OTHER'), ('sambutan', 'DATE'), ('Aidilfitri', 'DATE'), ('minggu', 'DATE'), ('depan', 'DATE'), (',', 'OTHER'), ('Perdana', 'OTHER'), ('Menteri', 'OTHER'), ('Tun', 'PERSON'), ('Dr', 'PERSON'), ('Mahathir', 'PERSON'), ('Mohamad', 'PERSON'), ('dan', 'OTHER'), ('Menteri', 'OTHER'), ('Pengangkutan', 'OTHER'), ('Anthony', 'PERSON'), ('Loke', 'PERSON'), ('Siew', 'PERSON'), ('Fook', 'PERSON'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), (continues on next page)

9.44. Entities Recognition 389 malaya Documentation

(continued from previous page) ('ke', 'OTHER'), ('kampung', 'OTHER'), ('halaman', 'OTHER'), ('masing-masing', 'OTHER'), ('.', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'ORG'), ('Keselamatan', 'ORG'), ('Jalan', 'ORG'), ('Raya', 'ORG'), ('(', 'ORG'), ('JKJR', 'ORG'), (')', 'ORG'), ('itu', 'OTHER'), (',', 'OTHER'), ('Dr', 'OTHER'), ('Mahathir', 'PERSON'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER'), ('.', 'OTHER')]

[40]: albert.predict(string1) [40]: [('memperkenalkan', 'OTHER'), ('Husein', 'PERSON'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'DATE'), ('25', 'DATE'), ('tahun', 'DATE'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'OTHER'), (',', 'OTHER'), ('tinggal', 'OTHER'), ('di', 'OTHER'), ('cyberjaya', 'GPE'), ('malaysia', 'GPE'), (continues on next page)

390 Chapter 9. Contents: malaya Documentation

(continued from previous page) (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'WORK_OF_ART'), ('Nurhaliza', 'WORK_OF_ART'), ('-', 'WORK_OF_ART'), ('Seluruh', 'WORK_OF_ART'), ('Cinta', 'WORK_OF_ART'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'ORG')]

[41]: alxlnet.predict(string1) [41]: [('memperkenalkan', 'OTHER'), ('Husein', 'PERSON'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'OTHER'), ('25', 'DATE'), ('tahun', 'DATE'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'NORP'), (',', 'OTHER'), ('tinggal', 'OTHER'), ('di', 'OTHER'), ('cyberjaya', 'GPE'), ('malaysia', 'GPE'), (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'LANGUAGE'), ('melayu', 'LANGUAGE'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (continues on next page)

9.44. Entities Recognition 391 malaya Documentation

(continued from previous page) (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'WORK_OF_ART'), ('Nurhaliza', 'WORK_OF_ART'), ('-', 'WORK_OF_ART'), ('Seluruh', 'WORK_OF_ART'), ('Cinta', 'WORK_OF_ART'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'OTHER')]

[28]: quantized_albert.predict(string) [28]: [('KUALA', 'GPE'), ('LUMPUR', 'GPE'), (':', 'OTHER'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'DATE'), ('minggu', 'OTHER'), ('depan', 'OTHER'), (',', 'OTHER'), ('Perdana', 'OTHER'), ('Menteri', 'OTHER'), ('Tun', 'PERSON'), ('Dr', 'PERSON'), ('Mahathir', 'PERSON'), ('Mohamad', 'PERSON'), ('dan', 'OTHER'), ('Menteri', 'OTHER'), ('Pengangkutan', 'OTHER'), ('Anthony', 'PERSON'), ('Loke', 'PERSON'), ('Siew', 'PERSON'), ('Fook', 'PERSON'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'OTHER'), ('halaman', 'OTHER'), ('masing-masing', 'OTHER'), ('.', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'ORG'), (continues on next page)

392 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('Keselamatan', 'ORG'), ('Jalan', 'ORG'), ('Raya', 'ORG'), ('(', 'ORG'), ('JKJR', 'ORG'), (')', 'ORG'), ('itu', 'OTHER'), (',', 'OTHER'), ('Dr', 'PERSON'), ('Mahathir', 'PERSON'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER'), ('.', 'OTHER')]

[42]: quantized_alxlnet.predict(string1) [42]: [('memperkenalkan', 'OTHER'), ('Husein', 'PERSON'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'DATE'), ('25', 'DATE'), ('tahun', 'DATE'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'OTHER'), (',', 'OTHER'), ('tinggal', 'OTHER'), ('di', 'OTHER'), ('cyberjaya', 'GPE'), ('malaysia', 'GPE'), (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (continues on next page)

9.44. Entities Recognition 393 malaya Documentation

(continued from previous page) (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'WORK_OF_ART'), ('Nurhaliza', 'WORK_OF_ART'), ('-', 'X'), ('Seluruh', 'WORK_OF_ART'), ('Cinta', 'WORK_OF_ART'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'OTHER')]

Group similar tags

def analyze(self, string: str): """ Analyze a string.

Parameters ------string : str

Returns ------result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location',

˓→'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]} """

[30]: alxlnet.analyze(string1) [30]: [{'text': ['memperkenalkan', 'Husein', ',', 'dia', 'sangat', 'comel', ','], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 0, 'endOffset': 7}, {'text': ['berumur', '25', 'tahun'], 'type': 'DATE', 'score': 1.0, 'beginOffset': 7, 'endOffset': 10}, {'text': [',', 'bangsa', 'melayu', ',', 'agama'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 10, 'endOffset': 15}, {'text': ['islam'], 'type': 'NORP', 'score': 1.0, 'beginOffset': 15, 'endOffset': 16}, {'text': [',', 'tinggal', 'di'], 'type': 'OTHER', 'score': 1.0, (continues on next page)

394 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'beginOffset': 16, 'endOffset': 19}, {'text': ['cyberjaya'], 'type': 'GPE', 'score': 1.0, 'beginOffset': 19, 'endOffset': 20}, {'text': ['malaysia', ',', 'bercakap', 'bahasa', 'melayu', ',', 'semua', 'membaca', 'buku', 'undang-undang', 'kewangan', ',', 'dengar', 'laju'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 20, 'endOffset': 34}, {'text': ['Justin', 'Bieber'], 'type': 'ORG', 'score': 1.0, 'beginOffset': 34, 'endOffset': 36}, {'text': ['-', 'Baby'], 'type': 'X', 'score': 1.0, 'beginOffset': 36, 'endOffset': 38}, {'text': ['sambil', 'makan', 'ayam', 'goreng'], 'type': 'OTHER', 'score': 1.0, 'beginOffset': 38, 'endOffset': 42}, {'text': ['KFC'], 'type': 'ORG', 'score': 1.0, 'beginOffset': 42, 'endOffset': 43}]

9.44. Entities Recognition 395 malaya Documentation

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str): """ vectorize a string.

Parameters ------string: List[str]

Returns ------result: np.array """

[31]: strings= [string, string1] r= [quantized_model.vectorize(string) for string in strings]

[32]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[33]: tsne= TSNE().fit_transform(y) tsne.shape [33]: (107, 2)

[48]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

396 Chapter 9. Contents: malaya Documentation

Pretty good, the model able to know cluster similar entities.

9.44.7 Load general Malaya entity model

This model able to classify, 1. date 2. money 3. temperature 4. distance 5. volume 6. duration 7. phone 8. email 9. url 10. time 11. datetime 12. local and generic foods, can check available rules in malaya.texts._food 13. local and generic drinks, can check available rules in malaya.texts._food

9.44. Entities Recognition 397 malaya Documentation

We can insert BERT or any deep learning model by passing malaya.entity.general_entity(model = model), as long the model has predict method and return [(string, label), (string, label)]. This is an optional.

[32]: entity= malaya.entity.general_entity(model= model)

[33]: entity.predict('Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc

˓→sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan

˓→ayam goreng dan milo o ais') [33]: {'PERSON': ['Husein'], 'OTHER': ['baca buku Perlembagaan yang berharga', 'ringgit dekat kfc sungai petani', ', suhu 32 celcius, sambil makan ayam goreng dan milo o ais'], 'CARDINAL': ['3k'], 'DATE': ['minggu lepas,', '2019'], 'TIME': ['2 ptg'], 'MONEY': ['2 oktober'], 'date': {'2 oktober 2019': datetime.datetime(2019, 10, 2, 0, 0), 'minggu lalu': datetime.datetime(2021, 2, 11, 13, 27, 58, 82807)}, 'money': {'3k ringgit': 'RM3000.0'}, 'temperature': ['32 celcius'], 'distance': [], 'volume': [], 'duration': [], 'phone': [], 'email': [], 'url': [], 'time': {'2 PM': datetime.datetime(2021, 2, 18, 14, 0)}, 'datetime': {'2 ptg 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0)}, 'food': ['ayam goreng'], 'drink': ['milo o ais'], 'weight': []}

[34]: entity.predict('contact Husein at [email protected]') [34]: {'OTHER': ['contact Husein at [email protected]'], 'date': {}, 'money': {}, 'temperature': [], 'distance': [], 'volume': [], 'duration': [], 'phone': [], 'email': ['[email protected]'], 'url': [], 'time': {}, 'datetime': {}, 'food': [], 'drink': [], 'weight': []}

[35]: entity.predict('tolong tempahkan meja makan makan nasi dagang dan jus apple, milo

˓→tarik esok dekat Restoran Sebulek') [35]: {'OTHER': ['tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik', 'dekat'], 'DATE': ['esok'], (continues on next page)

398 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'ORG': ['Restoran Sebulek'], 'date': {'esok': datetime.datetime(2021, 2, 19, 13, 27, 58, 505853)}, 'money': {}, 'temperature': [], 'distance': [], 'volume': [], 'duration': [], 'phone': [], 'email': [], 'url': [], 'time': {}, 'datetime': {}, 'food': ['nasi dagang'], 'drink': ['milo tarik', 'jus apple'], 'weight': []}

9.44.8 Voting stack model

[43]: malaya.stack.voting_stack([albert, alxlnet, alxlnet], string1) [43]: [('memperkenalkan', 'OTHER'), ('Husein', 'PERSON'), (',', 'OTHER'), ('dia', 'OTHER'), ('sangat', 'OTHER'), ('comel', 'OTHER'), (',', 'OTHER'), ('berumur', 'DATE'), ('25', 'DATE'), ('tahun', 'DATE'), (',', 'OTHER'), ('bangsa', 'OTHER'), ('melayu', 'OTHER'), (',', 'OTHER'), ('agama', 'OTHER'), ('islam', 'OTHER'), (',', 'OTHER'), ('tinggal', 'OTHER'), ('di', 'OTHER'), ('cyberjaya', 'GPE'), ('malaysia', 'GPE'), (',', 'OTHER'), ('bercakap', 'OTHER'), ('bahasa', 'OTHER'), ('melayu', 'LANGUAGE'), (',', 'OTHER'), ('semua', 'OTHER'), ('membaca', 'OTHER'), ('buku', 'OTHER'), ('undang-undang', 'OTHER'), ('kewangan', 'OTHER'), (',', 'OTHER'), ('dengar', 'OTHER'), ('laju', 'OTHER'), ('Siti', 'WORK_OF_ART'), (continues on next page)

9.44. Entities Recognition 399 malaya Documentation

(continued from previous page) ('Nurhaliza', 'WORK_OF_ART'), ('-', 'WORK_OF_ART'), ('Seluruh', 'WORK_OF_ART'), ('Cinta', 'WORK_OF_ART'), ('sambil', 'OTHER'), ('makan', 'OTHER'), ('ayam', 'OTHER'), ('goreng', 'OTHER'), ('KFC', 'ORG')]

9.45 Part-of-Speech Recognition

This tutorial is available as an IPython notebook at Malaya/example/part-of-speech.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya CPU times: user 4.09 s, sys: 556 ms, total: 4.65 s Wall time: 3.75 s

9.45.1 Describe supported POS

[2]: malaya.pos.describe() [2]: Tag Description 0 ADJ Adjective, kata sifat 1 ADP Adposition 2 ADV Adverb, kata keterangan 3 ADX Auxiliary verb, kata kerja tambahan 4 CCONJ Coordinating conjuction, kata hubung 5 DET Determiner, kata penentu 6 NOUN Noun, kata nama 7 NUM Number, nombor 8 PART Particle 9 PRON Pronoun, kata ganti 10 PROPN Proper noun, kata ganti nama khas 11 SCONJ Subordinating conjunction 12 SYM Symbol 13 VERB Verb, kata kerja 14 X Other

400 Chapter 9. Contents: malaya Documentation

9.45.2 List available Transformer POS models

[3]: malaya.pos.available_transformer() INFO:root:tested on 20% test set. [3]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 426.4 111.00 0.93280 0.93129 tiny-bert 57.7 15.40 0.92810 0.92649 albert 48.7 12.80 0.93199 0.91948 tiny-albert 22.4 5.98 0.90579 0.89501 xlnet 446.6 118.00 0.93303 0.93222 alxlnet 46.8 13.30 0.92732 0.93046

macro f1-score bert 0.93181 tiny-bert 0.92704 albert 0.92547 tiny-albert 0.90002 xlnet 0.93236 alxlnet 0.92819

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#pos-recognition You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

[4]: string='KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun

˓→Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan

˓→pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing.

˓→Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir

˓→menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.'

9.45.3 Load Transformer model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer POS Tagging model, transfer learning Transformer + CRF.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

(continues on next page)

9.45. Part-of-Speech Recognition 401 malaya Documentation

(continued from previous page) Returns ------result : malaya.supervised.tag.transformer function """

[5]: model= malaya.pos.transformer(model='albert') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/

˓→site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/

˓→site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

402 Chapter 9. Contents: malaya Documentation

9.45.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[6]: quantized_model= malaya.pos.transformer(model='albert', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model

Predict

def predict(self, string: str): """ Tag a string.

Parameters ------string : str

Returns ------result: Tuple[str, str] """

[7]: model.predict(string) [7]: [('KUALA', 'PROPN'), ('LUMPUR:', 'PROPN'), ('Sempena', 'ADP'), ('sambutan', 'NOUN'), ('Aidilfitri', 'NOUN'), ('minggu', 'NOUN'), ('depan,', 'ADJ'), ('Perdana', 'PROPN'), ('Menteri', 'PROPN'), ('Tun', 'PROPN'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('Mohamad', 'PROPN'), ('dan', 'CCONJ'), ('Menteri', 'PROPN'), ('Pengangkutan', 'PROPN'), ('Anthony', 'PROPN'), ('Loke', 'PROPN'), ('Siew', 'PROPN'), ('Fook', 'PROPN'), ('menitipkan', 'VERB'), ('pesanan', 'NOUN'), ('khas', 'ADJ'), ('kepada', 'ADP'), ('orang', 'NOUN'), ('ramai', 'ADJ'), (continues on next page)

9.45. Part-of-Speech Recognition 403 malaya Documentation

(continued from previous page) ('yang', 'PRON'), ('mahu', 'ADV'), ('pulang', 'VERB'), ('ke', 'ADP'), ('kampung', 'NOUN'), ('halaman', 'NOUN'), ('masing-masing.', 'DET'), ('Dalam', 'ADP'), ('video', 'NOUN'), ('pendek', 'ADJ'), ('terbitan', 'NOUN'), ('Jabatan', 'PROPN'), ('Keselamatan', 'PROPN'), ('Jalan', 'PROPN'), ('Raya', 'PROPN'), ('(JKJR)', 'PUNCT'), ('itu,', 'DET'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('menasihati', 'VERB'), ('mereka', 'PRON'), ('supaya', 'SCONJ'), ('berhenti', 'VERB'), ('berehat', 'VERB'), ('dan', 'CCONJ'), ('tidur', 'VERB'), ('sebentar', 'NOUN'), ('sekiranya', 'SCONJ'), ('mengantuk', 'ADJ'), ('ketika', 'SCONJ'), ('memandu.', 'VERB')]

[8]: quantized_model.predict(string) [8]: [('KUALA', 'PROPN'), ('LUMPUR:', 'PROPN'), ('Sempena', 'ADP'), ('sambutan', 'NOUN'), ('Aidilfitri', 'NOUN'), ('minggu', 'NOUN'), ('depan,', 'ADJ'), ('Perdana', 'PROPN'), ('Menteri', 'PROPN'), ('Tun', 'PROPN'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('Mohamad', 'PROPN'), ('dan', 'CCONJ'), ('Menteri', 'PROPN'), ('Pengangkutan', 'PROPN'), ('Anthony', 'PROPN'), ('Loke', 'PROPN'), ('Siew', 'PROPN'), ('Fook', 'PROPN'), ('menitipkan', 'VERB'), ('pesanan', 'NOUN'), ('khas', 'ADJ'), (continues on next page)

404 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('kepada', 'ADP'), ('orang', 'NOUN'), ('ramai', 'ADJ'), ('yang', 'PRON'), ('mahu', 'ADV'), ('pulang', 'VERB'), ('ke', 'ADP'), ('kampung', 'NOUN'), ('halaman', 'NOUN'), ('masing-masing.', 'DET'), ('Dalam', 'ADP'), ('video', 'NOUN'), ('pendek', 'ADJ'), ('terbitan', 'NOUN'), ('Jabatan', 'PROPN'), ('Keselamatan', 'PROPN'), ('Jalan', 'PROPN'), ('Raya', 'PROPN'), ('(JKJR)', 'PUNCT'), ('itu,', 'DET'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('menasihati', 'VERB'), ('mereka', 'PRON'), ('supaya', 'SCONJ'), ('berhenti', 'VERB'), ('berehat', 'VERB'), ('dan', 'CCONJ'), ('tidur', 'VERB'), ('sebentar', 'NOUN'), ('sekiranya', 'SCONJ'), ('mengantuk', 'ADJ'), ('ketika', 'SCONJ'), ('memandu.', 'VERB')]

Group similar tags

def analyze(self, string: str): """ Analyze a string.

Parameters ------string : str

Returns ------result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location',

˓→'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]} """

[9]: model.analyze(string) [9]: {'words': ['KUALA', 'LUMPUR:', (continues on next page)

9.45. Part-of-Speech Recognition 405 malaya Documentation

(continued from previous page) 'Sempena', 'sambutan', 'Aidilfitri', 'minggu', 'depan,', 'Perdana', 'Menteri', 'Tun', 'Dr', 'Mahathir', 'Mohamad', 'dan', 'Menteri', 'Pengangkutan', 'Anthony', 'Loke', 'Siew', 'Fook', 'menitipkan', 'pesanan', 'khas', 'kepada', 'orang', 'ramai', 'yang', 'mahu', 'pulang', 'ke', 'kampung', 'halaman', 'masing-masing.', 'Dalam', 'video', 'pendek', 'terbitan', 'Jabatan', 'Keselamatan', 'Jalan', 'Raya', '(JKJR)', 'itu,', 'Dr', 'Mahathir', 'menasihati', 'mereka', 'supaya', 'berhenti', 'berehat', 'dan', 'tidur', 'sebentar', 'sekiranya', 'mengantuk', 'ketika', 'memandu.'], 'tags': [{'text': 'KUALA LUMPUR:', 'type': 'PROPN', (continues on next page)

406 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}, {'text': 'Sempena', 'type': 'ADP', 'score': 1.0, 'beginOffset': 2, 'endOffset': 2}, {'text': 'sambutan Aidilfitri minggu', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 3, 'endOffset': 5}, {'text': 'depan,', 'type': 'ADJ', 'score': 1.0, 'beginOffset': 6, 'endOffset': 6}, {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad', 'type': 'PROPN', 'score': 1.0, 'beginOffset': 7, 'endOffset': 12}, {'text': 'dan', 'type': 'CCONJ', 'score': 1.0, 'beginOffset': 13, 'endOffset': 13}, {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook', 'type': 'PROPN', 'score': 1.0, 'beginOffset': 14, 'endOffset': 19}, {'text': 'menitipkan', 'type': 'VERB', 'score': 1.0, 'beginOffset': 20, 'endOffset': 20}, {'text': 'pesanan', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 21, 'endOffset': 21}, {'text': 'khas', 'type': 'ADJ', 'score': 1.0, 'beginOffset': 22, 'endOffset': 22}, {'text': 'kepada', 'type': 'ADP', 'score': 1.0, 'beginOffset': 23, 'endOffset': 23}, {'text': 'orang', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 24, (continues on next page)

9.45. Part-of-Speech Recognition 407 malaya Documentation

(continued from previous page) 'endOffset': 24}, {'text': 'ramai', 'type': 'ADJ', 'score': 1.0, 'beginOffset': 25, 'endOffset': 25}, {'text': 'yang', 'type': 'PRON', 'score': 1.0, 'beginOffset': 26, 'endOffset': 26}, {'text': 'mahu', 'type': 'ADV', 'score': 1.0, 'beginOffset': 27, 'endOffset': 27}, {'text': 'pulang', 'type': 'VERB', 'score': 1.0, 'beginOffset': 28, 'endOffset': 28}, {'text': 'ke', 'type': 'ADP', 'score': 1.0, 'beginOffset': 29, 'endOffset': 29}, {'text': 'kampung halaman', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 30, 'endOffset': 31}, {'text': 'masing-masing.', 'type': 'DET', 'score': 1.0, 'beginOffset': 32, 'endOffset': 32}, {'text': 'Dalam', 'type': 'ADP', 'score': 1.0, 'beginOffset': 33, 'endOffset': 33}, {'text': 'video', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 34, 'endOffset': 34}, {'text': 'pendek', 'type': 'ADJ', 'score': 1.0, 'beginOffset': 35, 'endOffset': 35}, {'text': 'terbitan', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 36, 'endOffset': 36}, {'text': 'Jabatan Keselamatan Jalan Raya', (continues on next page)

408 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'type': 'PROPN', 'score': 1.0, 'beginOffset': 37, 'endOffset': 40}, {'text': '(JKJR)', 'type': 'PUNCT', 'score': 1.0, 'beginOffset': 41, 'endOffset': 41}, {'text': 'itu,', 'type': 'DET', 'score': 1.0, 'beginOffset': 42, 'endOffset': 42}, {'text': 'Dr Mahathir', 'type': 'PROPN', 'score': 1.0, 'beginOffset': 43, 'endOffset': 44}, {'text': 'menasihati', 'type': 'VERB', 'score': 1.0, 'beginOffset': 45, 'endOffset': 45}, {'text': 'mereka', 'type': 'PRON', 'score': 1.0, 'beginOffset': 46, 'endOffset': 46}, {'text': 'supaya', 'type': 'SCONJ', 'score': 1.0, 'beginOffset': 47, 'endOffset': 47}, {'text': 'berhenti berehat', 'type': 'VERB', 'score': 1.0, 'beginOffset': 48, 'endOffset': 49}, {'text': 'dan', 'type': 'CCONJ', 'score': 1.0, 'beginOffset': 50, 'endOffset': 50}, {'text': 'tidur', 'type': 'VERB', 'score': 1.0, 'beginOffset': 51, 'endOffset': 51}, {'text': 'sebentar', 'type': 'NOUN', 'score': 1.0, 'beginOffset': 52, 'endOffset': 52}, {'text': 'sekiranya', 'type': 'SCONJ', 'score': 1.0, (continues on next page)

9.45. Part-of-Speech Recognition 409 malaya Documentation

(continued from previous page) 'beginOffset': 53, 'endOffset': 53}, {'text': 'mengantuk', 'type': 'ADJ', 'score': 1.0, 'beginOffset': 54, 'endOffset': 54}, {'text': 'ketika', 'type': 'SCONJ', 'score': 1.0, 'beginOffset': 55, 'endOffset': 55}]}

9.45.5 Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str): """ vectorize a string.

Parameters ------string: List[str]

Returns ------result: np.array """

[10]: strings= [string, 'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai

˓→petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam

˓→goreng dan milo o ais', 'contact Husein at [email protected]', 'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik

˓→esok dekat Restoran Sebulek']

[11]:r= [quantized_model.vectorize(string) for string in strings]

[12]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[13]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(y) tsne.shape [13]: (108, 2)

410 Chapter 9. Contents: malaya Documentation

[14]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

Pretty good, the model able to know cluster similar part-of-speech.

9.45. Part-of-Speech Recognition 411 malaya Documentation

9.45.6 Voting stack model

[16]: alxlnet= malaya.pos.transformer(model='alxlnet') malaya.stack.voting_stack([model, alxlnet, alxlnet], string) [16]: [('KUALA', 'PROPN'), ('LUMPUR:', 'PROPN'), ('Sempena', 'ADP'), ('sambutan', 'NOUN'), ('Aidilfitri', 'PROPN'), ('minggu', 'NOUN'), ('depan,', 'ADJ'), ('Perdana', 'PROPN'), ('Menteri', 'PROPN'), ('Tun', 'PROPN'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('Mohamad', 'PROPN'), ('dan', 'CCONJ'), ('Menteri', 'PROPN'), ('Pengangkutan', 'PROPN'), ('Anthony', 'PROPN'), ('Loke', 'PROPN'), ('Siew', 'PROPN'), ('Fook', 'PROPN'), ('menitipkan', 'VERB'), ('pesanan', 'NOUN'), ('khas', 'ADJ'), ('kepada', 'ADP'), ('orang', 'NOUN'), ('ramai', 'ADJ'), ('yang', 'PRON'), ('mahu', 'ADV'), ('pulang', 'VERB'), ('ke', 'ADP'), ('kampung', 'NOUN'), ('halaman', 'NOUN'), ('masing-masing.', 'ADV'), ('Dalam', 'ADP'), ('video', 'NOUN'), ('pendek', 'ADJ'), ('terbitan', 'NOUN'), ('Jabatan', 'NOUN'), ('Keselamatan', 'PROPN'), ('Jalan', 'PROPN'), ('Raya', 'PROPN'), ('(JKJR)', 'PUNCT'), ('itu,', 'DET'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('menasihati', 'VERB'), ('mereka', 'PRON'), ('supaya', 'SCONJ'), ('berhenti', 'VERB'), ('berehat', 'VERB'), ('dan', 'CCONJ'), ('tidur', 'VERB'), ('sebentar', 'ADV'), (continues on next page)

412 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('sekiranya', 'SCONJ'), ('mengantuk', 'ADJ'), ('ketika', 'SCONJ'), ('memandu.', 'VERB')]

[ ]:

9.46 Dependency Parsing

This tutorial is available as an IPython notebook at Malaya/example/dependency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya CPU times: user 5.15 s, sys: 925 ms, total: 6.07 s Wall time: 6.8 s

9.46.1 Describe supported dependencies

[2]: malaya.dependency.describe() INFO:root:you can read more from https://universaldependencies.org/treebanks/id_pud/

˓→index.html [2]: Tag Description 0 acl clausal modifier of noun 1 advcl adverbial clause modifier 2 advmod adverbial modifier 3 amod adjectival modifier 4 appos appositional modifier 5 aux auxiliary 6 case case marking 7 ccomp clausal complement 8 compound compound 9 compound:plur plural compound 10 conj conjunct 11 cop cop 12 csubj clausal subject 13 dep dependent 14 det determiner 15 fixed multi-word expression 16 flat name 17 iobj indirect object 18 mark marker 19 nmod nominal modifier 20 nsubj nominal subject 21 obj direct object (continues on next page)

9.46. Dependency Parsing 413 malaya Documentation

(continued from previous page) 22 parataxis parataxis 23 root root 24 xcomp open clausal complement

9.46.2 List available transformer Dependency models

def available_transformer(version: str='v2'): """ List available transformer dependency parsing models.

Parameters ------version : str, optional (default='v2') Version supported. Allowed values:

* ``'v1'`` - version 1, maintain for knowledge graph. * ``'v2'`` - Trained on bigger dataset, better version.

"""

[5]: malaya.dependency.available_transformer() INFO:root:tested on 20% test set. [5]: Size (MB) Quantized Size (MB) Arc Accuracy Types Accuracy \ bert 455.0 114.00 0.820450 0.79970 tiny-bert 69.7 17.50 0.795252 0.72470 albert 60.8 15.30 0.821895 0.79752 tiny-albert 33.4 8.51 0.786500 0.75870 xlnet 480.2 121.00 0.848110 0.82741 alxlnet 61.2 16.40 0.849290 0.82810

Root Accuracy bert 0.98936 tiny-bert 0.98939 albert 1.00000 tiny-albert 1.00000 xlnet 0.92101 alxlnet 0.92099

9.46.3 Load xlnet dependency model

def transformer(version: str='v2', model: str='xlnet', quantized: bool= False, ˓→**kwargs): """ Load Transformer Dependency Parsing model, transfer learning Transformer +

˓→biaffine attention.

Parameters ------version : str, optional (default='v2') Version supported. Allowed values:

(continues on next page)

414 Chapter 9. Contents: malaya Documentation

(continued from previous page) * ``'v1'`` - version 1, maintain for knowledge graph. * ``'v2'`` - Trained on bigger dataset, better version.

model : str, optional (default='xlnet') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `bert` in model, will return `malaya.model.bert.DependencyBERT`. * if `xlnet` in model, will return `malaya.model.xlnet.DependencyXLNET`. """

[4]: model= malaya.dependency.transformer(model='albert') INFO:root:running dependency-v2/albert using device /device:CPU:0

9.46.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[5]: quantized_model= malaya.dependency.transformer(model='albert', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running dependency-v2/albert-quantized using device /device:CPU:0

9.46.5 Predict

def predict(self, string: str): """ Tag a string.

Parameters ------string: str

Returns ------(continues on next page)

9.46. Dependency Parsing 415 malaya Documentation

(continued from previous page) result: Tuple """

[6]: string='Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar

˓→sekiranya mengantuk ketika memandu.'

[7]: d_object, tagging, indexing= model.predict(string) d_object.to_graphvis() [7]: [8]: d_object, tagging, indexing= quantized_model.predict(string) d_object.to_graphvis() [8]:

9.46.6 Voting stack model

[10]: alxlnet= malaya.dependency.transformer(model='alxlnet') tagging, indexing= malaya.stack.voting_stack([model, model, alxlnet], string) malaya.dependency.dependency_graph(tagging, indexing).to_graphvis() INFO:root:running dependency-v2/alxlnet using device /device:CPU:0 [10]:

9.46.7 Harder example

[13]: # https://www.astroawani.com/berita-malaysia/terbaik-tun-kita-geng-najib-razak-puji-

˓→tun-m-297884

s= """ KUALA LUMPUR: Dalam hal politik, jarang sekali untuk melihat dua figura ini - bekas

˓→Perdana Menteri, Datuk Seri Najib Razak dan Tun Dr Mahathir Mohamad mempunyai

˓→'pandangan yang sama' atau sekapal. Namun, situasi itu berbeza apabila melibatkan

˓→isu ketidakpatuhan terhadap prosedur operasi standard (SOP). Najib, yang juga Ahli

˓→Parlimen Pekan memuji sikap Ahli Parlimen Langkawi itu yang mengaku bersalah

˓→selepas melanggar SOP kerana tidak mengambil suhu badan ketika masuk ke sebuah

˓→surau di Langkawi pada Sabtu lalu. """

[14]: d_object, tagging, indexing= model.predict(s) d_object.to_graphvis() [14]: [15]: tagging, indexing= malaya.stack.voting_stack([model, model, alxlnet], s) malaya.dependency.dependency_graph(tagging, indexing).to_graphvis()

416 Chapter 9. Contents: malaya Documentation

[15]:

9.46.8 Dependency graph object

To initiate a dependency graph from dependency models, you need to call malaya.dependency. dependency_graph.

[16]: graph= malaya.dependency.dependency_graph(tagging, indexing) graph [16]:

generate graphvis

[17]: graph.to_graphvis() [17]:

Get nodes

[17]: graph.nodes [17]: defaultdict(

˓→.()>, {0: {'address': 0, 'word': None, 'lemma': None, 'ctag': 'TOP', 'tag': 'TOP', 'feats': None, 'head': None, 'deps': defaultdict(list, {'root': [11]}), 'rel': None}, 1: {'address': 1, 'word': 'KUALA', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, {'flat': [2], 'obl': [5], 'punct': [7]}), 'rel': 'nsubj'}, 11: {'address': 11, 'word': 'melihat', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 0, 'deps': defaultdict(list, {'nsubj': [1], 'advmod': [8, 9], 'case': [10], 'advcl': [29], (continues on next page)

9.46. Dependency Parsing 417 malaya Documentation

(continued from previous page) 'dep': [42]}), 'rel': 'root'}, 2: {'address': 2, 'word': 'LUMPUR', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 1, 'deps': defaultdict(list, {}), 'rel': 'flat'}, 3: {'address': 3, 'word': ':', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 5, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 5: {'address': 5, 'word': 'hal', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 1, 'deps': defaultdict(list, {'punct': [3], 'case': [4], 'compound': [6]}), 'rel': 'obl'}, 4: {'address': 4, 'word': 'Dalam', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 5, 'deps': defaultdict(list, {}), 'rel': 'case'}, 6: {'address': 6, 'word': 'politik', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 5, 'deps': defaultdict(list, {}), 'rel': 'compound'}, 7: {'address': 7, 'word': ',', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 1, 'deps': defaultdict(list, {}), 'rel': 'punct'}, (continues on next page)

418 Chapter 9. Contents: malaya Documentation

(continued from previous page) 8: {'address': 8, 'word': 'jarang', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, {}), 'rel': 'advmod'}, 9: {'address': 9, 'word': 'sekali', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, {}), 'rel': 'advmod'}, 10: {'address': 10, 'word': 'untuk', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, {}), 'rel': 'case'}, 12: {'address': 12, 'word': 'dua', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 13, 'deps': defaultdict(list, {}), 'rel': 'nummod'}, 13: {'address': 13, 'word': 'figura', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 29, 'deps': defaultdict(list, {'nummod': [12], 'punct': [15], 'compound:plur': [16], 'flat': [17]}), 'rel': 'obj'}, 29: {'address': 29, 'word': 'mempunyai', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, (continues on next page)

9.46. Dependency Parsing 419 malaya Documentation

(continued from previous page) {'obj': [13, 31], 'punct': [37], 'mark': [38]}), 'rel': 'advcl'}, 14: {'address': 14, 'word': 'ini', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 17, 'deps': defaultdict(list, {}), 'rel': 'det'}, 17: {'address': 17, 'word': 'Perdana', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 13, 'deps': defaultdict(list, {'det': [14], 'flat': [18], 'punct': [19], 'appos': [20], 'conj': [25]}), 'rel': 'flat'}, 15: {'address': 15, 'word': '-', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 13, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 16: {'address': 16, 'word': 'bekas', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 13, 'deps': defaultdict(list, {}), 'rel': 'compound:plur'}, 18: {'address': 18, 'word': 'Menteri', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 17, 'deps': defaultdict(list, {}), 'rel': 'flat'}, 19: {'address': 19, 'word': ',', 'lemma': '_', 'ctag': '_', 'tag': '_', (continues on next page)

420 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'feats': '_', 'head': 17, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 20: {'address': 20, 'word': 'Datuk', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 17, 'deps': defaultdict(list, {'flat': [21]}), 'rel': 'appos'}, 21: {'address': 21, 'word': 'Seri', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 20, 'deps': defaultdict(list, {'flat': [22]}), 'rel': 'flat'}, 22: {'address': 22, 'word': 'Najib', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 21, 'deps': defaultdict(list, {'flat': [23]}), 'rel': 'flat'}, 23: {'address': 23, 'word': 'Razak', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 22, 'deps': defaultdict(list, {}), 'rel': 'flat'}, 24: {'address': 24, 'word': 'dan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 25, 'deps': defaultdict(list, {}), 'rel': 'cc'}, 25: {'address': 25, 'word': 'Tun', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 17, 'deps': defaultdict(list, {'cc': [24], 'flat': [26]}), (continues on next page)

9.46. Dependency Parsing 421 malaya Documentation

(continued from previous page) 'rel': 'conj'}, 26: {'address': 26, 'word': 'Dr', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 25, 'deps': defaultdict(list, {'flat': [27]}), 'rel': 'flat'}, 27: {'address': 27, 'word': 'Mahathir', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 26, 'deps': defaultdict(list, {'flat': [28]}), 'rel': 'flat'}, 28: {'address': 28, 'word': 'Mohamad', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 27, 'deps': defaultdict(list, {}), 'rel': 'flat'}, 30: {'address': 30, 'word': "'", 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 31, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 31: {'address': 31, 'word': 'pandangan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 29, 'deps': defaultdict(list, {'punct': [30], 'amod': [33]}), 'rel': 'obj'}, 32: {'address': 32, 'word': 'yang', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 36, 'deps': defaultdict(list, {}), 'rel': 'nsubj'}, 36: {'address': 36, 'word': 'sekapal', (continues on next page)

422 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 33, 'deps': defaultdict(list, {'nsubj': [32], 'punct': [34], 'cc': [35]}), 'rel': 'conj'}, 33: {'address': 33, 'word': 'sama', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 31, 'deps': defaultdict(list, {'conj': [36]}), 'rel': 'amod'}, 34: {'address': 34, 'word': "'", 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 36, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 35: {'address': 35, 'word': 'atau', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 36, 'deps': defaultdict(list, {}), 'rel': 'cc'}, 37: {'address': 37, 'word': '.', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 29, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 38: {'address': 38, 'word': 'Namun', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 29, 'deps': defaultdict(list, {}), 'rel': 'mark'}, 39: {'address': 39, 'word': ',', 'lemma': '_', 'ctag': '_', (continues on next page)

9.46. Dependency Parsing 423 malaya Documentation

(continued from previous page) 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 42: {'address': 42, 'word': 'berbeza', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 11, 'deps': defaultdict(list, {'punct': [39, 54, 89], 'nsubj': [40], 'advcl': [44], 'dep': [55]}), 'rel': 'dep'}, 40: {'address': 40, 'word': 'situasi', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {'det': [41]}), 'rel': 'nsubj'}, 41: {'address': 41, 'word': 'itu', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 40, 'deps': defaultdict(list, {}), 'rel': 'det'}, 43: {'address': 43, 'word': 'apabila', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 44, 'deps': defaultdict(list, {}), 'rel': 'mark'}, 44: {'address': 44, 'word': 'melibatkan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {'mark': [43], 'obj': [45]}), 'rel': 'advcl'}, 45: {'address': 45, 'word': 'isu', 'lemma': '_', (continues on next page)

424 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 44, 'deps': defaultdict(list, {'compound': [46], 'nmod': [48]}), 'rel': 'obj'}, 46: {'address': 46, 'word': 'ketidakpatuhan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 45, 'deps': defaultdict(list, {}), 'rel': 'compound'}, 47: {'address': 47, 'word': 'terhadap', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 48, 'deps': defaultdict(list, {}), 'rel': 'case'}, 48: {'address': 48, 'word': 'prosedur', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 45, 'deps': defaultdict(list, {'case': [47], 'compound': [49], 'amod': [50], 'appos': [52]}), 'rel': 'nmod'}, 49: {'address': 49, 'word': 'operasi', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 48, 'deps': defaultdict(list, {}), 'rel': 'compound'}, 50: {'address': 50, 'word': 'standard', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 48, 'deps': defaultdict(list, {}), 'rel': 'amod'}, 51: {'address': 51, 'word': '(', (continues on next page)

9.46. Dependency Parsing 425 malaya Documentation

(continued from previous page) 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 52, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 52: {'address': 52, 'word': 'SOP', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 48, 'deps': defaultdict(list, {'punct': [51, 53]}), 'rel': 'appos'}, 53: {'address': 53, 'word': ')', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 52, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 54: {'address': 54, 'word': '.', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 55: {'address': 55, 'word': 'Najib', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {'punct': [56], 'nsubj': [59], 'acl': [62]}), 'rel': 'dep'}, 56: {'address': 56, 'word': ',', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 55, 'deps': defaultdict(list, {}), 'rel': 'punct'}, 57: {'address': 57, 'word': 'yang', 'lemma': '_', 'ctag': '_', (continues on next page)

426 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'tag': '_', 'feats': '_', 'head': 59, 'deps': defaultdict(list, {}), 'rel': 'nsubj'}, 59: {'address': 59, 'word': 'Ahli', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 55, 'deps': defaultdict(list, {'nsubj': [57], 'advmod': [58], 'flat': [60]}), 'rel': 'nsubj'}, 58: {'address': 58, 'word': 'juga', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 59, 'deps': defaultdict(list, {}), 'rel': 'advmod'}, 60: {'address': 60, 'word': 'Parlimen', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 59, 'deps': defaultdict(list, {'flat': [61]}), 'rel': 'flat'}, 61: {'address': 61, 'word': 'Pekan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 60, 'deps': defaultdict(list, {}), 'rel': 'flat'}, 62: {'address': 62, 'word': 'memuji', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 55, 'deps': defaultdict(list, {'obj': [63]}), 'rel': 'acl'}, 63: {'address': 63, 'word': 'sikap', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', (continues on next page)

9.46. Dependency Parsing 427 malaya Documentation

(continued from previous page) 'head': 62, 'deps': defaultdict(list, {'flat': [64], 'acl': [69]}), 'rel': 'obj'}, 64: {'address': 64, 'word': 'Ahli', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 63, 'deps': defaultdict(list, {'flat': [65]}), 'rel': 'flat'}, 65: {'address': 65, 'word': 'Parlimen', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 64, 'deps': defaultdict(list, {'flat': [66]}), 'rel': 'flat'}, 66: {'address': 66, 'word': 'Langkawi', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 65, 'deps': defaultdict(list, {'det': [67]}), 'rel': 'flat'}, 67: {'address': 67, 'word': 'itu', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 66, 'deps': defaultdict(list, {}), 'rel': 'det'}, 68: {'address': 68, 'word': 'yang', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 69, 'deps': defaultdict(list, {}), 'rel': 'nsubj'}, 69: {'address': 69, 'word': 'mengaku', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 63, 'deps': defaultdict(list, {'nsubj': [68], 'xcomp': [70]}), 'rel': 'acl'}, (continues on next page)

428 Chapter 9. Contents: malaya Documentation

(continued from previous page) 70: {'address': 70, 'word': 'bersalah', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 69, 'deps': defaultdict(list, {'xcomp': [72]}), 'rel': 'xcomp'}, 71: {'address': 71, 'word': 'selepas', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 72, 'deps': defaultdict(list, {}), 'rel': 'case'}, 72: {'address': 72, 'word': 'melanggar', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 70, 'deps': defaultdict(list, {'case': [71], 'obj': [73], 'advcl': [76]}), 'rel': 'xcomp'}, 73: {'address': 73, 'word': 'SOP', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 72, 'deps': defaultdict(list, {}), 'rel': 'obj'}, 74: {'address': 74, 'word': 'kerana', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 76, 'deps': defaultdict(list, {}), 'rel': 'mark'}, 76: {'address': 76, 'word': 'mengambil', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 72, 'deps': defaultdict(list, {'mark': [74], 'advmod': [75], 'obj': [77], (continues on next page)

9.46. Dependency Parsing 429 malaya Documentation

(continued from previous page) 'advcl': [80]}), 'rel': 'advcl'}, 75: {'address': 75, 'word': 'tidak', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 76, 'deps': defaultdict(list, {}), 'rel': 'advmod'}, 77: {'address': 77, 'word': 'suhu', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 76, 'deps': defaultdict(list, {'compound': [78]}), 'rel': 'obj'}, 78: {'address': 78, 'word': 'badan', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 77, 'deps': defaultdict(list, {}), 'rel': 'compound'}, 79: {'address': 79, 'word': 'ketika', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 80, 'deps': defaultdict(list, {}), 'rel': 'mark'}, 80: {'address': 80, 'word': 'masuk', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 76, 'deps': defaultdict(list, {'mark': [79], 'obl': [83, 85, 87]}), 'rel': 'advcl'}, 81: {'address': 81, 'word': 'ke', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 83, 'deps': defaultdict(list, {}), 'rel': 'case'}, 83: {'address': 83, (continues on next page)

430 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'word': 'surau', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 80, 'deps': defaultdict(list, {'case': [81], 'det': [82]}), 'rel': 'obl'}, 82: {'address': 82, 'word': 'sebuah', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 83, 'deps': defaultdict(list, {}), 'rel': 'det'}, 84: {'address': 84, 'word': 'di', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 85, 'deps': defaultdict(list, {}), 'rel': 'case'}, 85: {'address': 85, 'word': 'Langkawi', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 80, 'deps': defaultdict(list, {'case': [84]}), 'rel': 'obl'}, 86: {'address': 86, 'word': 'pada', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 87, 'deps': defaultdict(list, {}), 'rel': 'case'}, 87: {'address': 87, 'word': 'Sabtu', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 80, 'deps': defaultdict(list, {'case': [86], 'amod': [88]}), 'rel': 'obl'}, 88: {'address': 88, 'word': 'lalu', 'lemma': '_', 'ctag': '_', (continues on next page)

9.46. Dependency Parsing 431 malaya Documentation

(continued from previous page) 'tag': '_', 'feats': '_', 'head': 87, 'deps': defaultdict(list, {}), 'rel': 'amod'}, 89: {'address': 89, 'word': '.', 'lemma': '_', 'ctag': '_', 'tag': '_', 'feats': '_', 'head': 42, 'deps': defaultdict(list, {}), 'rel': 'punct'}})

Flat the graph

[20]: list(graph.triples()) [20]: [(('melihat', '_'), 'nsubj', ('KUALA', '_')), (('KUALA', '_'), 'flat', ('LUMPUR', '_')), (('KUALA', '_'), 'obl', ('hal', '_')), (('hal', '_'), 'punct', (':', '_')), (('hal', '_'), 'case', ('Dalam', '_')), (('hal', '_'), 'compound', ('politik', '_')), (('KUALA', '_'), 'punct', (',', '_')), (('melihat', '_'), 'advmod', ('jarang', '_')), (('melihat', '_'), 'advmod', ('sekali', '_')), (('melihat', '_'), 'case', ('untuk', '_')), (('melihat', '_'), 'advcl', ('mempunyai', '_')), (('mempunyai', '_'), 'obj', ('figura', '_')), (('figura', '_'), 'nummod', ('dua', '_')), (('figura', '_'), 'punct', ('-', '_')), (('figura', '_'), 'compound:plur', ('bekas', '_')), (('figura', '_'), 'flat', ('Perdana', '_')), (('Perdana', '_'), 'det', ('ini', '_')), (('Perdana', '_'), 'flat', ('Menteri', '_')), (('Perdana', '_'), 'punct', (',', '_')), (('Perdana', '_'), 'appos', ('Datuk', '_')), (('Datuk', '_'), 'flat', ('Seri', '_')), (('Seri', '_'), 'flat', ('Najib', '_')), (('Najib', '_'), 'flat', ('Razak', '_')), (('Perdana', '_'), 'conj', ('Tun', '_')), (('Tun', '_'), 'cc', ('dan', '_')), (('Tun', '_'), 'flat', ('Dr', '_')), (('Dr', '_'), 'flat', ('Mahathir', '_')), (('Mahathir', '_'), 'flat', ('Mohamad', '_')), (('mempunyai', '_'), 'obj', ('pandangan', '_')), (('pandangan', '_'), 'punct', ("'", '_')), (('pandangan', '_'), 'amod', ('sama', '_')), (('sama', '_'), 'conj', ('sekapal', '_')), (('sekapal', '_'), 'nsubj', ('yang', '_')), (('sekapal', '_'), 'punct', ("'", '_')), (('sekapal', '_'), 'cc', ('atau', '_')), (('mempunyai', '_'), 'punct', ('.', '_')), (continues on next page)

432 Chapter 9. Contents: malaya Documentation

(continued from previous page) (('mempunyai', '_'), 'mark', ('Namun', '_')), (('melihat', '_'), 'dep', ('berbeza', '_')), (('berbeza', '_'), 'punct', (',', '_')), (('berbeza', '_'), 'nsubj', ('situasi', '_')), (('situasi', '_'), 'det', ('itu', '_')), (('berbeza', '_'), 'advcl', ('melibatkan', '_')), (('melibatkan', '_'), 'mark', ('apabila', '_')), (('melibatkan', '_'), 'obj', ('isu', '_')), (('isu', '_'), 'compound', ('ketidakpatuhan', '_')), (('isu', '_'), 'nmod', ('prosedur', '_')), (('prosedur', '_'), 'case', ('terhadap', '_')), (('prosedur', '_'), 'compound', ('operasi', '_')), (('prosedur', '_'), 'amod', ('standard', '_')), (('prosedur', '_'), 'appos', ('SOP', '_')), (('SOP', '_'), 'punct', ('(', '_')), (('SOP', '_'), 'punct', (')', '_')), (('berbeza', '_'), 'punct', ('.', '_')), (('berbeza', '_'), 'dep', ('Najib', '_')), (('Najib', '_'), 'punct', (',', '_')), (('Najib', '_'), 'nsubj', ('Ahli', '_')), (('Ahli', '_'), 'nsubj', ('yang', '_')), (('Ahli', '_'), 'advmod', ('juga', '_')), (('Ahli', '_'), 'flat', ('Parlimen', '_')), (('Parlimen', '_'), 'flat', ('Pekan', '_')), (('Najib', '_'), 'acl', ('memuji', '_')), (('memuji', '_'), 'obj', ('sikap', '_')), (('sikap', '_'), 'flat', ('Ahli', '_')), (('Ahli', '_'), 'flat', ('Parlimen', '_')), (('Parlimen', '_'), 'flat', ('Langkawi', '_')), (('Langkawi', '_'), 'det', ('itu', '_')), (('sikap', '_'), 'acl', ('mengaku', '_')), (('mengaku', '_'), 'nsubj', ('yang', '_')), (('mengaku', '_'), 'xcomp', ('bersalah', '_')), (('bersalah', '_'), 'xcomp', ('melanggar', '_')), (('melanggar', '_'), 'case', ('selepas', '_')), (('melanggar', '_'), 'obj', ('SOP', '_')), (('melanggar', '_'), 'advcl', ('mengambil', '_')), (('mengambil', '_'), 'mark', ('kerana', '_')), (('mengambil', '_'), 'advmod', ('tidak', '_')), (('mengambil', '_'), 'obj', ('suhu', '_')), (('suhu', '_'), 'compound', ('badan', '_')), (('mengambil', '_'), 'advcl', ('masuk', '_')), (('masuk', '_'), 'mark', ('ketika', '_')), (('masuk', '_'), 'obl', ('surau', '_')), (('surau', '_'), 'case', ('ke', '_')), (('surau', '_'), 'det', ('sebuah', '_')), (('masuk', '_'), 'obl', ('Langkawi', '_')), (('Langkawi', '_'), 'case', ('di', '_')), (('masuk', '_'), 'obl', ('Sabtu', '_')), (('Sabtu', '_'), 'case', ('pada', '_')), (('Sabtu', '_'), 'amod', ('lalu', '_')), (('berbeza', '_'), 'punct', ('.', '_'))]

9.46. Dependency Parsing 433 malaya Documentation

Check the graph contains cycles

[21]: graph.contains_cycle() [21]: False

Generate networkx

Make sure you already installed networkx,

pip install networkx

[22]: digraph= graph.to_networkx() digraph [22]:

[23]: import networkx as nx import matplotlib.pyplot as plt nx.draw_networkx(digraph) plt.show()

[24]: digraph.edges() [24]: OutMultiEdgeDataView([(1, 11), (2, 1), (3, 5), (4, 5), (5, 1), (6, 5), (7, 1), (8,

˓→11), (9, 11), (10, 11), (12, 13), (13, 29), (14, 17), (15, 13), (16, 13), (17, 13),

˓→(18, 17), (19, 17), (20, 17), (21, 20), (22, 21), (23, 22), (24, 25), (25, 17), (26,

˓→ 25), (27, 26), (28, 27), (29, 11), (30, 31), (31, 29), (32, 36), (33, 31), (34,

˓→36), (35, 36), (36, 33), (37, 29), (38, 29), (39, 42), (40, 42), (41, 40), (42, 11),

˓→ (43, 44), (44, 42), (45, 44), (46, 45), (47, 48), (48, 45), (49, 48), (50, 48),

˓→(51, 52), (52, 48), (53, 52), (54, 42), (55, 42), (56, 55), (57, 59), (58, 59), (59,

˓→ 55), (60, 59), (61, 60), (62, 55), (63, 62), (64, 63), (65, 64), (66, 65), (67,

˓→66), (68, 69), (69, 63), (70, 69), (71, 72), (72, 70), (73, 72), (74, 76), (75, 76),

˓→ (76, 72), (77, 76), (78, 77), (79, 80), (80, 76), (81, 83), (82, 83), (83, 80),

˓→(84, 85), (85, 80), (86, 87), (87, 80), (88, 87), (89, 42)])

[25]: digraph.nodes() [25]: NodeView((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,

˓→22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,

˓→43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, (continues on next page) ˓→64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,

˓→85, 86, 87, 88, 89)) 434 Chapter 9. Contents: malaya Documentation

(continued from previous page)

[26]: labels= {i:graph.get_by_address(i)['word'] for i in digraph.nodes()} labels [26]: {1: 'KUALA', 2: 'LUMPUR', 3: ':', 4: 'Dalam', 5: 'hal', 6: 'politik', 7: ',', 8: 'jarang', 9: 'sekali', 10: 'untuk', 11: 'melihat', 12: 'dua', 13: 'figura', 14: 'ini', 15: '-', 16: 'bekas', 17: 'Perdana', 18: 'Menteri', 19: ',', 20: 'Datuk', 21: 'Seri', 22: 'Najib', 23: 'Razak', 24: 'dan', 25: 'Tun', 26: 'Dr', 27: 'Mahathir', 28: 'Mohamad', 29: 'mempunyai', 30: "'", 31: 'pandangan', 32: 'yang', 33: 'sama', 34: "'", 35: 'atau', 36: 'sekapal', 37: '.', 38: 'Namun', 39: ',', 40: 'situasi', 41: 'itu', 42: 'berbeza', 43: 'apabila', 44: 'melibatkan', 45: 'isu', 46: 'ketidakpatuhan', 47: 'terhadap', 48: 'prosedur', 49: 'operasi', 50: 'standard', 51: '(', 52: 'SOP', (continues on next page)

9.46. Dependency Parsing 435 malaya Documentation

(continued from previous page) 53: ')', 54: '.', 55: 'Najib', 56: ',', 57: 'yang', 58: 'juga', 59: 'Ahli', 60: 'Parlimen', 61: 'Pekan', 62: 'memuji', 63: 'sikap', 64: 'Ahli', 65: 'Parlimen', 66: 'Langkawi', 67: 'itu', 68: 'yang', 69: 'mengaku', 70: 'bersalah', 71: 'selepas', 72: 'melanggar', 73: 'SOP', 74: 'kerana', 75: 'tidak', 76: 'mengambil', 77: 'suhu', 78: 'badan', 79: 'ketika', 80: 'masuk', 81: 'ke', 82: 'sebuah', 83: 'surau', 84: 'di', 85: 'Langkawi', 86: 'pada', 87: 'Sabtu', 88: 'lalu', 89: '.'}

[27]: plt.figure(figsize=(15,5)) nx.draw_networkx(digraph,labels=labels) plt.show()

436 Chapter 9. Contents: malaya Documentation

9.46.9 Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str): """ vectorize a string.

Parameters ------string: List[str]

Returns ------result: np.array """

[28]:r= quantized_model.vectorize(s)

[29]:x= [i[0] for i in r] y= [i[1] for i in r]

[30]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(y) tsne.shape [30]: (89, 2)

[31]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.46. Dependency Parsing 437 malaya Documentation

9.47 Constituency Parsing

This tutorial is available as an IPython notebook at Malaya/example/constituency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time

import malaya CPU times: user 5.92 s, sys: 1.6 s, total: 7.52 s Wall time: 11.5 s

438 Chapter 9. Contents: malaya Documentation

9.47.1 what is constituency parsing

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

[2]: from IPython.core.display import Image, display

display(Image('constituency.png', width=500))

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/famrashel/ idn-treebank

9.47. Constituency Parsing 439 malaya Documentation

9.47.2 List available transformer Constituency models

[2]: malaya.constituency.available_transformer() INFO:root:tested on 20% test set. [2]: Size (MB) Quantized Size (MB) Recall Precision FScore \ bert 470.0 118.0 78.96 81.78 80.35 tiny-bert 125.0 31.8 74.89 78.79 76.79 albert 180.0 45.7 77.57 80.50 79.01 tiny-albert 56.7 14.5 67.21 74.89 70.84 xlnet 498.0 126.0 81.52 85.18 83.31

CompleteMatch TaggingAccuracy bert 10.37 91.59 tiny-bert 9.01 91.17 albert 5.77 90.30 tiny-albert 2.11 87.75 xlnet 11.71 91.71

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ models-accuracy.html#Constituency-Parsing The best model in term of accuracy is XLNET.

[3]: string='Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar

˓→sekiranya mengantuk ketika memandu.'

9.47.3 Load xlnet constituency model

def transformer(model: str='xlnet', quantized: bool= False, **kwargs): """ Load Transformer Constituency Parsing model, transfer learning Transformer + self

˓→attentive parsing.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

* ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.model.tf.Constituency class """

[5]: model= malaya.constituency.transformer(model='xlnet')

440 Chapter 9. Contents: malaya Documentation

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:73: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:75: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:68: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

9.47.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_model= malaya.constituency.transformer(model='xlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

9.47. Constituency Parsing 441 malaya Documentation

9.47.5 Parse into NLTK Tree

Make sure you already installed nltk, if not, simply,

pip install nltk

We preferred to parse into NLTK tree, so we can play around with children / subtrees.

def parse_nltk_tree(self, string: str):

""" Parse a string into NLTK Tree, to make it useful, make sure you already installed

˓→tktinker.

Parameters ------string : str

Returns ------result: nltk.Tree object """

[10]: tree= model.parse_nltk_tree(string)

[11]: tree [11]:

[5]: tree= quantized_model.parse_nltk_tree(string)

[6]: tree

442 Chapter 9. Contents: malaya Documentation

[6]:

9.47.6 Parse into Tree

This is a simple Tree object defined at malaya.text.trees.

def parse_tree(self, string):

""" Parse a string into string treebank format.

Parameters ------string : str

Returns ------result: malaya.text.trees.InternalTreebankNode class """

[6]: tree= model.parse_tree(string)

9.47.7 Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str): """ vectorize a string.

Parameters ------string: List[str]

Returns ------result: np.array """

9.47. Constituency Parsing 443 malaya Documentation

[5]:r= quantized_model.vectorize(string)

[7]:x= [i[0] for i in r] y= [i[1] for i in r]

[9]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(y) tsne.shape [9]: (14, 2)

[10]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

444 Chapter 9. Contents: malaya Documentation

[ ]:

9.48 Abstractive

This tutorial is available as an IPython notebook at Malaya/example/abstractive-summarization.

This module only trained on standard language structure, so it is not save to use it for local language structure.

This module trained heavily on news structure.

[1]: %%time import malaya from pprint import pprint CPU times: user 4.79 s, sys: 711 ms, total: 5.51 s Wall time: 4.46 s

[2]: import re

(continues on next page)

9.48. Abstractive 445 malaya Documentation

(continued from previous page) # minimum cleaning, just simply to remove newlines. def cleaning(string): string= string.replace(' \n','') string= re.sub(r'[ ]+','', string).strip() return string

I am going to simply copy paste some local news into this notebook. I will search about isu mahathir in google news, link here. link: https://www.hmetro.com.my/mutakhir/2020/05/580438/peletakan-jawatan-tun-m-ditolak-bukan-lagi-isu Title: Peletakan jawatan Tun M ditolak, bukan lagi isu. Body: PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada 24 Februari lalu. Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri Muhyiddin Yassin. Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT sebulat suara menolak peletakan jawatan Dr Mahathir. “Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang buat keputusan. “Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan keputusan parti. “Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti,” katanya kepada Harian Metro. Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di tengah-tengah pergolakan politik pada akhir Februari adalah sah. Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah. Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi ditolak oleh MPT. “Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini mesyuarat sudah menolak,” katanya. Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari ini yang menyatakan sokon- gan kepada Perikatan Nasional. “Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita catat di dalam minit apa yang berlaku di dalam mesyuarat,” katanya.

[3]: string= """ PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu

˓→Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT)

˓→pada 24 Februari lalu.

Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun

˓→diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri

˓→Muhyiddin Yassin.

Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT

˓→sebulat suara menolak peletakan jawatan Dr Mahathir.

"Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham

˓→bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan(continues on itu next page)

˓→sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang

˓→buat keputusan. 446 Chapter 9. Contents: malaya Documentation

(continued from previous page)

"Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di

˓→luar daripada keputusan mesyuarat, ini bukan keputusan parti.

"Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada

˓→JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti,"

˓→katanya kepada Harian Metro.

Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr

˓→Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di

˓→tengah-tengah pergolakan politik pada akhir Februari adalah sah.

Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah.

Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi

˓→ditolak oleh MPT.

"Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini

˓→mesyuarat sudah menolak," katanya.

Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari

˓→ini yang menyatakan sokongan kepada Perikatan Nasional.

"Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali

˓→pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita

˓→catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya. """ string= cleaning(string)

Link: https://www.malaysiakini.com/news/525953 Title: Mahathir jangan hipokrit isu kes mahkamah Riza, kata Takiyuddin Body: Menteri undang-undang Takiyuddin Hassan berkata kerajaan berharap Dr Mahathir Mohamad tidak bersikap hipokrit dengan mengatakan beliau tertanya-tanya dan tidak faham dengan keputusan mahkamah melepas tanpa mem- bebaskan (DNAA) Riza Aziz, anak tiri bekas perdana menteri Najib Razak, dalam kes pengubahan wang haram membabitkan dana 1MDB. Pemimpin PAS itu berkata ini kerana keputusan itu dibuat oleh peguam negara dan dilaksanakan oleh timbalan pen- dakwa raya yang mengendalikan kes tersebut pada akhir 2019. “Saya merujuk kepada kenyataan Dr Mahathir tentang tindakan Mahkamah Sesyen memberikan pelepasan tanpa pem- bebasan (discharge not amounting to acquittal) kepada Riza Aziz baru-baru ini. “Kerajaan berharap Dr Mahathir tidak bersikap hipokrit dengan mengatakan beliau ‘tertanya-tanya’, keliru dan tidak faham terhadap suatu keputusan yang dibuat oleh Peguam Negara dan dilaksanakan oleh Timbalan Pendakwa Raya yang mengendalikan kes ini pada akhir tahun 2019,” katanya dalam satu kenyataan hari ini. Riza pada Khamis dilepas tanpa dibebaskan daripada lima pertuduhan pengubahan wang berjumlah AS$248 juta (RM1.08 bilion). Dalam persetujuan yang dicapai antara pihak Riza dan pendakwaan, beliau dilepas tanpa dibebaskan atas pertuduhan itu dengan syarat memulangkan semula aset dari luar negara dengan nilai anggaran AS$107.3 juta (RM465.3 juta). Ekoran itu, Mahathir antara lain menyuarakan kekhuatirannya berkenaan persetujuan itu dan mempersoalkan jika pihak yang didakwa atas tuduhan mencuri boleh terlepas daripada tindakan jika memulangkan semula apa yang di- curinya.

9.48. Abstractive 447 malaya Documentation

“Dia curi berbilion-bilion. . . Dia bagi balik kepada kerajaan. Dia kata kepada kerajaan, ‘Nah, duit yang aku curi. Sekarang ini, jangan ambil tindakan terhadap aku.’ Kita pun kata, ‘Sudah bagi balik duit okey lah’,” katanya. Menjelaskan bahawa beliau tidak mempersoalkan keputusan mahkamah, Mahathir pada masa sama berkata ia menun- jukkan undang-undang mungkin perlu dipinda. Mengulas lanjut, Takiyuddin yang juga setiausaha agung PAS berkata kenyataan Mahathir tidak munasabah sebagai bekas perdana menteri. “Kerajaan berharap Dr Mahathir tidak terus bertindak mengelirukan rakyat dengan mengatakan beliau ‘keliru’. “Kerajaan PN akan terus bertindak mengikut undang-undang dan berpegang kepada prinsip kebebasan badan kehaki- man dan proses perundangan yang sah,” katanya.

[4]: string2= """ Menteri undang-undang Takiyuddin Hassan berkata kerajaan berharap Dr Mahathir Mohamad

˓→tidak bersikap hipokrit dengan mengatakan beliau tertanya-tanya dan tidak faham

˓→dengan keputusan mahkamah melepas tanpa membebaskan (DNAA) Riza Aziz, anak tiri

˓→bekas perdana menteri Najib Razak, dalam kes pengubahan wang haram membabitkan dana

˓→1MDB.

Pemimpin PAS itu berkata ini kerana keputusan itu dibuat oleh peguam negara dan

˓→dilaksanakan oleh timbalan pendakwa raya yang mengendalikan kes tersebut pada akhir

˓→2019.

“Saya merujuk kepada kenyataan Dr Mahathir tentang tindakan Mahkamah Sesyen

˓→memberikan pelepasan tanpa pembebasan (discharge not amounting to acquittal) kepada

˓→Riza Aziz baru-baru ini.

“Kerajaan berharap Dr Mahathir tidak bersikap hipokrit dengan mengatakan beliau

˓→‘tertanya-tanya’, keliru dan tidak faham terhadap suatu keputusan yang dibuat oleh

˓→Peguam Negara dan dilaksanakan oleh Timbalan Pendakwa Raya yang mengendalikan kes

˓→ini pada akhir tahun 2019,” katanya dalam satu kenyataan hari ini.

Riza pada Khamis dilepas tanpa dibebaskan daripada lima pertuduhan pengubahan wang

˓→berjumlah AS$248 juta (RM1.08 bilion).

Dalam persetujuan yang dicapai antara pihak Riza dan pendakwaan, beliau dilepas tanpa

˓→dibebaskan atas pertuduhan itu dengan syarat memulangkan semula aset dari luar

˓→negara dengan nilai anggaran AS$107.3 juta (RM465.3 juta).

Ekoran itu, Mahathir antara lain menyuarakan kekhuatirannya berkenaan persetujuan itu

˓→dan mempersoalkan jika pihak yang didakwa atas tuduhan mencuri boleh terlepas

˓→daripada tindakan jika memulangkan semula apa yang dicurinya.

"Dia curi berbilion-bilion...Dia bagi balik kepada kerajaan. Dia kata kepada kerajaan,

˓→ 'Nah, duit yang aku curi. Sekarang ini, jangan ambil tindakan terhadap aku.' Kita

˓→pun kata,'Sudah bagi balik duit okey lah'," katanya.

Menjelaskan bahawa beliau tidak mempersoalkan keputusan mahkamah, Mahathir pada masa

˓→sama berkata ia menunjukkan undang-undang mungkin perlu dipinda.

Mengulas lanjut, Takiyuddin yang juga setiausaha agung PAS berkata kenyataan Mahathir tidak munasabah sebagai bekas perdana menteri.

"Kerajaan berharap Dr Mahathir tidak terus bertindak mengelirukan rakyat dengan

˓→mengatakan beliau ‘keliru’.

“Kerajaan PN akan terus bertindak mengikut undang-undang dan berpegang kepada prinsip

˓→kebebasan badan kehakiman dan proses perundangan yang sah,” katanya. (continues on next page)

448 Chapter 9. Contents: malaya Documentation

(continued from previous page) """

string2= cleaning(string2)

9.48.1 List available Transformer models

[2]: malaya.summarization.abstractive.available_transformer() INFO:root:tested on 12k CNN + DailyNews test set. [2]: Size (MB) Quantized Size (MB) ROUGE-1 ROUGE-2 ROUGE-L \ t5 1250.0 481.0 0.371740 0.184714 0.258272 small-t5 355.6 195.0 0.366970 0.177330 0.254670 tiny-t5 208.0 103.0 0.302676 0.119321 0.202918 pegasus 894.0 225.0 0.251093 0.066789 0.155907 small-pegasus 293.0 74.2 0.290123 0.118788 0.192322 bigbird 910.0 230.0 0.267346 0.072391 0.161326 small-bigbird 303.0 77.3 0.246203 0.058961 0.151590

Suggested length t5 512.0 small-t5 512.0 tiny-t5 512.0 pegasus 512.0 small-pegasus 512.0 bigbird 1536.0 small-bigbird 1536.0

1. t5 is multitasks Transformer from T5 paper. 2. pegasus is Sentence Gap Pegasus. 3. bigbird is Finetuning Sentence Gap Pegasus.

9.48.2 Load Transformer model

def transformer(model: str='small-t5', quantized: bool= False, **kwargs): """ Load Malaya transformer encoder-decoder model to generate a summary given a

˓→string.

Parameters ------model : str, optional (default='small-t5') Model architecture supported. Allowed values:

* ``'t5'`` - T5 BASE parameters. * ``'small-t5'`` - T5 SMALL parameters. * ``'tiny-t5'`` - T5 TINY parameters. * ``'pegasus'`` - Pegasus BASE parameters. * ``'small-pegasus'`` - Pegasus SMALL parameters. * ``'bigbird'`` - BigBird + Pegasus BASE parameters. * ``'small-bigbird'`` - BigBird + Pegasus SMALL parameters.

(continues on next page)

9.48. Abstractive 449 malaya Documentation

(continued from previous page) quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `t5` in model, will return `malaya.model.t5.Summarization`. * if `bigbird` in model, will return `malaya.model.bigbird.Summarization`. * if `pegasus` in model, will return `malaya.model.pegasus.Summarization`. """

For T5 models, you need to install tensorflow-text, and make sure tensorflow-text version same as your tensorflow version,

# if your TF version is 1.15.4 pip3 install install tensorflow-text==1.15.1

# if your TF version is 2.5 pip3 install install tensorflow-text==2.5.0

[6]: # t5 = malaya.summarization.abstractive.transformer(model = 't5')

[3]: model= malaya.summarization.abstractive.transformer(model='pegasus') small_model= malaya.summarization.abstractive.transformer(model='small-pegasus')

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]: t5= malaya.summarization.abstractive.transformer(model='t5', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. INFO:root:running abstractive-summarization-v2/pegasus-quantized using device /device:

˓→CPU:0

Predict using greedy decoder

def greedy_decoder( self, strings: List[str], postprocess: bool= False, **kwargs, ): """ Summarize strings using greedy decoder.

Parameters (continues on next page)

450 Chapter 9. Contents: malaya Documentation

(continued from previous page) ------strings: List[str] postprocess: bool, optional (default=False) If True, will filter sentence generated using ROUGE score and removed

˓→international news publisher.

Returns ------result: List[str] """

[17]: pprint(t5.greedy_decoder([string])) ['Kenyataan media yang dibuat beberapa pimpinan parti tidak mengubah keputusan ' 'mesyuarat. Kenyataan media tidak mengubah keputusan mesyuarat. Marzuki ' 'berkata peletakan jawatan Dr Mahathir adalah sah. Beliau berkata peletakan ' 'jawatan itu sudah diputuskan oleh semua pihak.']

[20]: pprint(t5.greedy_decoder([string2])) ['"Kerajaan berharap Dr Mahathir tidak bersikap hipokrit," kata menteri ' 'undang-undang. Riza Aziz, anak tiri Najib Razak, dilepas tanpa dibebaskan ' 'daripada lima pertuduhan pengubahan wang haram. Mahathir mengatakan Riza ' 'adalah "tertanya-tanya" dan tidak faham. Mahathir juga mempersoalkan jika ' 'pihak yang didakwa mencuri boleh terlepas daripada tindakan jika memulangkan ' 'semula aset.']

[9]: pprint(model.greedy_decoder([string])) ['Bekas Pengerusi JPT , Datuk Dr . Ismail Mustafa mempersoalkan keputusan ' 'Mesyuarat Agung Bersatu . Mustafar dilantik sebagai Ketua Parti Bersatu ' 'Sabah dan akan digantikan dengan kepimpinan parti . Mesyuarat ini bukan ' 'untuk Mesyuarat Agung Bersatu tetapi untuk Mesyuarat Parti Bersatu yang akan ' 'diadakan pada 1 Februari ini .']

[7]: pprint(small_model.greedy_decoder([string])) ['Dengan adanya sebarang bayaran balik , pihak - pihak bersetuju untuk membuat ' 'perubahan . Dr Mahathir Mohamad ditolak oleh MPT . Ketua Parti Pribumi ' 'Bersatu Malaysia ( USDP ) menyatakan kedudukan Yassin tidak lagi sah .']

Predict using nucleus decoder

def nucleus_decoder( self, strings: List[str], top_p: float= 0.7, temperature: float= 0.2, postprocess: bool= False, **kwargs, ): """ Summarize strings using nucleus decoder.

Parameters (continues on next page)

9.48. Abstractive 451 malaya Documentation

(continued from previous page) ------strings: List[str] top_p: float, (default=0.7) cumulative distribution and cut off as soon as the CDF exceeds `top_p`. temperature: float, (default=0.3) logits * -log(random.uniform) * temperature. postprocess: bool, optional (default=False) If True, will filter sentence generated using ROUGE score and removed

˓→international news publisher.

Returns ------result: List[str] """

[8]: pprint(model.nucleus_decoder([string])) ['Bekas Pengerusi JPT , Datuk Dr . Ismail Mustafa mempersoalkan keputusan ' 'Mesyuarat Agung Bersatu . Mustafar dilantik sebagai Ketua Parti Bersatu ' 'Sabah dan akan digantikan dengan kepimpinan parti . Memetik laporan media ' 'yang menyebut Dr . Mahathir sebagai seorang lagi ahli bukan Melayu .']

[9]: pprint(small_model.nucleus_decoder([string])) ['Dengan adanya sebarang bayaran balik , pihak - pihak bersetuju untuk membuat ' 'semua keputusan di dalam sebuah mesyuarat . Dr Mahathir Mohamad ditolak oleh ' 'MPT dalam mesyuarat khas . Ketua Parti Pribumi Bersatu Malaysia ( USDP ) ' 'menyatakan pihak tidak pernah mengubah keputusan .']

9.49 Long Text Abstractive Summarization

This tutorial is available as an IPython notebook at Malaya/example/long-text-abstractive.

This module only trained on standard language structure, so it is not save to use it for local language structure.

This module trained heavily on news structure.

[1]: %%time import malaya from pprint import pprint CPU times: user 5.08 s, sys: 861 ms, total: 5.94 s Wall time: 5.23 s

[2]: import re

# minimum cleaning, just simply to remove newlines. def cleaning(string): (continues on next page)

452 Chapter 9. Contents: malaya Documentation

(continued from previous page) string= string.replace(' \n','') string= re.sub(r'[ ]+','', string).strip() return string

9.49.1 List available Transformer models

[2]: malaya.summarization.abstractive.available_transformer() INFO:root:tested on 12k CNN + DailyNews test set. [2]: Size (MB) Quantized Size (MB) ROUGE-1 ROUGE-2 ROUGE-L \ t5 1250.0 481.0 0.341030 0.149940 0.236550 small-t5 355.6 195.0 0.366970 0.177330 0.254670 tiny-t5 208.0 103.0 0.341030 0.149940 0.236550 pegasus 894.0 225.0 0.251093 0.066789 0.155907 small-pegasus 293.0 74.2 0.290123 0.118788 0.192322 bigbird 910.0 230.0 0.267346 0.072391 0.161326 small-bigbird 303.0 77.3 0.246203 0.058961 0.151590

Suggested length t5 512.0 small-t5 512.0 tiny-t5 512.0 pegasus 512.0 small-pegasus 512.0 bigbird 1536.0 small-bigbird 1536.0

If you look at Suggested length, bigbird 3x longer than T5 and PEGASUS models. Below is an example I combined 3 news become 1 string.

[4]: # https://www.astroawani.com/berita-malaysia/lagipun-kalau-isu-yang-berlaku-198788-

˓→najib-boleh-ingat-dia-nak-mandikan-keris-dengan-darah-tun-m-282301 # https://www.bharian.com.my/berita/nasional/2021/02/785400/kabinet-tidak-pernah-

˓→bangkitkan-isu-konflik-kepentingan-najib # https://www.astroawani.com/berita-malaysia/laporan-audit-1mdb-najib-gagal-gugurkan-

˓→sri-ram-daripada-pasukan-pendakwaan-283003 # https://www.astroawani.com/berita-politik/tun-m-menyesal-lantik-tommy-thomas-

˓→sebagai-peguam-negara-281460 # https://www.astroawani.com/berita-malaysia/tun-mahathir-mahu-saya-letak-jawatan-

˓→kerana-tentangan-orang-melayu-tommy-thomas-280066

string= """ Perseteruan antara dua bekas Perdana Menteri, Tun Dr Mahathir Mohamad dan Datuk Seri

˓→Najib Tun Razak belum ada penghujungnya dengan masing-masing berbalas kenyataan di

˓→media sosial. Selepas Najib menyanggah kenyataan tidak campur tangan dalam badan kehakiman negara

˓→dan mentertawakannya, Dr Mahathir membalasnya dengan meminta Ahli Parlimen Pekan

˓→itu memberi perhatian kepada kes 1Malaysia Development Berhad (1MDB). Dr Mahathir juga secara sinis berkata, jika Najib boleh mengingati isu yang berlaku

˓→pada 1987 dan 1988 - isu pemilihan UMNO dan pengharaman parti itu, Najib juga boleh

˓→mengingati peristiwa dia ingin mandikan keris dengan darah. "Saya rasa Najib tak payah campur tangan dengan tuduhan terhadap saya. Dia harus

˓→fokus kes curi duit rakyat berbilion-bilion dalam 1MDB. "Dia juga perlu bagi perhatian saman Tommy Thomas yang kait dia dengan pembunuhan ˓→Altantuya. Lagipun, kalau isu yang berlaku 1987/88, Najib boleh ingat (peristiwa)(continues on next page) ˓→yang dia'hunus' keris," kata Dr Mahathir.

9.49. Long Text Abstractive Summarization 453 malaya Documentation

(continued from previous page) Sebelum ini, Najib menyanggah dakwaan Dr Mahathir yang mendakwa pembatalan

˓→pendaftaran UMNO pada 1998 sebagai bukti tidak campur tangan dalam badan kehakiman

˓→negara. Menyokong hujahnya, Najib berkongsi apa yang berlaku pada tahun tersebut hingga

˓→menyebabkan UMNO diharamkan dan tertubuhnya UMNO baharu.

Mahkamah Tinggi di sini, hari ini, diberitahu Kabinet tidak pernah membangkitkan isu

˓→mengenai konflik kepentingan Datuk Seri Najib Razak dalam 1Malaysia Development

˓→Berhad (1MDB).

Menurut Kod Etika Bagi Anggota Pentadbiran dan Ahli Parlimen adalah menjadi amalan

˓→ahli Kabinet untuk mengisytiharkan konflik kepentingan sekiranya mempunyai

˓→pembabitan di dalam hal yang dibincangkan dalam Mesyuarat Jemaah Menteri.

Perkara itu dimaklumkan bekas Timbalan Ketua Setiausaha (Kabinet) Bahagian Kabinet

˓→Perlembagaan dan Perhubungan Jabatan Perdana Menteri (JPM), Tan Sri Mazidah Abdul

˓→Majid, dalam keterangannya pada perbicaraan kes 1MDB yang dihadapi Najib.

Kod etika itu antara lain turut menyatakan bahawa ahli Kabinet yang mempunyai

˓→kepentingan peribadi dan bercanggah dengan kepentingan kerajaan, atau membabitkan

˓→ahli keluarga, perlu meninggalkan mesyuarat dan merekodkan ketidakhadiran mereka.

Di dalam 1MDB, Najib memegang tiga jawatan iaitu Perdana Menteri, Menteri Kewangan

˓→dan Pengerusi Pengerusi Badan Penasihat 1MDB

Menjawab soalan peguam Tan Sri Muhammad Shafee Abdullah, sama ada terdapat ahli

˓→Kabinet yang membangkitkan isu bahawa Menteri Kewangan tidak seharusnya membabitkan

˓→diri dalam perbincangan itu kerana konflik kepentingan, Maziah menjawab:"Tiada."

Muhammad Shafee: Ada sesiapa yang membangkitkan perkara berhubung hal 1MBD dengan

˓→Najib?

Maziah: Tidak

Muhammad Shafee: Timbalan Perdana Menteri (ketika itu) adalah Tan Sri Muhyiddin

˓→Yassin, manakala Datuk Seri adalah bekas Menteri Kewangan II.

˓→Mereka juga tidak pernah membangkitkan hal konflik kepentingan?

Mazidah: Ya, tidak pernah.

Sementara itu, menjawab soalan Timbalan Pendakwa Raya, Ahmad Akram Gharib sama ada

˓→beliau mengetahui bahawa Najib mempunyai kepentingan peribadi dalam 1MDB, Mazidah

˓→berkata:"Tidak"

Ahmad Akram: Adakah anda mengetahui bahawa Najib secara peribadi menerima duit

˓→daripada 1MDB?

Maziah: Tidak

Ahmad Akram: Sekiranya Najib secara peribadi menerima wang daripada 1MDB adakah itu

˓→konflik kepentingan dan melanggar Kod Etika Bagi Anggota Pentadbiran dan Ahli

˓→Parlimen.

Maziah: Menurut pandangan peribadi saya, ya, namun kerana ia membabitkan menteri dan

˓→perdana menteri, saya cadangkan untuk dapatkan pandangan Peguam Negara.

Terdahulu, di awal prosiding, Maziah turut memberitahu mahkamah bahawa Najib tidak

˓→pernah menyebut nama ahli perniagaan dalam buruan, Low Taek Jho atau Jho(continues Low onpada next page)

˓→mesyuarat Kabinet sebagai individu yang membantu beliau mendapatkan sumbangan

˓→daripada kerabat diraja Arab Saudi. 454 Chapter 9. Contents: malaya Documentation

(continued from previous page)

"Sekiranya perkara itu dimaklumkan kepada Kabinet, maka, ia akan dicatat dalam minit

˓→mesyuarat," katanya.

Tambah Maziah, beliau hanya mendengar dan mengetahui mengenai nama Jho Low selepas

˓→timbul isu membabitkan 1MDB.

Najib, 68, menghadapi empat pertuduhan menggunakan kedudukannya untuk memperoleh

˓→suapan berjumlah RM2.3 bilion daripada dana 1MDB dan 21 pertuduhan pengubahan wang

˓→haram membabitkan jumlah wang yang sama.

Perbicaraan di hadapan Hakim Collin Lawrence Sequerah bersambung Isnin ini.

Datuk Seri Najib Tun Razak hari ini gagal dalam satu lagi cubaannya untuk

˓→menggugurkan bekas Hakim Mahkamah Persekutuan Datuk Seri Gopal Sri Ram daripada

˓→mengetuai pasukan pendakwaan dalam kes meminda laporan audit 1Malaysia Development

˓→Berhad (1MDB) melibatkan bekas perdana menteri itu. Ini merupakan cubaan kali ketiga Najib untuk menggugurkan Sri Ram sebagai pendakwa

˓→dalam kes jenayah berkaitan 1MDB itu. Sebelum ini, satu permohonan difailkan dalam

˓→satu lagi kes 1MDB di hadapan hakim berbeza dan cubaan kedua menerusi prosiding

˓→sivil. Ketika menolak permohonan Ahli Parlimen Pekan itu, Hakim Mahkamah Tinggi Mohamed

˓→Zaini Mazlan berkata dakwaan Najib bahawa Sri Ram terlibat dalam siasatan

˓→terhadapnya sebagai tidak ada merit. “Tidak ada bukti kukuh untuk menyokong dakwaan pemohon dan kekal sebagai hipotesis

˓→semata-mata. Isu ini telah disiasat oleh pihak pendakwaan selaku responden semasa

˓→permohonan pemohon (Najib) yang terdahulu. “Isu ini sudah dibincangkan dan diputuskan. Keputusan oleh mahkamah lain sebelum ini

˓→kekal dan tidak boleh diterbalikkan,” kata hakim. Mohamed Zaini ketika memberi alasan penolakan berkata pemohon, antara lain,

˓→menjadikan komunikasi di antara bekas Peguam Negara Tan Sri Mohamed Apandi Ali dan

˓→Sri Ram sebagai bukti berat sebelah Sri Ram terhadap pemohon dan kebimbangan

˓→pemohon berhubung perkara itu adalah tidak berasas. “Seperti juga individu lain, Sri Ram berhak mempunyai pandangan peribadi. Itu sahaja.

˓→Namun, pertimbangan berbeza akan dibuat jika beliau menunjukkan sikap berat sebelah

˓→semasa melaksanakan tugas sebagai pendakwa raya kanan. Pandangan peribadi beliau

˓→tidak boleh dianggap akan menghalang tanggungjawab beliau sebagai pendakwa raya

˓→kanan. “Tambahan pula, kejadian itu, seperti yang dikemukakan oleh responden, berlaku ketika

˓→sebelum pelantikan Sri Ram sebagai pendakwa raya kanan. Turut penting ialah pemohon

˓→tidak membuat sebarang aduan mengenai tindak-tanduk Sri Ram ketika menjalankan

˓→perbicaraan melibatkan pemohon. Ini mengukuhkan hujah responden bahawa Sri Ram

˓→bersikap terbuka semasa menjalankan tugas sebagai pendakwa raya kanan,” kata hakim. Mohamed Zaini seterusnya berkata pada akhirnya, mahkamah bertanggungjawab memastikan

˓→sesebuah perbicaraan dilaksanakan secara adil demi mendapatkan keadilan. “Mahkamah akan membantu mana-mana pihak yang dilayan secara tidak adil, jika perkara

˓→tersebut berlaku. Sehubunggan itu, permohonan pemohon ditolak,” katanya. Perbicaraan kes laporan audit 1MDB itu akan bersambung pada 22 Feb ini. Pada prosiding hari ini, Timbalan Pendakwa Raya Ahmad Akram Gharib bertindak bagi

˓→pihak pendakwaan, manakala Najib diwakili peguam Nur Syahirah Hanapiah. Najib, 67, dan bekas Ketua Pegawai Eksekutif 1MDB Arul Kanda Kandasamy, 45,

˓→dibicarakan atas tuduhan meminda laporan audit 1MDB. Ahli Parlimen Pekan itu dituduh menggunakan kedudukannya untuk mengarahkan pindaan ke

˓→atas laporan audit akhir 1MDB sebelum dibentangkan kepada Jawatankuasa Kira-Kira

˓→Wang Negara bagi mengelakkan sebarang tindakan diambil terhadapnya, sementara Arul

˓→Kanda didakwa bersubahat dengan Najib dalam membuat pindaan ke atas laporan

˓→tersebut bagi melindungi Najib daripada dikenakan tindakan. (continues on next page)

9.49. Long Text Abstractive Summarization 455 malaya Documentation

(continued from previous page) """

string= cleaning(string)

[5]: len(string.split()) [5]: 1020

9.49.2 Load Transformer model

def transformer(model: str='small-t5', quantized: bool= False, **kwargs): """ Load Malaya transformer encoder-decoder model to generate a summary given a

˓→string.

Parameters ------model : str, optional (default='t2t') Model architecture supported. Allowed values:

* ``'t5'`` - T5 BASE parameters. * ``'small-t5'`` - T5 SMALL parameters. * ``'tiny-t5'`` - T5 TINY parameters. * ``'pegasus'`` - Pegasus BASE parameters. * ``'small-pegasus'`` - Pegasus SMALL parameters. * ``'bigbird'`` - BigBird + Pegasus BASE parameters. * ``'small-bigbird'`` - BigBird + Pegasus SMALL parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model List of model classes:

* if `t5` in model, will return `malaya.model.t5.Summarization`. * if `bigbird` in model, will return `malaya.model.bigbird.Summarization`. * if `pegasus` in model, will return `malaya.model.pegasus.Summarization`. """

[4]: bigbird= malaya.summarization.abstractive.transformer(model='bigbird') bigbird_small= malaya.summarization.abstractive.transformer(model='small-bigbird') INFO:root:running abstractive-summarization-v2/bigbird using device /device:CPU:0 INFO:root:running abstractive-summarization-v2/small-bigbird using device /device:CPU:

˓→0

456 Chapter 9. Contents: malaya Documentation

Predict using greedy decoder

def greedy_decoder( self, strings: List[str], postprocess: bool= False, **kwargs, ): """ Summarize strings using greedy decoder.

Parameters ------strings: List[str] postprocess: bool, optional (default=Falsse) If True, will filter sentence generated using ROUGE score and removed

˓→international news publisher.

Returns ------result: List[str] """

[8]: pprint(t5.greedy_decoder([string], mode='ringkasan')) ['Perdana Menteri, Dr Mahathir membalas dengan meminta Najib untuk memberi ' 'perhatian kepada kes 1MDB. Mahkamah Tinggi diberitahu Kabinet tidak pernah ' 'membangkitkan isu konflik kepentingan Najib. Hakim menolak permintaan Najib ' 'untuk menggugurkan Hakim Mahkamah Persekutuan Gopal Sri Ram. Ini adalah ' 'percubaan ketiga Najib untuk menggugurkan Sri Ram sebagai pendakwa dalam kes ' '1MDB.']

[7]: pprint(bigbird.greedy_decoder([string])) ['Bekas Perdana Menteri , Datuk Seri Najib Razak tidak akan memberi keterangan ' 'di hadapan mahkamah berhubung kes berkenaan . Seorang lagi responden ' 'menyatakan jika perbicaraan kes terhadap Najib Razak telah diadakan , beliau ' 'akan didakwa secara peribadi . Pendakwa raya menyatakan Najib Razak tidak ' 'mempunyai bukti seberat sebelah dalam kes tersebut . Bekas Perdana Menteri , ' 'Najib Razak tidak mempunyai pandangan mengenai perkara ini .']

[8]: pprint(bigbird_small.greedy_decoder([string])) ['BARU : Seri Najib berkata , beliau tidak pernah secara peribadi meminta ' 'untuk menggugurkan nama Serie - A dalam kes tersebut . Dia berkata bahawa ' 'dia tidak pernah secara peribadi menyentuh perkara itu sebelum ini . Russell ' 'Maziah berkata , beliau tidak pernah secara peribadi meminta untuk ' 'menggugurkan nama Serie A . S .']

9.49. Long Text Abstractive Summarization 457 malaya Documentation

Predict using nucleus decoder

def nucleus_decoder( self, strings: List[str], top_p: float= 0.7, temperature: float= 0.2, postprocess: bool= False, **kwargs, ): """ Summarize strings using nucleus decoder.

Parameters ------strings: List[str] top_p: float, (default=0.7) cumulative distribution and cut off as soon as the CDF exceeds `top_p`. temperature: float, (default=0.3) logits * -log(random.uniform) * temperature. postprocess: bool, optional (default=False) If True, will filter sentence generated using ROUGE score and removed

˓→international news publisher.

Returns ------result: List[str] """

[9]: pprint(bigbird.nucleus_decoder([string])) ['Bekas Perdana Menteri , Datuk Seri Najib Razak tidak akan memberi keterangan ' 'di hadapan mahkamah berhubung kes berkenaan . Seorang lagi responden ' 'menyatakan sama ada Najib Razak menyertai parti politik yang dikaitkan ' 'dengan pembunuhan tokoh - tokoh penting negara . Pendakwa raya menyatakan ' 'Peguam Negara tidak akan membuat pindaan terhadap kes yang sedang dihadapi .']

[10]: pprint(bigbird_small.nucleus_decoder([string])) ['BARU : Seri Najib berkata , beliau masih belum mendedahkan pandangan ahli ' 'Kabinet . Mahathir Mohamad awas dakwaan konflik interestsome tidak pernah ' 'dibincangkan sebelum ini . Beliau berkata , pegawai istana yang dinaikinya ' 'itu berkata , mereka tidak pernah secara peribadi merupakan perkara yang ' 'betul untuk dilakukan .']

9.50 Extractive

This tutorial is available as an IPython notebook at Malaya/example/extractive-summarization.

[1]: %%time import malaya from pprint import pprint

458 Chapter 9. Contents: malaya Documentation

CPU times: user 4.74 s, sys: 636 ms, total: 5.37 s Wall time: 4.35 s

9.50.1 List available skip-thought models

[2]: malaya.summarization.extractive.available_skipthought() [2]: Size (MB) lstm 55.4 residual-network 99.7

• 'lstm' - LSTM skip-thought deep learning model trained on news dataset. • 'residual-network' - CNN residual network with Bahdanau Attention skip-thought deep learning model trained on wikipedia dataset.

9.50.2 Algorithm

We use TextRank from networkx for scoring algorithm.

[3]: isu_kerajaan=[ 'Kenyataan kontroversi Setiausaha Agung (BN), Datuk Seri Mohamed

˓→Nazri Aziz berhubung sekolah vernakular merupakan pandangan peribadi beliau', 'Timbalan Presiden UMNO, Datuk Seri Mohamad Hasan berkata, kenyataan tersebut

˓→tidak mewakili pendirian serta pandangan UMNO \n\nkerana parti itu menghormati

˓→serta memahami keperluan sekolah vernakular dalam negara', '"Saya ingin menegaskan dua perkara penting', 'Pertama pendirian beliau tersebut adalah pandangan peribadi yang tidak mewakili

˓→pendirian dan pandangan UMNO', '"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan

˓→sekolah vernakular di Malaysia', 'UMNO berpendirian sekolah jenis ini perlu terus wujud di negara kita," katanya

˓→dalam satu kenyataan akhbar malam ini', 'Mohamed Nazri semalam menjelaskan, kenyataannya mengenai sekolah jenis

˓→kebangsaan Cina dan Tamil baru-baru ini disalah petik pihak media', 'Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua pihak perlu

˓→menghormati hak orang Melayu dan bumiputera', 'Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten

˓→dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik termasuk hak

˓→untuk beragama serta mendapat pendidikan', 'Menurut beliau, persefahaman dan keupayaan meraikan kepelbagaian itu menjadi

˓→kelebihan dan kekuatan UMNO dan BN selama ini', 'Kata beliau, komitmen UMNO dan BN berhubung perkara itu dapat dilihat dengan

˓→jelas dalam bentuk sokongan infrastruktur, pengiktirafan dan pemberian peruntukan

˓→yang diperlukan', '"Saya berharap isu ini tidak dipolitikkan secara tidak bertanggungjawab oleh

˓→mana-mana pihak terutama dengan cara yang tidak menggambarkan pendirian sebenar

˓→UMNO dan BN," katanya', 'Beliau turut menegaskan Mohamed Nazri telah mengambil pertanggungjawaban dengan

˓→membuat penjelasan maksud sebenarnya ucapanny di Semenyih, Selangor tersebut', ]

[4]: isu_string=' \n\n\n\nDUA legenda hebat dan ‘The living legend’ ini sudah

˓→memartabatkan bidang muzik sejak lebih tiga dekad lalu. Jika Datuk Zainal Abidin,

˓→59, dikenali sebagai penyanyi yang memperjuangkan konsep ‘world music’, Datuk

˓→Sheila Majid, 55, pula lebih dikenali dengan irama jazz dan R&B.\n\nNamun,(continues ada on next satu page)

˓→persamaan yang mengeratkan hubungan mereka kerana sama-sama mencintai bidang muzik

˓→sejak dulu.\n\nKetika ditemui dalam sesi fotografi yang diatur di Balai Berita, 9.50. Extractive 459 ˓→baru-baru ini, Zainal berkata, dia lebih ‘senior’ daripada Sheila kerana bermula

˓→dengan kumpulan Headwind sebelum menempa nama sebagai penyanyi solo.\n\n“Saya mula

˓→berkawan rapat dengan Sheila ketika sama-sama bernaung di bawah pengurusan Roslan

˓→Aziz Productions (RAP) selepas membina karier sebagai artis solo.\n\n“Namun,

˓→selepas tidak lagi bernaung di bawah RAP, kami juga membawa haluan karier seni

˓→masing-masing selepas itu,” katanya.\n\nJusteru katanya, dia memang menanti peluang

˓→berganding dengan Sheila dalam satu konsert.\n\nPenyanyi yang popular dengan lagu

˓→Hijau dan Ikhlas Tapi Jauh itu mengakui mereka memang ada keserasian ketika

˓→bergandingan kerana membesar pada era muzik yang sama.\n\n“Kami memang meminati

˓→bidang muzik dan saling memahami antara satu sama lain. Mungkin kerana kami berdua

˓→sudah berada pada tahap di puncak karier muzik masing-masing.\n\n“Saya bersama

˓→Sheila serta Datuk Afdlin Shauki akan terbabit dalam satu segmen yang ditetapkan.\n\

˓→n“Selain persembahan solo, saya juga berduet dengan Sheila dan Afdlin dalam segmen

˓→interaktif ini. Setiap penyanyi akan menyampaikan enam hingga tujuh lagu setiap

˓→seorang sepanjang konsert yang berlangsung tiga hari ini,” katanya.\n\nBagi Sheila

˓→pula, dia memang ada terbabit dengan beberapa persembahan bersama Zainal cuma tiada

˓→publisiti ketika itu.\n\n“Kami pernah terbabit dengan showcase dan majlis korporat

˓→sebelum ini. Selain itu, Zainal juga terbabit dengan Konsert Legenda yang

˓→membabitkan jelajah empat lokasi sebelum ini.\n\n“Sebab itu, saya sukar menolak

˓→untuk bekerjasama dengannya dalam Festival KL Jamm yang dianjurkan buat julung kali

˓→dan berkongsi pentas dalam satu konsert bertaraf antarabangsa,” katanya.\n\n\n\

˓→nFESTIVAL KL Jamm bakal menggabungkan pelbagai genre muzik seperti rock, hip hop,

˓→jazz dan pop dengan lebih 100 persembahan, 20 ‘showcase’ dan pameran.\n\nKonsert

˓→berbayar\n\n\n\nMewakili golongan anak seni, Sheila menaruh harapan semoga Festival

˓→KL Jamm akan menjadi platform buat artis yang sudah ada nama dan artis muda untuk

˓→membuat persembahan, sekali gus sama-sama memartabatkan industri muzik tempatan.\n\

˓→nMenurut Sheila, dia juga mencadangkan lebih banyak tempat diwujudkan untuk

˓→menggalakkan artis muda membuat persembahan, sekali gus menggilap bakat mereka.\n\

˓→n“Berbanding pada zaman saya dulu, artis muda sekarang tidak banyak tempat khusus

˓→untuk mereka menyanyi dan menonjolkan bakat di tempat awam.\n\n“Rata-rata hanya

˓→sekadar menyanyi di laman Instagram dan cuma dikenali menerusi satu lagu. Justeru,

˓→bagaimana mereka mahu buat showcase kalau hanya dikenali dengan satu lagu?” katanya.

˓→\n\nPada masa sama, Sheila juga merayu peminat tempatan untuk sama-sama memberi

˓→sokongan pada penganjuran festival KL Jamm sekali gus mencapai objektifnya.\n\

˓→n“Peminat perlu ubah persepsi negatif mereka dengan menganggap persembahan artis

˓→tempatan tidak bagus.\n\n“Kemasukan artis luar juga perlu dilihat dari sudut yang

˓→positif kerana kita perlu belajar bagaimana untuk menjadi bagus seperti mereka,”

˓→katanya.\n\nSementara itu, Zainal pula berharap festival itu akan mendidik orang

˓→ramai untuk menonton konsert berbayar serta memberi sokongan pada artis tempatan.\n\

˓→n“Ramai yang hanya meminati artis tempatan tetapi tidak mahu mengeluarkan sedikit

˓→wang untuk membeli tiket konsert mereka.\n\n“Sedangkan artis juga menyanyi untuk

˓→kerjaya dan ia juga punca pendapatan bagi menyara hidup,” katanya.\n\nFestival KL

˓→Jamm bakal menghimpunkan barisan artis tempatan baru dan nama besar dalam konsert

˓→iaitu Datuk Ramli Sarip, Datuk Afdlin Shauki, Zamani, Amelina, Radhi OAG, Dr Burn,

˓→Santesh, Rabbit Mac, Sheezy, kumpulan Bunkface, Ruffedge, Pot Innuendo, artis dari

˓→Kartel (Joe Flizzow, Sona One, Ila Damia, Yung Raja, Faris Jabba dan Abu Bakarxli)

˓→dan Malaysia Pasangge (artis India tempatan).\n\nManakala, artis antarabangsa pula

˓→membabitkan J Arie (Hong Kong), NCT Dream (Korea Selatan) dan DJ Sura (Korea

˓→Selatan).\n\nKL Jamm dianjurkan Music Unlimited International Sdn Bhd dan bakal

˓→menggabungkan pelbagai genre muzik seperti rock, hip hop, jazz dan pop dengan lebih

˓→100 persembahan, 20 ‘showcase’, pameran dan perdagangan berkaitan.\n\nFestival tiga

˓→hari itu bakal berlangsung di Pusat Pameran dan Perdagangan Antarabangsa Malaysia

˓→(MITEC), Kuala Lumpur pada 26 hingga 28 April ini.\n\nMaklumat mengenai pembelian

˓→tiket dan keterangan lanjut boleh melayari www.kljamm.com.' malaya Documentation

(continued from previous page)

9.50.3 Load SKLearn Interface

Load decomposition and text vectorizer module from sklearn,

def sklearn(model, vectorizer): """ sklearn interface for summarization.

Parameters ------model : object Should have `fit_transform` method. Commonly:

* ``sklearn.decomposition.TruncatedSVD`` - LSA algorithm. * ``sklearn.decomposition.LatentDirichletAllocation`` - LDA algorithm. vectorizer : object Should have `fit_transform` method. Commonly:

* ``sklearn.feature_extraction.text.TfidfVectorizer`` - TFIDF algorithm. * ``sklearn.feature_extraction.text.CountVectorizer`` - Bag-of-Word algorithm. * ``malaya.text.vectorizer.SkipGramCountVectorizer`` - Skip Gram Bag-of-Word ˓→algorithm. * ``malaya.text.vectorizer.SkipGramTfidfVectorizer`` - Skip Gram TFIDF ˓→algorithm.

Returns ------result: malaya.model.extractive_summarization.SKLearn """

[5]: from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from malaya.text.vectorizer import SkipGramCountVectorizer, SkipGramTfidfVectorizer from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation

stopwords= malaya.text.function.get_stopwords()

[6]: vectorizer= SkipGramCountVectorizer( max_df= 0.95, min_df=1, ngram_range=(1,3), stop_words= stopwords, skip=2 )

[7]: svd= TruncatedSVD(n_components= 30)

[8]: model= malaya.summarization.extractive.sklearn(svd, vectorizer)

460 Chapter 9. Contents: malaya Documentation

Sentence level

This will predict scores for each sentences,

def sentence_level( self, corpus, isi_penting: str= None, top_k: int=3, important_words: int= 10, **kwargs ): """ Summarize list of strings / string on sentence level.

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. top_k: int, (default=3) number of summarized strings. important_words: int, (default=10) number of important words.

Returns ------dict: {'summary', 'top-words', 'cluster-top-words', 'score'} """

[9]:r= model.sentence_level(isu_kerajaan) r.keys() [9]: dict_keys(['summary', 'top-words', 'cluster-top-words', 'score'])

[10]: pprint(r['summary']) ('"Kedua UMNO sebagai sebuah parti sangat menghormati dan memahami keperluan ' 'sekolah vernakular di Malaysia. kerana parti itu menghormati serta memahami ' 'keperluan sekolah vernakular dalam negara. Timbalan Presiden UMNO, Datuk ' 'Seri Mohamad Hasan berkata, kenyataan tersebut tidak mewakili pendirian ' 'serta pandangan UMNO .')

[11]: r['cluster-top-words'] [11]: ['tugas presiden umno', 'nazri', 'vernakular', 'menghormati', 'tugas umno', 'pendirian pandangan', 'sekolah']

[12]: r['score'][:20] [12]: [('Kenyataan', 0.07025715941848468), ('kontroversi', 0.07025715941848468), ('Setiausaha', 0.07025715941848468), (continues on next page)

9.50. Extractive 461 malaya Documentation

(continued from previous page) ('Agung', 0.07025715941848468), ('Barisan', 0.07025715941848468), ('Nasional', 0.07025715941848468), ('(BN),', 0.07025715941848468), ('Datuk', 0.07025715941848468), ('Seri', 0.07025715941848468), ('Mohamed', 0.07025715941848468), ('Nazri', 0.07025715941848468), ('Aziz', 0.07025715941848468), ('berhubung', 0.07025715941848468), ('sekolah', 0.07025715941848468), ('vernakular', 0.07025715941848468), ('merupakan', 0.07025715941848468), ('pandangan', 0.07025715941848468), ('peribadi', 0.07025715941848468), ('beliau.', 0.07025715941848468), ('Timbalan', 0.11720420897327886)]

[13]:r= model.sentence_level(isu_kerajaan, isi_penting='Mohamed Nazri') pprint(r['summary']) ('Beliau turut menegaskan Mohamed Nazri telah mengambil pertanggungjawaban ' 'dengan membuat penjelasan maksud sebenarnya ucapanny di Semenyih, Selangor ' 'tersebut. Kata Nazri dalam kenyataannya itu, beliau menekankan bahawa semua ' 'pihak perlu menghormati hak orang Melayu dan bumiputera. Mohamed Nazri ' 'semalam menjelaskan, kenyataannya mengenai sekolah jenis kebangsaan Cina dan ' 'Tamil baru-baru ini disalah petik pihak media.')

Word level

This will predict scores for each words. This interface will not returned a summary, just score for each words.

def word_level( self, corpus, isi_penting: str= None, window_size: int= 10, important_words: int= 10, **kwargs ): """ Summarize list of strings / string on word level.

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. window_size: int, (default=10) window size for each word. important_words: int, (default=10) number of important words.

Returns ------(continues on next page)

462 Chapter 9. Contents: malaya Documentation

(continued from previous page) dict: {'top-words', 'cluster-top-words', 'score'} """

[14]:r= model.word_level(isu_kerajaan, isi_penting='Mohamed Nazri') r.keys() [14]: dict_keys(['top-words', 'cluster-top-words', 'score'])

[15]: r['score'][:20] [15]: [('Kenyataan', 0.16809629126117476), ('kontroversi', 0.2133122956188781), ('Setiausaha', 0.2946152484378591), ('Agung', 0.3335993239450466), ('Barisan', 0.3779873115316621), ('Nasional', 0.4254942849807996), ('(BN),', 0.4402120384348302), ('Datuk', 0.4402120384348302), ('Seri', 0.4398985388456278), ('Mohamed', 0.42780575539932747), ('Nazri', 0.42780575539932747), ('Aziz', 0.41245035042151057), ('berhubung', 0.3879585108632802), ('sekolah', 0.37430313847686075), ('vernakular', 0.3727791362231256), ('merupakan', 0.3683025784505585), ('pandangan', 0.34671436914445564), ('peribadi', 0.32911022738895535), ('beliau.', 0.32911022738895535), ('Timbalan', 0.31232539921078645)]

9.50.4 Load Doc2Vec interface

Doc2Vec interface using WordVector Malaya.

def doc2vec(wordvector): """ Doc2Vec interface for summarization.

Parameters ------wordvector : object malaya.wordvector.WordVector object. should have `get_vector_by_name` method.

Returns ------result: malaya.model.extractive_summarization.Doc2Vec """

[16]: vocab_news, embedded_news= malaya.wordvector.load_news() w2v= malaya.wordvector.load(embedded_news, vocab_news)

9.50. Extractive 463 malaya Documentation

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/wordvector.py:

˓→120: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/wordvector.py:

˓→131: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

[17]: model= malaya.summarization.extractive.doc2vec(w2v)

Sentence level

This will predict scores for each sentences,

def sentence_level( self, corpus, isi_penting: str= None, top_k: int=3, aggregation= np.mean, soft: bool= False, **kwargs ): """ Summarize list of strings / string on sentence level.

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. top_k: int, (default=3) number of summarized strings. aggregation: Callable, optional (default=numpy.mean) Aggregation method for Doc2Vec. soft: bool, optional (default=False) soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest

˓→JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns ------dict: {'summary', 'score'} """

[18]: %%time

r= model.sentence_level(isu_kerajaan) pprint(r['summary']) ('Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten ' 'dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik ' 'termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam ' 'kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak ' (continues on next page)

464 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'orang Melayu dan bumiputera. Kata beliau, komitmen UMNO dan BN berhubung ' 'perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, ' 'pengiktirafan dan pemberian peruntukan yang diperlukan.') CPU times: user 9.18 ms, sys: 1.64 ms, total: 10.8 ms Wall time: 10.3 ms

[19]: r['score'][:20] [19]: [('Kenyataan', 0.07027066163158477), ('kontroversi', 0.07027066163158477), ('Setiausaha', 0.07027066163158477), ('Agung', 0.07027066163158477), ('Barisan', 0.07027066163158477), ('Nasional', 0.07027066163158477), ('(BN),', 0.07027066163158477), ('Datuk', 0.07027066163158477), ('Seri', 0.07027066163158477), ('Mohamed', 0.07027066163158477), ('Nazri', 0.07027066163158477), ('Aziz', 0.07027066163158477), ('berhubung', 0.07027066163158477), ('sekolah', 0.07027066163158477), ('vernakular', 0.07027066163158477), ('merupakan', 0.07027066163158477), ('pandangan', 0.07027066163158477), ('peribadi', 0.07027066163158477), ('beliau.', 0.07027066163158477), ('Timbalan', 0.07037457887922696)]

[20]:r= model.sentence_level(isu_kerajaan, isi_penting='Mohamed Nazri') pprint(r['summary']) ('Mohamad yang menjalankan tugas-tugas Presiden UMNO berkata, UMNO konsisten ' 'dengan pendirian itu dalam mengiktiraf kepelbagaian bangsa dan etnik ' 'termasuk hak untuk beragama serta mendapat pendidikan. Kata Nazri dalam ' 'kenyataannya itu, beliau menekankan bahawa semua pihak perlu menghormati hak ' 'orang Melayu dan bumiputera. Kata beliau, komitmen UMNO dan BN berhubung ' 'perkara itu dapat dilihat dengan jelas dalam bentuk sokongan infrastruktur, ' 'pengiktirafan dan pemberian peruntukan yang diperlukan.')

Word level

This will predict scores for each words. This interface will not returned a summary, just score for each words.

def word_level( self, corpus, isi_penting: str= None, window_size: int= 10, aggregation= np.mean, soft: bool= False, **kwargs ): """ Summarize list of strings / string on sentence level. (continues on next page)

9.50. Extractive 465 malaya Documentation

(continued from previous page)

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. window_size: int, (default=10) window size for each word. aggregation: Callable, optional (default=numpy.mean) Aggregation method for Doc2Vec. soft: bool, optional (default=False) soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest

˓→JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns ------dict: {'score'} """

[21]: %%time

r= model.word_level(isu_string, window_size=5) r['score'][:20] CPU times: user 71.7 ms, sys: 3.52 ms, total: 75.2 ms Wall time: 72 ms [21]: [('DUA', 0.10217850225566492), ('legenda', 0.10256503728273535), ('hebat', 0.10135587890905765), ('dan', 0.10158261783201081), ("'The", 0.10185488894991031), ('living', 0.10207170990828254), ("legend'", 0.10205730536399951), ('ini', 0.10202961616004474), ('sudah', 0.10210986074917726), ('memartabatkan', 0.10203142215244121), ('bidang', 0.10433985839774515), ('muzik', 0.10437382317184496), ('sejak', 0.10437382707131633), ('lebih', 0.10444438297719007), ('tiga', 0.1044684279620556), ('dekad', 0.10444221573136775), ('lalu.', 0.10206875552003145), ('Jika', 0.10127214013691332), ('Datuk', 0.10122756975172667), ('Zainal', 0.10127912371730628)]

466 Chapter 9. Contents: malaya Documentation

9.50.5 Load Encoder summarization

We leverage the power of deep encoder models like skip-thought or Transformer to do extractive summarization for us.

def encoder(vectorizer): """ Encoder interface for summarization.

Parameters ------vectorizer : object encoder interface object, eg, BERT, skip-thought, XLNET, ALBERT, ALXLNET. should have `vectorize` method.

Returns ------result: malaya.model.extractive_summarization.Encoder """

[22]: lstm= malaya.summarization.extractive.deep_skipthought(model='lstm') encoder= malaya.summarization.extractive.encoder(lstm) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:root:vectorizer model does not have `attention` method, `top-words` will not

˓→work

If we loaded an encoder model that do not have ``attention`` method, it will not returned ``top-words``.

[23]: alxlnet= malaya.transformer.load(model='alxlnet') encoder_alxlnet= malaya.summarization.extractive.encoder(alxlnet) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:70: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.

˓→GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:70: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.

˓→GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:253: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:253: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

9.50. Extractive 467 malaya Documentation

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:253: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.

˓→AUTO_REUSE instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/xlnet.py:253: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.

˓→AUTO_REUSE instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:697: The name tf.logging.info is deprecated. Please use tf.

˓→compat.v1.logging.info instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:697: The name tf.logging.info is deprecated. Please use tf.

˓→compat.v1.logging.info instead.

INFO:tensorflow:memory input None INFO:tensorflow:memory input None INFO:tensorflow:Use float type INFO:tensorflow:Use float type WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:704: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:704: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:809: dropout (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:809: dropout (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/tensorflow_core/python/layers/core.py:271: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/tensorflow_core/python/layers/core.py:271: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead.

468 Chapter 9. Contents: malaya Documentation

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:109: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/modeling.py:109: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:95: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:95: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:96: The name tf.trainable_variables is deprecated. Please use

˓→tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:96: The name tf.trainable_variables is deprecated. Please use

˓→tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:100: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:100: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

9.50. Extractive 469 malaya Documentation

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:103: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→alxlnet/__init__.py:103: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/alxlnet-model/

˓→base/alxlnet-base/model.ckpt INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/alxlnet-model/

˓→base/alxlnet-base/model.ckpt

We also can load specific domain transformer, eg, sentiment, so our extractive summary more sensitive towards senti- ment.

[24]: alxlnet= malaya.sentiment.transformer(model='alxlnet', quantized= True) encoder_sentiment= malaya.summarization.extractive.encoder(alxlnet) WARNING:root:Load quantized model will cause accuracy drop. WARNING:root:vectorizer model does not have `attention` method, `top-words` will not

˓→work

Sentence level

This will predict scores for each sentences,

def sentence_level( self, corpus, isi_penting: str= None, top_k: int=3, important_words: int= 10, batch_size: int= 16, **kwargs ): """ Summarize list of strings / string on sentence level.

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. top_k: int, (default=3) number of summarized strings. important_words: int, (default=10) number of important words. batch_size: int, (default=16) for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.

Returns ------(continues on next page)

470 Chapter 9. Contents: malaya Documentation

(continued from previous page) dict: {'summary', 'top-words', 'cluster-top-words', 'score'} """

[25]: %%time

r= encoder.sentence_level(isu_string, isi_penting='antarabangsa') pprint(r['summary']) ('Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan ' 'bersama Zainal cuma tiada publisiti ketika itu. "Sebab itu, saya sukar ' 'menolak untuk bekerjasama dengannya dalam Festival KL Jamm yang dianjurkan ' 'buat julung kali dan berkongsi pentas dalam satu konsert bertaraf ' 'antarabangsa," katanya. Mewakili golongan anak seni, Sheila menaruh harapan ' 'semoga Festival KL Jamm akan menjadi platform buat artis yang sudah ada nama ' 'dan artis muda untuk membuat persembahan, sekali gus sama-sama memartabatkan ' 'industri muzik tempatan.') CPU times: user 400 ms, sys: 107 ms, total: 507 ms Wall time: 347 ms

[26]: %%time

r= encoder_alxlnet.sentence_level(isu_string, isi_penting='antarabangsa') pprint(r['summary']) ('Bagi Sheila pula, dia memang ada terbabit dengan beberapa persembahan ' 'bersama Zainal cuma tiada publisiti ketika itu. "Kami pernah terbabit dengan ' 'showcase dan majlis korporat sebelum ini. Selain itu, Zainal juga terbabit ' 'dengan Konsert Legenda yang membabitkan jelajah empat lokasi sebelum ini.') CPU times: user 45.9 s, sys: 3.49 s, total: 49.4 s Wall time: 9.69 s

[27]: %%time

r= encoder_sentiment.sentence_level(isu_string, isi_penting='antarabangsa') pprint(r['summary']) ('"Rata-rata hanya sekadar menyanyi di laman Instagram dan cuma dikenali ' 'menerusi satu lagu. Namun, ada satu persamaan yang mengeratkan hubungan ' 'mereka kerana sama-sama mencintai bidang muzik sejak dulu. Justeru, ' 'bagaimana mereka mahu buat showcase kalau hanya dikenali dengan satu lagu"?') CPU times: user 23.5 s, sys: 2.54 s, total: 26 s Wall time: 6.65 s

Word level

This will predict scores for each words. This interface will not returned a summary, just score for each words.

def word_level( self, corpus, isi_penting: str= None, window_size: int= 10, important_words: int= 10, batch_size: int= 16, **kwargs (continues on next page)

9.50. Extractive 471 malaya Documentation

(continued from previous page) ): """ Summarize list of strings / string on word level.

Parameters ------corpus: str / List[str] isi_penting: str, optional (default=None) if not None, will put priority based on `isi_penting`. window_size: int, (default=10) window size for each word. important_words: int, (default=10) number of important words. batch_size: int, (default=16) for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.

Returns ------dict: {'summary', 'top-words', 'cluster-top-words', 'score'} """

[29]: %%time

r= encoder.word_level(isu_string) r['score'][:20] CPU times: user 2.21 s, sys: 125 ms, total: 2.33 s Wall time: 540 ms [29]: [('DUA', 0.6887927), ('legenda', 0.66629106), ('hebat', 0.68231773), ('dan', 0.7088285), ("'The", 0.7100761), ('living', 0.7477336), ("legend'", 0.7506831), ('ini', 0.7550354), ('sudah', 0.7749422), ('memartabatkan', 0.68789077), ('bidang', 0.75334096), ('muzik', 0.77209735), ('sejak', 0.77883065), ('lebih', 0.75666404), ('tiga', 0.7756305), ('dekad', 0.77535397), ('lalu.', 0.7996583), ('Jika', 0.7532508), ('Datuk', 0.73633057), ('Zainal', 0.7388078)]

[32]: %%time

r= encoder_alxlnet.word_level(isu_string) r['score'][:20] CPU times: user 3min 20s, sys: 21.3 s, total: 3min 41s Wall time: 39.9 s

472 Chapter 9. Contents: malaya Documentation

[32]: [('DUA', 0.65859354), ('legenda', 0.67562085), ('hebat', 0.6850947), ('dan', 0.6763989), ("'The", 0.67380124), ('living', 0.68667483), ("legend'", 0.65132356), ('ini', 0.92398083), ('sudah', 0.90694207), ('memartabatkan', 0.9024261), ('bidang', 0.8732655), ('muzik', 0.56828856), ('sejak', 0.24973005), ('lebih', 0.27954298), ('tiga', 0.27266973), ('dekad', 0.83495057), ('lalu.', 0.9397317), ('Jika', 0.9280187), ('Datuk', 0.92466754), ('Zainal', 0.93014336)]

[33]: %%time

r= encoder_sentiment.word_level(isu_string) r['score'][:20] CPU times: user 3min 33s, sys: 23.9 s, total: 3min 57s Wall time: 42.6 s [33]: [('DUA', 0.7614572), ('legenda', 0.7729454), ('hebat', 0.7851631), ('dan', 0.838418), ("'The", 0.88925576), ('living', 0.5610372), ("legend'", 0.89473295), ('ini', 0.8995718), ('sudah', 0.87563884), ('memartabatkan', 0.8340343), ('bidang', 0.87825197), ('muzik', 0.867957), ('sejak', 0.8548101), ('lebih', 0.740687), ('tiga', 0.8771104), ('dekad', 0.89566004), ('lalu.', 0.7821031), ('Jika', 0.9377091), ('Datuk', 0.9417584), ('Zainal', 0.8979969)]

9.50. Extractive 473 malaya Documentation

9.51 MS to EN

This tutorial is available as an IPython notebook at Malaya/example/ms-en-translation.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time

import malaya CPU times: user 5.11 s, sys: 1.06 s, total: 6.17 s Wall time: 7.45 s

9.51.1 List available Transformer models

[2]: malaya.translation.ms_en.available_transformer() INFO:root:tested on 100k MS-EN sentences. [2]: Size (MB) Quantized Size (MB) BLEU Suggested length small 42.7 13.4 0.626 256.0 base 234.0 82.7 0.792 256.0 large 815.0 244.0 0.714 256.0 bigbird 246.0 63.7 0.678 1024.0 small-bigbird 50.4 13.1 0.586 1024.0

We tested on 100k MS-EN sentences.

9.51.2 Load Transformer models

def transformer(model: str='base', quantized: bool= False, **kwargs): """ Load Transformer encoder-decoder model to translate MS-to-EN.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'small'`` - Transformer SMALL parameters. * ``'base'`` - Transformer BASE parameters. * ``'large'`` - Transformer LARGE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: malaya.model.tf.Translation class """

474 Chapter 9. Contents: malaya Documentation

[1]: transformer= malaya.translation.ms_en.transformer() transformer_small= malaya.translation.ms_en.transformer(model='small') transformer_large= malaya.translation.ms_en.transformer(model='large')

9.51.3 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_transformer= malaya.translation.ms_en.transformer(quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

9.51.4 Translate

Using greedy decoder

def greedy_decoder(self, strings: List[str]): """ translate list of strings.

Parameters ------strings : List[str]

Returns ------result: List[str] """

Using beam decoder

def beam_decoder(self, strings: List[str]): """ translate list of strings using beam decoder, beam width size 3, alpha 0.5 .

Parameters ------strings : List[str]

Returns ------result: List[str] """

For better results, always split by end of sentences.

[5]: from pprint import pprint

9.51. MS to EN 475 malaya Documentation

[6]: # https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-

˓→isu-politik-Muhyiddin

string_news1='TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu

˓→menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada

˓→soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas

˓→berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap

˓→pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri

˓→(DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.' pprint(string_news1) ('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh ' 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal ' 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas ' 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika ' 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan ' 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')

[7]: # https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-

˓→disaran-bersara-untuk-selesai-kemelut-politik

string_news2='ALOR SETAR - Kemelut politik (PH) belum berkesudahan

˓→apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama.

˓→Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya

˓→mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr

˓→Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim

˓→mengundurkan diri daripada politik sebagai jalan penyelesaian.' pprint(string_news2) ('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila ' 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. ' 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya ' 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun ' 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri ' 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')

[8]: string_news3='Menteri Kanan (Kluster Keselamatan) Datuk Seri

˓→berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi

˓→mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang

˓→pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke

˓→pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.' pprint(string_news3) ('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, ' 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi ' 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat ' 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) ' 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan ' 'lanjutan tempoh.')

[9]: # https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan='Selain itu, pameran kerjaya membantu para pelajar menentukan

˓→kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya

˓→di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih

˓→kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan.

˓→Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan

˓→tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan

˓→oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya(continues on next page)

˓→perkhidmatan kesihatan dan perubatan. Setelah menyedari hakikat ini, para pelajar

˓→akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang

476˓→dilaksanakan amat membantu memberikan pengetahuan am tentang kerjayaChapter ini' 9. Contents: malaya Documentation

(continued from previous page) pprint(string_karangan) ('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang ' 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di ' 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang ' 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar ' 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi ' 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar ' 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki ' 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. ' 'Setelah menyedari hakikat ini, para pelajar akan lebih berminat untuk ' 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat ' 'membantu memberikan pengetahuan am tentang kerjaya ini')

[10]: # https://www.parlimen.gov.my/bills-dewan-rakyat.html?uweb=dr#, RUU Kumpulan Wang

˓→Simpanan Pekerja (Pindaan) 2019

string_parlimen='Subfasal 6(b) bertujuan untuk memasukkan subseksyen baharu 39(3)

˓→dan (4) ke dalam Akta 452. Subseksyen (3) yang dicadangkan bertujuan untuk

˓→menjadikan suatu kesalahan bagi mana-mana orang yang meninggalkan Malaysia tanpa

˓→membayar caruman yang tertunggak dan kena dibayar atau mengemukakan jaminan bagi

˓→pembayarannya. Subseksyen (4) yang dicadangkan memperuntukkan bahawa bagi maksud

˓→seksyen 39 Akta 452, “caruman” termasuklah apa-apa dividen atau caj lewat bayar

˓→yang kena dibayar ke atas mana-mana caruman.' pprint(string_parlimen) ('Subfasal 6(b) bertujuan untuk memasukkan subseksyen baharu 39(3) dan (4) ke ' 'dalam Akta 452. Subseksyen (3) yang dicadangkan bertujuan untuk menjadikan ' 'suatu kesalahan bagi mana-mana orang yang meninggalkan Malaysia tanpa ' 'membayar caruman yang tertunggak dan kena dibayar atau mengemukakan jaminan ' 'bagi pembayarannya. Subseksyen (4) yang dicadangkan memperuntukkan bahawa ' 'bagi maksud seksyen 39 Akta 452, “caruman” termasuklah apa-apa dividen atau ' 'caj lewat bayar yang kena dibayar ke atas mana-mana caruman.')

[11]: string_random1='saya menikmati filem mengenai makhluk asing yang menyerang bumi. <>

˓→Saya fikir fiksyen sains adalah genre yang luar biasa untuk apa sahaja. Sains masa

˓→depan, teknologi, perjalanan masa, perjalanan FTL, semuanya adalah konsep yang

˓→menarik. <> Saya sendiri peminat fiksyen sains!' pprint(string_random1) ('saya menikmati filem mengenai makhluk asing yang menyerang bumi. <> Saya ' 'fikir fiksyen sains adalah genre yang luar biasa untuk apa sahaja. Sains ' 'masa depan, teknologi, perjalanan masa, perjalanan FTL, semuanya adalah ' 'konsep yang menarik. <> Saya sendiri peminat fiksyen sains!')

[12]: string_random2='Fiksyen sains <> saya menikmati filem mengenai makhluk asing yang

˓→menyerang bumi. <> Fiksyen sains (sering dipendekkan menjadi SF atau sci-fi) adalah

˓→genre fiksyen spekulatif, biasanya berurusan dengan konsep khayalan seperti sains

˓→dan teknologi futuristik, perjalanan angkasa, perjalanan waktu, lebih cepat

˓→daripada perjalanan ringan, alam semesta selari, dan kehidupan di luar bumi .' pprint(string_random2) ('Fiksyen sains <> saya menikmati filem mengenai makhluk asing yang menyerang ' 'bumi. <> Fiksyen sains (sering dipendekkan menjadi SF atau sci-fi) adalah ' 'genre fiksyen spekulatif, biasanya berurusan dengan konsep khayalan seperti ' 'sains dan teknologi futuristik, perjalanan angkasa, perjalanan waktu, lebih ' 'cepat daripada perjalanan ringan, alam semesta selari, dan kehidupan di luar ' 'bumi .')

9.51. MS to EN 477 malaya Documentation

Comparing with Google Translate

These printscreens taken on 4th July 2020, Google always update model, so Google Translate in the future might improved. string_news1

[13]: from IPython.core.display import Image, display

display(Image('string1.png', width=450))

Tan Sri Muhyiddin Yassin said he did not want to touch on current political issues, instead focusing on the welfare of the people and revitalizing the country’s economy following the Covid-19 pandemic. The prime minister explained this when speaking at a Leaders’ Meetings with leaders of the Gambir State Legislative Assembly (assembly) at the Bukit Gambir Multipurpose Hall today. string_karangan

[14]: display(Image('string2.png', width=450))

478 Chapter 9. Contents: malaya Documentation

Additionally, career fairs help students determine which careers they will pursue. As we know, the job market in Malaysia is very broad and many of the jobs in the country are still vacant because it is difficult to find a truly qualified workforce. For example, the medical sector in Malaysia is facing a significant shortage of labor force, in particular by specialists due to the resignation of doctors and medical professionals to enter the private sector as well as expanding health and medical services. Upon realizing this fact, students will be more interested in the field of medicine as the career exhibitions help provide a wealth of knowledge about this profession. string_parlimen

[15]: display(Image('string3.png', width=450))

9.51. MS to EN 479 malaya Documentation

Subsection 6 (b) seeks to introduce new subsections 39 (3) and (4) into Act 452. Subsection (3) is intended to make it an offense for any person to leave Malaysia without paying any outstanding and payable contribution or submit a guar- antee for payment. The proposed subsection (4) provides that for the purposes of section 39 of Act 452, “contribution” includes any dividend or late payment chargeable on any contribution.

Translate transformer base

[13]: %%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3])) ['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on ' 'political issues at the moment, instead focusing on the welfare of the ' "people and efforts to revitalize the affected country's economy following " 'the Covid-19 pandemic. The prime minister explained the matter when speaking ' (continues on next page)

480 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'at a Leadership Meeting with Gambir State Assembly (DUN) leaders at the ' 'Bukit Gambir Multipurpose Hall today.', 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended when it ' "has failed to finalize the Prime Minister's candidate agreed upon. Sik MP " 'Ahmad Tarmizi Sulaiman said he had suggested former United Nations (UN) ' "Indigenous Party chairman Tun Dr Mahathir Mohamad and People's Justice Party " '(PKR) president Datuk Seri Anwar Ibrahim resign from politics as a solution.', 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the ' 'relaxation was given as the government was aware of the problems they had to ' 'renew the document. He added that for foreigners who had passed the social ' 'visit during the Movement Control Order (CPP) they could go to the nearest ' 'Immigration Department office for further extension.'] CPU times: user 23.9 s, sys: 14.5 s, total: 38.4 s Wall time: 9.95 s

[14]: %%time

pprint(quantized_transformer.greedy_decoder([string_news1, string_news2, string_

˓→news3])) ['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on ' 'political issues at the moment, instead focusing on the welfare of the ' "people and efforts to revitalize the affected country's economy following " 'the Covid-19 pandemic. The prime minister explained the matter when speaking ' 'at a Leadership Meeting with Gambir State Assembly (DUN) leaders at the ' 'Bukit Gambir Multipurpose Hall today.', 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended when it ' "has failed to finalize the Prime Minister's candidate agreed upon. Sik MP " 'Ahmad Tarmizi Sulaiman said he had suggested former United Nations (UN) ' "Indigenous Party chairman Tun Dr Mahathir Mohamad and People's Justice Party " '(PKR) president Datuk Seri Anwar Ibrahim resign from politics as a solution.', 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the ' 'relaxation was given as the government was aware of the problems they had to ' 'renew the document. He added that for foreigners who had passed the social ' 'visit during the Movement Control Order (CPP) they could go to the nearest ' 'Immigration Department office for an extension.'] CPU times: user 23.5 s, sys: 13.9 s, total: 37.5 s Wall time: 9.58 s

[15]: %%time

pprint(transformer.greedy_decoder([string_karangan, string_parlimen])) ['In addition, career exhibitions help students determine their careers. As we ' 'know, the career market in Malaysia is very broad and there are still many ' 'job sectors in the country that are still vacant because it is difficult to ' 'find a truly qualified workforce. For example, the medical sector in ' 'Malaysia is facing a critical shortage of labor, especially specialists due ' 'to the resignation of doctors and physicians to enter the private sector and ' 'develop health and medical services. Upon realizing this fact, students will ' 'be more interested in medicine because the exhibition careers are very ' 'helpful in providing general knowledge of this career.', 'Subclause 6 (b) seeks to introduce new subsections 39 (3) and (4) into Act ' '452. Subsection (3) proposed to make an offense for any person leaving ' 'Malaysia without paying a deferred and payable contribution or filing a ' 'guarantee for payment. Subsection (4) proposed provides that for the purpose ' 'of section 39 of Act 452, the "contribution" includes any dividend or late ' (continues on next page)

9.51. MS to EN 481 malaya Documentation

(continued from previous page) 'payment charge payable on any contribution.'] CPU times: user 30.8 s, sys: 17 s, total: 47.7 s Wall time: 10.6 s

[16]: %%time

pprint(quantized_transformer.greedy_decoder([string_karangan, string_parlimen])) ['In addition, career exhibitions help students determine their careers. As we ' 'know, the career market in Malaysia is very broad and there are still many ' 'job sectors in the country that are still vacant because it is difficult to ' 'find a truly qualified workforce. For example, the medical sector in ' 'Malaysia is facing a critical shortage of labor, especially specialists due ' 'to the resignation of doctors and physicians to enter the private sector and ' 'develop health and medical services. Upon realizing this fact, students will ' 'be more interested in the medical field as the career exhibitions are very ' 'helpful to provide general knowledge of this career.', 'Subclause 6 (b) seeks to introduce new subsections 39 (3) and (4) into Act ' '452. Subsection (3) proposed to make an offense for any person leaving ' 'Malaysia without paying a deferred and payable contribution or to submit a ' 'guarantee for his payment. Subsection (4) proposed provides that for the ' 'purpose of section 39 of Act 452, the "contribution" includes any dividend ' 'or late payment charge payable on any contribution.'] CPU times: user 30.9 s, sys: 16.8 s, total: 47.7 s Wall time: 10.8 s

[17]: %%time

result= transformer.greedy_decoder([string_random1, string_random2]) pprint(result) ['I enjoy movies about aliens attacking the earth. <> I think science fiction ' 'is an incredible genre for anything. Future science, technology, time ' "travel, FTL travel, everything is an exciting concept. <> I'm a science " 'fiction fan!', 'Science fiction <> I enjoy movies about aliens invading the earth. <> ' 'Science fiction (often shortened to SF or sci-fi) is a genre of speculative ' 'fiction, usually dealing with imaginary concepts such as science and ' 'futuristic technology, space travel, time travel, faster than light travel, ' 'parallel universe, and life abroad.'] CPU times: user 19.1 s, sys: 10.3 s, total: 29.5 s Wall time: 6.77 s

[18]: %%time

result= quantized_transformer.greedy_decoder([string_random1, string_random2]) pprint(result) ['I enjoy movies about aliens attacking the earth. <> I think science fiction ' 'is an incredible genre for anything. Future science, technology, time ' "travel, FTL travel, everything is an exciting concept. <> I'm a science " 'fiction fan!', 'Science fiction <> I enjoy movies about aliens invading the earth. <> ' 'Science fiction (often shortened to SF or sci-fi) is a genre of speculative ' 'fiction, usually dealing with imaginary concepts such as science and ' 'futuristic technology, space travel, time travel, faster than light travel, ' 'parallel universe, and life abroad.'] (continues on next page)

482 Chapter 9. Contents: malaya Documentation

(continued from previous page) CPU times: user 19.2 s, sys: 11.7 s, total: 30.9 s Wall time: 6.66 s

Translate transformer small

[19]: %%time

pprint(transformer_small.greedy_decoder([string_news1, string_news2, string_news3])) ['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on ' 'political issues at this time, instead focusing on the welfare of the people ' "and efforts to revitalize the country's economy affected following the " 'Covid-19 pandemic. The Prime Minister explained the matter when speaking at ' 'the Leaders Meeting with the leaders of the Gambir State Assembly (DUN) ' 'community at the Bukit Gambir Multipurpose Hall today.', 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not been expected ' "when it still fails to finalize the Prime Minister's candidate agreed " 'together. Sik MP Ahmad Tarmizi Sulaiman said the party had suggested former ' 'United Nations Indigenous Party (UN) chairman Tun Dr Mahathir Mohamad and ' "President of the People's Justice Party (PKR), Datuk Seri Anwar Ibrahim " 'resigned from politics as a solution.', 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the ' 'relaxation was given as the government was aware of the problems they faced ' 'to renew the document. He said in addition, for foreigners who had passed ' 'the social visit expired during the Movement Control Order (PKP) could go to ' 'the nearest Immigration Department office for further time.'] CPU times: user 3.46 s, sys: 871 ms, total: 4.33 s Wall time: 1.5 s

[20]: %%time

pprint(transformer_small.greedy_decoder([string_karangan, string_parlimen])) ['In addition, career exhibitions help students determine their careers. As we ' 'know, the career market in Malaysia is very broad and many employment ' 'sectors in the country are still vacant because it is difficult to find a ' 'truly qualified workforce. For example, the medical sector in Malaysia is ' 'facing critical labor shortages, especially specialists as specialists ' 'resign as well as medical professionals to enter the private sector and the ' 'development of health and medical services. After realizing this fact that ' 'students will be more interested in medicine as the exhibition of careers is ' 'helping to provide general knowledge of this career in this career as this ' 'career is especially in providing general knowledge of this career as this ' 'career as this career is difficult to gain knowledge of this career as it is ' 'difficult to gain knowledge of this career is difficult to find the career ' 'as it is difficult to find the career as it is difficult to enter this ' 'career as it is difficult to enter this career as it is difficult to gain ' 'through this career is difficult to enter the private sector and medical ' 'career as it is difficult to find the field of career as it is difficult to ' 'gain knowledge of this career as it is difficult to find the field of career ' 'as it is difficult to find the field of career as it is especially in the ' 'field of career as it is difficult to find the field of this career as it is ' 'difficult to find the field of career as it is difficult to find that it is ' 'difficult to find that it is difficult to', 'Subclaal 6 (b) aims to include a new subsection of 39 (3) and (4) into Act ' (continues on next page)

9.51. MS to EN 483 malaya Documentation

(continued from previous page) '452. Subsection (3) proposed aimed at making a mistake for any person who ' 'leaves Malaysia without paying for outstanding contributions and paid or to ' 'provide guarantees for his payment. Subsection (4) proposed provides that ' 'for the purpose of section 39 of Act 452, "contributions" including any ' 'dividend or late payment charge to be paid to any contribution.'] CPU times: user 9.94 s, sys: 1.8 s, total: 11.7 s Wall time: 3.14 s

[21]: %%time

result= transformer_small.greedy_decoder([string_random1, string_random2]) pprint(result) ['I enjoy movies about aliens attacking the earth. <> I think science fiction ' 'is a great genre for whatever future science, technology, travel, FTL ' 'travel, all of which is an interesting concept. <> I personally love science ' 'fiction!', 'science fiction <> I enjoy movies about aliens who attack the earth. <> The ' 'science fiction (often shortened to SF or sci-fi) is a speculative fiction ' 'genre, usually dealing with the concept of imaginary science and futuristic ' 'technology, space travel, travel, faster than light travel, parallel ' 'universe, and outer life.'] CPU times: user 2.64 s, sys: 402 ms, total: 3.04 s Wall time: 773 ms

Translate transformer large

[ ]: %%time

pprint(transformer_large.greedy_decoder([string_news1, string_news2, string_news3]))

[ ]: %%time

pprint(transformer_large.greedy_decoder([string_karangan, string_parlimen]))

[ ]: %%time

result= transformer_large.greedy_decoder([string_random1, string_random2]) pprint(result)

9.52 EN to MS

This tutorial is available as an IPython notebook at Malaya/example/en-ms-translation.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya

484 Chapter 9. Contents: malaya Documentation

CPU times: user 5.24 s, sys: 1.15 s, total: 6.4 s Wall time: 10 s

9.52.1 List available Transformer models

[2]: malaya.translation.en_ms.available_transformer() INFO:root:tested on 77k EN-MS sentences. [2]: Size (MB) Quantized Size (MB) BLEU Suggested length small 42.7 13.4 0.512 256.0 base 234.0 82.7 0.696 256.0 large 817.0 244.0 0.699 256.0 bigbird 246.0 63.7 0.551 1024.0 small-bigbird 50.4 13.1 0.522 1024.0

We tested on 77k EN-MS sentences.

9.52.2 Load Transformer models

def transformer(model: str='base', quantized: bool= False, **kwargs): """ Load Transformer encoder-decoder model to translate EN-to-MS.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'small'`` - Transformer SMALL parameters. * ``'base'`` - Transformer BASE parameters. * ``'large'`` - Transformer LARGE parameters. * ``'bigbird'`` - BigBird BASE parameters. * ``'small-bigbird'`` - BigBird SMALL parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model if `bigbird` in model, return malaya.model.bigbird.Translation else, return malaya.model.tf.Translation """

[3]: transformer= malaya.translation.en_ms.transformer() transformer_small= malaya.translation.en_ms.transformer(model='small') transformer_large= malaya.translation.en_ms.transformer(model='large')

9.52. EN to MS 485 malaya Documentation

9.52.3 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_transformer= malaya.translation.en_ms.transformer(quantized= True)

9.52.4 Translate

Using greedy decoder

def greedy_decoder(self, strings: List[str]): """ translate list of strings.

Parameters ------strings : List[str]

Returns ------result: List[str] """

Using beam decoder

def beam_decoder(self, strings: List[str]): """ translate list of strings using beam decoder, beam width size 3, alpha 0.5 .

Parameters ------strings : List[str]

Returns ------result: List[str] """

[5]: from pprint import pprint

[6]: # https://www.malaymail.com/news/malaysia/2020/07/01/dr-mahathir-again-claims-anwar-

˓→lacks-popularity-with-malays-to-be-pakatans/1880420

string_news1='KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as

˓→the prime minister candidate as he is allegedly not"popular" among the Malays, Tun

˓→Dr Mahathir Mohamad claimed. The former prime minister reportedly said the PKR

˓→president needs someone like himself in order to acquire support from the Malays

˓→and win the election.' pprint(string_news1)

486 Chapter 9. Contents: malaya Documentation

('KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the ' 'prime minister candidate as he is allegedly not "popular" among the Malays, ' 'Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said ' 'the PKR president needs someone like himself in order to acquire support ' 'from the Malays and win the election.')

[7]: # https://edition.cnn.com/2020/07/06/politics/new-york-attorney-general-blm/index.html

string_news2='(CNN)New York Attorney General Letitia James on Monday ordered the

˓→Black Lives Matter Foundation -- which she said is not affiliated with the larger

˓→Black Lives Matter movement -- to stop collecting donations in New York."I ordered

˓→the Black Lives Matter Foundation to stop illegally accepting donations that were

˓→intended for the #BlackLivesMatter movement. This foundation is not affiliated with

˓→the movement, yet it accepted countless donations and deceived goodwill," James

˓→tweeted.' pprint(string_news2) ('(CNN)New York Attorney General Letitia James on Monday ordered the Black ' 'Lives Matter Foundation -- which she said is not affiliated with the larger ' 'Black Lives Matter movement -- to stop collecting donations in New York. "I ' 'ordered the Black Lives Matter Foundation to stop illegally accepting ' 'donations that were intended for the #BlackLivesMatter movement. This ' 'foundation is not affiliated with the movement, yet it accepted countless ' 'donations and deceived goodwill," James tweeted.')

[8]: # https://www.thestar.com.my/business/business-news/2020/07/04/malaysia-worries-new-

˓→eu-food-rules-could-hurt-palm-oil-exports

string_news3='Amongst the wide-ranging initiatives proposed are a sustainable food

˓→labelling framework, a reformulation of processed foods, and a sustainability

˓→chapter in all EU bilateral trade agreements. The EU also plans to publish a

˓→proposal for a legislative framework for sustainable food systems by 2023 to ensure

˓→all foods on the EU market become increasingly sustainable.' pprint(string_news3) ('Amongst the wide-ranging initiatives proposed are a sustainable food ' 'labelling framework, a reformulation of processed foods, and a ' 'sustainability chapter in all EU bilateral trade agreements. The EU also ' 'plans to publish a proposal for a legislative framework for sustainable food ' 'systems by 2023 to ensure all foods on the EU market become increasingly ' 'sustainable.')

[9]: # https://jamesclear.com/articles

string_article1='This page shares my best articles to read on topics like health,

˓→happiness, creativity, productivity and more. The central question that drives my

˓→work is, “How can we live better?” To answer that question, I like to write about

˓→science-based ways to solve practical problems.' pprint(string_article1) ('This page shares my best articles to read on topics like health, happiness, ' 'creativity, productivity and more. The central question that drives my work ' 'is, “How can we live better?” To answer that question, I like to write about ' 'science-based ways to solve practical problems.')

[10]: # https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

(continues on next page)

9.52. EN to MS 487 malaya Documentation

(continued from previous page) string_article2='Fuzzy matching at scale. From 3.7 hours to 0.2 seconds. How to

˓→perform intelligent string matching in a way that can scale to even the biggest

˓→data sets. Data in the real world is messy. Dealing with messy data sets is painful

˓→and burns through time which could be spent analysing the data itself.' pprint(string_article2) ('Fuzzy matching at scale. From 3.7 hours to 0.2 seconds. How to perform ' 'intelligent string matching in a way that can scale to even the biggest data ' 'sets. Data in the real world is messy. Dealing with messy data sets is ' 'painful and burns through time which could be spent analysing the data ' 'itself.')

[11]: random_string1='i am in medical school.' random_string2='Emmerdale is the debut studio album,songs were not released in the

˓→U.S <> These songs were not released in the U.S. edition of said album and were

˓→previously unavailable on any U.S. release.' pprint(random_string2) ('Emmerdale is the debut studio album,songs were not released in the U.S <> ' 'These songs were not released in the U.S. edition of said album and were ' 'previously unavailable on any U.S. release.')

Comparing with Google Translate

These printscreens took on 7th July 2020, in the future the results might improved. string_news1

[11]: from IPython.core.display import Image, display

display(Image('en-string1.png', width=450))

488 Chapter 9. Contents: malaya Documentation

KUALA LUMPUR, 1 Julai - Anwar Ibrahim tidak sesuai menjadi calon perdana menteri kerana dia dikatakan tidak “popular” di kalangan orang Melayu, kata Tun Dr Mahathir Mohamad. Bekas perdana menteri itu dilaporkan men- gatakan bahawa presiden PKR memerlukan seseorang seperti dirinya untuk mendapatkan sokongan orang Melayu dan memenangi pilihan raya. string_news2

[12]: display(Image('en-string2.png', width=450))

9.52. EN to MS 489 malaya Documentation

(CNN) Peguam Negara New York, Letitia James pada hari Isnin memerintahkan Yayasan Black Lives Matter - yang menurutnya tidak berafiliasi dengan gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan derma di New York. “Saya memerintahkan Black Lives Matter Foundation untuk berhenti secara haram menerima sumbangan yang ditujukan untuk gerakan #BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun ia menerima banyak sumbangan dan menipu muhibah,” tweet James. string_news3

[13]: display(Image('en-string3.png', width=450))

490 Chapter 9. Contents: malaya Documentation

Di antara inisiatif luas yang dicadangkan adalah kerangka pelabelan makanan yang berkelanjutan, penyusunan semula makanan yang diproses, dan bab keberlanjutan dalam semua perjanjian perdagangan dua hala EU. EU juga beren- cana untuk menerbitkan proposal untuk kerangka perundangan untuk sistem makanan lestari pada tahun 2023 untuk memastikan semua makanan di pasar EU menjadi semakin random_string2

[14]: display(Image('en-string4.png', width=450))

9.52. EN to MS 491 malaya Documentation

Emmerdale adalah album studio sulung, lagu-lagu tidak dirilis di A.S.

Translate transformer base

[12]: %%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3])) ['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai menjadi calon ' 'Perdana Menteri kerana beliau didakwa tidak "popular" dalam kalangan orang ' 'Melayu, Tun Dr Mahathir Mohamad mendakwa, bekas Perdana Menteri itu ' 'dilaporkan berkata Presiden PKR itu memerlukan seseorang seperti dirinya ' 'bagi mendapatkan sokongan daripada orang Melayu dan memenangi pilihan raya.', '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan ' 'Black Lives Matter Foundation - yang menurutnya tidak berafiliasi dengan ' 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan ' 'sumbangan di New York. "Saya memerintahkan Black Lives Matter Foundation ' 'untuk berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan ' '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun ' 'ia menerima banyak sumbangan dan muhibah yang ditipu," tweet James.', 'Di antara inisiatif luas yang diusulkan adalah kerangka pelabelan makanan ' 'yang berkelanjutan, reformulasi makanan yang diproses, dan bab keberlanjutan ' 'dalam semua perjanjian perdagangan dua hala EU. EU juga berencana untuk ' 'menerbitkan proposal untuk kerangka perundangan untuk sistem makanan lestari ' 'pada tahun 2023 untuk memastikan semua makanan di pasar EU menjadi semakin ' 'lestari.'] (continues on next page)

492 Chapter 9. Contents: malaya Documentation

(continued from previous page) CPU times: user 24.3 s, sys: 14 s, total: 38.3 s Wall time: 11.6 s

[13]: %%time

pprint(transformer.greedy_decoder([string_article1, string_article2])) ['Halaman ini berkongsi artikel terbaik saya untuk dibaca mengenai topik ' 'seperti kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi. ' 'Soalan utama yang mendorong kerja saya adalah, "Bagaimana kita dapat hidup ' 'lebih baik?" Untuk menjawab soalan itu, saya suka menulis mengenai kaedah ' 'berasaskan sains untuk menyelesaikan masalah praktikal.', 'Pemadanan kabur pada skala. Dari 3.7 jam hingga 0.2 saat. Cara melakukan ' 'pemadanan rentetan pintar dengan cara yang dapat meningkatkan bahkan set ' 'data terbesar. Data di dunia nyata tidak kemas. Berurusan dengan set data ' 'yang tidak kemas menyakitkan dan terbakar sepanjang masa yang dapat ' 'dihabiskan untuk menganalisis data itu sendiri.'] CPU times: user 15.9 s, sys: 9.21 s, total: 25.2 s Wall time: 6.32 s

[14]: %%time

pprint(transformer.greedy_decoder([random_string1, random_string2])) ['saya di sekolah perubatan.', 'Emmerdale adalah album studio debut, lagu-lagu tidak dikeluarkan di A.S <> ' 'Lagu-lagu ini tidak dikeluarkan dalam edisi A.S. album tersebut dan ' 'sebelumnya tidak tersedia pada sebarang pelepasan A.S.'] CPU times: user 9.98 s, sys: 5.52 s, total: 15.5 s Wall time: 4.23 s

Translate transformer small

[15]: %%time

pprint(transformer_small.greedy_decoder([string_news1, string_news2, string_news3])) ['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai kerana calon ' 'perdana menteri kerana didakwa tidak "popular" dalam kalangan orang Melayu, ' 'Tun Dr Mahathir Mohamad mendakwa. Bekas perdana menteri itu dilaporkan ' 'berkata, presiden PKR itu memerlukan seseorang seperti dirinya sendiri untuk ' 'memperoleh sokongan daripada orang Melayu dan memenangi pilihan raya.hari ' 'ini, Datuk Seri Anwar Ibrahim tidak sesuai untuk menjadi calon', '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan ' 'Yayasan Black Lives Matter - yang menurutnya tidak berafiliasi dengan ' 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan ' 'sumbangan di New York. "Saya memerintahkan Yayasan Black Lives Matter untuk ' 'berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan ' '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun ' 'ia menerima banyak sumbangan dan muhibah yang menipu," tweet James.', 'Amongst inisiatif luas yang dicadangkan adalah kerangka kerja kerja kerja ' 'makanan yang berkelanjutan, penyusunan semula makanan yang diproses, dan bab ' 'kelestarian dalam semua perjanjian perdagangan dua hala EU. EU juga ' 'merancang untuk menerbitkan cadangan kerangka perundangan untuk sistem ' 'makanan lestari pada tahun 2023 untuk memastikan semua makanan di pasaran EU ' (continues on next page)

9.52. EN to MS 493 malaya Documentation

(continued from previous page) 'semakin lestari.'] CPU times: user 3.69 s, sys: 773 ms, total: 4.46 s Wall time: 1.61 s

[16]: %%time

pprint(transformer_small.greedy_decoder([string_article1, string_article2])) ['Halaman ini berkongsi artikel terbaik saya untuk membaca topik seperti ' 'kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi. Soalan ' 'pusat yang mendorong karya saya adalah, "Bagaimana kita dapat hidup lebih ' 'baik?" Untuk menjawab soalan itu, saya suka menulis mengenai cara berasaskan ' 'sains untuk menyelesaikan masalah praktikal.', 'Pemadanan Fuzzy pada skala. Dari 3.7 jam hingga 0.2 saat. Cara melakukan ' 'pemadanan rentetan pintar dengan cara yang dapat meningkatkan set data ' 'terbesar bahkan. Data di dunia nyata tidak kemas. Berurusan dengan set data ' 'yang tidak kemas menyakitkan dan terbakar melalui masa yang dapat dihabiskan ' 'untuk menganalisis data itu sendiri.'] CPU times: user 2.45 s, sys: 384 ms, total: 2.84 s Wall time: 738 ms

[17]: %%time

pprint(transformer_small.greedy_decoder([random_string1, random_string2])) ['saya berada di sekolah perubatan.', 'Emmerdale adalah album studio sulung, lagu-lagu tidak dikeluarkan di A.S <> ' 'Lagu-lagu ini tidak dikeluarkan di edisi A.S. yang dikatakan album dan ' 'sebelumnya tidak tersedia di mana-mana pelepasan A.S.'] CPU times: user 1.7 s, sys: 291 ms, total: 1.99 s Wall time: 535 ms

Translate transformer large

[18]: %%time

pprint(transformer_large.greedy_decoder([string_news1, string_news2, string_news3])) ['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai menjadi calon ' 'Perdana Menteri kerana beliau didakwa tidak "popular" dalam kalangan orang ' 'Melayu, kata Tun Dr Mahathir Mohamad. Bekas Perdana Menteri itu dilaporkan ' 'berkata, Presiden PKR memerlukan seseorang seperti dirinya bagi mendapatkan ' 'sokongan daripada orang Melayu dan memenangi pilihan raya.', '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan ' 'Black Lives Matter Foundation - yang menurutnya tidak berafiliasi dengan ' 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan ' 'sumbangan di New York. "Saya memerintahkan Black Lives Matter Foundation ' 'untuk berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan ' '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun ' 'ia menerima banyak sumbangan dan muhibah yang ditipu," tweet James.', 'Di antara inisiatif luas yang diusulkan adalah kerangka pelabelan makanan ' 'berkelanjutan, penyusunan semula makanan yang diproses, dan bab ' 'keberlanjutan dalam semua perjanjian perdagangan dua hala EU. EU juga ' 'berencana untuk menerbitkan proposal untuk kerangka perundangan untuk sistem ' 'makanan berkelanjutan pada tahun 2023 untuk memastikan semua makanan di ' (continues on next page)

494 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'pasar EU menjadi semakin berkelanjutan.'] CPU times: user 1min 3s, sys: 26.5 s, total: 1min 30s Wall time: 26.1 s

[19]: %%time

pprint(transformer_large.greedy_decoder([string_article1, string_article2])) ['Halaman ini berkongsi artikel terbaik saya untuk membaca topik seperti ' 'kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi. Soalan ' 'utama yang mendorong karya saya adalah, "Bagaimana kita dapat hidup lebih ' 'baik?" Untuk menjawab soalan itu, saya suka menulis mengenai kaedah ' 'berasaskan sains untuk menyelesaikan masalah praktikal.', 'Pemadanan kabur pada skala. Dari 3.7 jam hingga 0.2 saat. Cara melakukan ' 'pemadanan rentetan pintar dengan cara yang dapat meningkatkan skala ke set ' 'data terbesar. Data di dunia nyata tidak kemas. Berurusan dengan set data ' 'yang tidak kemas menyakitkan dan terbakar sepanjang masa yang dapat ' 'dihabiskan untuk menganalisis data itu sendiri.'] CPU times: user 48.9 s, sys: 19.5 s, total: 1min 8s Wall time: 12.5 s

[20]: %%time

pprint(transformer_large.greedy_decoder([random_string1, random_string2])) ['saya di sekolah perubatan.', 'Emmerdale adalah album studio debut, lagu-lagu tidak dikeluarkan di AS <> ' 'Lagu-lagu ini tidak dikeluarkan dalam edisi A.S. album tersebut dan ' 'sebelumnya tidak tersedia untuk sebarang pelepasan A.S.'] CPU times: user 29.9 s, sys: 12.3 s, total: 42.2 s Wall time: 7.36 s

[ ]:

9.53 Long Text Translation

This tutorial is available as an IPython notebook at Malaya/example/long-text-translation.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time import malaya CPU times: user 4.71 s, sys: 916 ms, total: 5.63 s Wall time: 6.2 s

9.53. Long Text Translation 495 malaya Documentation

9.53.1 List available Transformer models

[2]: malaya.translation.ms_en.available_transformer() INFO:root:tested on 100k MS-EN sentences. [2]: Size (MB) Quantized Size (MB) BLEU Suggested length small 42.7 13.4 0.626 256.0 base 234.0 82.7 0.792 256.0 large 815.0 244.0 0.714 256.0 bigbird 246.0 63.7 0.678 1024.0 small-bigbird 50.4 13.1 0.586 1024.0

If you look at Suggested length, bigbird 4x longer than normal transformer models, able to infer longer text without need to partition. Let’s we do some examples. We are going to compare small and base models with small-bigbird and bigbird models for MS-EN translation task. Feel free to test on EN-MS translation.

9.53.2 Load Transformer models

def transformer(model: str='base', quantized: bool= False, **kwargs): """ Load Transformer encoder-decoder model to translate MS-to-EN.

Parameters ------model : str, optional (default='base') Model architecture supported. Allowed values:

* ``'small'`` - Transformer SMALL parameters. * ``'base'`` - Transformer BASE parameters. * ``'large'`` - Transformer LARGE parameters. * ``'bigbird'`` - BigBird BASE parameters. * ``'small-bigbird'`` - BigBird SMALL parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result: model if `bigbird` in model, return malaya.model.bigbird.Translation else, return malaya.model.tf.Translation """

[3]: transformer= malaya.translation.ms_en.transformer() transformer_small= malaya.translation.ms_en.transformer(model='small')

[4]: bigbird= malaya.translation.ms_en.transformer(model='bigbird') quantized_bigbird= malaya.translation.ms_en.transformer(model='bigbird', quantized

˓→= True) WARNING:root:Load quantized model will cause accuracy drop.

496 Chapter 9. Contents: malaya Documentation

[5]: bigbird_small= malaya.translation.ms_en.transformer(model='small-bigbird') quantized_bigbird_small= malaya.translation.ms_en.transformer(model='small-bigbird

˓→', quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

9.53.3 Long text examples

[6]: # https://www.bharian.com.my/berita/nasional/2021/02/785400/kabinet-tidak-pernah-

˓→bangkitkan-isu-konflik-kepentingan-najib string= """ KUALA LUMPUR: Mahkamah Tinggi di sini, hari ini, diberitahu Kabinet tidak pernah

˓→membangkitkan isu mengenai konflik kepentingan Datuk Seri Najib Razak dalam

˓→1Malaysia Development Berhad (1MDB).

Menurut Kod Etika Bagi Anggota Pentadbiran dan Ahli Parlimen adalah menjadi amalan

˓→ahli Kabinet untuk mengisytiharkan konflik kepentingan sekiranya mempunyai

˓→pembabitan di dalam hal yang dibincangkan dalam Mesyuarat Jemaah Menteri.

Perkara itu dimaklumkan bekas Timbalan Ketua Setiausaha (Kabinet) Bahagian Kabinet

˓→Perlembagaan dan Perhubungan Jabatan Perdana Menteri (JPM), Tan Sri Mazidah Abdul

˓→Majid, dalam keterangannya pada perbicaraan kes 1MDB yang dihadapi Najib.

Kod etika itu antara lain turut menyatakan bahawa ahli Kabinet yang mempunyai

˓→kepentingan peribadi dan bercanggah dengan kepentingan kerajaan, atau membabitkan

˓→ahli keluarga, perlu meninggalkan mesyuarat dan merekodkan ketidakhadiran mereka.

Di dalam 1MDB, Najib memegang tiga jawatan iaitu Perdana Menteri, Menteri Kewangan

˓→dan Pengerusi Pengerusi Badan Penasihat 1MDB

Menjawab soalan peguam Tan Sri Muhammad Shafee Abdullah, sama ada terdapat ahli

˓→Kabinet yang membangkitkan isu bahawa Menteri Kewangan tidak seharusnya membabitkan

˓→diri dalam perbincangan itu kerana konflik kepentingan, Maziah menjawab:"Tiada."

Muhammad Shafee: Ada sesiapa yang membangkitkan perkara berhubung hal 1MBD dengan

˓→Najib?

Maziah: Tidak

Muhammad Shafee: Timbalan Perdana Menteri (ketika itu) adalah Tan Sri Muhyiddin

˓→Yassin, manakala Datuk Seri Ahmad Husni Hanadzlah adalah bekas Menteri Kewangan II.

˓→Mereka juga tidak pernah membangkitkan hal konflik kepentingan?

Mazidah: Ya, tidak pernah.

Sementara itu, menjawab soalan Timbalan Pendakwa Raya, Ahmad Akram Gharib sama ada

˓→beliau mengetahui bahawa Najib mempunyai kepentingan peribadi dalam 1MDB, Mazidah

˓→berkata:"Tidak"

Ahmad Akram: Adakah anda mengetahui bahawa Najib secara peribadi menerima duit

˓→daripada 1MDB?

Maziah: Tidak

Ahmad Akram: Sekiranya Najib secara peribadi menerima wang daripada 1MDB adakah itu

˓→konflik kepentingan dan melanggar Kod Etika Bagi Anggota Pentadbiran dan Ahli ˓→Parlimen. (continues on next page)

9.53. Long Text Translation 497 malaya Documentation

(continued from previous page)

Maziah: Menurut pandangan peribadi saya, ya, namun kerana ia membabitkan menteri dan

˓→perdana menteri, saya cadangkan untuk dapatkan pandangan Peguam Negara.

Terdahulu, di awal prosiding, Maziah turut memberitahu mahkamah bahawa Najib tidak

˓→pernah menyebut nama ahli perniagaan dalam buruan, Low Taek Jho atau Jho Low pada

˓→mesyuarat Kabinet sebagai individu yang membantu beliau mendapatkan sumbangan

˓→daripada kerabat diraja Arab Saudi.

"Sekiranya perkara itu dimaklumkan kepada Kabinet, maka, ia akan dicatat dalam minit

˓→mesyuarat," katanya.

Tambah Maziah, beliau hanya mendengar dan mengetahui mengenai nama Jho Low selepas

˓→timbul isu membabitkan 1MDB.

Najib, 68, menghadapi empat pertuduhan menggunakan kedudukannya untuk memperoleh

˓→suapan berjumlah RM2.3 bilion daripada dana 1MDB dan 21 pertuduhan pengubahan wang

˓→haram membabitkan jumlah wang yang sama.

Perbicaraan di hadapan Hakim Collin Lawrence Sequerah bersambung Isnin ini. """

[7]: # https://www.astroawani.com/berita-malaysia/laporan-audit-1mdb-najib-gagal-gugurkan-

˓→sri-ram-daripada-pasukan-pendakwaan-283003 string2= """ KUALA LUMPUR: Datuk Seri Najib Tun Razak hari ini gagal dalam satu lagi cubaannya

˓→untuk menggugurkan bekas Hakim Mahkamah Persekutuan Datuk Seri Gopal Sri Ram

˓→daripada mengetuai pasukan pendakwaan dalam kes meminda laporan audit 1Malaysia

˓→Development Berhad (1MDB) melibatkan bekas perdana menteri itu. Ini merupakan cubaan kali ketiga Najib untuk menggugurkan Sri Ram sebagai pendakwa

˓→dalam kes jenayah berkaitan 1MDB itu. Sebelum ini, satu permohonan difailkan dalam

˓→satu lagi kes 1MDB di hadapan hakim berbeza dan cubaan kedua menerusi prosiding

˓→sivil. Ketika menolak permohonan Ahli Parlimen Pekan itu, Hakim Mahkamah Tinggi Mohamed

˓→Zaini Mazlan berkata dakwaan Najib bahawa Sri Ram terlibat dalam siasatan

˓→terhadapnya sebagai tidak ada merit. “Tidak ada bukti kukuh untuk menyokong dakwaan pemohon dan kekal sebagai hipotesis

˓→semata-mata. Isu ini telah disiasat oleh pihak pendakwaan selaku responden semasa

˓→permohonan pemohon (Najib) yang terdahulu. “Isu ini sudah dibincangkan dan diputuskan. Keputusan oleh mahkamah lain sebelum ini

˓→kekal dan tidak boleh diterbalikkan,” kata hakim. Mohamed Zaini ketika memberi alasan penolakan berkata pemohon, antara lain,

˓→menjadikan komunikasi di antara bekas Peguam Negara Tan Sri Mohamed Apandi Ali dan

˓→Sri Ram sebagai bukti berat sebelah Sri Ram terhadap pemohon dan kebimbangan

˓→pemohon berhubung perkara itu adalah tidak berasas. “Seperti juga individu lain, Sri Ram berhak mempunyai pandangan peribadi. Itu sahaja.

˓→Namun, pertimbangan berbeza akan dibuat jika beliau menunjukkan sikap berat sebelah

˓→semasa melaksanakan tugas sebagai pendakwa raya kanan. Pandangan peribadi beliau

˓→tidak boleh dianggap akan menghalang tanggungjawab beliau sebagai pendakwa raya

˓→kanan. “Tambahan pula, kejadian itu, seperti yang dikemukakan oleh responden, berlaku ketika

˓→sebelum pelantikan Sri Ram sebagai pendakwa raya kanan. Turut penting ialah pemohon

˓→tidak membuat sebarang aduan mengenai tindak-tanduk Sri Ram ketika menjalankan

˓→perbicaraan melibatkan pemohon. Ini mengukuhkan hujah responden bahawa Sri Ram

˓→bersikap terbuka semasa menjalankan tugas sebagai pendakwa raya kanan,” kata hakim. Mohamed Zaini seterusnya berkata pada akhirnya, mahkamah bertanggungjawab memastikan

˓→sesebuah perbicaraan dilaksanakan secara adil demi mendapatkan keadilan. (continues on next page)

498 Chapter 9. Contents: malaya Documentation

(continued from previous page) “Mahkamah akan membantu mana-mana pihak yang dilayan secara tidak adil, jika perkara

˓→tersebut berlaku. Sehubunggan itu, permohonan pemohon ditolak,” katanya. Perbicaraan kes laporan audit 1MDB itu akan bersambung pada 22 Feb ini. Pada prosiding hari ini, Timbalan Pendakwa Raya Ahmad Akram Gharib bertindak bagi

˓→pihak pendakwaan, manakala Najib diwakili peguam Nur Syahirah Hanapiah. Najib, 67, dan bekas Ketua Pegawai Eksekutif 1MDB Arul Kanda Kandasamy, 45,

˓→dibicarakan atas tuduhan meminda laporan audit 1MDB. Ahli Parlimen Pekan itu dituduh menggunakan kedudukannya untuk mengarahkan pindaan ke

˓→atas laporan audit akhir 1MDB sebelum dibentangkan kepada Jawatankuasa Kira-Kira

˓→Wang Negara bagi mengelakkan sebarang tindakan diambil terhadapnya, sementara Arul

˓→Kanda didakwa bersubahat dengan Najib dalam membuat pindaan ke atas laporan

˓→tersebut bagi melindungi Najib daripada dikenakan tindakan. """

[8]: # https://miasa.org.my/blog/2018/08/artikel-sokongan-rakan-senasib-dapat-mengurangkan-

˓→kemasukan-semula/ string3= """ 3 Ogos 2018 - Kajian terbaharu yang dijalankan oleh University College London (UCL)

˓→mendapati khidmat penjagaan kesihatan mental daripada pekerja sokongan (support

˓→worker) yang pernah mengharungi sendiri penyakit ini mungkin boleh membantu

˓→mengurangkan kebarangkalian pesakit yang baru keluar dari penjagaan kesihatan

˓→mental akut (acute mental health care) daripada dimasukkan semula ke unit berkenaan.

Kertas penyelidikan tersebut yang diterbitkan pada hari ini dalam jurnal The Lancet

˓→mendapati bilangan pesakit mental yang dimasukkan semula ke unit penjagaan akut

˓→dalam tempoh setahun ialah 24% jauh lebih rendah bagi kumpulan pesakit yang

˓→ditawarkan pekerja sokongan berbentuk rakan senasib, berbanding dengan kumpulan

˓→pesakit yang hanya diberi buku kerja pemulihan peribadi sahaja.

“Pesakit yang dibenarkan keluar (discharge) dari perkhidmatan krisis komuniti sering

˓→dimasukkan semula ke unit penjagaan akut. Ini bukan sahaja membantutkan pemulihan,

˓→malah menggunapakai sumber yang sepatutnya dikhususkan untuk penambahbaikan jangka

˓→panjang fungsi dan kualiti hidup pesakit. Pekerja sokongan daripada golongan rakan

˓→senasib berupaya memberikan sokongan dan dorongan yang benar-benar mesra dan

˓→berempati kerana ianya datang daripada pengalaman peribadi mereka sendiri, di

˓→samping menjadi contoh tauladan (role model) yang baik untuk pemulihan pesakit,”

˓→kata pengarang utama, Profesor Sonia Johnson (Psikiatri, UCL).

Di United Kingdom (UK), lebih separuh daripada jumlah pesakit yang dimasukkan ke unit

˓→penjagaan akut dimasukkan semula ke unit berkenaan dalam tempoh setahun, namun

˓→tiada bukti kukuh yang menjelaskan kaedah untuk mengurangkan jumlah ini.

Sebenarnya, sokongan daripada individu yang pernah mengalami masalah kesihatan mental

˓→telah dipraktikkan dalam program seperti Pelaksanaan Pemulihan melalui Perubahan

˓→Organisasi NHS (UK) dan juga Pelan Tindakan Pemulihan Kesihatan Amerika Syarikat.

˓→Penyelidikan ini merupakan ujian terkawal rawak pertama yang seumpamanya untuk

˓→menilai keberkesanan program sokongan rakan senasib. Walaupun begitu, masih banyak

˓→penyelidikan yang perlu dijalankan sebelum strategi ini boleh dilaksanakan secara

˓→menyeluruh di UK, contohnya untuk memahami sebab-sebab mengapa program ini berkesan.

(continues on next page)

9.53. Long Text Translation 499 malaya Documentation

(continued from previous page)

Pastinya, campur tangan atau intervensi dalam pengurusan kendiri (self-management)

˓→mungkin boleh membantu pesakit menguruskan kesihatan mental mereka dengan lebih

˓→baik. Dalam kajian ini, penyelidik menyediakan peserta sama ada dengan buku kerja

˓→pemulihan peribadi sahaja (220 orang) atau buku kerja berserta bantuan pekerja

˓→sokongan rakan senasib (221 orang). Peserta juga dibenarkan meneruskan penjagaan

˓→biasa mereka.

Kajian ini dijalankan menerusi enam buah pasukan penyelesaian krisis kesihatan mental

˓→(crisis resolution team) di England dan peserta dipilih hanya selepas mereka

˓→dibenarkan keluar dari unit krisis oleh pasukan penyelesaian krisis. Peserta kajian

˓→terdiri daripada pelbagai diagnosis antaranya termasuk skizofrenia, bipolar,

˓→psikosis, kemurungan, kecelaruan keresahan, kecelaruan stres pascatrauma (PTSD),

˓→dan kecelaruan personaliti.

Peserta yang menerima sokongan rakan senasib ditawarkan 10 sesi pertemuan mingguan

˓→selama satu jam. Dalam sesi pertemuan ini, pekerja sokongan mendengar masalah

˓→peserta dan bermatlamat menyemaikan harapan dengan berkongsi kemahiran dan strategi

˓→pengurusan penyakit (coping strategies) yang dikuasai sewaktu mereka sendiri berada

˓→dalam proses pemulihan. Kesemua pekerja sokongan diberikan latihan terlebih dahulu

˓→dalam kemahiran mendengar, kesedaran budaya, pendedahan kendiri, dan kerahsiaan,

˓→serta cara menggunakan buku kerja.

Rekod kesihatan peserta dipantau oleh penyelidik selama satu tahun untuk menentukan

˓→sama ada peserta dimasukkan semula ke unit penjagaan akut (seperti wad pesakit akut,

˓→ pasukan penyelesaian krisis, rumah krisis, dan perkhidmatan penjagaan harian akut)

˓→atau tidak.

Selepas tamat tempoh satu tahun, kajian mendapati peratusan kemasukan semula ke unit

˓→penjagaan akut adalah lebih rendah bagi kumpulan peserta yang menerima intervensi

˓→berbanding dengan kelompok kawalan-- dengan 29% peserta dimasukkan semula setelah

˓→menerima sokongan rakan senasib, berbanding dengan 38% peserta yang hanya menerima

˓→buku kerja.

Kadar penyerapan intervensi juga baik-- 72% daripada peserta yang ditawarkan khidmat

˓→pekerja sokongan rakan senasib dan buku kerja menghadiri sekurang-kurangnya tiga

˓→sesi pertemuan, manakala satu pertiga menghadiri kesemua 10 sesi pertemuan.

“Kami sedia maklum bahawa banyak pengguna perkhidmatan penjagaan krisis kesihatan

˓→mental merasakan perkhidmatan ini terhenti secara mendadak kerana kurangnya

˓→penjagaan susulan yang siap sedia. Kajian kami menunjukkan bahawa pekerja sokongan

˓→rakan senasib dapat membantu mengisi kelompangan ini dengan membantu pengguna

˓→perkhidmatan (pesakit) membangunkan strategi pengurusan kendiri dan pelan pemulihan

˓→dengan cara tersendiri dan lebih bermakna, seterusnya dapat membantu pesakit terus

˓→bertahan sesudah berlakunya sesuatu krisis,” kata pengarang bersama, Dr Brynmor

˓→Lloyd-Evans (Psikiatri, UCL). """

500 Chapter 9. Contents: malaya Documentation

do simple cleaning, remove newline and weird characters.

[9]: import re from unidecode import unidecode

def cleaning(string): return re.sub(r'[ ]+','', unidecode(string.replace(' \n',''))).strip()

[10]: string= cleaning(string) string2= cleaning(string2) string3= cleaning(string3)

[11]: len(string.split()), len(string2.split()), len(string3.split()) [11]: (406, 433, 649)

9.53.4 Translate using greedy decoder

[12]: from pprint import pprint

[13]: pprint(transformer.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today said the Cabinet had not raised the ' "issue of Datuk Seri Najib Razak's conflict in 1Malaysia Development Berhad " '(1MDB). According to the Code of Ethics for Administrators and Members of ' 'Parliament it is customary practice for Cabinet members to declare conflicts ' 'of interest if they have been involved in matters discussed at the 1MDB ' 'Cabinet. The matter was announced by the former Deputy Chief Secretary ' '(Cabinet) of the Constitutional and Liaison Division of the Prime Minister, ' 'Tan Sri Mazidah Abdul Majid, in his statement at the 1MDB hearing. The code ' 'of ethics is also stated that the Cabinet members have personal interests ' 'and conflicts with the government, or involving family members, had to leave ' 'meetings and record their absence. In 1MDB, Najib held three positions, ' "namely the Prime Minister's Office of Finance and the Prime Minister's " "Chairman of the Prime Minister's Office of the Prime Minister, namely Tan " 'Sri Mazram Abdullah Abdullah Abdullah Shafee Mazidah Mazghani Mazram, who ' 'had been informed in the former cabinet of the matter of the 1MDB.']

[14]: pprint(transformer_small.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today, said that the Cabinet had never ' "raised the issue of Datuk Seri Najib Razak's interest in 1Malaysia " 'Development Berhad (1MDB). According to the Code of Ethics For Members of ' 'the Administration and Member of Parliament is the practice of Cabinet ' 'members to declare conflict of interest if he had been discussed in matters ' 'discussed at the Cabinet. Article. The matter was informed that former ' 'Secretary-General (Cabinet) of the Cabinet, he said. Tan Sri Mazidah Abdul ' 'Majid, in his statement at the 1MDB trial, in his statement at the 1MDB ' 'case, Najib, the case of the 1MDB, he said. Najib, he had not been informed ' 'that the case of the case, Najib, he said. Najib, he said. Najib, he had not ' 'been informed that the case, he had informed that the case, Najib had ' 'informed that the 1MDB, he had not been informed that the case, he had ' 'informed that the 1MDB, he had informed that the case, he had given that he ' 'had informed that the case, he had informed that the case, he had informed ' 'of the case, Najib, he had given that the case, he had informed that the ' 'case, Najib, he had not been informed that the 1MDB, he had not been ' (continues on next page)

9.53. Long Text Translation 501 malaya Documentation

(continued from previous page) 'informed that he had informed that the case, he had given the case, he had ' 'given him that the case, he had not been informed that the case, Najib, he ' 'had informed that he had informed that he had informed that he had been ' 'informed that he had informed that the 1MDB, he had informed of the 1MDB, he ' "had informed that the 1MDB, he had informed that the 1MDB's case, Najib, he " 'had informed of the case. Najib, he had informed that the case. Najib, he ' 'had been that the case, he had not been informed that he had been informed ' 'that he had informed of the case, he had informed that the 1MDB, he had been ' 'informed that the case, he had been informed that the 1MDB, Najib, he had ' 'been that he had informed of the case. Najib, he had informed of the case, ' 'he had been informed of the 1MDB, he had not been informed of the case. ' 'Najib, he had not been informed of the case. Najib, he had not been informed ' 'of the case. Najib, he had been informed of the case, he had been that the ' 'case, he had informed that the 1MDB, he had informed of the case, he had ' 'been informed of the case. Najib, he had been informed of the case. Najib, ' 'he had been informed that the case. Najib, he had not been informed that the ' 'case. Najib, he had informed of the case. Najib, he had informed that the ' 'case. Najib, he had informed that the case. Najib, he had been informed that ' 'he had been informed that the 1MDB, he had informed that the case, in his ' 'case. Najib, in the case. Najib, and Najib, he had informed that the case. ' 'Najib, he had informed that the case, he had informed that the case, he had ' 'informed that the case. Najib, he had informed that the case. Najib, he had ' 'given him that the case, he had not been informed that the case, he had been ' 'that the case, he had not been informed that he had not been informed that ' 'the case, he had informed that he had not been informed that he had informed ' 'that he had been informed that the case, he had been that the case. Najib, ' 'he had informed that he had informed that the case. Najib, he had informed ' 'that he had been informed that he had been informed of the case. Najib, he ' 'had been informed of the case. Najib, he had been informed that he had been ' 'informed that he had been informed that the case. Najib, he had been ' 'informed that the case. Najib, he had been informed that he had repeatedly ' 'informed that the 1MDB, he had been informed that the 1MDB, he had been ' 'informed of the case. Najib, he had informed that the case. Najib, and ' 'Najib, he had informed that the 1MDB, he had informed that the case. Najib, ' 'and Najib, he had not been that the case. Najib, and that he had informed ' 'that the case. Najib, he had informed that he had informed that he had ' 'repeatedly informed that he had informed that the case, he had informed that ' 'the 1MDB, he had been informed that the case of the 1MDB, he had been ' 'informed that he had informed that he had informed that he had informed that ' 'he had not been informed that he had been informed that the case. Najib, and ' "that the 1MDB's case. Najib, he had informed that he had not informed that " 'the case. Najib, and that the case. Najib, he had not been informed that the ' 'case. Najib, he had informed of the case of the case. Najib, he had informed ' 'that the case. Najib, he had not been informed that the case. Najib, he had ' 'informed that he had not been informed that he had repeatedly informed of ' 'the 1MDB, he had not been informed that he had informed of the case. Najib, ' 'he had been that the case, he had informed that the case. Najib, and Najib, ' 'he had been informed that the case, he had repeatedly informed that the ' 'case. Najib, he had given the case. Najib, and that the case. Najib, he had ' 'repeatedly, he']

Normal transformer models not able to infer long texts, we need to partition the text, example, split the text by .

[15]: partition_string= malaya.text.function.split_into_sentences(string) partition_string

502 Chapter 9. Contents: malaya Documentation

[15]: ['KUALA LUMPUR: Mahkamah Tinggi di sini, hari ini, diberitahu Kabinet tidak pernah

˓→membangkitkan isu mengenai konflik kepentingan Datuk Seri Najib Razak dalam

˓→1Malaysia Development Berhad (1MDB).', 'Menurut Kod Etika Bagi Anggota Pentadbiran dan Ahli Parlimen adalah menjadi amalan

˓→ahli Kabinet untuk mengisytiharkan konflik kepentingan sekiranya mempunyai

˓→pembabitan di dalam hal yang dibincangkan dalam Mesyuarat Jemaah Menteri.', 'Perkara itu dimaklumkan bekas Timbalan Ketua Setiausaha (Kabinet) Bahagian Kabinet

˓→Perlembagaan dan Perhubungan Jabatan Perdana Menteri (JPM), Tan Sri Mazidah Abdul

˓→Majid, dalam keterangannya pada perbicaraan kes 1MDB yang dihadapi Najib.', 'Kod etika itu antara lain turut menyatakan bahawa ahli Kabinet yang mempunyai

˓→kepentingan peribadi dan bercanggah dengan kepentingan kerajaan, atau membabitkan

˓→ahli keluarga, perlu meninggalkan mesyuarat dan merekodkan ketidakhadiran mereka.', 'Di dalam 1MDB, Najib memegang tiga jawatan iaitu Perdana Menteri, Menteri Kewangan

˓→dan Pengerusi Pengerusi Badan Penasihat 1MDB Menjawab soalan peguam Tan Sri

˓→Muhammad Shafee Abdullah, sama ada terdapat ahli Kabinet yang membangkitkan isu

˓→bahawa Menteri Kewangan tidak seharusnya membabitkan diri dalam perbincangan itu

˓→kerana konflik kepentingan, Maziah menjawab: "Tiada".', 'Muhammad Shafee: Ada sesiapa yang membangkitkan perkara berhubung hal 1MBD dengan

˓→Najib?', 'Maziah: Tidak Muhammad Shafee: Timbalan Perdana Menteri (ketika itu) adalah Tan Sri

˓→Muhyiddin Yassin, manakala Datuk Seri Ahmad Husni Hanadzlah adalah bekas Menteri

˓→Kewangan II.', 'Mereka juga tidak pernah membangkitkan hal konflik kepentingan?', 'Mazidah: Ya, tidak pernah.', 'Sementara itu, menjawab soalan Timbalan Pendakwa Raya, Ahmad Akram Gharib sama ada

˓→beliau mengetahui bahawa Najib mempunyai kepentingan peribadi dalam 1MDB, Mazidah

˓→berkata: "Tidak" Ahmad Akram: Adakah anda mengetahui bahawa Najib secara peribadi

˓→menerima duit daripada 1MDB?', 'Maziah: Tidak Ahmad Akram: Sekiranya Najib secara peribadi menerima wang daripada

˓→1MDB adakah itu konflik kepentingan dan melanggar Kod Etika Bagi Anggota

˓→Pentadbiran dan Ahli Parlimen.', 'Maziah: Menurut pandangan peribadi saya, ya, namun kerana ia membabitkan menteri

˓→dan perdana menteri, saya cadangkan untuk dapatkan pandangan Peguam Negara.', 'Terdahulu, di awal prosiding, Maziah turut memberitahu mahkamah bahawa Najib tidak

˓→pernah menyebut nama ahli perniagaan dalam buruan, Low Taek Jho atau Jho Low pada

˓→mesyuarat Kabinet sebagai individu yang membantu beliau mendapatkan sumbangan

˓→daripada kerabat diraja Arab Saudi.', '"Sekiranya perkara itu dimaklumkan kepada Kabinet, maka, ia akan dicatat dalam

˓→minit mesyuarat," katanya.', 'Tambah Maziah, beliau hanya mendengar dan mengetahui mengenai nama Jho Low selepas

˓→timbul isu membabitkan 1MDB.', 'Najib, 68, menghadapi empat pertuduhan menggunakan kedudukannya untuk memperoleh

˓→suapan berjumlah RM2.3 bilion daripada dana 1MDB dan 21 pertuduhan pengubahan wang

˓→haram membabitkan jumlah wang yang sama.', 'Perbicaraan di hadapan Hakim Collin Lawrence Sequerah bersambung Isnin ini.']

[16]: pprint(transformer.greedy_decoder(partition_string)) ['KUALA LUMPUR: The High Court here today said the Cabinet had never raised ' "the issue of Datuk Seri Najib Razak's conflict of interest in 1Malaysia " 'Development Berhad (1MDB).', 'According to the Code of Ethics for Administrative Members and Members of ' 'Parliament it is the practice of Cabinet members to declare conflicts of ' 'interest if they have any involvement in the matter discussed at the Cabinet ' 'Meeting.', 'The matter was informed by former Deputy Secretary-General (Cabinet) of the ' "Constitutional and Liaison Division of the Prime Minister's Department " (continues on next page)

9.53. Long Text Translation 503 malaya Documentation

(continued from previous page) '(JPM), Tan Sri Mazidah Abdul Majid, in his testimony at the 1MDB trial ' 'hearing facing Najib.', 'The code of ethics, among others, states that Cabinet members who have ' 'personal interests and conflict with the interests of the government, or ' 'involve family members, must leave meetings and record their absence.', 'In 1MDB, Najib held three positions namely Prime Minister, Finance Minister ' 'and Chairman of 1MDB Advisory Board Answering the question of lawyer Tan Sri ' 'Muhammad Shafee Abdullah, whether there were Cabinet members who raised the ' 'issue that the Minister of Finance should not be involved in the discussion ' 'because of conflict of interest, Maziah replied: "None".', 'Muhammad Shafee: Is there anyone raising questions about 1MBD with Najib?', 'Maziah: Muhammad Shafee: Deputy Prime Minister (then) was Tan Sri Muhyiddin ' 'Yassin, while Datuk Seri Ahmad Husni Hanadzlah was the former Minister of ' 'Finance II.', 'They have never raised any issues of conflict of interest?', 'Mazidah: Yes, never.', "Meanwhile, responding to Deputy Public Prosecutor Ahmad Akram Gharib's " 'question as to whether he knew Najib had personal interests in 1MDB, Mazidah ' 'said: "No" Ahmad Akram: Do you know that Najib personally received money ' 'from 1MDB?', 'Maziah: No Ahmad Akram: If Najib personally receives money from 1MDB whether ' 'it is a conflict of interest and violates the Code of Ethics for ' 'Administrative Members and Members of Parliament.', 'Maziah: In my personal opinion, yes, but because it involves ministers and ' "prime ministers, I suggest to get the Attorney General's views.", 'Earlier, at the beginning of the proceedings, Maziah also told the court ' 'that Najib had never mentioned the name of the businessman in the game, Low ' 'Taek Jho or Jho Low at a Cabinet meeting as an individual who helped him get ' "donations from Saudi Arabia's royal family.", '"If the matter is communicated to the Cabinet, then it will be recorded at ' 'the minutes of the meeting," he said.', "Maziah added that he only heard and knew about Jho Low's name after the 1MDB " 'issue.', 'Najib, 68, faces four charges of using his position to obtain RM2.3 billion ' 'in bribes from 1MDB funds and 21 counts of money laundering involving the ' 'same amount.', 'The trial before Judge Collin Lawrence Sequerah continues this Monday.']

[17]: pprint(transformer_small.greedy_decoder(partition_string)) ['KUALA LUMPUR: The High Court here today said the Cabinet had never raised ' 'the issue of conflict of interest between Datuk Seri Najib Razak in ' '1Malaysia Development Berhad (1MDB).', 'According to the Code of Ethics For Members of the Administration and MPs is ' 'the practice of Cabinet members to declare conflicts of interest if they ' 'have involvement in matters discussed at the Cabinet Meeting.', 'The matter was informed by former Deputy Secretary-General (Cabinet) of the ' "Cabinet of the Constitution and the Relations of the Prime Minister's " 'Department (JPM), Tan Sri Mazidah Abdul Majid, in his statement at the 1MDB ' 'case trial against Najib.', 'The code of ethics also states that Cabinet members with personal interests ' 'and conflict with the interests of the government, or involving family ' 'members, need to leave the meeting and record their absence.', 'In 1MDB, Najib held three positions, Prime Minister, Finance Minister and ' 'Chairman of the 1MDB Advisory Agency Responding to lawyer Tan Sri Muhammad ' 'Shafee Abdullah, whether Cabinet members raised the issue that the Minister ' 'of Finance should not involve the discussion because of the conflict of ' (continues on next page)

504 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'interest, Maziah replied: "No".', 'Muhammad Shafee: Someone who raises things about 1MBD with Najib?', 'Maziah: Muhammad Shafee: Deputy Prime Minister (then) was Tan Sri Muhyiddin ' 'Yassin, while Datuk Seri Ahmad Husni Hanadzlah was the former Minister of ' 'Finance II.', 'They have never raised conflicts of interest?', 'Mazidah: Yes, never.', "Meanwhile, responding to Deputy Public Prosecutor Ahmad Akram Gharib's " 'question whether he knew that Najib had personal interests in 1MDB, Mazidah ' 'said: "No" Ahmad Akram: Do you know that Najib personally receives money ' 'from 1MDB?', 'Maziah: Ahmad Akram: If Najib personally receives money from 1MDB whether it ' 'is a conflict of interest and violates the Code of Ethics for Members of the ' 'Administration and Member of Parliament.', 'Maziah: According to my personal view, yes, but because it involves ' 'ministers and prime ministers, I propose to get the views of the Attorney ' 'General.', 'Earlier, in the early proceedings, Maziah also told the court that Najib had ' 'never mentioned the name of a businessman in the hunt, Low Taek Jho or Jho ' 'Low at a Cabinet meeting as an individual who helped him get a donation from ' "Saudi Arabia's royal family.", '"If the matter is communicated to the Cabinet, then it will be recorded in ' 'the minutes of the meeting," he said.', "Maziah added that he only heard and learned about Jho Low's name after the " 'issue of involving 1MDB.', 'Najib, 68, faces four charges of using his position to obtain a RM2.3 ' 'billion bribe from 1MDB funds and 21 counts of money laundering involving ' 'the same amount of money.', 'The trial before Judge Collin Lawrence Sequerah continues this Monday.']

9.53.5 Problem with partitioning

Problem with partition, the model assumed N + 1 element has no relationship with N element, vice versa. Attention mechanism will not work across partition. So we introduced BigBird to solve this problem.

[18]: pprint(bigbird_small.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today, was told the Cabinet never raised ' 'issues about the conflict of interest of Datuk Seri Najib Razak in 1Malaysia ' 'Development Berhad (1MDB). According to the Code of Ethics for Members of ' 'Administration and Member of Parliament is the practice of Cabinet members ' 'to declare conflicts of interest if they have involvement in matters ' 'discussed in the Cabinet Meeting. The matter was informed by the former ' 'Deputy Secretary General (Cabinet) of the Cabinet of the Constitution and ' "Relations of the Prime Minister's Department (JPM), Tan Sri Mazidah Abdul " 'Majid, in his statement at the 1MDB case trial faced by Najib. The code of ' 'ethics, among others, also stated that Cabinet members who have personal ' 'interests and contradict the interests of the government, or involving ' 'family members, should leave the meeting and record their absence. In 1MDB, ' 'Najib holds three positions, namely the Prime Minister, Finance Minister and ' 'Chairman of the 1MDB Advisory Body Answering a question from lawyer Tan Sri ' 'Muhammad Shafee Abdullah, whether there are Cabinet members who raise the ' 'issue that the Minister of Finance should not engage in the discussion ' 'because conflicts of interest, Maziah replied: "No." Muhammad Shafee: ' (continues on next page)

9.53. Long Text Translation 505 malaya Documentation

(continued from previous page) 'Someone raised matters regarding 1MBD with Najib? Maziah: Not Muhammad ' 'Shafee: Deputy Prime Minister (then) is Tan Sri Muhyiddin Yassin, while ' 'Datuk Seri Ahmad Husni Hanadzlah is a former Finance Minister II. They also ' 'never raised matters of interest conflict? Mazidah: Yes, never. Meanwhile, ' 'answering a question from Deputy Public Prosecutor Ahmad Akram Gharib ' 'whether he knew that Najib had personal interest in 1MDB, Mazidah said: "No" ' 'Ahmad Akram: Do you know that Najib personally received money from 1MDB? ' 'Maziah: No Ahmad Akram: If Najib personally receives money from 1MDB is it a ' 'conflict of interest and violates the Code of Ethics for Members of ' 'Administration and Member of Parliament. Maziah: According to my personal ' 'opinion, yes, but because it involves ministers and prime minister, I ' 'suggest to get the views of the Attorney General. Earlier, at the beginning ' 'of the proceedings, Maziah also told the court that Najib had never ' 'mentioned the name of a businessman in the hunt, Low Taek Jho or Jho Low at ' 'a Cabinet meeting as an individual who helped him get donations from the ' 'royal family of Saudi Arabia. "If the matter was informed to the Cabinet, ' 'then, it will be recorded in the minutes of the meeting," he said. Maziah ' "added that he only heard and found out about Jho Low's name after the issue " 'involving 1MDB. Najib, 68, faced four charges of using his position to ' 'obtain bribes amounting to RM2.3 billion from 1MDB funds and 21 charges of ' 'money laundering involving the same amount of money. The trial before Judge ' 'Collin Lawrence Sequerah continues this Monday.']

[19]: pprint(quantized_bigbird_small.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today, was told the Cabinet never raised ' 'issues about the conflict of interest of Datuk Seri Najib Razak in 1Malaysia ' 'Development Berhad (1MDB). According to the Code of Ethics for Members of ' 'Administration and Member of Parliament is the practice of Cabinet members ' 'to declare conflicts of interest if they have involvement in matters ' 'discussed in the Cabinet Meeting. The matter was informed by the former ' 'Deputy Secretary General (Cabinet) of the Cabinet of the Constitution and ' "Relations of the Prime Minister's Department (JPM), Tan Sri Mazidah Abdul " 'Majid, in his statement at the 1MDB case trial faced by Najib. The code of ' 'ethics, among others, also stated that Cabinet members who have personal ' 'interests and contradict the interests of the government, or involve family ' 'members, should leave the meeting and record their absence. In 1MDB, Najib ' 'holds three positions, namely the Prime Minister, Finance Minister and ' 'Chairman of the 1MDB Advisory Body Answering a question from lawyer Tan Sri ' 'Muhammad Shafee Abdullah, whether there are Cabinet members who raise the ' 'issue that the Minister of Finance should not engage in the discussion ' 'because conflicts of interest, Maziah replied: "No." Muhammad Shafee: ' 'Someone raised matters regarding 1MBD with Najib? Maziah: Not Muhammad ' 'Shafee: Deputy Prime Minister (then) is Tan Sri Muhyiddin Yassin, while ' 'Datuk Seri Ahmad Husni Hanadzlah is a former Finance Minister II. They also ' 'never raised matters of interest conflict? Mazidah: Yes, never. Meanwhile, ' 'answering the question of Deputy Public Prosecutor, Ahmad Akram Gharib ' 'whether he knew that Najib had personal interest in 1MDB, Mazidah said: "No" ' 'Ahmad Akram: Do you know that Najib personally received money from 1MDB? ' 'Maziah: Not Ahmad Akram: If Najib personally receives money from 1MDB is it ' 'a conflict of interest and violates the Code of Ethics for Members of ' 'Administration and Member of Parliament. Maziah: According to my personal ' 'opinion, yes, but because it involves ministers and prime minister, I ' 'suggest to get the views of the Attorney General. Earlier, at the beginning ' 'of the proceedings, Maziah also told the court that Najib never mentioned ' 'the name of a businessman in the hunt, Low Taek Jho or Jho Low at a Cabinet ' 'meeting as an individual who helped him get donations from the royal family ' (continues on next page)

506 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'of Saudi Arabia. "If the matter was informed to the Cabinet, then, it will ' 'be recorded in the minutes of the meeting," he said. Maziah added that he ' "only heard and found out about Jho Low's name after the issue involving " '1MDB. Najib, 68, faced four charges of using his position to obtain bribes ' 'amounting to RM2.3 billion from 1MDB funds and 21 charges of money ' 'laundering involving the same amount of money. The trial before Judge Collin ' 'Lawrence Sequerah continues this Monday.']

[20]: pprint(bigbird.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today was told that the Cabinet had never ' 'raised the issue of the conflict of interest of Datuk Seri Najib Razak in ' '1Malaysia Development Berhad (1MDB). According to the Code of Ethics for ' 'Administrative Members and Members of Parliament, it is the practice of ' 'Cabinet members to declare conflicts of interest if they have involvement in ' 'matters discussed in the Cabinet Meeting. The matter was informed by the ' 'former Deputy Secretary General (Cabinet) of the Cabinet Division of the ' "Constitution and the Relations of the Prime Minister's Department (JPM), Tan " 'Sri Mazidah Abdul Majid, in his testimony at the 1MDB case trial faced by ' 'Najib. The code of ethics, among others, also states that Cabinet members ' 'who have personal interests and are contrary to the interests of the ' 'government, or involving family members, have to leave the meeting and ' 'record their absence. In 1MDB, Najib held three positions, namely the Prime ' 'Minister, Finance Minister and Chairman of the Chairman of the 1MDB Advisory ' 'Body, answered a question from lawyer Tan Sri Muhammad Shafee Abdullah, ' 'whether there were Cabinet members who raised the issue that the Minister of ' 'Finance should not be involved in the discussion due to conflicts of ' 'interest, Maziah replied: "No." Muhammad Shafee: There are anyone who raised ' 'the matter of 1MBD with Najib? Maziah: No Muhammad Shafee: The Deputy Prime ' 'Minister (then) is Tan Sri Muhyiddin Yassin, while Datuk Seri Ahmad Husni ' 'Hanadzlah is the former Minister of Finance II. They have never raised the ' 'conflict of interest? Mazidah: Yes, never. Meanwhile, answering a question ' 'from Deputy Public Prosecutor Ahmad Akram Gharib whether he knew that Najib ' 'had a personal interest in 1MDB, Mazidah said: "No" Ahmad Akram: Do you know ' 'that Najib personally receives money from 1MDB? Maziah: No Ahmad Akram: If ' 'Najib personally receives money from 1MDB is that a conflict of interest and ' 'violates the Code of Ethics for Administrative Members and Members of ' 'Parliament. Maziah: According to my personal view, yes, but because it ' 'involves ministers and prime ministers, I propose to get the views of the ' 'Attorney General. Earlier, at the beginning of the proceedings, Maziah also ' 'told the court that Najib had never mentioned the name of a businessman in ' 'the hunt, Low Taek Jho or Jho Low at a Cabinet meeting as an individual who ' 'helped him get donations from the royal family of Saudi Arabia. "If the ' 'matter is informed to the Cabinet, then, it will be recorded in the minutes ' 'of the meeting," he said. Maziah added that he only heard and knew about Jho ' "Low's name after an issue involving 1MDB. Najib, 68, faces four charges of " 'using his position to obtain bribes amounting to RM2.3 billion from 1MDB ' 'funds and 21 charges of money laundering involving the same amount of money. ' 'The trial before Judge Collin Lawrence Sequerah continues this Monday.']

[21]: pprint(quantized_bigbird.greedy_decoder([string])) ['KUALA LUMPUR: The High Court here today was told that the Cabinet had never ' 'raised the issue of the conflict of interest of Datuk Seri Najib Razak in ' '1Malaysia Development Berhad (1MDB). According to the Code of Ethics for ' 'Administrative Members and Members of Parliament, it is the practice of ' 'Cabinet members to declare conflicts of interest if they have involvement in ' (continues on next page)

9.53. Long Text Translation 507 malaya Documentation

(continued from previous page) 'matters discussed in the Cabinet Meeting. The matter was informed by the ' 'former Deputy Secretary General (Cabinet) of the Cabinet Division of the ' "Constitution and the Relations of the Prime Minister's Department (JPM), Tan " 'Sri Mazidah Abdul Majid, in his testimony at the 1MDB case trial faced by ' 'Najib. The code of ethics, among others, also states that Cabinet members ' 'who have personal interests and are contrary to the interests of the ' 'government, or involving family members, have to leave the meeting and ' 'record their absence. In 1MDB, Najib held three positions, namely the Prime ' 'Minister, Finance Minister and Chairman of the Chairman of the 1MDB Advisory ' 'Body, answered a question from lawyer Tan Sri Muhammad Shafee Abdullah, ' 'whether there were Cabinet members who raised the issue that the Minister of ' 'Finance should not be involved in the discussion due to conflicts of ' 'interest, Maziah replied: "No." Muhammad Shafee: There are anyone who raised ' 'the matter of 1MBD with Najib? Maziah: No Muhammad Shafee: The Deputy Prime ' 'Minister (then) is Tan Sri Muhyiddin Yassin, while Datuk Seri Ahmad Husni ' 'Hanadzlah is the former Minister of Finance II. They have never raised the ' 'conflict of interest? Mazidah: Yes, never. Meanwhile, answering a question ' 'from the Deputy Public Prosecutor, Ahmad Akram Gharib whether he knew that ' 'Najib had a personal interest in 1MDB, Mazidah said: "No" Ahmad Akram: Do ' 'you know that Najib personally receives money from 1MDB? Maziah: No Ahmad ' 'Akram: If Najib personally receives money from 1MDB is that there is a ' 'conflict of interest and violates the Code of Ethics for Administrative ' 'Members and Members of Parliament. Maziah: According to my personal view, ' 'yes, but because it involves ministers and prime ministers, I propose to get ' 'the views of the Attorney General. Earlier, at the beginning of the ' 'proceedings, Maziah also told the court that Najib had never mentioned the ' 'name of a businessman in the hunt, Low Taek Jho or Jho Low at a Cabinet ' 'meeting as an individual who helped him get donations from the royal family ' 'of Saudi Arabia. "If the matter is informed to the Cabinet, then, it will be ' 'recorded in the minutes of the meeting," he said. Maziah added that he only ' "heard and knew about Jho Low's name after an issue involving 1MDB. Najib, " '68, faces four charges of using his position to obtain bribes amounting to ' 'RM2.3 billion from 1MDB funds and 21 charges of money laundering involving ' 'the same amount of money. The trial before Judge Collin Lawrence Sequerah ' 'continues this Monday.']

[22]: pprint(bigbird_small.greedy_decoder([string2])) ['KUALA LUMPUR: Datuk Seri Najib Tun Razak today failed in another attempt to ' 'drop former Federal Court Judge Datuk Seri Gopal Sri Ram from leading the ' 'prosecution team in the case of amending the 1Malaysia Development Berhad ' "(1MDB) audit report involving the former prime minister. This is Najib's " 'third attempt to drop Sri Ram as prosecutor in the 1MDB-related criminal ' 'case. Earlier, an application was filed in another 1MDB case before a ' 'different judge and a second attempt through civil proceedings. While ' 'rejecting the application of the Pekan Member of Parliament, High Court ' "Judge Mohamed Zaini Mazlan said Najib's allegation that Sri Ram was involved " 'in the investigation into him as there was no merit. "There is no solid ' "evidence to support the applicant's allegation and remain a mere hypothesis. " 'This issue has been investigated by the prosecution as the previous ' 'respondent during the application of the applicant (Najib)." This issue has ' 'been discussed and decided. The decision by other courts previously remained ' 'and could not be reversed, "said the judge. Mohamed Zaini when giving ' 'reasons for rejection said applicants, among others, made communication ' 'between former Attorney General Tan Sri Mohamed Apandi Ali and Sri Ram as ' "evidence of Sri Ram's biased side of the applicant and the applicant's " 'concern regarding the matter was baseless. "As well as other individuals, ' (continues on next page)

508 Chapter 9. Contents: malaya Documentation

(continued from previous page) "Sri Ram has the right to have personal views. That's all. However, different " 'considerations will be made if he shows a biased attitude while performing ' 'his duties as a senior prosecutor. His personal view cannot be considered to ' 'prevent his responsibilities as a senior prosecutor." Furthermore, the ' 'incident, as presented by the respondent, happened when before the ' 'appointment of Sri Ram as a senior prosecutor. Also important is that the ' "applicant did not make any complaint about Sri Ram's actions while " 'conducting a trial involving the applicant. This strengthened the ' "respondents' argument that Sri Ram was open while carrying out his duties as " 'a senior prosecutor, "said the judge. Mohamed Zaini further said that in the ' 'end, the court was responsible for ensuring that a trial was implemented ' 'fairly for justice." The court will help any party treated unfairly, if the ' "matter happened. The relationship, the applicant's application was rejected, " '"he said. The trial of the 1MDB audit report case will continue on Feb 22. ' "At today's proceedings, Deputy Public Prosecutor Ahmad Akram Gharib acted on " 'behalf of the prosecution, while Najib was represented by lawyer Nur ' 'Syahirah Hanapiah. Najib, 67, and former 1MDB Chief Executive Officer Arul ' 'Kanda Kandasamy, 45, were tried on charges of amending the 1MDB audit ' 'report. The Pekan Member of Parliament was accused of using his position to ' 'direct amendments to the 1MDB final audit report before being tabled to the ' 'Public Accounts Committee to prevent any action from being taken against ' 'him, while Arul Kanda was charged with abetting Najib in making amendments ' 'to the report to protect Najib from being prosecuted.']

[23]: pprint(quantized_bigbird_small.greedy_decoder([string2])) ['KUALA LUMPUR: Datuk Seri Najib Tun Razak today failed in another attempt to ' 'drop former Federal Court Judge Datuk Seri Gopal Sri Ram from leading the ' 'prosecution team in the case of amending the 1Malaysia Development Berhad ' "(1MDB) audit report involving the former prime minister. This is Najib's " 'third attempt to drop Sri Ram as prosecutor in the 1MDB-related criminal ' 'case. Earlier, an application was filed in another 1MDB case before a ' 'different judge and a second attempt through civil proceedings. While ' 'rejecting the application of the Pekan Member of Parliament, High Court ' "Judge Mohamed Zaini Mazlan said Najib's allegation that Sri Ram was involved " 'in the investigation into him as there is no merit. "There is no solid ' "evidence to support the applicant's allegation and remain a mere hypothesis. " 'This issue has been investigated by the prosecution as the previous ' 'respondent during the application of the applicant (Najib)." This issue has ' 'been discussed and decided. The decision by other courts previously remained ' 'and could not be reversed, "said the judge. Mohamed Zaini when giving ' 'reasons for rejection said applicants, among others, made communication ' 'between former Attorney General Tan Sri Mohamed Apandi Ali and Sri Ram as ' "evidence of Sri Ram's biased side of the applicant and the applicant's " 'concern regarding the matter was baseless. "As well as other individuals, ' 'Sri Ram has the right to have a personal view. That alone. However, ' 'different considerations will be made if he shows a biased attitude while ' 'performing his duties as a senior prosecutor. His personal view cannot be ' 'considered to prevent his responsibility as a senior prosecutor." ' 'Furthermore, the incident, as presented by the respondent, happened when ' 'before the appointment of Sri Ram as a senior prosecutor. Also important is ' "that the applicant did not make any complaint about Sri Ram's actions while " "conducting trial involving the applicant. This strengthened the respondents' " 'argument that Sri Ram was open while carrying out his duties as a senior ' 'prosecutor, "said the judge. Mohamed Zaini further said that in the end, the ' 'court was responsible for ensuring that a trial was implemented fairly for ' 'justice." The court will help any party treated unfairly, if the matter ' (continues on next page)

9.53. Long Text Translation 509 malaya Documentation

(continued from previous page) 'happened. The connection, the applicant\'s application was rejected, "he ' 'said. The trial of the 1MDB audit report case will continue on Feb 22. At ' "today's proceedings, Deputy Public Prosecutor Ahmad Akram Gharib acted on " 'behalf of the prosecution, while Najib was represented by lawyer Nur ' 'Syahirah Hanapiah. Najib, 67, and former 1MDB Chief Executive Officer Arul ' 'Kanda Kandasamy, 45, were tried on charges of amending the 1MDB audit ' 'report. The Pekan Member of Parliament was accused of using his position to ' 'direct amendments to the 1MDB final audit report before being tabled to the ' 'Public Accounts Committee to prevent any action from being taken against ' 'him, while Arul Kanda was charged with abetting Najib in making amendments ' 'to the report to protect Najib from being prosecuted.']

[24]: pprint(bigbird.greedy_decoder([string2])) ['KUALA LUMPUR: Datuk Seri Najib Tun Razak today failed in another attempt to ' 'drop former Federal Court Judge Datuk Seri Gopal Sri Ram from leading the ' 'prosecution team in the case of amending the 1Malaysia Development Berhad ' "(1MDB) audit report involving the former prime minister. This is Najib's " 'third attempt to drop Sri Ram as a prosecution in the 1MDB-related criminal ' 'case. Previously, an application was filed in another 1MDB case before a ' 'different judge and a second attempt through civil proceedings. While ' "rejecting the Pekan Member of Parliament's application, High Court Judge " "Mohamed Zaini Mazlan said Najib's allegation that Sri Ram was involved in " 'his investigation into him as no merit. "There is no solid evidence to ' "support the applicant's allegations and remain a mere hypothesis. This issue " 'has been investigated by the prosecution as the respondent during the ' 'application of the applicant (Najib) earlier." This issue has been discussed ' 'and decided. The decision by other courts previously remains and cannot be ' 'reversed, "said the judge. Mohamed Zaini when giving the reason for refusing ' 'said the applicant, among others, made communication between former ' 'Attorney-General Tan Sri Mohamed Apandi Ali and Sri Ram as proof of Sri ' "Ram's bias towards the applicant and the applicant's concern over the matter " 'was baseless. "Like other individuals, Sri Ram has the right to have a ' "personal view. That's all. However, different considerations will be made if " 'he shows a biased attitude while performing his duties as a senior ' 'prosecutor. His personal views should not be considered to hinder his ' 'responsibility as a senior prosecutor. "Furthermore, the incident, as ' 'submitted by the respondent, occurred before the appointment of Sri Ram as ' 'senior prosecutor. It is also important that the applicant did not make any ' "complaints about Sri Ram's actions during the trial involving the applicant. " "This strengthens the respondents' argument that Sri Ram was open while " 'carrying out his duties as senior prosecutor," said the judge. Mohamed Zaini ' 'further said that in the end, the court was responsible for ensuring that a ' 'trial was held fairly in order to obtain justice. "The court will assist any ' 'party treated unfairly, if the matter happens. In relation to that, the ' 'applicant\'s application is rejected," he said. The trial of the 1MDB audit ' "report case will continue on Feb 22. In today's proceedings, Deputy Public " 'Prosecutor Ahmad Akram Gharib acted on behalf of the prosecution, while ' 'Najib was represented by lawyer Nur Syahirah Hanapiah. Najib, 67, and former ' '1MDB Chief Executive Officer Arul Kanda Kandasamy, 45, were tried on charges ' 'of amending the 1MDB audit report. The Pekan Member of Parliament was ' 'accused of using his position to order amendments to the 1MDB final audit ' 'report before being tabled to the Public Accounts Committee to prevent any ' 'action being taken against him, while Arul Kanda was accused of abetting ' 'Najib in amending the report to protect Najib from being prosecuted.']

510 Chapter 9. Contents: malaya Documentation

[25]: pprint(quantized_bigbird.greedy_decoder([string2])) ['KUALA LUMPUR: Datuk Seri Najib Tun Razak today failed in another attempt to ' 'drop former Federal Court Judge Datuk Seri Gopal Sri Ram from leading the ' 'prosecution team in the case of amending the 1Malaysia Development Berhad ' "(1MDB) audit report involving the former prime minister. This is Najib's " 'third attempt to drop Sri Ram as a prosecution in the 1MDB-related criminal ' 'case. Previously, an application was filed in another 1MDB case before a ' 'different judge and a second attempt through civil proceedings. While ' "rejecting the Pekan Member of Parliament's application, High Court Judge " "Mohamed Zaini Mazlan said Najib's allegation that Sri Ram was involved in an " 'investigation into him as no merit. "There is no solid evidence to support ' "the applicant's allegations and remain a mere hypothesis. This issue has " 'been investigated by the prosecution as the respondent during the applicant ' '(Najib) application." This issue has been discussed and decided. The ' 'decision by other courts has previously remained and cannot be reversed, ' '"said the judge. Mohamed Zaini when giving the reason for his rejection said ' 'the applicant, among others, made communication between former ' 'Attorney-General Tan Sri Mohamed Apandi Ali and Sri Ram as proof of Sri ' "Ram's bias towards the applicant and the applicant's concern over the matter " 'was baseless. "Like other individuals, Sri Ram has the right to have a ' "personal view. That's all. However, different considerations will be made if " 'he shows a biased attitude while performing his duties as a senior ' 'prosecutor. His personal views should not be considered to hinder his ' 'responsibility as a senior prosecutor. "Furthermore, the incident, as ' 'submitted by the respondent, occurred before the appointment of Sri Ram as ' 'senior prosecutor. It is also important that the applicant did not make any ' "complaints about Sri Ram's actions during the trial involving the applicant. " "This strengthens the respondents' argument that Sri Ram is open while " 'carrying out his duties as senior prosecutor," said the judge. Mohamed Zaini ' 'further said that in the end, the court was responsible for ensuring that a ' 'trial was held fairly in order to obtain justice. "The court will assist any ' 'party treated unfairly, if the matter happens. In relation to that, the ' 'applicant\'s application is rejected," he said. The trial of the 1MDB audit ' "report case will continue on Feb 22. At today's proceedings, Deputy Public " 'Prosecutor Ahmad Akram Gharib acted on behalf of the prosecution, while ' 'Najib was represented by lawyer Nur Syahirah Hanapiah. Najib, 67, and former ' '1MDB Chief Executive Officer Arul Kanda Kandasamy, 45, were tried on charges ' 'of amending the 1MDB audit report. The Pekan Member of Parliament was ' 'accused of using his position to order amendments to the 1MDB final audit ' 'report before being tabled to the Public Accounts Committee to prevent any ' 'action being taken against him, while Arul Kanda was charged with abetting ' "Najib in amending Najib's report to protect Najib from being prosecuted."]

[21]: pprint(bigbird_small.greedy_decoder([string3])) ['3 August 2018 - The latest study conducted by University College London ' '(UCL) found that mental healthcare services from support workers who have ' 'been facing the disease themselves may help reduce the probability of ' 'patients who have just left out of acute mental health care from ' 're-introduced to the unit. The research paper published today in the journal ' 'The Lancet found that the number of mental patients re-induced into acute ' 'care units within a year was 24% much lower for the group of patients ' 'offered by support workers in the form of senasib friends, compared to ' 'groups of patients who are only given personal rehabilitation work books. ' '"The pain allowed to go out (discharge) from community crisis services is ' 'often re-entered into acute care units. This not only helps recovery, but ' 'also uses resources that should be reserved for long-term improvement of ' (continues on next page)

9.53. Long Text Translation 511 malaya Documentation

(continued from previous page) "patients' functions and quality of life. Support workers from the group of " 'senasib friends are able to provide truly friendly and empatth support and ' 'encouragement because they come from their own personal experience, in ' 'addition to being a good example of role model for patient rehabilitation, ' '"said the main author, Professor Sonia Johnson (Psychiiatri, UCL). In the ' 'United Kingdom (UK), more than half of the number of patients included in ' 'the acute care unit is re-entered into the unit within a year, but there is ' 'no solid evidence to explain the method to reduce this number. In fact, ' 'support from individuals who have experienced mental health problems has ' 'been practiced in programs such as the Implementation of Rehabilitation ' 'through NHS Organizational Change (UK) as well as the United States Health ' 'Rehabilitation Action Plan. This research is the first random controlled ' 'test to evaluate the effectiveness of senasib partner support programs. ' 'However, there are still many research that needs to be carried out before ' 'this strategy can be implemented comprehensively in the UK, for example to ' 'understand the reasons why this program is effective. Certainly, ' 'intervention or intervention in self-management (self-management) may help ' 'patients manage their mental health better. In this study, researchers ' 'provide participants either with personal rehabilitation work books only ' '(220 people) or working books with help with senasib partner support workers ' '(221 people). Participants are also allowed to continue their regular care. ' 'This study was conducted through six mental health crisis solution teams in ' 'England and participants were selected only after they were allowed to leave ' 'the crisis unit by crisis resolution team. Study participants consisted of ' 'various diagnosiss including schizophrenia, bipolar, psychosis, depression, ' 'anxiety confusion, post-trauma stress disorders (PTSD), and personality ' 'confusion. Participants who received support from senasib friends were ' 'offered 10 weekly meeting sessions for an hour. During this meeting session, ' "support workers heard participants' problems and aim to instill hope by " 'sharing the skills and strategies of disease management (coping strategies) ' 'which were mastered when they themselves were in the process of recovery. ' 'All support workers are given prior training in listening skills, cultural ' 'awareness, self-disclosure, and confidentiality, as well as how to use work ' "books. Participants' health records were monitored by researchers for one " 'year to determine whether participants were re-entered to acute care units ' '(such as acute patient wards, crisis solutions, crisis homes, and acute ' 'daily care services) or not. After completing a year, the study found that ' 'the percentage of re-entry into acute care units is lower for the group of ' 'participants who receive intervention compared to the control group - with ' '29% of participants re-in-reduced after receiving support from senasib ' 'friends, compared to 38% participants who only received working books. The ' 'intervention absorption rate also well - 72% of participants offered the ' 'services of senasib support employees and working books attended at least ' 'three meeting sessions, while one third attended all 10 meeting sessions. ' '"We are aware that many users of mental health crisis care services felt ' 'this service stopped sharply due to lack of ready-up care. Our study shows ' 'that senasib partner support workers can help fill this gap by helping ' 'service users (accuraration) to develop self-management strategies and ' 'recovery plans in their own way and more meaningful, in turn can help ' 'patients continue to survive after a crisis, "said the joint author, Dr ' 'Brynmor Lloyd-Evans (Psikiatrialt) to help in turn can help patients ' 'continue to survive after a crisis after a crisis after a crisis," said ' 'after a crisis," said. together," said that. (Piatri, (s after receiving job ' 'management services of the services, Dr imized service users with at least ' 'three-inforceditionary support services (Philly, and more meaningfully and ' 'more meaningfully, Dr (Planties, in turn help patients with a crisis, in ' 'turn can help patients continue to help']

512 Chapter 9. Contents: malaya Documentation

[22]: pprint(quantized_bigbird_small.greedy_decoder([string3])) ['3 August 2018 - The latest study conducted by University College London ' '(UCL) found that mental healthcare services from support workers who have ' 'been going through the disease themselves may help reduce the probability of ' 'patients who have just left from acute mental health care from re-increased ' 'to the unit. The research paper published today in the journal The Lancet ' 'found that the number of mental patients re-induced into acute care units ' 'within a year is 24% much lower for the group of patients offered by support ' 'workers in the form of dustry partners, compared to groups of patients who ' 'are only given personal rehabilitation work books. "The pain allowed to go ' 'out (discharge) from community crisis services is often re-entered to acute ' 'care units. This not only hinders recovery, but also uses resources that ' "should be reserved for long-term improvement of patients' functions and " 'quality of life. Support workers from the group of senasib friends are able ' 'to provide truly friendly and empath support and encouragement because they ' 'come from their own personal experience, in addition to being a good example ' 'of role model for patient rehabilitation, "said the main author, Professor ' 'Sonia Johnson (Psychiiatri, UCL). In the United Kingdom (UK), more than half ' 'of the number of patients included in the acute care unit is re-entered into ' 'the unit within a year, but there is no solid evidence to explain the method ' 'to reduce this number. In fact, support from individuals who have ' 'experienced mental health problems has been practiced in programs such as ' 'the Implementation of Rehabilitation through NHS Organizational Change (UK) ' 'as well as the United States Health Recovery Action Plan. This research is ' 'the first random controlled test to evaluate the effectiveness of senasib ' 'support programs. Even so, there are still many research that needs to be ' 'carried out before this strategy can be implemented comprehensively in the ' 'UK, for example to understand the reasons why this program is effective. ' 'Certainly, intervention or intervention in self-management may help patients ' 'manage their mental health better. In this study, researchers provide ' 'participants either with personal rehabilitation work books (220 people) or ' 'working books with help from senasib partner support workers (221 people). ' 'Participants are also allowed to continue their regular care. This study was ' 'conducted through six mental health crisis solution teams in England and ' 'participants were selected only after they were allowed to leave the crisis ' 'unit by crisis resolution team. Study participants consisted of various ' 'diagnosiss including schizophrenia, bipolar, psychosis, depression, anxiety ' 'confusion, post-trauma stress disorders (PTSD), and personality confusion. ' 'Participants who received support from senasib friends were offered 10 ' 'weekly meeting sessions for an hour. During this meeting session, support ' "workers heard participants' problems and aim to instill hope by sharing the " 'skills and strategies of disease management when they themselves were in the ' 'process of recovery. All support workers are given prior training in ' 'listening skills, cultural awareness, self-disclosure, and confidentiality, ' "as well as how to use work books. Participants' health records were " 'monitored by researchers for one year to determine whether participants were ' 're-entered to acute care units (such as acute patient wards, crisis ' 'solutions, crisis homes, and acute daily care services) or not. After the ' 'end of one year, the study found that the percentage of re-entry to acute ' 'care units is lower for the group of participants who receive intervention ' 'compared to the control group - with 29% of participants re-increased after ' 'receiving the support of the senasib partner, compared to 38% participants ' 'who only received work books. The intervention absorption rate also - 72% of ' 'participants offered the services of senasib support employees and working ' 'books attended at least three meeting sessions, while one third attended all ' '10 meeting sessions. "We are aware that many users of mental health crisis ' 'care services feel that this service stopped dramatically due to lack of ' (continues on next page)

9.53. Long Text Translation 513 malaya Documentation

(continued from previous page) 'follow-up care ready. Our study shows that senasib peer support workers can ' 'help fill this gap by helping service users (accuratic) develop ' 'self-management strategies and recovery plans in their own way and more ' 'meaningfully, in turn can help patients continue to survive after a crisis, ' '"said the joint author, Dr Brynmor Lloyd-Evans (Pcyclist, UCL).']

[26]: pprint(bigbird.greedy_decoder([string3])) ['3 August 2018 - A recent study conducted by University College London (UCL) ' 'found that mental healthcare services from support workers who have ' 'undergone the disease themselves may help reduce the probability of patients ' 'who have just left acute mental health care (acute mental health care) from ' 'being re-entered to the unit. The research paper published today in the ' 'journal The Lancet found that the number of mental patients re-entered to ' 'acute care units within a year was 24% much lower for the group of patients ' 'offered by support workers in the form of senib friends, compared to the ' 'group of patients who were only given personal rehabilitation workbooks ' 'only. "Patients allowed to leave (discharge) from community crisis services ' 'are often re-entered to acute care units. This not only hinders recovery, ' 'but also uses resources that should be reserved for long-term improvement of ' 'patient function and quality of life. Support workers from the senasib ' 'friends are able to provide support and encouragement that is truly friendly ' 'and empathic because it comes from their own personal experience, in ' 'addition to being a good example of models (role models) for patient ' 'recovery, "said lead author Professor Sonia Johnson (Psychiatric, UCL). In ' 'the United Kingdom (UK), more than half of the total number of patients ' 'admitted to acute care units are re-entered to the unit within a year, but ' 'no solid evidence explains the method to reduce this number.In fact, support ' 'from individuals who have had mental health problems has been practiced in ' 'programs such as Rehabilitation Implementation through NHS Organization ' 'Change (UK) and also the United States Health Rehabilitation Action ' 'Plan.This research is the first such randomized test to evaluate the ' 'effectiveness of senmitarian support programs.However, there is still a lot ' 'of research that needs to be done before this strategy can be implemented ' 'comprehensively in the UK, for example to understand the reasons why the ' 'program is effective. Certainly, intervention or intervention in ' 'self-management may help patients manage their mental health better.In this ' 'study, the researcher provided participants either with personal ' 'rehabilitation workbooks only (220 people) or workbooks with the help of ' 'support employees of senasib friends (221 people). Participants are also ' 'allowed to continue their normal care. This study was conducted through six ' 'teams to resolve the mental health crisis (crisis resolution team) in ' 'England and participants were selected only after they were allowed to leave ' 'the crisis unit by the crisis-solving team. Study participants consisted of ' 'various diagnosiss including schizophrenia, bipolar, psychosis, depression, ' 'anxiety disorder, post-trauma stress disorder (PTSD), and personality ' 'disorder. Participants who received the support of senmiline friends were ' 'offered 10 weekly meeting sessions for an hour. During this meeting session, ' "support workers listened to participants' problems and aimed at instilling " 'hope by sharing skills and strategies of disease management (coping ' 'strategies) controlled while they themselves were in the process of ' 'recovery. All support workers are given prior training in listening skills, ' 'cultural awareness, self-disclosure, and confidentiality, as well as how to ' "use workbooks. Participants' health records are monitored by researchers for " 'a year to determine whether participants are re-entered into acute care ' 'units (such as acute patient wards, crisis-solving teams, crisis houses, ' 'crisis houses, and acute daily care services) or not. After a year, the ' (continues on next page)

514 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'study found that the percentage of re-entry to acute care units was lower ' 'for groups of participants who received the intervention compared to the ' 'control group - with 29% of participants re-entered after receiving support ' 'from their peers, compared to 38% of participants who only received ' 'workbooks. The intervention absorption rate is also good - 72% of ' 'participants offered the services of support partner support workers and ' 'workbooks attended at least three meeting sessions, while one-third attended ' 'all 10 meeting sessions. "We are aware that many users of mental health ' 'crisis care services feel this service ceases sharply due to lack of ' 'ready-made follow-up care. Our study shows that support employees of senied ' 'partners can help fill out this gap by helping service users (pesakit) ' 'develop self-management strategies and recovery and more meaningfully, ' 'further helping patients survive after a crisis, " said the co-winnary ' 'author, " said the co-star, Dr Brynmorynmor Lloyd-E).']

[27]: pprint(quantized_bigbird.greedy_decoder([string3])) ['August 3, 2018 - A recent study conducted by University College London (UCL) ' 'found that mental healthcare services from support workers who have ' 'undergone the disease themselves may help reduce the probability of patients ' 'who have just left acute mental health care (acute mental health care) from ' 'being re-entered to the unit. The research paper published today in the ' 'journal The Lancet found that the number of mental patients re-entered into ' 'acute care units within a year was 24% much lower for the group of patients ' 'offered by support workers in the form of senib partners, compared to the ' 'group of patients who were only given personal rehabilitation workbooks ' 'only. "Puncharge allowed out of community crisis services are often ' 're-entered to acute care units. This not only hinders recovery, but also ' 'uses resources that should be reserved for long-term improvement of patient ' 'function and quality of life. Support workers from the senasib partners are ' 'able to provide support and encouragement that is truly friendly and ' 'empathistic because it comes from their own personal experience, in addition ' 'to being a good example of a good role model for patient recovery, "said ' 'lead author Professor Sonia Johnson (Psychiatrist, UCL). In the United ' 'Kingdom (UK), more than half of the total number of patients admitted to ' 'acute care units are re-entered to the unit within a year, but no solid ' 'evidence explains the method to reduce this amount.In fact, support from ' 'individuals who have had mental health problems has been practiced in ' 'programs such as Rehabilitation Implementation through NHS Organization ' 'Change (UK) and also the United States Health Rehabilitation Action ' 'Plan.This research is the first such random controlled test to evaluate the ' 'effectiveness of the senasib support program.However, there is still a lot ' 'of research that needs to be done before this strategy can be implemented ' 'comprehensively in the UK, for example to understand the reasons why the ' 'program is effective. Certainly, intervention or intervention in ' 'self-management may help patients manage their mental health better.In this ' 'study, the researcher provided participants either with personal ' 'rehabilitation workbooks only (220 people) or workbooks with the help of ' 'support staff of senasib friends (221 people). Participants are also allowed ' 'to continue their normal care. This study was conducted through six teams ' 'solving the mental health crisis (crisis resolution team) in England and ' 'participants were selected only after they were allowed to leave the crisis ' 'unit by the crisis-solving team. Study participants consisted of various ' 'diagnosis including schizophrenia, bipolar, psychosis, depression, anxiety ' 'disorder, post-trauma stress disorder (PTSD), and personality disorder. ' 'Participants who received the support of senasib friends were offered 10 ' 'weekly meeting sessions for an hour. During this meeting session, support ' (continues on next page)

9.53. Long Text Translation 515 malaya Documentation

(continued from previous page) "workers listened to participants' problems and aimed at instilling hope by " 'sharing skills and strategies of disease management (coping strategies) ' 'controlled while they were in the process of recovery. All support workers ' 'are given prior training in listening skills, cultural awareness, ' 'self-disclosure, and confidentiality, as well as how to use workbooks. ' "Participants' health records are monitored by researchers for a year to " 'determine whether participants are re-entered into acute care units (such as ' 'acute patient wards, crisis-solving teams, crisis houses, crisis houses, and ' 'acute daily care services) or not. After a year, the study found that the ' 'percentage of re-entry to acute care units is lower for groups of ' 'participants who receive interventions compared to control groups - with 29% ' 'of participants re-entered after receiving support from their peers, ' 'compared to 38% of participants who only received workbooks. The ' 'intervention absorption rate is also good - 72% of participants offered the ' 'services of support partner support and workbooks attend at least three ' 'meeting sessions, while one-thirds attend all 10 meeting sessions. "We are ' 'aware that many users of mental health crisis care services feel this ' 'service is abruptly due to lack of ready-made follow-up care. Our study ' 'shows that support staff support workers can help fill this gap by helping ' 'service users (pesakit) develop self-management strategies and recovery ' 'plans in their own way, further helping patients survive after a crisis," ' 'said the occurrence, " said the co-started author Dr Bryn of the author, Dr ' 'Brynousa.']

[ ]:

9.54 SQUAD

This tutorial is available as an IPython notebook at Malaya/example/qa-squad.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]: %%time

import malaya from pprint import pprint CPU times: user 5.14 s, sys: 1 s, total: 6.14 s Wall time: 7.01 s

516 Chapter 9. Contents: malaya Documentation

9.54.1 What is SQUAD

Stanford Dataset (SQuAD) is a reading comprehension dataset, eg,

{ 'title': 'Normans', 'paragraphs': [ { 'context': 'Orang Norman (Norman: Nourmands; Perancis: Normands; Latin:

˓→Normanni) ialah orang-orang yang pada abad ke-10 dan ke-11 memberikan nama mereka

˓→kepada Normandy, sebuah wilayah di Perancis. Mereka diturunkan daripada Norse (

˓→"Norman" berasal daripada penyerang "Norseman") dan lanun dari Denmark, Iceland dan

˓→Norway yang, di bawah pimpinan mereka Rollo, bersetuju untuk bersumpah fealty

˓→kepada Raja Charles III dari Francia Barat. Melalui generasi asimilasi dan

˓→percampuran dengan penduduk asli Frankish dan Roman-Gaulish, keturunan mereka akan

˓→beransur-ansur bergabung dengan budaya Carolingian yang berpusat di Francia Barat.

˓→Identiti budaya dan etnik yang berbeza dari orang Norman muncul pada mulanya pada

˓→separuh pertama abad ke-10, dan ia terus berkembang pada abad-abad yang berjaya.', 'qas': [ { 'question': 'Di negara manakah Normandy berada?', 'answers': [ {'text': 'Perancis', 'answer_start': 177}, {'text': 'Perancis', 'answer_start': 177}, {'text': 'Perancis', 'answer_start': 177}, {'text': 'Perancis', 'answer_start': 177}, ], 'id': '56ddde6b9a695914005b9628', 'is_impossible': False, } ], } ], }

So we need to give a long paragraph and multiple questions, and the model will return answers based on that paragraph with start and end spans. Read more about SQUAD dataset https://rajpurkar.github.io/SQuAD-explorer/

9.54.2 List available Transformer models

[2]: malaya.qa.available_transformer_squad() INFO:root:tested on SQUAD V2 Dev set. [2]: Size (MB) Quantized Size (MB) exact f1 total tiny-bert 60.9 15.30 53.45758 56.79821 11858.0 bert 452.0 113.00 57.16810 61.48740 11858.0 albert 58.1 14.60 58.97284 63.12757 11858.0 tiny-albert 24.8 6.35 50.00843 50.00843 11858.0 xlnet 478.0 120.00 62.74245 66.56101 11858.0 alxlnet 58.4 15.60 61.97503 65.89765 11858.0

9.54. SQUAD 517 malaya Documentation

9.54.3 Load Transformer model

[3]: xlnet_model= malaya.qa.transformer_squad(model='xlnet') albert_model= malaya.qa.transformer_squad(model='albert')

9.54.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_xlnet_model= malaya.qa.transformer_squad(model='xlnet', quantized= True) quantized_albert_model= malaya.qa.transformer_squad(model='albert', quantized=

˓→True) WARNING:root:Load quantized model will cause accuracy drop. WARNING:root:Load quantized model will cause accuracy drop.

9.54.5 Copy from wikipedia and news

[5]: # https://ms.wikipedia.org/wiki/Mohd_Najib_bin_Abdul_Razak

p_wikipedia= """ Najib razak telah dipilih untuk Parlimen Malaysia pada tahun 1976, pada usia 23 tahun, menggantikan bapanya duduk di kerusi Pekan yang berpangkalan di

˓→Pahang. Dari tahun 1982 hingga 1986 beliau menjadi Menteri Besar (Ketua Menteri) Pahang, sebelum memasuki persekutuan Kabinet Tun Dr Mahathir Mohamad pada tahun 1986 sebagai

˓→Menteri Kebudayaan, Belia dan Sukan. Beliau telah berkhidmat dalam pelbagai jawatan Kabinet sepanjang baki tahun 1980-an

˓→dan 1990-an, termasuk sebagai Menteri Pertahanan dan Menteri Pelajaran. Beliau menjadi Timbalan Perdana Menteri pada 7 Januari 2004, berkhidmat di bawah

˓→Perdana Menteri Tun Dato' Seri Abdullah Ahmad Badawi, sebelum menggantikan Badawi setahun selepas Barisan Nasional mengalami kerugian besar

˓→dalam pilihan raya 2008. Di bawah kepimpinan beliau, Barisan Nasional memenangi pilihan raya 2013, walaupun buat kali pertama dalam sejarah Malaysia pembangkang memenangi majoriti undi

˓→popular. """ q_wikipedia=['Siapakah Menteri Besar Pahang','Apakah jawatan yang pernah dipegang

˓→oleh Najib Razak']

[6]: # https://www.malaysiakini.com/news/574914

p_news= """ Bekas perdana menteri Najib Razak mempersoalkan tindakan polis yang menurutnya tidak

˓→serta-merta mengeluarkan kenyataan berhubung dakwaan Adun Perikatan Nasional (PN)

˓→"merancang" insiden rogol. Sedangkan, kata ahli parlimen Pekan itu, polis pantas mengeluarkan kenyataan apabila

˓→dia dilapor terlupa mengimbas MySejahtera sebelum masuk restoran. "Berita Najib lupa scan MySejahtera tular, kenyataan polis terus keluar. Berita Dr

˓→Mahathir Mohamad lupa scan, kenyataan, polis serta-merta keluar. "Sebab itu saya pelik kenapa pihak polis belum sempat keluar apa-apa kenyataan

˓→berhubung kes seorang gadis membuat laporan polis untuk dakwa Adun PN rancang (continues on next page) ˓→insiden rogolnya," katanya di Facebook hari ini.

518 Chapter 9. Contents: malaya Documentation

(continued from previous page) Najib merujuk dakwaan seorang wanita yang mendakwa dirogol kenalan kepada Adun Gombak

˓→Setia, Hilman Idham. Wanita itu mendakwa ahli politik dari Bersatu berkenaan merancang insiden yang

˓→berlaku pada 5 Dis lalu. Menurut laporan polis pada 8 Mei, mangsa mendakwa kejadian itu berlaku di sebuah

˓→hotel di Selangor, yang pada masa itu berada di bawah perintah kawalan pergerakan

˓→bersyarat (PKPB). """

q_news=['siapakah yang mempersoalkan tindakan polis','siapakah Adun Gombak Setia']

9.54.6 Predict

def predict( self, paragraph_text: str, question_texts: List[str], doc_stride: int= 128, max_query_length: int= 64, max_answer_length: int= 64, n_best_size: int= 20, ): """ Predict Span from questions given a paragraph.

Parameters ------paragraph_text: str question_texts: List[str] List of questions, results really depends on case sensitive questions. doc_stride: int, optional (default=128) striding size to split a paragraph into multiple texts. max_query_length: int, optional (default=64) Maximum length if question tokens. max_answer_length: int, optional (default=30) Maximum length if answer tokens.

Returns ------result: List[{'text': 'text', 'start': 0, 'end': 1}] """

[7]: xlnet_model.predict(p_wikipedia, q_wikipedia) [7]: [{'text': 'Najib razak', 'start': 0, 'end': 11}, {'text': 'Pekan yang berpangkalan di Pahang', 'start': 123, 'end': 157}]

[9]: albert_model.predict(p_wikipedia, q_wikipedia) [9]: [{'text': 'Najib razak', 'start': 0, 'end': 11}, {'text': 'Menteri Pertahanan dan Menteri Pelajaran', 'start': 475, 'end': 516}]

9.54. SQUAD 519 malaya Documentation

[10]: quantized_xlnet_model.predict(p_wikipedia, q_wikipedia) [10]: [{'text': 'Najib razak', 'start': 0, 'end': 11}, {'text': 'Pekan yang berpangkalan di Pahang', 'start': 123, 'end': 157}]

[11]: quantized_albert_model.predict(p_wikipedia, q_wikipedia) [11]: [{'text': 'Najib razak', 'start': 0, 'end': 11}, {'text': 'Menteri Pertahanan dan Menteri Pelajaran', 'start': 475, 'end': 516}]

[12]: xlnet_model.predict(p_news, q_news) [12]: [{'text': 'Bekas perdana menteri Najib Razak', 'start': 0, 'end': 33}, {'text': 'Hilman Idham', 'start': 791, 'end': 804}]

[13]: albert_model.predict(p_news, q_news) [13]: [{'text': 'Bekas perdana menteri Najib Razak', 'start': 0, 'end': 33}, {'text': 'Hilman Idham', 'start': 791, 'end': 804}]

[14]: quantized_xlnet_model.predict(p_news, q_news) [14]: [{'text': 'Bekas perdana menteri Najib Razak', 'start': 0, 'end': 33}, {'text': 'Hilman Idham', 'start': 791, 'end': 804}]

[15]: quantized_albert_model.predict(p_news, q_news) [15]: [{'text': 'Bekas perdana menteri Najib Razak', 'start': 0, 'end': 33}, {'text': 'Hilman Idham', 'start': 791, 'end': 804}]

9.54.7 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str='first'): """ vectorize list of strings.

Parameters ------strings: List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

520 Chapter 9. Contents: malaya Documentation

Sentence level

[25]: strings=['Siapakah Menteri Besar Pahang', 'Apakah jawatan yang pernah dipegang oleh Najib Razak', 'Najib razak', 'Menteri Pertahanan dan Menteri Pelajaran', 'Bekas perdana menteri Najib Razak', 'Hilman Idham', 'Tun Dr Mahathir Mohamad pada tahun 1986', 'Berita Najib lupa scan MySejahtera tular, kenyataan polis terus keluar']

[26]:r= quantized_xlnet_model.vectorize(strings, method='first')

[27]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r) tsne.shape [27]: (8, 2)

[28]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels= strings for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.54. SQUAD 521 malaya Documentation

Word level

[33]:r= quantized_xlnet_model.vectorize(strings, method='word')

[34]: x, y= [], [] for row in r: x.extend([i[0] for i in row]) y.extend([i[1] for i in row])

[35]: tsne= TSNE().fit_transform(y) tsne.shape [35]: (43, 2)

[36]: plt.figure(figsize=(7,7)) plt.scatter(tsne[:,0], tsne[:,1]) labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( (continues on next page)

522 Chapter 9. Contents: malaya Documentation

(continued from previous page) label, xy= (x, y), xytext=(0,0), textcoords='offset points', )

9.55 Classification

This tutorial is available as an IPython notebook at Malaya/example/zeroshot-classification.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]: %%time import malaya CPU times: user 4.86 s, sys: 709 ms, total: 5.57 s Wall time: 5 s /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

9.55. Classification 523 malaya Documentation

9.55.1 what is zero-shot classification

Commonly we supervised a on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of ‘jealous’ in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one ‘jealous’ label before, impossible. So, zero-shot trying to solve this problem. zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification. Yin et al. (2019) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier. So, we are going to use transformer models from malaya.similarity.transformer with a little tweaks.

9.55.2 List available Transformer models

[2]: malaya.zero_shot.classification.available_transformer() [2]: Size (MB) Quantized Size (MB) macro precision macro recall \ bert 423.4 111.0 0.88315 0.88656 tiny-bert 56.6 15.0 0.87210 0.87546 albert 48.3 12.8 0.87164 0.87146 tiny-albert 21.9 6.0 0.82234 0.82383 xlnet 448.7 119.0 0.80866 0.76775 alxlnet 49.0 13.9 0.88756 0.88700

macro f1-score bert 0.88405 tiny-bert 0.87292 albert 0.87155 tiny-albert 0.82295 xlnet 0.77112 alxlnet 0.88727

We trained on Quora Question Pairs, translated SNLI and translated MNLI Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/ Accuracy.html#similarity You might want to use ALXLNET, a very small size, 49MB, but the accuracy is still on the top notch.

9.55.3 Load transformer model

In this example, I am going to load alxlnet, feel free to use any available models above.

def transformer(model: str='bert', quantized: bool= False, **kwargs): """ Load Transformer zero-shot model.

Parameters ------model : str, optional (default='bert') Model architecture supported. Allowed values:

(continues on next page)

524 Chapter 9. Contents: malaya Documentation

(continued from previous page) * ``'bert'`` - Google BERT BASE parameters. * ``'tiny-bert'`` - Google BERT TINY parameters. * ``'albert'`` - Google ALBERT BASE parameters. * ``'tiny-albert'`` - Google ALBERT TINY parameters. * ``'xlnet'`` - Google XLNET BASE parameters. * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

quantized : bool, optional (default=False) if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns ------result : malaya.model.bert.ZeroshotBERT class """

[3]: model= malaya.zero_shot.classification.transformer(model='alxlnet') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

9.55.4 Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False. We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]: quantized_model= malaya.zero_shot.classification.transformer(model='alxlnet',

˓→quantized= True) WARNING:root:Load quantized model will cause accuracy drop.

predict batch

def predict_proba(self, strings: List[str], labels: List[str]): """ classify list of strings and return probability.

Parameters ------strings : List[str] labels : List[str]

(continues on next page)

9.55. Classification 525 malaya Documentation

(continued from previous page) Returns ------list: list of float """

Because it is a zero-shot, we need to give labels for the model.

[5]: # copy from twitter

string='gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'

[6]: model.predict_proba([string], labels=['najib razak','mahathir','kerajaan','PRU',

˓→'anarki']) [6]: [{'najib razak': 0.02823881, 'mahathir': 0.059464306, 'kerajaan': 0.0032106405, 'PRU': 0.9422462, 'anarki': 0.9644167}]

[7]: quantized_model.predict_proba([string], labels=['najib razak','mahathir','kerajaan

˓→','PRU','anarki']) [7]: [{'najib razak': 0.004405794, 'mahathir': 0.015691597, 'kerajaan': 0.0154573675, 'PRU': 0.8233098, 'anarki': 0.34632725}]

Quite good.

[8]: string='tolong order foodpanda jab, lapar'

[9]: model.predict_proba([string], labels=['makan','makanan','novel','buku','kerajaan

˓→','food delivery']) [9]: [{'makan': 0.54341537, 'makanan': 0.9774909, 'novel': 0.00090197776, 'buku': 0.00044378178, 'kerajaan': 0.0028080132, 'food delivery': 0.8143844}]

the model understood order foodpanda got close relationship with makan, makanan and food delivery.

[10]: string='kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

[11]: model.predict_proba([string], labels=['makan','makanan','novel','buku','kerajaan

˓→','food delivery', 'kerajaan jahat','kerajaan prihatin',

˓→'bantuan rakyat']) [11]: [{'makan': 0.008046242, 'makanan': 0.0016310408, 'novel': 0.00044678123, 'buku': 0.00071050954, 'kerajaan': 0.98634493, (continues on next page)

526 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'food delivery': 0.0009665733, 'kerajaan jahat': 0.1006222, 'kerajaan prihatin': 0.9954796, 'bantuan rakyat': 0.35266426}]

9.55.5 Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize( self, strings: List[str], labels: List[str], method: str='first' ): """ vectorize a string.

Parameters ------strings: List[str] labels : List[str] method : str, optional (default='first') Vectorization layer supported. Allowed values:

* ``'last'`` - vector from last sequence. * ``'first'`` - vector from first sequence. * ``'mean'`` - average vectors from all sequences. * ``'word'`` - average vectors based on tokens.

Returns ------result: np.array """

Sentence level

[4]: texts=['kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai', 'tolong order foodpanda jab, lapar', 'Hapuskan vernacular school first, only then we can talk about UiTM'] labels=['makan','makanan','novel','buku','kerajaan','food delivery', 'kerajaan jahat','kerajaan prihatin','bantuan rakyat'] r= quantized_model.vectorize(texts, labels, method='first')

vectorize method from zeroshot classification model will returned 2 values, (combined, vector).

[5]: r[0][:5] [5]: [('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'makan'), ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'makanan'), ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'novel'), (continues on next page)

9.55. Classification 527 malaya Documentation

(continued from previous page) ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'buku'), ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'kerajaan')]

[6]: r[1] [6]: array([[-0.00587193, -0.7214614 , -0.7524409 , ..., 0.31107777, 1.022762 , 0.28308758], [ 0.63863456, 0.12698255, 0.67567766, ..., 0.7627216 , 0.56795114, -0.37056473], [-0.90291303, 0.93581504, 0.05650915, ..., 0.5578094 , 1.1304276 , 0.5470246 ], ..., [-2.1161728 , -1.4592253 , 0.5284856 , ..., 0.28636536, -0.36558965, -0.8226106 ], [-2.2050292 , -0.14624506, 0.19812807, ..., 0.1307496 , -0.20792441, 0.18430969], [-2.5969799 , 0.4205628 , 0.18376699, ..., 0.124988 , -0.9915105 , -0.10085672]], dtype=float32)

[7]: from sklearn.manifold import TSNE import matplotlib.pyplot as plt

tsne= TSNE().fit_transform(r[1]) tsne.shape [7]: (36, 2)

[8]: unique_labels= list(set([i[1] for i in r[0]])) palette= plt.cm.get_cmap('hsv', len(unique_labels))

[9]: plt.figure(figsize=(7,7))

for label in unique_labels: indices= [i for i in range(len(r[0])) if r[0][i][1] == label] plt.scatter(tsne[indices,0], tsne[indices,1], cmap= palette(unique_labels.

˓→index(label)), label= label)

labels= [i[0] for i in r[0]] for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', ) plt.legend()

528 Chapter 9. Contents: malaya Documentation

[9]:

Word level

[28]: texts=['kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan', 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai', 'tolong order foodpanda jab, lapar', 'Hapuskan vernacular school first, only then we can talk about UiTM'] labels=['makan','makanan','novel','buku','kerajaan','food delivery', 'kerajaan jahat','kerajaan prihatin','bantuan rakyat'] r= quantized_model.vectorize(texts, labels, method='word')

[29]: x, y, labels= [], [], [] for no, row in enumerate(r[1]): x.extend([i[0] for i in row]) y.extend([i[1] for i in row]) labels.extend([r[0][no][1]] * len(row))

[30]: tsne= TSNE().fit_transform(y) tsne.shape [30]: (315, 2)

[31]: unique_labels= list(set(labels)) palette= plt.cm.get_cmap('hsv', len(unique_labels))

[32]: plt.figure(figsize=(7,7))

for label in unique_labels: indices= [i for i in range(len(labels)) if labels[i] == label] plt.scatter(tsne[indices,0], tsne[indices,1], cmap= palette(unique_labels.

˓→index(label)), (continues on next page)

9.55. Classification 529 malaya Documentation

(continued from previous page) label= label)

labels=x for label, x, y in zip( labels, tsne[:,0], tsne[:,1] ): label=( '%s, %.3f'% (label[0], label[1]) if isinstance(label, list) else label ) plt.annotate( label, xy= (x, y), xytext=(0,0), textcoords='offset points', ) plt.legend() [32]:

530 Chapter 9. Contents: malaya Documentation

9.55.6 Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

malaya.stack.predict_stack([model1, model2], List[str], labels= List[str])

We will passed labels as **kwargs.

[10]: alxlnet= malaya.zero_shot.classification.transformer(model='alxlnet') albert= malaya.zero_shot.classification.transformer(model='albert') tiny_bert= malaya.zero_shot.classification.transformer(model='tiny-bert') WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/albert/tokenization.py:

˓→240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info

˓→instead.

INFO:tensorflow:loading sentence piece model

[11]: string='kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan' labels=['makan','makanan','novel','buku','kerajaan','food delivery', 'kerajaan jahat','kerajaan prihatin','bantuan rakyat'] malaya.stack.predict_stack([alxlnet, albert, tiny_bert], [string], labels= labels) [11]: [{'makan': 0.0044827852, 'makanan': 0.0027062024, 'novel': 0.0020867025, 'buku': 0.013082165, 'kerajaan': 0.8859287, 'food delivery': 0.0028363755, 'kerajaan jahat': 0.018133936, 'kerajaan prihatin': 0.9922408, 'bantuan rakyat': 0.909674}]

[ ]:

9.56 Topic Modeling

This tutorial is available as an IPython notebook at Malaya/example/topic-modeling.

[1]: from IPython.core.display import display, HTML display(HTML(""))

[2]: import pandas as pd import malaya /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

9.56. Topic Modeling 531 malaya Documentation

[3]: df= pd.read_csv('tests/02032018.csv',sep=';') df= df.iloc[3:,1:] df.columns=['text','label'] corpus= df.text.tolist()

You can get this file Malaya/tests. This csv already stemmed.

9.56.1 Load Transformer

We can use Transformer model to build topic modeling for corpus we have, the power of attention!

def attention( corpus: List[str], n_topics: int, vectorizer, cleaning= simple_textcleaning, stopwords= get_stopwords, ngram: Tuple[int, int]=(1,3), batch_size: int= 10, ):

""" Use attention from transformer model to do topic modelling based on corpus / list

˓→of strings given.

Parameters ------corpus: list n_topics: int, (default=10) size of decomposition column. vectorizer: object cleaning: function, (default=malaya.text.function.simple_textcleaning) function to clean the corpus. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str] ngram: tuple, (default=(1,3)) n-grams size to train a corpus. batch_size: int, (default=10) size of strings for each vectorization and attention.

Returns ------result: malaya.topic_modelling.AttentionTopic class """

[4]: electra= malaya.transformer.load(model='electra') WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.

˓→v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated

˓→and will be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. (continues on next page)

532 Chapter 9. Contents: malaya Documentation

(continued from previous page) WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from

˓→tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a

˓→future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.

˓→compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.

˓→compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is

˓→deprecated and will be removed in a future version. Instructions for updating: Use `tf.random.categorical` instead. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use

˓→tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:119: The name tf.global_variables_initializer is deprecated.

˓→Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.

˓→compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.

˓→v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.

˓→compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/

˓→electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.

˓→compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/

˓→base/electra-base/model.ckpt

[5]: attention= malaya.topic_model.attention(corpus, n_topics= 10, vectorizer= electra)

9.56. Topic Modeling 533 malaya Documentation

Get topics

def top_topics( self, len_topic: int, top_n: int= 10, return_df: bool= True ): """ Print important topics based on decomposition.

Parameters ------len_topic: int size of topics. top_n: int, optional (default=10) top n of each topic. return_df: bool, optional (default=True) return as pandas.DataFrame, else JSON. """

[6]: attention.top_topics(5, top_n= 10, return_df= True) [6]: topic 0 topic 1 topic 2 topic 3 topic 4 0 kwsp negara umno menteri projek 1 mahkamah malaysia parti perdana hutang 2 dana rakyat pas bahasa malaysia 3 syarikat pengalaman kerajaan perdana menteri mdb 4 bon berkongsi ros kerajaan kementerian 5 dakwaan kerajaan perlembagaan laporan rumah 6 kelulusan berkembang keputusan isu kerajaan 7 bank parti menteri pelan gembira 8 jppm kemudahan bersatu pemilihan pendekatan 9 kenyataan rakyat malaysia isu penjelasan gembira projek

Get topics as string

def get_topics(self, len_topic: int): """ Return important topics based on decomposition.

Parameters ------len_topic: int size of topics.

Returns ------result: List[str] """

[7]: attention.get_topics(10) [7]: [(0, 'kwsp mahkamah dana syarikat bon dakwaan kelulusan bank jppm kenyataan'), (1, 'negara malaysia rakyat pengalaman berkongsi kerajaan berkembang parti kemudahan

˓→rakyat malaysia'), (2, 'umno parti pas kerajaan ros perlembagaan keputusan menteri bersatu isu'), (3, (continues on next page)

534 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'menteri perdana bahasa perdana menteri kerajaan laporan isu pelan pemilihan

˓→penjelasan'), (4, 'projek hutang malaysia mdb kementerian rumah kerajaan gembira pendekatan gembira

˓→projek'), (5, 'bayar rakyat selesaikan raya pilihan raya ppsmi bincang bayar tutup mca jppm'), (6, 'kapal malaysia asli low jho jho low negara wang berita islam'), (7, 'undi parti pimpinan pakatan sokong pucuk suara pucuk suara bertanding suara pucuk

˓→pimpinan'), (8, 'pertumbuhan hutang harga pendapatan produk malaysia kaya kenaikan kumpulan

˓→peningkatan'), (9, 'lancar rakyat teknikal berjalan lancar kerja buku bahasa berjalan catatan berlaku

˓→')]

9.56.2 Train LDA2Vec model def lda2vec( corpus: List[str], vectorizer, n_topics: int= 10, cleaning= simple_textcleaning, stopwords= get_stopwords, window_size: int=2, embedding_size: int= 128, epoch: int= 10, switch_loss: int=3, **kwargs, ): """ Train a LDA2Vec model to do topic modelling based on corpus / list of strings

˓→given.

Parameters ------corpus: list vectorizer : object Should have `fit_transform` method. Commonly:

* ``sklearn.feature_extraction.text.TfidfVectorizer`` - TFIDF algorithm. * ``sklearn.feature_extraction.text.CountVectorizer`` - Bag-of-Word algorithm. * ``malaya.text.vectorizer.SkipGramCountVectorizer`` - Skip Gram Bag-of-Word ˓→algorithm. * ``malaya.text.vectorizer.SkipGramTfidfVectorizer`` - Skip Gram TFIDF ˓→algorithm. n_topics: int, (default=10) size of decomposition column. cleaning: function, (default=malaya.text.function.simple_textcleaning) function to clean the corpus. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str] embedding_size: int, (default=128) (continues on next page)

9.56. Topic Modeling 535 malaya Documentation

(continued from previous page) embedding size of lda2vec tensors. epoch: int, (default=10) one complete iteration. switch_loss: int, (default=1000) baseline to switch from document based loss to document + word based loss.

Returns ------result: malaya.topic_modelling.DeepTopic class """

[8]: from malaya.text.vectorizer import SkipGramCountVectorizer

stopwords= malaya.text.function.get_stopwords() vectorizer= SkipGramCountVectorizer( max_df= 0.95, min_df=1, ngram_range=(1,3), stop_words= stopwords, skip=2 )

[9]: lda2vec= malaya.topic_model.lda2vec(corpus, vectorizer, n_topics= 10, switch_loss= 5000, epoch=5) WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/model/lda2vec.

˓→py:43: The name tf.random_uniform is deprecated. Please use tf.random.uniform

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/model/lda2vec.

˓→py:46: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_

˓→normal instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/model/lda2vec.

˓→py:54: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/model/lda2vec.

˓→py:117: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

minibatch loop: 100%|| 2187/2187 [00:22<00:00, 95.41it/s, cost=40.5, epoch=1] minibatch loop: 100%|| 2187/2187 [00:24<00:00, 88.48it/s, cost=12.9, epoch=2] minibatch loop: 100%|| 2187/2187 [00:23<00:00, 93.20it/s, cost=591, epoch=3] minibatch loop: 100%|| 2187/2187 [00:23<00:00, 91.28it/s, cost=479, epoch=4] minibatch loop: 100%|| 2187/2187 [00:24<00:00, 89.11it/s, cost=449, epoch=5]

536 Chapter 9. Contents: malaya Documentation

Get topics

def top_topics( self, len_topic: int, top_n: int= 10, return_df: bool= True ): """ Print important topics based on decomposition.

Parameters ------len_topic: int size of topics. top_n: int, optional (default=10) top n of each topic. return_df: bool, optional (default=True) return as pandas.DataFrame, else JSON. """

[10]: lda2vec.top_topics(5, top_n= 10, return_df= True) [10]: topic 0 topic 1 \ 0 bank dakwaan wang bank negara dakwaan 1 dakwaan pemindahan akaun bank dakwaan wang 2 bank pemindahan wang bank rhb syarikat 3 menangguhkan menangguhkan kebenaran kerajaan berkaitan 4 subjek menjadikan ranking penilaian tahunan ditawarkan 5 persendirian dibenarkan lingkup bank milik syarikat 6 menjadikan subjek subjek menjadikan ranking 7 wang bank bank pemindahan wang 8 dolar bank milik mendedahkan ruang had 9 luas menjadikan dolar bank milik

topic 2 topic 3 \ 0 ros bank dakwaan wang 1 perlembagaan subjek menjadikan ranking 2 lancar menangguhkan menangguhkan kebenaran 3 status menjadikan subjek 4 dihentikan mencadangkan pembangkang azizah 5 berjalan bank rhb syarikat 6 sedar mengambil dakwaan pemindahan akaun 7 sahkan perolehi keputusan wang dolar 8 berjalan lancar bank negara dakwaan 9 pilihan jabatan malaysia

topic 4 0 ros 1 perlembagaan 2 berjalan 3 lancar 4 pilihan 5 dihentikan 6 status 7 sedar mengambil 8 sahkan perolehi keputusan 9 berjalan lancar

9.56. Topic Modeling 537 malaya Documentation

Important sentences based on topics

def get_sentences(self, len_sentence: int, k: int=0): """ Return important sentences related to selected column based on decomposition.

Parameters ------len_sentence: int k: int, (default=0) index of decomposition matrix.

Returns ------result: List[str] """

[11]: lda2vec.get_sentences(5) [11]: ['bank negara dakwaan pemindahan wang akaun dolar bank rhb milik syarikat

˓→persendirian mendedahkan dibenarkan ruang lingkup had perundangan', 'jho low anak kapal ditahan perairan indonesia', 'tumpuan pekan najib tumpuan langkawi', 'april berbangkit status memegang jawatan umno', 'membantu negara negara maju bidang perancangan ekonomi kewangan perdagangan

˓→pertanian pendidikan latihan teknikal industri diplomasi']

Get topics as string

def get_topics(self, len_topic: int): """ Return important topics based on decomposition.

Parameters ------len_topic: int size of topics.

Returns ------result: List[str] """

[12]: lda2vec.get_topics(10) [12]: [(0, 'bank dakwaan wang dakwaan pemindahan akaun bank pemindahan wang menangguhkan

˓→menangguhkan kebenaran subjek menjadikan ranking persendirian dibenarkan lingkup

˓→menjadikan subjek wang bank dolar bank milik luas menjadikan'), (1, 'bank negara dakwaan bank dakwaan wang bank rhb syarikat kerajaan berkaitan

˓→penilaian tahunan ditawarkan bank milik syarikat subjek menjadikan ranking bank

˓→pemindahan wang mendedahkan ruang had dolar bank milik'), (2, 'ros perlembagaan lancar status dihentikan berjalan sedar mengambil sahkan perolehi

˓→keputusan berjalan lancar pilihan'), (continues on next page)

538 Chapter 9. Contents: malaya Documentation

(continued from previous page) (3, 'bank dakwaan wang subjek menjadikan ranking menangguhkan menangguhkan kebenaran

˓→menjadikan subjek mencadangkan pembangkang azizah bank rhb syarikat dakwaan

˓→pemindahan akaun wang dolar bank negara dakwaan jabatan malaysia'), (4, 'ros perlembagaan berjalan lancar pilihan dihentikan status sedar mengambil sahkan

˓→perolehi keputusan berjalan lancar'), (5, 'ros perlembagaan dihentikan lancar sedar mengambil rakyat berjalan status sahkan

˓→perolehi keputusan berjalan lancar'), (6, 'ros perlembagaan lancar sedar mengambil status rakyat berjalan lancar berjalan

˓→pilihan dihentikan'), (7, 'bank dakwaan wang menjadikan subjek bank rhb syarikat subjek menjadikan ranking

˓→menangguhkan menangguhkan kebenaran bank milik syarikat dakwaan pemindahan akaun

˓→kerajaan berkaitan dolar bank milik jabatan malaysia'), (8, 'ros perlembagaan lancar sedar mengambil pilihan berjalan lancar kebenaran sahkan

˓→perolehi keputusan berjalan rakyat'), (9, 'ros perlembagaan lancar berjalan sedar mengambil status dihentikan sahkan perolehi

˓→keputusan pilihan rakyat')]

Visualize topics

This will initiate pyLDAvis object, to understand pyLDAvis more, read at https://github.com/bmabey/pyLDAvis.

def visualize_topics(self, notebook_mode: int= False, mds: str='pcoa'): """ Print important topics based on decomposition.

Parameters ------mds : str, optional (default='pcoa') 2D Decomposition. Allowed values:

* ``'pcoa'`` - Dimension reduction via Jensen-Shannon Divergence & Principal ˓→Coordinate Analysis (aka Classical Multidimensional Scaling) * ``'mmds'`` - Dimension reduction via Multidimensional scaling * ``'tsne'`` - Dimension reduction via t-distributed stochastic neighbor ˓→embedding """

[15]: lda2vec.visualize_topics(notebook_mode= True) [15]: PreparedData(topic_coordinates= x y topics cluster Freq topic 2 0.199065 0.003735 1 1 11.141104 5 0.148336 -0.006479 2 1 11.104436 9 0.158898 -0.003680 3 1 11.097365 4 0.115346 0.008548 4 1 10.745593 6 0.073904 -0.001631 5 1 10.744272 8 0.022029 -0.000809 6 1 10.435264 1 -0.141205 0.009112 7 1 8.966058 (continues on next page)

9.56. Topic Modeling 539 malaya Documentation

(continued from previous page) 0 -0.154737 -0.010546 8 1 8.874525 7 -0.194209 0.000224 9 1 8.538020 3 -0.227427 0.001526 10 1 8.353364, topic_info=

˓→ Term Freq Total Category logprob \ 13313 ros 25.000000 25.000000 Default 30.0000 11954 perlembagaan 21.000000 21.000000 Default 29.0000 6611 lancar 14.000000 14.000000 Default 28.0000 9101 menjadikan subjek 7.000000 7.000000 Default 27.0000 14440 subjek menjadikan ranking 7.000000 7.000000 Default 26.0000 ...... 11888 perjanjian keselamatan lawatan 1.280732 3.392128 Topic10 -7.0652 6914 luas subjek qs 1.368553 3.673588 Topic10 -6.9989 1061 bank milik syarikat 1.615340 4.865867 Topic10 -6.8331 8449 mendedahkan ruang had 1.695390 5.254410 Topic10 -6.7847 12556 positif membina 1.328130 3.781593 Topic10 -7.0288

loglift 13313 30.0000 11954 29.0000 6611 28.0000 9101 27.0000 14440 26.0000 ...... 11888 1.5085 6914 1.4951 1061 1.3798 8449 1.3514 12556 1.4361

[762 rows x 6 columns], token_table= Topic Freq Term term 69 1 0.362784 adnan 69 2 0.181392 adnan 69 3 0.181392 adnan 69 4 0.181392 adnan 69 5 0.181392 adnan ...... 15956 1 0.346625 wang keluarga putihnya 15956 2 0.346625 wang keluarga putihnya 15956 3 0.346625 wang keluarga putihnya 16017 1 0.368357 wujud 16017 3 0.368357 wujud

[1123 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab':

˓→'PC2'}, topic_order=[3, 6, 10, 5, 7, 9, 2, 1, 8, 4])

540 Chapter 9. Contents: malaya Documentation

9.56.3 Train SKLearn LDA model

[16]: from sklearn.decomposition import LatentDirichletAllocation

lda= malaya.topic_model.sklearn( corpus, LatentDirichletAllocation, vectorizer= vectorizer, n_topics= 10, ) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call

˓→`transform_cell` automatically in the future. Please pass the result to

˓→`transformed_cell` argument and any exception that happen during thetransform in

˓→`preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)

Print topics

def top_topics( self, len_topic: int, top_n: int= 10, return_df: bool= True ): """ Print important topics based on decomposition.

Parameters ------len_topic: int size of topics. top_n: int, optional (default=10) top n of each topic. return_df: bool, optional (default=True) return as pandas.DataFrame, else JSON. """

[17]: lda.top_topics(5, top_n= 10, return_df= True) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call

˓→`transform_cell` automatically in the future. Please pass the result to

˓→`transformed_cell` argument and any exception that happen during thetransform in

˓→`preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [17]: topic 0 topic 1 topic 2 topic 3 \ 0 negara malaysia rakyat sukan 1 mdb negara negara saham 2 malaysia parti umno sukan suka 3 perniagaan kerajaan keputusan berlaku 4 ahli mdb hutang kerajaan 5 negara bidang menteri tindakan rendah 6 negara maju bidang isu kepentingan 7 membantu negara bidang kuok air sumber 8 membantu negara berlaku hutang hutang meningkatkan 9 negara maju bidang pendidikan negeri diterjemahkan

(continues on next page)

9.56. Topic Modeling 541 malaya Documentation

(continued from previous page) topic 4 0 menteri 1 rakyat 2 kerajaan 3 perdana 4 perdana menteri 5 malaysia 6 anak 7 nilai 8 ph 9 beban

Important sentences based on topics

def get_sentences(self, len_sentence: int, k: int=0): """ Return important sentences related to selected column based on decomposition.

Parameters ------len_sentence: int k: int, (default=0) index of decomposition matrix.

Returns ------result: List[str] """

[18]: lda.get_sentences(5) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call

˓→`transform_cell` automatically in the future. Please pass the result to

˓→`transformed_cell` argument and any exception that happen during thetransform in

˓→`preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [18]: ['catatan itu menunjukkan exco pkr selangor elizabeth wong adun pkr chua yee ling dan

˓→ahli majlis selayang daripada pkr fok wai mun di sebuah acara perayaan cina tetapi

˓→membabitkan saya dalam gambar itu', 'rakyat malaysia yang berfikiran waras akan ingat bagaimana mahathir memulakan

˓→serangan berita palsu terhadap mdb apabila menyatakan dengan salah bahawa rm bilion

˓→telah hilang hanya untuk dibuktikan berulang kali bahawa kenyataan itu adalah salah

˓→', 'sehingga hari ini selain dakwaan asas yang terkandung dalam tuntutan sivil itu doj

˓→belum mengemukakan sebarang bukti kukuh bahawa jho low ialah pemilik sebenar kapal

˓→mewah itu ataupun ia dibeli menggunakan dana daripada mdb', 'sebagai negara yang menandatangani who pertubuhan kesihatan sedunia kami juga

˓→komited untuk mencapai strategi sektor kesihatan global dengan matlamat untuk

˓→menghapuskan viral hepatitis menjelang tahun', 'mdb berulang kali menjelaskan bahawa walaupun ia mempunyai urusan perniagaan dengan

˓→aabar bvi mdb tidak mempunyai sebarang urusan perniagaan dengan jho low dan yang

˓→lebih penting mdb bukanlah pihak dalam tuntutan sivil doj']

542 Chapter 9. Contents: malaya Documentation

Get topics

def get_topics(self, len_topic: int): """ Return important topics based on decomposition.

Parameters ------len_topic: int

Returns ------result: List[str] """

[19]: lda.get_topics(10) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call

˓→`transform_cell` automatically in the future. Please pass the result to

˓→`transformed_cell` argument and any exception that happen during thetransform in

˓→`preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [19]: [(0, 'negara mdb malaysia perniagaan ahli negara bidang negara maju membantu negara

˓→bidang membantu negara negara maju bidang'), (1, 'malaysia negara parti kerajaan mdb menteri bidang kuok berlaku pendidikan'), (2, 'rakyat negara umno keputusan hutang tindakan isu air hutang hutang negeri'), (3, 'sukan saham sukan suka berlaku kerajaan rendah kepentingan sumber meningkatkan

˓→diterjemahkan'), (4, 'menteri rakyat kerajaan perdana perdana menteri malaysia anak nilai ph beban'), (5, 'kerajaan malaysia dana negara pendapatan asli peningkatan awam usaha tertinggi'), (6, 'projek masyarakat harga isu rm malaysia rakyat hutang dijual sokongan'), (7, 'pembangunannya negara selatan negara selatan negara malaysia berkongsi pengalaman

˓→berkongsi pengalaman pengalaman pembangunannya negara berkongsi pengalaman negara

˓→pembangunannya negara'), (8, 'projek negara parti syarikat kerajaan harapan undi malaysia berjalan asli'), (9, 'parti bahasa faktor berita perlembagaan umno kelulusan amanah pas islam')]

9.56. Topic Modeling 543 malaya Documentation

Visualize topics

def visualize_topics(self, notebook_mode: bool= False, mds: str='pcoa'): """ Print important topics based on decomposition.

Parameters ------mds : str, optional (default='pcoa') 2D Decomposition. Allowed values:

* ``'pcoa'`` - Dimension reduction via Jensen-Shannon Divergence & Principal ˓→Coordinate Analysis (aka Classical Multidimensional Scaling) * ``'mmds'`` - Dimension reduction via Multidimensional scaling * ``'tsne'`` - Dimension reduction via t-distributed stochastic neighbor ˓→embedding """

[20]: lda.visualize_topics(notebook_mode= True) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/

˓→ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call

˓→`transform_cell` automatically in the future. Please pass the result to

˓→`transformed_cell` argument and any exception that happen during thetransform in

˓→`preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [20]: PreparedData(topic_coordinates= x y topics cluster Freq topic 0 -0.121414 0.120838 1 1 12.155887 8 0.121421 0.008191 2 1 12.127153 1 -0.094837 -0.150771 3 1 12.120930 6 0.036948 -0.021168 4 1 10.627565 7 0.012586 0.017853 5 1 10.233161 4 0.018758 -0.007332 6 1 9.582676 3 0.002527 0.011784 7 1 9.234506 2 0.001481 0.019098 8 1 9.193199 5 0.009064 0.001161 9 1 7.791288 9 0.013466 0.000347 10 1 6.933636, topic_info=

˓→ Term Freq Total Category logprob \ 10784 pembangunannya negara selatan 4.000000 4.000000 Default 30.000 9966 negara selatan 4.000000 4.000000 Default 29.000 12683 projek 8.000000 8.000000 Default 28.000 15683 umno 6.000000 6.000000 Default 27.000 9232 menteri 9.000000 9.000000 Default 26.000 ...... 12520 politik 0.933800 3.714995 Topic10 -7.224 10490 pelaburan 0.933799 2.001124 Topic10 -7.224 4266 ilmu 0.933799 2.449582 Topic10 -7.224 10378 pas parti 0.933799 2.001433 Topic10 -7.224 2262 buku 0.933799 1.943558 Topic10 -7.224

loglift 10784 30.0000 9966 29.0000 12683 28.0000 15683 27.0000 9232 26.0000 (continues on next page)

544 Chapter 9. Contents: malaya Documentation

(continued from previous page) ...... 12520 1.2879 10490 1.9066 4266 1.7044 10378 1.9064 2262 1.9358

[766 rows x 6 columns], token_table= Topic Freq Term term 108 1 0.333772 ahli 108 3 0.166886 ahli 108 4 0.166886 ahli 108 5 0.166886 ahli 108 9 0.166886 ahli ...... 15820 8 0.410521 usaha 15820 9 0.410521 usaha 15842 9 0.995617 usaha penambahbaikan 15860 6 0.633754 usia beban 15863 6 0.633754 usia beban pemikiran

[1044 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab':

˓→'PC2'}, topic_order=[1, 9, 2, 7, 8, 5, 4, 3, 6, 10])

9.57 Clustering

This tutorial is available as an IPython notebook at Malaya/example/clustering.

This module visualized using matplotlib, a static image, and this caused saturated graph. Use returned value to visualize using dynamic library.

[1]: %%time import malaya CPU times: user 4.83 s, sys: 722 ms, total: 5.55 s Wall time: 4.92 s

9.57.1 Cluster same word structure based on POS and Entities

[2]: string='KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun

˓→Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan

˓→pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing.

˓→Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir

˓→menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.'

[3]: entity= malaya.entity.transformer(model='albert', quantized= True) pos= malaya.pos.transformer(model='albert', quantized= True)

9.57. Clustering 545 malaya Documentation

WARNING:root:Load quantized model will cause accuracy drop. WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef

˓→instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.

˓→7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated.

˓→Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init_

˓→_.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.

˓→InteractiveSession instead.

WARNING:root:Load quantized model will cause accuracy drop. INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model

[4]: result_entities= entity.predict(string) result_pos= pos.predict(string)

546 Chapter 9. Contents: malaya Documentation

Generate Ngram using POS and Entity tagging

def pos_entities_ngram( result_pos: List[Tuple[str, str]], result_entities: List[Tuple[str, str]], ngram: Tuple[int, int]=(1,3), accept_pos: List[str]=['NOUN','PROPN','VERB'], accept_entities: List[str]=[ 'law', 'location', 'organization', 'person', 'time', ], ): """ generate ngrams.

Parameters ------result_pos : List[Tuple[str, str]] result from POS recognition. result_entities : List[Tuple[str, str]] result of Entities recognition. ngram : Tuple[int, int] ngram sizes. accept_pos : List[str] accepted POS elements. accept_entities : List[str] accept entities elements.

Returns ------result: list """

[5]: generated_grams= malaya.generator.pos_entities_ngram( result_pos, result_entities, ngram=(1,3), accept_pos=['NOUN','PROPN','VERB'], accept_entities=['law','location','organization','person','time'], ) generated_grams [5]: ['tidur sebentar', 'Jabatan Keselamatan', 'menitipkan', 'pesanan', '(JKJR)', 'Menteri Tun', 'Jalan Raya', 'Raya', 'Mahathir Mohamad Menteri', 'Mahathir menasihati berhenti', 'berehat', 'pesanan orang pulang', 'kampung halaman', (continues on next page)

9.57. Clustering 547 malaya Documentation

(continued from previous page) 'halaman video terbitan', 'Tun Dr Mahathir', 'terbitan Jabatan Keselamatan', 'video terbitan', '(JKJR) Dr Mahathir', 'video', 'Siew Fook menitipkan', 'Loke', 'LUMPUR: sambutan Aidilfitri', 'Siew Fook', 'Menteri', 'berehat tidur sebentar', 'Raya (JKJR) Dr', 'sebentar', 'Tun', 'Menteri Pengangkutan Anthony', 'Jabatan', 'orang pulang kampung', 'Jabatan Keselamatan Jalan', 'memandu.', 'Loke Siew Fook', 'Mohamad Menteri', 'Menteri Pengangkutan', 'terbitan', 'Keselamatan Jalan Raya', 'KUALA', 'Pengangkutan Anthony', 'minggu Perdana', 'berhenti berehat tidur', 'Fook', 'Tun Dr', 'Dr Mahathir', 'halaman', 'Dr Mahathir menasihati', 'Aidilfitri minggu', 'Anthony Loke Siew', 'LUMPUR:', 'minggu Perdana Menteri', 'Dr Mahathir Mohamad', 'menasihati', 'Perdana Menteri', 'Pengangkutan Anthony Loke', 'Jalan Raya (JKJR)', 'Raya (JKJR)', 'sebentar memandu.', 'KUALA LUMPUR: sambutan', 'video terbitan Jabatan', 'Perdana Menteri Tun', 'KUALA LUMPUR:', 'kampung', 'Mahathir', '(JKJR) Dr', 'orang', 'Keselamatan Jalan', 'halaman video', 'sambutan', 'Mohamad', (continues on next page)

548 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'Anthony Loke', 'pulang kampung halaman', 'Mahathir Mohamad', 'Pengangkutan', 'Anthony', 'tidur', 'Fook menitipkan', 'menitipkan pesanan orang', 'sambutan Aidilfitri', 'sambutan Aidilfitri minggu', 'pulang', 'Mahathir menasihati', 'pulang kampung', 'berhenti', 'pesanan orang', 'Keselamatan', 'Jalan', 'Aidilfitri', 'Siew', 'menitipkan pesanan', 'Mohamad Menteri Pengangkutan', 'menasihati berhenti', 'kampung halaman video', 'Perdana', 'Fook menitipkan pesanan', 'berehat tidur', 'Aidilfitri minggu Perdana', 'minggu', 'orang pulang', 'Menteri Tun Dr', 'tidur sebentar memandu.', 'berhenti berehat', 'menasihati berhenti berehat', 'Loke Siew', 'terbitan Jabatan', 'Dr', 'LUMPUR: sambutan']

Cluster similar sentences based on Unigram def cluster_words(list_words: List[str], lowercase: bool= False): """ cluster similar words based on structure, eg, ['mahathir mohamad', 'mahathir'] = [

˓→'mahathir mohamad']. big O = n^2

Parameters ------list_words : List[str] lowercase: bool, optional (default=True) if True, will group using lowercase but maintain the original form.

Returns ------string: List[str] (continues on next page)

9.57. Clustering 549 malaya Documentation

(continued from previous page) """

[6]: malaya.cluster.cluster_words(generated_grams) [6]: ['berehat tidur sebentar', 'Raya (JKJR) Dr', 'menitipkan pesanan orang', 'Anthony Loke Siew', 'sambutan Aidilfitri minggu', 'minggu Perdana Menteri', 'Dr Mahathir Mohamad', 'Menteri Pengangkutan Anthony', 'Pengangkutan Anthony Loke', 'Mahathir Mohamad Menteri', 'Jalan Raya (JKJR)', 'Mahathir menasihati berhenti', 'pesanan orang pulang', 'orang pulang kampung', 'KUALA LUMPUR: sambutan', 'Jabatan Keselamatan Jalan', 'video terbitan Jabatan', 'Loke Siew Fook', 'halaman video terbitan', 'Keselamatan Jalan Raya', 'Perdana Menteri Tun', 'Mohamad Menteri Pengangkutan', 'kampung halaman video', 'berhenti berehat tidur', 'Fook menitipkan pesanan', 'Tun Dr Mahathir', 'Aidilfitri minggu Perdana', 'terbitan Jabatan Keselamatan', 'tidur sebentar memandu.', 'Menteri Tun Dr', '(JKJR) Dr Mahathir', 'menasihati berhenti berehat', 'Siew Fook menitipkan', 'pulang kampung halaman', 'Dr Mahathir menasihati', 'LUMPUR: sambutan Aidilfitri']

9.57.2 Cluster Part-Of-Speech

def cluster_pos(result: List[Tuple[str, str]]): """ cluster similar POS.

Parameters ------result: List[Tuple[str, str]]

Returns ------result: Dict[str, List[str]] """

550 Chapter 9. Contents: malaya Documentation

[7]: malaya.cluster.cluster_pos(result_pos) [7]: {'ADJ': ['depan,', 'khas', 'ramai', 'pendek', 'mengantuk'], 'ADP': ['Sempena', 'kepada', 'ke', 'Dalam'], 'ADV': ['mahu'], 'ADX': [], 'CCONJ': ['dan'], 'DET': ['masing-masing.', 'itu,'], 'NOUN': ['sambutan Aidilfitri minggu', 'pesanan', 'orang', 'kampung halaman', 'video', 'terbitan', 'sebentar'], 'NUM': [], 'PART': [], 'PRON': ['yang', 'mereka'], 'PROPN': ['KUALA LUMPUR:', 'Perdana Menteri Tun Dr Mahathir Mohamad', 'Menteri Pengangkutan Anthony Loke Siew Fook', 'Jabatan Keselamatan Jalan Raya', 'Dr Mahathir'], 'PUNCT': ['(JKJR)'], 'SCONJ': ['supaya', 'sekiranya', 'ketika'], 'SYM': [], 'VERB': ['menitipkan', 'pulang', 'menasihati', 'berhenti berehat', 'tidur', 'memandu.'], 'X': []}

9.57.3 Cluster Entities

def cluster_entities(result: List[Tuple[str, str]]): """ cluster similar Entities.

Parameters ------result: List[Tuple[str, str]]

Returns ------result: Dict[str, List[str]] """

[8]: malaya.cluster.cluster_entities(result_entities) [8]: {'OTHER': ['Sempena sambutan', 'minggu depan,', 'dan', 'menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman

˓→masing-masing. Dalam video pendek terbitan', (continues on next page)

9.57. Clustering 551 malaya Documentation

(continued from previous page) 'itu,', 'menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.'], 'law': [], 'location': ['KUALA LUMPUR:'], 'organization': ['Jabatan Keselamatan Jalan Raya (JKJR)'], 'person': ['Perdana Menteri Tun Dr Mahathir Mohamad', 'Menteri Pengangkutan Anthony Loke Siew Fook', 'Dr Mahathir'], 'quantity': [], 'time': [], 'event': ['Aidilfitri'], 'X': []}

9.57.4 Load example data

[4]:% matplotlib inline

import pandas as pd df= pd.read_csv('tests/02032018.csv',sep=';') df= df.iloc[3:,1:] df.columns=['text','label'] corpus= df.text.tolist()

You can get this file Malaya/tests. This csv already stemmed.

[5]: model= malaya.sentiment.transformer(model='alxlnet', quantized= True) similarity_model= malaya.similarity.transformer(model='alxlnet', quantized= True) WARNING:root:Load quantized model will cause accuracy drop. WARNING:root:Load quantized model will cause accuracy drop.

9.57.5 Generate scatter plot for unsupervised clustering

def cluster_scatter( corpus: List[str], vectorizer, num_clusters: int=5, titles: List[str]= None, colors: List[str]= None, stopwords= get_stopwords, cleaning= simple_textcleaning, clustering= KMeans, decomposition= MDS, ngram: Tuple[int, int]=(1,3), figsize: Tuple[int, int]=(17,9), batch_size: int= 20, ): """ plot scatter plot on similar text clusters.

Parameters ------(continues on next page)

552 Chapter 9. Contents: malaya Documentation

(continued from previous page)

corpus: List[str] vectorizer: class vectorizer class. num_clusters: int, (default=5) size of unsupervised clusters. titles: List[str], (default=None) list of titles, length must same with corpus. colors: List[str], (default=None) list of colors, length must same with num_clusters. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str] ngram: Tuple[int, int], (default=(1,3)) n-grams size to train a corpus. cleaning: function, (default=malaya.texts.function.simple_textcleaning) function to clean the corpus. batch_size: int, (default=10) size of strings for each vectorization and attention. Only useful if use

˓→transformer vectorizer.

Returns ------dictionary: {'X': X, 'Y': Y, 'labels': clusters, 'vector': transformed_text_clean,

˓→ 'titles': titles} """

[11]: result_scatter= malaya.cluster.cluster_scatter(corpus, vectorizer= model)

9.57. Clustering 553 malaya Documentation

9.57.6 Generate dendogram plot for unsupervised clustering

def cluster_dendogram( corpus: List[str], vectorizer, titles: List[str]= None, stopwords= get_stopwords, cleaning= simple_textcleaning, random_samples: float= 0.3, ngram: Tuple[int, int]=(1,3), figsize: Tuple[int, int]=(17,9), batch_size: int= 20, ): """ plot hierarchical dendogram with similar texts.

Parameters ------

corpus: List[str] vectorizer: class vectorizer class. num_clusters: int, (default=5) size of unsupervised clusters. titles: List[str], (default=None) list of titles, length must same with corpus. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str], or a List[str], or a Tuple[str] cleaning: function, (default=malaya.text.function.simple_textcleaning) function to clean the corpus. random_samples: float, (default=0.3) random samples from the corpus, 0.3 means 30%. ngram: Tuple[int, int], (default=(1,3)) n-grams size to train a corpus. batch_size: int, (default=20) size of strings for each vectorization and attention. Only useful if use

˓→transformer vectorizer.

Returns ------dictionary: {'linkage_matrix': linkage_matrix, 'titles': titles} """

[12]: result_scatter= malaya.cluster.cluster_dendogram(corpus, vectorizer= model)

554 Chapter 9. Contents: malaya Documentation

9.57.7 Generate undirected graph for unsupervised clustering def cluster_graph( corpus: List[str], vectorizer, threshold: float= 0.9, num_clusters: int=5, titles: List[str]= None, colors: List[str]= None, stopwords= get_stopwords, ngram: Tuple[int, int]=(1,3), cleaning= simple_textcleaning, clustering= KMeans, figsize: Tuple[int, int]=(17,9), with_labels: bool= True, batch_size: int= 20, ): """ plot undirected graph with similar texts.

Parameters ------

corpus: List[str] vectorizer: class vectorizer class. threshold: float, (default=0.9) 0.9 means, 90% above absolute pearson correlation. num_clusters: int, (default=5) size of unsupervised clusters. titles: List[str], (default=True) list of titles, length must same with corpus. (continues on next page)

9.57. Clustering 555 malaya Documentation

(continued from previous page) stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str] or List[str] or Tuple[str]. cleaning: function, (default=malaya.texts.function.simple_textcleaning) function to clean the corpus. ngram: Tuple[int, int], (default=(1,3)) n-grams size to train a corpus. batch_size: int, (default=20) size of strings for each vectorization and attention. Only useful if use

˓→transformer vectorizer.

Returns ------dictionary: {'G': G, 'pos': pos, 'node_colors': node_colors, 'node_labels': node_

˓→labels} """

[15]: result_scatter= malaya.cluster.cluster_graph(corpus, vectorizer= similarity_model,

˓→threshold= 0.9)

9.57.8 Generate undirected graph for Entities and topics relationship

def cluster_entity_linking( corpus: List[str], vectorizer, entity_model, topic_modeling_model, threshold: float= 0.3, topic_decomposition: int=2, topic_length: int= 10, fuzzy_ratio: int= 70, accepted_entities: List[str]=[ (continues on next page)

556 Chapter 9. Contents: malaya Documentation

(continued from previous page) 'law', 'location', 'organization', 'person', 'event', ], cleaning= simple_textcleaning, colors: List[str]= None, stopwords= get_stopwords, max_df: float= 1.0, min_df: int=1, ngram: Tuple[int, int]=(2,3), figsize: Tuple[int, int]=(17,9), batch_size: int= 20, ): """ plot undirected graph for Entities and topics relationship.

Parameters ------corpus: list or str vectorizer: class titles: list list of titles, length must same with corpus. colors: list list of colors, length must same with num_clusters. threshold: float, (default=0.3) 0.3 means, 30% above absolute pearson correlation. topic_decomposition: int, (default=2) size of decomposition. topic_length: int, (default=10) size of topic models. fuzzy_ratio: int, (default=70) size of ratio for fuzzywuzzy. max_df: float, (default=0.95) maximum of a word selected based on document frequency. min_df: int, (default=2) minimum of a word selected on based on document frequency. ngram: tuple, (default=(1,3)) n-grams size to train a corpus. cleaning: function, (default=simple_textcleaning) function to clean the corpus. stopwords: List[str], (default=malaya.texts.function.get_stopwords) A callable that returned a List[str] or List[str] or Tuple[str]

Returns ------dictionary: {'G': G, 'pos': pos, 'node_colors': node_colors, 'node_labels': node_

˓→labels} """

[6]: from sklearn.feature_extraction.text import TfidfVectorizer tf_vectorizer= TfidfVectorizer( ngram_range=(1,3), min_df=2, max_df= 0.95, (continues on next page)

9.57. Clustering 557 malaya Documentation

(continued from previous page) ) topic_model= malaya.topic_model.lda

result_linking= malaya.cluster.cluster_entity_linking(corpus, tf_vectorizer, entity, topic_model)

9.58 Stacking

This tutorial is available as an IPython notebook at Malaya/example/stacking.

9.58.1 Why Stacking?

Sometime a single model is not good enough. So, you need to use multiple models to get a better result! It called stacking.

[1]: %%time import malaya CPU times: user 5.67 s, sys: 1.34 s, total: 7 s Wall time: 7.97 s /Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning:

˓→Possible nested set at position 2289 self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

558 Chapter 9. Contents: malaya Documentation

[3]: albert= malaya.sentiment.transformer('albert', quantized= True) alxlnet= malaya.sentiment.transformer('alxlnet', quantized= True) multinomial= malaya.sentiment.multinomial() WARNING:root:Load quantized model will cause accuracy drop. INFO:tensorflow:loading sentence piece model INFO:tensorflow:loading sentence piece model WARNING:root:Load quantized model will cause accuracy drop.

9.58.2 Stack multiple sentiment models

malaya.stack.predict_stack provide an easy stacking solution for Malaya models. Well, not just for senti- ment models, any classification models can use malaya.stack.predict_stack.

def predict_stack( models, strings: List[str], aggregate: Callable= gmean, **kwargs ): """ Stacking for predictive models.

Parameters ------models: List[Callable] list of models. strings: List[str] aggregate : Callable, optional (default=scipy.stats.mstats.gmean) Aggregate function.

Returns ------result: dict """

[4]: malaya.stack.predict_stack([albert, multinomial, alxlnet], ['harga minyak tak menentu']) [4]: [{'negative': 0.5016266912464752, 'positive': 4.4445397894955644e-05, 'neutral': 0.004399656207132555}]

To disable neutral, simply, add_neutral = False.

[5]: malaya.stack.predict_stack([albert, multinomial, alxlnet], ['harga minyak tak menentu'], add_neutral= False) [5]: [{'negative': 0.8257116478969977, 'positive': 0.0016922961136002735}]

9.58. Stacking 559 malaya Documentation

9.58.3 Stack tagging models

For tagging models, we use majority voting stacking. So you need to need have more than 2 models to make it perfect, or else, it will pick randomly from 2 models. malaya.stack.voting_stack provides easy interface for this kind of stacking. But only can use for Entites, POS and Dependency Parsing recognition.

def voting_stack(models, text): """ Stacking for POS and Entities Recognition models.

Parameters ------models: list list of models text: str string to predict

Returns ------result: list """

[9]: string='KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun

˓→Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan

˓→pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing.

˓→Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir

˓→menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.'

albert= malaya.pos.transformer('albert') bert= malaya.pos.transformer('bert') malaya.stack.voting_stack([albert, bert], string) [9]: [('Kuala', 'PROPN'), ('Lumpur:', 'PROPN'), ('Sempena', 'ADP'), ('sambutan', 'NOUN'), ('Aidilfitri', 'PROPN'), ('minggu', 'NOUN'), ('depan,', 'ADJ'), ('Perdana', 'PROPN'), ('Menteri', 'PROPN'), ('Tun', 'PROPN'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('Mohamad', 'PROPN'), ('dan', 'CCONJ'), ('Menteri', 'PROPN'), ('Pengangkutan', 'PROPN'), ('Anthony', 'PROPN'), ('Loke', 'PROPN'), ('Siew', 'PROPN'), ('Fook', 'PROPN'), ('menitipkan', 'VERB'), ('pesanan', 'NOUN'), ('khas', 'ADJ'), ('kepada', 'ADP'), ('orang', 'NOUN'), (continues on next page)

560 Chapter 9. Contents: malaya Documentation

(continued from previous page) ('ramai', 'ADJ'), ('yang', 'PRON'), ('mahu', 'ADV'), ('pulang', 'VERB'), ('ke', 'ADP'), ('kampung', 'NOUN'), ('halaman', 'NOUN'), ('masing-masing.', 'DET'), ('Dalam', 'ADP'), ('video', 'NOUN'), ('pendek', 'ADJ'), ('terbitan', 'NOUN'), ('Jabatan', 'PROPN'), ('Keselamatan', 'PROPN'), ('Jalan', 'PROPN'), ('Raya', 'PROPN'), ('(JKJR)', 'PUNCT'), ('itu,', 'DET'), ('Dr', 'PROPN'), ('Mahathir', 'PROPN'), ('menasihati', 'VERB'), ('mereka', 'PRON'), ('supaya', 'SCONJ'), ('berhenti', 'VERB'), ('berehat', 'VERB'), ('dan', 'CCONJ'), ('tidur', 'VERB'), ('sebentar', 'ADV'), ('sekiranya', 'SCONJ'), ('mengantuk', 'NOUN'), ('ketika', 'SCONJ'), ('memandu.', 'VERB')]

[10]: string='KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun

˓→Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan

˓→pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing.

˓→Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir

˓→menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk

˓→ketika memandu.'

xlnet= malaya.dependency.transformer(model='xlnet') alxlnet= malaya.dependency.transformer(model='alxlnet')

[11]: tagging, indexing= malaya.stack.voting_stack([xlnet, xlnet, alxlnet], string) malaya.dependency.dependency_graph(tagging, indexing).to_graphvis() [11]: [ ]:

9.58. Stacking 561 malaya Documentation

9.59 Finetune ALXLNET-Bahasa

This tutorial is available as an IPython notebook at Malaya/finetune/alxlnet.

In this notebook, I will going to show to finetune pretrained ALXLNET-Bahasa using Tensorflow Estimator. TF-Estimator is really a great module created by Tensorflow Team to train a model for a very long period.

[1]: # !pip3 install tensorflow==1.15

9.59.1 Download pretrained model

https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/alxlnet#download, In this example, we are go- ing to try BASE size. Just uncomment below to download pretrained model and tokenizer.

[2]: # !wget https://f000.backblazeb2.com/file/malaya-model/bert-bahasa/alxlnet-base-500k-

˓→20-10-2020.gz # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→preprocess/sp10m.cased.v9.model # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→alxlnet/config/alxlnet-base_config.json # !tar -zxf alxlnet-base-500k-20-10-2020.gz !ls __pycache__ modeling.py alxlnet-base prepro_utils.py alxlnet-base-500k-20-10-2020.gz sp10m.cased.v9.model alxlnet-base_config.json tf-estimator-text-classification.ipynb custom_modeling.py xlnet.py model_utils.py

[3]:!ls alxlnet-base model.ckpt-500000.data-00000-of-00001 model.ckpt-500000.meta model.ckpt-500000.index

There is a helper function malaya/finetune/utils.py to help us to train the model on single GPU or multiGPUs.

[4]: import sys

sys.path.insert(0,'../') import utils

9.59.2 Load dataset

Just going to train on very small news bahasa sentiment.

[5]: import pandas as pd

df= pd.read_csv('../sentiment-data-v2.csv') df.head()

562 Chapter 9. Contents: malaya Documentation

[5]: label text 0 Negative Lebih-lebih lagi dengan kemudahan internet da... 1 Positive boleh memberi teguran kepada parti tetapi perl... 2 Negative Adalah membingungkan mengapa masyarakat Cina b... 3 Positive Kami menurunkan defisit daripada 6.7 peratus p... 4 Negative Ini masalahnya. Bukan rakyat, tetapi sistem

[6]: labels= df['label'].values.tolist() texts= df['text'].values.tolist() unique_labels= sorted(list(set(labels))) unique_labels [6]: ['Negative', 'Positive']

[7]: import xlnet import numpy as np import tensorflow as tf import model_utils WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/model_utils.py:334:

˓→ The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer

˓→instead.

[8]: import sentencepiece as spm from prepro_utils import preprocess_text, encode_ids

sp_model= spm.SentencePieceProcessor() sp_model.Load('sp10m.cased.v9.model')

SEG_ID_A=0 SEG_ID_B=1 SEG_ID_CLS=2 SEG_ID_SEP=3 SEG_ID_PAD=4

special_symbols={ '':0, '':1, '':2, '':3, '':4, '':5, '':6, '':7, '':8, }

VOCAB_SIZE= 32000 UNK_ID= special_symbols[''] CLS_ID= special_symbols[''] SEP_ID= special_symbols[''] MASK_ID= special_symbols[''] EOD_ID= special_symbols['']

def tokenize_fn(text): (continues on next page)

9.59. Finetune ALXLNET-Bahasa 563 malaya Documentation

(continued from previous page) text= preprocess_text(text, lower= False) return encode_ids(sp_model, text)

def token_to_ids(text, maxlen= 512): tokens_a= tokenize_fn(text) if len(tokens_a)> maxlen-2: tokens_a= tokens_a[: (maxlen-2)] segment_id= [SEG_ID_A] * len(tokens_a) tokens_a.append(SEP_ID) tokens_a.append(CLS_ID) segment_id.append(SEG_ID_A) segment_id.append(SEG_ID_CLS) input_mask=[0.0] * len(tokens_a) assert len(tokens_a) == len(input_mask) == len(segment_id) return { 'input_id': tokens_a, 'input_mask': input_mask, 'segment_id': segment_id, }

1. input_id, integer representation of tokenized words, sorted based on sentencepiece weightage. 2. input_mask, attention masking. During training, short words will padded with 1, so we do not want the model learn padded values as part of the context. https://github.com/zihangdai/xlnet/blob/master/classifier_utils.py#L113 3. segment_id, Use for text pair classification, in this case, we can simply put 0.

[9]: token_to_ids(texts[0]) [9]: {'input_id': [1620, 13, 5177, 53, 33, 2808, 3168, 24, 3400, 807, 21, 16179, 31, 742, 578, 17153, 9, 4, 3], 'input_mask': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, (continues on next page)

564 Chapter 9. Contents: malaya Documentation

(continued from previous page) 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]}

9.59.3 TF-Estimator

TF-Estimator, required 2 parts, 1. Input pipeline, https://www.tensorflow.org/api_docs/python/tf/data/Dataset 2. Model definition, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

9.59.4 Data pipeline

[10]: def generate(): while True: for i in range(len(texts)): if len(texts[i])>5: d= token_to_ids(texts[i]) d['label']= [unique_labels.index(labels[i])] d.pop('tokens', None) yield d

[11]:g= generate() next(g) [11]: {'input_id': [1620, 13, 5177, 53, 33, 2808, 3168, 24, 3400, 807, 21, 16179, 31, 742, 578, 17153, 9, 4, 3], (continues on next page)

9.59. Finetune ALXLNET-Bahasa 565 malaya Documentation

(continued from previous page) 'input_mask': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'label': [0]}

It must a function return a function.

def get_dataset(batch_size= 32, shuffle_size= 32): def get(): return dataset return get

[12]: def get_dataset(batch_size= 32, shuffle_size= 32): def get(): dataset= tf.data.Dataset.from_generator( generate, {'input_id': tf.int32,'input_mask': tf.float32,'segment_id': tf.int32,

˓→'label': tf.int32}, output_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, ) dataset= dataset.shuffle(shuffle_size) dataset= dataset.padded_batch( batch_size, padded_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, padding_values={ 'input_id': tf.constant(0, dtype= tf.int32), 'input_mask': tf.constant(1.0, dtype= tf.float32), 'segment_id': tf.constant(4, dtype= tf.int32), 'label': tf.constant(0, dtype= tf.int32), (continues on next page)

566 Chapter 9. Contents: malaya Documentation

(continued from previous page) }, ) return dataset return get

Test data pipeline using tf.session

[13]: tf.reset_default_graph() sess= tf.InteractiveSession() iterator= get_dataset()() iterator= iterator.make_one_shot_iterator().get_next() WARNING:tensorflow:From :4: DatasetV1.make_one_shot_

˓→iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be

˓→removed in a future version. Instructions for updating: Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return

˓→the `Dataset` object directly from your input function. As a last resort, you can

˓→use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.

[14]: iterator [14]: {'input_id': , 'input_mask': , 'segment_id': , 'label': }

[15]: sess.run(iterator) [15]: {'input_id': array([[ 19, 4084, 1500, ..., 0, 0, 0], [2740, 9369, 31, ..., 0, 0, 0], [1084, 791, 835, ..., 0, 0, 0], ..., [ 767, 250, 51, ..., 0, 0, 0], [3593, 21, 7901, ..., 0, 0, 0], [8097, 2519, 271, ..., 0, 0, 0]], dtype=int32), 'input_mask': array([[0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.], ..., [0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.]], dtype=float32), 'segment_id': array([[0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4], ..., [0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4]], dtype=int32), 'label': array([[0], [1], [0], [0], [1], (continues on next page)

9.59. Finetune ALXLNET-Bahasa 567 malaya Documentation

(continued from previous page) [1], [1], [1], [1], [1], [0], [0], [0], [1], [1], [0], [1], [1], [0], [1], [0], [1], [1], [1], [1], [0], [1], [0], [1], [1], [1], [1]], dtype=int32)}

9.59.5 Model definition

It must a function accepts 4 parameters.

def model_fn(features, labels, mode, params):

[16]: kwargs= dict( is_training= True, use_tpu= False, use_bfloat16= False, dropout= 0.1, dropatt= 0.1, init='normal', init_range= 0.1, init_std= 0.05, clamp_len=-1, )

xlnet_parameters= xlnet.RunConfig( **kwargs) xlnet_config= xlnet.XLNetConfig(json_path='alxlnet-base_config.json') WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/xlnet.py:70: The

˓→name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

[17]: epoch= 10 batch_size= 32 (continues on next page)

568 Chapter 9. Contents: malaya Documentation

(continued from previous page) warmup_proportion= 0.1 num_train_steps= 10 num_warmup_steps= int(num_train_steps * warmup_proportion) learning_rate= 2e-5

training_parameters= dict( decay_method='poly', train_steps= num_train_steps, learning_rate= learning_rate, warmup_steps= num_warmup_steps, min_lr_ratio= 0.0, weight_decay= 0.00, adam_epsilon= 1e-8, num_core_per_host=1, lr_layer_decay_rate=1, use_tpu= False, use_bfloat16= False, dropout= 0.0, dropatt= 0.0, init='normal', init_range= 0.1, init_std= 0.05, clip= 1.0, clamp_len=-1, )

[18]: class Parameter: def __init__( self, decay_method, warmup_steps, weight_decay, adam_epsilon, num_core_per_host, lr_layer_decay_rate, use_tpu, learning_rate, train_steps, min_lr_ratio, clip, **kwargs ): self.decay_method= decay_method self.warmup_steps= warmup_steps self.weight_decay= weight_decay self.adam_epsilon= adam_epsilon self.num_core_per_host= num_core_per_host self.lr_layer_decay_rate= lr_layer_decay_rate self.use_tpu= use_tpu self.learning_rate= learning_rate self.train_steps= train_steps self.min_lr_ratio= min_lr_ratio self.clip= clip

training_parameters= Parameter( **training_parameters) init_checkpoint='alxlnet-base/model.ckpt-500000'

9.59. Finetune ALXLNET-Bahasa 569 malaya Documentation

[19]: def model_fn(features, labels, mode, params): Y= tf.cast(features['label'][:,0], tf.int32)

xlnet_model= xlnet.XLNetModel( xlnet_config= xlnet_config, run_config= xlnet_parameters, input_ids= tf.transpose(features['input_id'], [1,0]), seg_ids= tf.transpose(features['segment_id'], [1,0]), input_mask= tf.transpose(features['input_mask'], [1,0]), )

output_layer= xlnet_model.get_sequence_output() output_layer= tf.transpose(output_layer, [1,0,2])

logits_seq= tf.layers.dense(output_layer,2) logits= logits_seq[:,0]

loss= tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits( logits= logits, labels=Y ) )

tf.identity(loss,'train_loss')

accuracy= tf.metrics.accuracy( labels= Y, predictions= tf.argmax(logits, axis=1) ) tf.identity(accuracy[1], name='train_accuracy')

variables= tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

assignment_map, initialized_variable_names= utils.get_assignment_map_from_

˓→checkpoint( variables, init_checkpoint )

tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

if mode == tf.estimator.ModeKeys.TRAIN: train_op, _, _= model_utils.get_train_op(training_parameters, loss) estimator_spec= tf.estimator.EstimatorSpec( mode= mode, loss= loss, train_op= train_op )

elif mode == tf.estimator.ModeKeys.EVAL: estimator_spec= tf.estimator.EstimatorSpec( mode= tf.estimator.ModeKeys.EVAL, loss= loss, eval_metric_ops={'accuracy': accuracy}, )

return estimator_spec

570 Chapter 9. Contents: malaya Documentation

9.59.6 Initiate training session

[20]: train_dataset= get_dataset()

[21]: train_hooks=[ tf.train.LoggingTensorHook( ['train_accuracy','train_loss'], every_n_iter=1 ) ] utils.run_training( train_fn= train_dataset, model_fn= model_fn, model_dir='finetuned-alxlnet-base', num_gpus=1, log_step=1, save_checkpoint_step= epoch, max_steps= epoch, train_hooks= train_hooks, ) WARNING:tensorflow:From ../utils.py:62: The name tf.logging.set_verbosity is

˓→deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From ../utils.py:62: The name tf.logging.INFO is deprecated.

˓→Please use tf.compat.v1.logging.INFO instead.

INFO:tensorflow:Using config: {'_model_dir': 'finetuned-alxlnet-base', '_tf_random_

˓→seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10, '_save_

˓→checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_

˓→count_steps': 1, '_train_distribute': None, '_device_fn': None, '_protocol': None,

˓→'_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_

˓→worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None,

˓→'_cluster_spec':

˓→0x7f7a5c1b2e48>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0,

˓→'_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0,

˓→'_num_worker_replicas': 1} WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/tensorflow_

˓→core/python/training/training_util.py:236: Variable.initialized_value (from

˓→tensorflow.python.ops.variables) is deprecated and will be removed in a future

˓→version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager

˓→and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/xlnet.py:253: The

˓→name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope

˓→instead.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/xlnet.py:253: The

˓→name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/custom_modeling.py:

˓→696: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info(continues on next page)

˓→instead. 9.59. Finetune ALXLNET-Bahasa 571 malaya Documentation

(continued from previous page)

INFO:tensorflow:memory input None INFO:tensorflow:Use float type WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/custom_modeling.py:

˓→703: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable

˓→instead.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/custom_modeling.py:

˓→808: dropout (from tensorflow.python.layers.core) is deprecated and will be removed

˓→in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/tensorflow_

˓→core/python/layers/core.py:271: Layer.apply (from tensorflow.python.keras.engine.

˓→base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/custom_modeling.py:

˓→109: dense (from tensorflow.python.layers.core) is deprecated and will be removed

˓→in a future version. Instructions for updating: Use keras.layers.Dense instead. INFO:tensorflow:**** Trainable Variables **** INFO:tensorflow: name = model/transformer/r_w_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/r_r_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/word_embedding/lookup_table:0, shape = ˓→(32000, 128), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/word_embedding/lookup_table_2:0, shape = ˓→(128, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/r_s_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/seg_embed:0, shape = (12, 2, 12, 64), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/q/kernel:0, shape = ˓→(768, 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/k/kernel:0, shape = ˓→(768, 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/v/kernel:0, shape = ˓→(768, 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/r/kernel:0, shape = ˓→(768, 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/o/kernel:0, shape = ˓→(768, 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/rel_attn/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* (continues on next page)

572 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_shared/ff/layer_1/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/ff/layer_1/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/ff/layer_2/kernel:0, shape = ˓→(3072, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/ff/layer_2/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/ff/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_shared/ff/LayerNorm/gamma:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = dense/kernel:0, shape = (768, 2) INFO:tensorflow: name = dense/bias:0, shape = (2,) WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/model_utils.py:105:

˓→ The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.

˓→train.get_or_create_global_step instead.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/model_utils.py:119:

˓→ The name tf.train.polynomial_decay is deprecated. Please use tf.compat.v1.train.

˓→polynomial_decay instead.

WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/model_utils.py:136:

˓→ where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in

˓→a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/ubuntu/malay/Malaya/finetune/alxlnet/model_utils.py:150:

˓→ The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.

˓→AdamOptimizer instead.

INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into finetuned-alxlnet-base/model.ckpt. INFO:tensorflow:train_accuracy = 0.34375, train_loss = 1.0174254 INFO:tensorflow:loss = 1.0174254, step = 1 INFO:tensorflow:global_step/sec: 0.0483137 INFO:tensorflow:train_accuracy = 0.4375, train_loss = 0.7347818 (20.699 sec) INFO:tensorflow:loss = 0.7347818, step = 2 (20.698 sec) INFO:tensorflow:global_step/sec: 0.0969523 INFO:tensorflow:train_accuracy = 0.4375, train_loss = 0.9868502 (10.314 sec) INFO:tensorflow:loss = 0.9868502, step = 3 (10.315 sec) INFO:tensorflow:global_step/sec: 0.139668 INFO:tensorflow:train_accuracy = 0.453125, train_loss = 0.7655573 (7.159 sec) INFO:tensorflow:loss = 0.7655573, step = 4 (7.159 sec) INFO:tensorflow:global_step/sec: 0.16895 INFO:tensorflow:train_accuracy = 0.48125, train_loss = 0.7466663 (5.919 sec) INFO:tensorflow:loss = 0.7466663, step = 5 (5.920 sec) INFO:tensorflow:global_step/sec: 0.131785 INFO:tensorflow:train_accuracy = 0.44270834, train_loss = 0.94823694 (7.588 sec) INFO:tensorflow:loss = 0.94823694, step = 6 (7.588 sec) INFO:tensorflow:global_step/sec: 0.15756 INFO:tensorflow:train_accuracy = 0.42410713, train_loss = 0.8999996 (6.347 sec) INFO:tensorflow:loss = 0.8999996, step = 7 (6.346 sec) (continues on next page)

9.59. Finetune ALXLNET-Bahasa 573 malaya Documentation

(continued from previous page) INFO:tensorflow:global_step/sec: 0.144356 INFO:tensorflow:train_accuracy = 0.41796875, train_loss = 0.92889994 (6.927 sec) INFO:tensorflow:loss = 0.92889994, step = 8 (6.927 sec) INFO:tensorflow:global_step/sec: 0.121323 INFO:tensorflow:train_accuracy = 0.41666666, train_loss = 1.0723866 (8.242 sec) INFO:tensorflow:loss = 1.0723866, step = 9 (8.242 sec) INFO:tensorflow:Saving checkpoints for 10 into finetuned-alxlnet-base/model.ckpt. INFO:tensorflow:global_step/sec: 0.130973 INFO:tensorflow:train_accuracy = 0.409375, train_loss = 1.0876663 (7.635 sec) INFO:tensorflow:loss = 1.0876663, step = 10 (7.636 sec) INFO:tensorflow:Loss for final step: 1.0876663.

[ ]:

9.60 Finetune BERT-Bahasa

This tutorial is available as an IPython notebook at Malaya/finetune/bert.

In this notebook, I will going to show to finetune pretrained BERT-Bahasa using Tensorflow Estimator. TF-Estimator is really a great module created by Tensorflow Team to train a model for a very long period.

[1]: # !pip3 install bert-tensorflow==1.0.1 tensorflow==1.15

9.60.1 Download pretrained model

https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/bert#download, In this example, we are going to try BASE size. Just uncomment below to download pretrained model and tokenizer.

[19]: # !wget https://f000.backblazeb2.com/file/malaya-model/bert-bahasa/bert-base-2020-10-

˓→08.tar.gz # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→bert/BERT.wordpiece # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→bert/config/BASE_config.json # !tar -zxf bert-base-2020-10-08.tar.gz !ls BASE_config.json bert-base-2020-10-08.tar.gz BERT.wordpiece tf-estimator-text-classification.ipynb bert-base

[3]:!ls bert-base model.ckpt-1000000.data-00000-of-00001 model.ckpt-1000000.meta model.ckpt-1000000.index

There is a helper function malaya/finetune/utils.py to help us to train the model on single GPU or multiGPUs.

574 Chapter 9. Contents: malaya Documentation

[4]: import sys

sys.path.insert(0,'../') import utils

9.60.2 Load dataset

Just going to train on very small news bahasa sentiment.

[5]: import pandas as pd

df= pd.read_csv('../sentiment-data-v2.csv') df.head() [5]: label text 0 Negative Lebih-lebih lagi dengan kemudahan internet da... 1 Positive boleh memberi teguran kepada parti tetapi perl... 2 Negative Adalah membingungkan mengapa masyarakat Cina b... 3 Positive Kami menurunkan defisit daripada 6.7 peratus p... 4 Negative Ini masalahnya. Bukan rakyat, tetapi sistem

[6]: labels= df['label'].values.tolist() texts= df['text'].values.tolist() unique_labels= sorted(list(set(labels))) unique_labels [6]: ['Negative', 'Positive']

[7]: import tensorflow as tf import bert from bert import run_classifier from bert import optimization from bert import tokenization from bert import modeling WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/bert/

˓→optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.

˓→v1.train.Optimizer instead.

[8]: tokenizer= tokenization.FullTokenizer(vocab_file='BERT.wordpiece', do_lower_case=

˓→False) tokens= tokenizer.tokenize('Husein Comel tersangat sangatlah') tokens WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/bert/

˓→tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.

˓→GFile instead.

[8]: ['Husein', 'Comel', 'tersangat', 'sangatlah']

[9]: tokenizer.convert_tokens_to_ids(tokens) [9]: [31560, 17094, 26759, 30559]

9.60. Finetune BERT-Bahasa 575 malaya Documentation

[10]: def token_to_ids(text, maxlen= 512): tokens_a= tokenizer.tokenize(text) if len(tokens_a)> maxlen-2: tokens_a= tokens_a[:(maxlen-2)] tokens=['[CLS]']+ tokens_a+['[SEP]'] segment_id=[0] * len(tokens) input_mask=[1] * len(tokens) input_id= tokenizer.convert_tokens_to_ids(tokens) return {'tokens': tokens,'input_id': input_id, 'input_mask': input_mask,'segment_id': segment_id}

1. tokens, tokenized words. 2. input_id, integer representation of tokenized words, sorted based on wordpiece weightage. 3. input_mask, attention masking. During training, short words will padded with 0, so we do not want the model learn padded values as part of the context. 4. segment_id, Use for text pair classification, in this case, we can simply put 0.

[11]: token_to_ids(texts[0]) [11]: {'tokens': ['[CLS]', 'Lebih', '-', 'lebih', 'lagi', 'dengan', 'kemudahan', 'internet', 'dan', 'laman', 'sosial', ',', 'taktik', 'ini', 'semakin', 'mudah', 'dikembangkan', '.', '[SEP]'], 'input_id': [2, 4015, 17, 2009, 2088, 1822, 5714, 6332, 1766, 3062, 3558, 16, 20153, 1828, 3718, 2766, 20018, 18, (continues on next page)

576 Chapter 9. Contents: malaya Documentation

(continued from previous page) 3], 'input_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

9.60.3 TF-Estimator

TF-Estimator, required 2 parts, 1. Input pipeline, https://www.tensorflow.org/api_docs/python/tf/data/Dataset 2. Model definition, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

9.60.4 Data pipeline

[12]: def generate(): while True: for i in range(len(texts)): if len(texts[i])>5: d= token_to_ids(texts[i]) d['label']= [unique_labels.index(labels[i])] d.pop('tokens', None) yield d

[13]:g= generate() next(g) [13]: {'input_id': [2, 4015, 17, 2009, 2088, 1822, 5714, 6332, 1766, 3062, 3558, 16, 20153, 1828, 3718, 2766, 20018, 18, 3], 'input_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': [0]}

It must a function return a function.

def get_dataset(batch_size= 32, shuffle_size= 32): def get(): return dataset return get

9.60. Finetune BERT-Bahasa 577 malaya Documentation

[14]: def get_dataset(batch_size= 32, shuffle_size= 32): def get(): dataset= tf.data.Dataset.from_generator( generate, {'input_id': tf.int32,'input_mask': tf.int32,'segment_id': tf.int32,

˓→'label': tf.int32}, output_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, ) dataset= dataset.shuffle(shuffle_size) dataset= dataset.padded_batch( batch_size, padded_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, padding_values={ 'input_id': tf.constant(0, dtype= tf.int32), 'input_mask': tf.constant(0, dtype= tf.int32), 'segment_id': tf.constant(0, dtype= tf.int32), 'label': tf.constant(0, dtype= tf.int32), }, ) return dataset return get

Test data pipeline using tf.session

[15]: tf.reset_default_graph() sess= tf.InteractiveSession() iterator= get_dataset()() iterator= iterator.make_one_shot_iterator().get_next() WARNING:tensorflow:From :4: DatasetV1.make_one_shot_

˓→iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be

˓→removed in a future version. Instructions for updating: Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return

˓→the `Dataset` object directly from your input function. As a last resort, you can

˓→use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.

[16]: iterator [16]: {'input_id': , 'input_mask': , 'segment_id': , 'label': }

[17]: sess.run(iterator)

578 Chapter 9. Contents: malaya Documentation

[17]: {'input_id': array([[ 2, 2009, 12237, ..., 0, 0, 0], [ 2, 3543, 7554, ..., 0, 0, 0], [ 2, 2007, 8065, ..., 0, 0, 0], ..., [ 2, 3566, 3841, ..., 0, 0, 0], [ 2, 3217, 1011, ..., 0, 0, 0], [ 2, 6009, 4177, ..., 0, 0, 0]], dtype=int32), 'input_mask': array([[1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], ..., [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0]], dtype=int32), 'segment_id': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int32), 'label': array([[0], [1], [1], [0], [0], [0], [1], [1], [0], [1], [0], [0], [1], [1], [1], [1], [1], [1], [0], [1], [1], [0], [1], [1], [0], [1], [1], [0], [1], [0], [0], [1]], dtype=int32)}

9.60. Finetune BERT-Bahasa 579 malaya Documentation

9.60.5 Model definition

It must a function accepts 4 parameters.

def model_fn(features, labels, mode, params):

[21]: bert_config= modeling.BertConfig.from_json_file('BASE_config.json') bert_config.__dict__ [21]: {'vocab_size': 32000, 'hidden_size': 768, 'num_hidden_layers': 12, 'num_attention_heads': 12, 'hidden_act': 'gelu', 'intermediate_size': 3072, 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02, 'directionality': 'bidi', 'pooler_fc_size': 768, 'pooler_num_attention_heads': 12, 'pooler_num_fc_layers': 3, 'pooler_size_per_head': 128, 'pooler_type': 'first_token_transform'}

[29]: epoch= 10 warmup_proportion= 0.1 num_warmup_steps= int(epoch * warmup_proportion) learning_rate= 2e-5 init_checkpoint='bert-base/model.ckpt-1000000'

[33]: def model_fn(features, labels, mode, params): Y= tf.cast(features['label'][:,0], tf.int32)

model= modeling.BertModel( config= bert_config, is_training= True, input_ids= features['input_id'], input_mask= features['input_mask'], token_type_ids= features['segment_id'], use_one_hot_embeddings= False, ) output_layer= model.get_pooled_output() logits= tf.layers.dense(output_layer,2) loss= tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits( logits= logits, labels=Y ) )

tf.identity(loss,'train_loss')

accuracy= tf.metrics.accuracy( labels= Y, predictions= tf.argmax(logits, axis=1) ) (continues on next page)

580 Chapter 9. Contents: malaya Documentation

(continued from previous page) tf.identity(accuracy[1], name='train_accuracy')

variables= tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

assignment_map, initialized_variable_names= utils.get_assignment_map_from_

˓→checkpoint( variables, init_checkpoint )

tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

if mode == tf.estimator.ModeKeys.TRAIN: train_op= optimization.create_optimizer(loss, learning_rate, epoch, num_

˓→warmup_steps, False) estimator_spec= tf.estimator.EstimatorSpec( mode= mode, loss= loss, train_op= train_op )

elif mode == tf.estimator.ModeKeys.EVAL: estimator_spec= tf.estimator.EstimatorSpec( mode= tf.estimator.ModeKeys.EVAL, loss= loss, eval_metric_ops={'accuracy': accuracy}, )

return estimator_spec

9.60.6 Initiate training session

[35]: train_dataset= get_dataset()

[36]: train_hooks=[ tf.train.LoggingTensorHook( ['train_accuracy','train_loss'], every_n_iter=1 ) ] utils.run_training( train_fn= train_dataset, model_fn= model_fn, model_dir='finetuned-bert-base', num_gpus=1, log_step=1, save_checkpoint_step= epoch, max_steps= epoch, train_hooks= train_hooks, ) INFO:tensorflow:Using config: {'_model_dir': 'finetuned-bert-base', '_tf_random_seed':

˓→ None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10, '_save_

˓→checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } (continues on next page)

9.60. Finetune BERT-Bahasa 581 malaya Documentation

(continued from previous page) , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_

˓→count_steps': 1, '_train_distribute': None, '_device_fn': None, '_protocol': None,

˓→'_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_

˓→worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None,

˓→'_cluster_spec':

˓→0x7fca3eb97080>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0,

˓→'_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0,

˓→'_num_worker_replicas': 1} INFO:tensorflow:Calling model_fn. INFO:tensorflow:**** Trainable Variables **** INFO:tensorflow: name = bert/embeddings/word_embeddings:0, shape = (32000, 768), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/embeddings/position_embeddings:0, shape = (512, 768), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_ ˓→CKPT* INFO:tensorflow: name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_0/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* (continues on next page)

582 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_1/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_1/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_2/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* (continues on next page)

9.60. Finetune BERT-Bahasa 583 malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_3/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_3/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* (continues on next page)

584 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_4/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_4/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_5/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* (continues on next page)

9.60. Finetune BERT-Bahasa 585 malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_6/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_6/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_7/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* (continues on next page)

586 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_8/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_8/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/key/bias:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/output/LayerNorm/beta:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_9/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/self/key/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* (continues on next page)

9.60. Finetune BERT-Bahasa 587 malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/encoder/layer_10/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/output/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_10/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/query/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/query/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/key/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/key/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/value/kernel:0, shape = ˓→(768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/self/value/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape ˓→= (768, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, ˓→shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = ˓→(768, 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = ˓→(3072,), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, ˓→768), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), *INIT_FROM_ ˓→CKPT* (continues on next page)

588 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = dense/kernel:0, shape = (768, 2) INFO:tensorflow: name = dense/bias:0, shape = (2,) INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into finetuned-bert-base/model.ckpt. INFO:tensorflow:train_accuracy = 0.34375, train_loss = 0.7432811 INFO:tensorflow:loss = 0.7432811, step = 1 INFO:tensorflow:global_step/sec: 0.0707289 INFO:tensorflow:train_accuracy = 0.4375, train_loss = 1.6084869 (14.139 sec) INFO:tensorflow:loss = 1.6084869, step = 2 (14.138 sec) INFO:tensorflow:global_step/sec: 0.17299 INFO:tensorflow:train_accuracy = 0.5416667, train_loss = 0.71116924 (5.781 sec) INFO:tensorflow:loss = 0.71116924, step = 3 (5.781 sec) INFO:tensorflow:global_step/sec: 0.181334 WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been

˓→increased. Current value (could be stable): 3 vs previous value: 3. You could

˓→increase the global step by passing tf.train.get_global_step() to Optimizer.apply_

˓→gradients or Optimizer.minimize. INFO:tensorflow:train_accuracy = 0.546875, train_loss = 0.6678002 (5.516 sec) INFO:tensorflow:loss = 0.6678002, step = 4 (5.515 sec) INFO:tensorflow:global_step/sec: 0.0801607 INFO:tensorflow:train_accuracy = 0.5125, train_loss = 1.4128941 (12.474 sec) INFO:tensorflow:loss = 1.4128941, step = 5 (12.475 sec) INFO:tensorflow:global_step/sec: 0.185281 WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been

˓→increased. Current value (could be stable): 5 vs previous value: 5. You could

˓→increase the global step by passing tf.train.get_global_step() to Optimizer.apply_

˓→gradients or Optimizer.minimize. INFO:tensorflow:train_accuracy = 0.49479166, train_loss = 1.22251 (5.398 sec) INFO:tensorflow:loss = 1.22251, step = 6 (5.398 sec) INFO:tensorflow:global_step/sec: 0.14771 INFO:tensorflow:train_accuracy = 0.4955357, train_loss = 0.75944936 (6.769 sec) INFO:tensorflow:loss = 0.75944936, step = 7 (6.769 sec) INFO:tensorflow:global_step/sec: 0.129142 WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been

˓→increased. Current value (could be stable): 7 vs previous value: 7. You could

˓→increase the global step by passing tf.train.get_global_step() to Optimizer.apply_

˓→gradients or Optimizer.minimize. INFO:tensorflow:train_accuracy = 0.52734375, train_loss = 0.4374127 (7.745 sec) INFO:tensorflow:loss = 0.4374127, step = 8 (7.745 sec) INFO:tensorflow:global_step/sec: 0.185809 INFO:tensorflow:train_accuracy = 0.5590278, train_loss = 0.47080472 (5.380 sec) INFO:tensorflow:loss = 0.47080472, step = 9 (5.380 sec) INFO:tensorflow:Saving checkpoints for 10 into finetuned-bert-base/model.ckpt. INFO:tensorflow:global_step/sec: 0.122564 INFO:tensorflow:train_accuracy = 0.5625, train_loss = 0.6999684 (8.159 sec) INFO:tensorflow:loss = 0.6999684, step = 10 (8.160 sec) INFO:tensorflow:Loss for final step: 0.6999684.

[ ]:

9.60. Finetune BERT-Bahasa 589 malaya Documentation

9.61 Finetune XLNET-Bahasa

This tutorial is available as an IPython notebook at Malaya/finetune/xlnet.

In this notebook, I will going to show to finetune pretrained XLNET-Bahasa using Tensorflow Estimator. TF-Estimator is really a great module created by Tensorflow Team to train a model for a very long period.

[2]: # !pip3 install tensorflow==1.15 xlnet-tensorflow

9.61.1 Download pretrained model

https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet#download, In this example, we are going to try BASE size. Just uncomment below to download pretrained model and tokenizer.

[4]: # !wget https://f000.backblazeb2.com/file/malaya-model/bert-bahasa/xlnet-base-500k-20-

˓→10-2020.gz # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→preprocess/sp10m.cased.v9.model # !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/

˓→xlnet/config/xlnet-base_config.json # !tar -zxf xlnet-base-500k-20-10-2020.gz !ls sp10m.cased.v9.model xlnet-base-500k-20-10-2020.gz tf-estimator-text-classification.ipynb xlnet-base_config.json xlnet-base

[5]:!ls xlnet-base model.ckpt-500000.data-00000-of-00001 model.ckpt-500000.meta model.ckpt-500000.index xlnet-base_config.json

There is a helper function malaya/finetune/utils.py to help us to train the model on single GPU or multiGPUs.

[6]: import sys

sys.path.insert(0,'../') import utils

9.61.2 Load dataset

Just going to train on very small news bahasa sentiment.

[7]: import pandas as pd

df= pd.read_csv('../sentiment-data-v2.csv') df.head() [7]: label text 0 Negative Lebih-lebih lagi dengan kemudahan internet da... 1 Positive boleh memberi teguran kepada parti tetapi perl... 2 Negative Adalah membingungkan mengapa masyarakat Cina b... (continues on next page)

590 Chapter 9. Contents: malaya Documentation

(continued from previous page) 3 Positive Kami menurunkan defisit daripada 6.7 peratus p... 4 Negative Ini masalahnya. Bukan rakyat, tetapi sistem

[8]: labels= df['label'].values.tolist() texts= df['text'].values.tolist() unique_labels= sorted(list(set(labels))) unique_labels [8]: ['Negative', 'Positive']

[10]: import numpy as np import tensorflow as tf from xlnet import model_utils from xlnet import xlnet WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/model_

˓→utils.py:295: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.

˓→train.Optimizer instead.

[11]: import sentencepiece as spm from xlnet.prepro_utils import preprocess_text, encode_ids

sp_model= spm.SentencePieceProcessor() sp_model.Load('sp10m.cased.v9.model')

SEG_ID_A=0 SEG_ID_B=1 SEG_ID_CLS=2 SEG_ID_SEP=3 SEG_ID_PAD=4

special_symbols={ '':0, '':1, '':2, '':3, '':4, '':5, '':6, '':7, '':8, }

VOCAB_SIZE= 32000 UNK_ID= special_symbols[''] CLS_ID= special_symbols[''] SEP_ID= special_symbols[''] MASK_ID= special_symbols[''] EOD_ID= special_symbols['']

def tokenize_fn(text): text= preprocess_text(text, lower= False) return encode_ids(sp_model, text)

(continues on next page)

9.61. Finetune XLNET-Bahasa 591 malaya Documentation

(continued from previous page)

def token_to_ids(text, maxlen= 512): tokens_a= tokenize_fn(text) if len(tokens_a)> maxlen-2: tokens_a= tokens_a[: (maxlen-2)] segment_id= [SEG_ID_A] * len(tokens_a) tokens_a.append(SEP_ID) tokens_a.append(CLS_ID) segment_id.append(SEG_ID_A) segment_id.append(SEG_ID_CLS) input_mask=[0.0] * len(tokens_a) assert len(tokens_a) == len(input_mask) == len(segment_id) return { 'input_id': tokens_a, 'input_mask': input_mask, 'segment_id': segment_id, }

1. input_id, integer representation of tokenized words, sorted based on sentencepiece weightage. 2. input_mask, attention masking. During training, short words will padded with 1, so we do not want the model learn padded values as part of the context. https://github.com/zihangdai/xlnet/blob/master/classifier_utils.py#L113 3. segment_id, Use for text pair classification, in this case, we can simply put 0.

[12]: token_to_ids(texts[0]) [12]: {'input_id': [1620, 13, 5177, 53, 33, 2808, 3168, 24, 3400, 807, 21, 16179, 31, 742, 578, 17153, 9, 4, 3], 'input_mask': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, (continues on next page)

592 Chapter 9. Contents: malaya Documentation

(continued from previous page) 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]}

9.61.3 TF-Estimator

TF-Estimator, required 2 parts, 1. Input pipeline, https://www.tensorflow.org/api_docs/python/tf/data/Dataset 2. Model definition, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

[13]: def generate(): while True: for i in range(len(texts)): if len(texts[i])>5: d= token_to_ids(texts[i]) d['label']= [unique_labels.index(labels[i])] d.pop('tokens', None) yield d

[14]:g= generate() next(g) [14]: {'input_id': [1620, 13, 5177, 53, 33, 2808, 3168, 24, 3400, 807, 21, 16179, 31, 742, 578, 17153, 9, 4, 3], 'input_mask': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, (continues on next page)

9.61. Finetune XLNET-Bahasa 593 malaya Documentation

(continued from previous page) 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'label': [0]}

It must a function return a function.

def get_dataset(batch_size= 32, shuffle_size= 32): def get(): return dataset return get

[15]: def get_dataset(batch_size= 32, shuffle_size= 32): def get(): dataset= tf.data.Dataset.from_generator( generate, {'input_id': tf.int32,'input_mask': tf.float32,'segment_id': tf.int32,

˓→'label': tf.int32}, output_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, ) dataset= dataset.shuffle(shuffle_size) dataset= dataset.padded_batch( batch_size, padded_shapes={ 'input_id': tf.TensorShape([None]), 'input_mask': tf.TensorShape([None]), 'segment_id': tf.TensorShape([None]), 'label': tf.TensorShape([None]) }, padding_values={ 'input_id': tf.constant(0, dtype= tf.int32), 'input_mask': tf.constant(1.0, dtype= tf.float32), 'segment_id': tf.constant(4, dtype= tf.int32), 'label': tf.constant(0, dtype= tf.int32), }, ) return dataset return get

594 Chapter 9. Contents: malaya Documentation

Test data pipeline using tf.session

[17]: tf.reset_default_graph() sess= tf.InteractiveSession() iterator= get_dataset()() iterator= iterator.make_one_shot_iterator().get_next() WARNING:tensorflow:From :4: DatasetV1.make_one_shot_

˓→iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be

˓→removed in a future version. Instructions for updating: Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return

˓→the `Dataset` object directly from your input function. As a last resort, you can

˓→use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.

[18]: iterator [18]: {'input_id': , 'input_mask': , 'segment_id': , 'label': }

[19]: sess.run(iterator) [19]: {'input_id': array([[1084, 791, 835, ..., 0, 0, 0], [ 256, 8993, 9, ..., 0, 0, 0], [8110, 87, 1743, ..., 0, 0, 0], ..., [ 767, 250, 51, ..., 0, 0, 0], [ 398, 8269, 742, ..., 9, 4, 3], [3593, 21, 7901, ..., 0, 0, 0]], dtype=int32), 'input_mask': array([[0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 1., 1., 1.], ..., [0., 0., 0., ..., 1., 1., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 1., 1., 1.]], dtype=float32), 'segment_id': array([[0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 4, 4, 4], ..., [0, 0, 0, ..., 4, 4, 4], [0, 0, 0, ..., 0, 0, 2], [0, 0, 0, ..., 4, 4, 4]], dtype=int32), 'label': array([[0], [0], [0], [1], [0], [1], [0], [1], [0], [1], [1], [0], [1], (continues on next page)

9.61. Finetune XLNET-Bahasa 595 malaya Documentation

(continued from previous page) [1], [1], [0], [1], [0], [1], [1], [1], [1], [1], [1], [1], [0], [0], [1], [0], [1], [0], [1]], dtype=int32)}

9.61.4 Model definition

It must a function accepts 4 parameters.

def model_fn(features, labels, mode, params):

[22]: kwargs= dict( is_training= True, use_tpu= False, use_bfloat16= False, dropout= 0.1, dropatt= 0.1, init='normal', init_range= 0.1, init_std= 0.05, clamp_len=-1, )

xlnet_parameters= xlnet.RunConfig( **kwargs) xlnet_config= xlnet.XLNetConfig(json_path='xlnet-base_config.json') WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/xlnet.

˓→py:64: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

[26]: epoch= 10 batch_size= 32 warmup_proportion= 0.1 num_train_steps= 10 num_warmup_steps= int(num_train_steps * warmup_proportion) learning_rate= 2e-5

training_parameters= dict( decay_method='poly', train_steps= num_train_steps, (continues on next page)

596 Chapter 9. Contents: malaya Documentation

(continued from previous page) learning_rate= learning_rate, warmup_steps= num_warmup_steps, min_lr_ratio= 0.0, weight_decay= 0.00, adam_epsilon= 1e-8, num_core_per_host=1, lr_layer_decay_rate=1, use_tpu= False, use_bfloat16= False, dropout= 0.0, dropatt= 0.0, init='normal', init_range= 0.1, init_std= 0.05, clip= 1.0, clamp_len=-1, )

[27]: class Parameter: def __init__( self, decay_method, warmup_steps, weight_decay, adam_epsilon, num_core_per_host, lr_layer_decay_rate, use_tpu, learning_rate, train_steps, min_lr_ratio, clip, **kwargs ): self.decay_method= decay_method self.warmup_steps= warmup_steps self.weight_decay= weight_decay self.adam_epsilon= adam_epsilon self.num_core_per_host= num_core_per_host self.lr_layer_decay_rate= lr_layer_decay_rate self.use_tpu= use_tpu self.learning_rate= learning_rate self.train_steps= train_steps self.min_lr_ratio= min_lr_ratio self.clip= clip

training_parameters= Parameter( **training_parameters) init_checkpoint='xlnet-base/model.ckpt-500000'

[28]: def model_fn(features, labels, mode, params): Y= tf.cast(features['label'][:,0], tf.int32)

xlnet_model= xlnet.XLNetModel( xlnet_config= xlnet_config, run_config= xlnet_parameters, (continues on next page)

9.61. Finetune XLNET-Bahasa 597 malaya Documentation

(continued from previous page) input_ids= tf.transpose(features['input_id'], [1,0]), seg_ids= tf.transpose(features['segment_id'], [1,0]), input_mask= tf.transpose(features['input_mask'], [1,0]), )

output_layer= xlnet_model.get_sequence_output() output_layer= tf.transpose(output_layer, [1,0,2])

logits_seq= tf.layers.dense(output_layer,2) logits= logits_seq[:,0]

loss= tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits( logits= logits, labels=Y ) )

tf.identity(loss,'train_loss')

accuracy= tf.metrics.accuracy( labels= Y, predictions= tf.argmax(logits, axis=1) ) tf.identity(accuracy[1], name='train_accuracy')

variables= tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

assignment_map, initialized_variable_names= utils.get_assignment_map_from_

˓→checkpoint( variables, init_checkpoint )

tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

if mode == tf.estimator.ModeKeys.TRAIN: train_op, _, _= model_utils.get_train_op(training_parameters, loss) estimator_spec= tf.estimator.EstimatorSpec( mode= mode, loss= loss, train_op= train_op )

elif mode == tf.estimator.ModeKeys.EVAL: estimator_spec= tf.estimator.EstimatorSpec( mode= tf.estimator.ModeKeys.EVAL, loss= loss, eval_metric_ops={'accuracy': accuracy}, )

return estimator_spec

598 Chapter 9. Contents: malaya Documentation

9.61.5 Initiate training session

[29]: train_dataset= get_dataset()

[ ]: train_hooks=[ tf.train.LoggingTensorHook( ['train_accuracy','train_loss'], every_n_iter=1 ) ] utils.run_training( train_fn= train_dataset, model_fn= model_fn, model_dir='finetuned-xlnet-base', num_gpus=1, log_step=1, save_checkpoint_step= epoch, max_steps= epoch, train_hooks= train_hooks, ) WARNING:tensorflow:From ../utils.py:62: The name tf.logging.set_verbosity is

˓→deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From ../utils.py:62: The name tf.logging.INFO is deprecated.

˓→Please use tf.compat.v1.logging.INFO instead.

INFO:tensorflow:Using config: {'_model_dir': 'finetuned-xlnet-base', '_tf_random_seed

˓→': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10, '_save_

˓→checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_

˓→count_steps': 1, '_train_distribute': None, '_device_fn': None, '_protocol': None,

˓→'_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_

˓→worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None,

˓→'_cluster_spec':

˓→0x7f31fb236fd0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0,

˓→'_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0,

˓→'_num_worker_replicas': 1} WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/tensorflow_

˓→core/python/training/training_util.py:236: Variable.initialized_value (from

˓→tensorflow.python.ops.variables) is deprecated and will be removed in a future

˓→version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager

˓→and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/xlnet.

˓→py:221: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_

˓→scope instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/xlnet.

˓→py:221: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE

˓→instead.

(continues on next page)

9.61. Finetune XLNET-Bahasa 599 malaya Documentation

(continued from previous page) WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/

˓→modeling.py:453: The name tf.logging.info is deprecated. Please use tf.compat.v1.

˓→logging.info instead.

INFO:tensorflow:memory input None INFO:tensorflow:Use float type WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/

˓→modeling.py:460: The name tf.get_variable is deprecated. Please use tf.compat.v1.

˓→get_variable instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/

˓→modeling.py:535: dropout (from tensorflow.python.layers.core) is deprecated and

˓→will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/tensorflow_

˓→core/python/layers/core.py:271: Layer.apply (from tensorflow.python.keras.engine.

˓→base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset. ˓→md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/

˓→modeling.py:67: dense (from tensorflow.python.layers.core) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use keras.layers.Dense instead. INFO:tensorflow:**** Trainable Variables **** INFO:tensorflow: name = model/transformer/r_w_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/r_r_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/word_embedding/lookup_table:0, shape = ˓→(32000, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/r_s_bias:0, shape = (12, 12, 64), *INIT_ ˓→FROM_CKPT* INFO:tensorflow: name = model/transformer/seg_embed:0, shape = (12, 2, 12, 64), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* (continues on next page)

600 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_0/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_0/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_1/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* (continues on next page)

9.61. Finetune XLNET-Bahasa 601 malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_2/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_2/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_3/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_4/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* (continues on next page)

602 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_4/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_5/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_6/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* (continues on next page)

9.61. Finetune XLNET-Bahasa 603 malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_7/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_7/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_8/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/q/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/k/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/v/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/r/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/o/kernel:0, shape = (768, ˓→12, 64), *INIT_FROM_CKPT* (continues on next page)

604 Chapter 9. Contents: malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_9/rel_attn/LayerNorm/beta:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/layer_1/kernel:0, shape = (768, ˓→3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/layer_1/bias:0, shape = (3072,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/layer_2/kernel:0, shape = (3072, ˓→ 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/layer_2/bias:0, shape = (768,), ˓→*INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_9/ff/LayerNorm/gamma:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/q/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/k/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/v/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/r/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/o/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/LayerNorm/beta:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/layer_1/kernel:0, shape = (768, ˓→ 3072), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/layer_1/bias:0, shape = (3072, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/layer_2/kernel:0, shape = ˓→(3072, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/layer_2/bias:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_10/ff/LayerNorm/gamma:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/q/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/k/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/v/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/r/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/o/kernel:0, shape = (768, ˓→ 12, 64), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/LayerNorm/beta:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/rel_attn/LayerNorm/gamma:0, shape ˓→= (768,), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/ff/layer_1/kernel:0, shape = (768, ˓→ 3072), *INIT_FROM_CKPT* (continues on next page)

9.61. Finetune XLNET-Bahasa 605 malaya Documentation

(continued from previous page) INFO:tensorflow: name = model/transformer/layer_11/ff/layer_1/bias:0, shape = (3072, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/ff/layer_2/kernel:0, shape = ˓→(3072, 768), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/ff/layer_2/bias:0, shape = (768,), ˓→ *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/ff/LayerNorm/beta:0, shape = (768, ˓→), *INIT_FROM_CKPT* INFO:tensorflow: name = model/transformer/layer_11/ff/LayerNorm/gamma:0, shape = ˓→(768,), *INIT_FROM_CKPT* INFO:tensorflow: name = dense/kernel:0, shape = (768, 2) INFO:tensorflow: name = dense/bias:0, shape = (2,) WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/model_

˓→utils.py:96: The name tf.train.get_or_create_global_step is deprecated. Please use

˓→tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/model_

˓→utils.py:108: The name tf.train.polynomial_decay is deprecated. Please use tf.

˓→compat.v1.train.polynomial_decay instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/model_

˓→utils.py:123: where (from tensorflow.python.ops.array_ops) is deprecated and will

˓→be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/xlnet/model_

˓→utils.py:131: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.

˓→v1.train.AdamOptimizer instead.

INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into finetuned-xlnet-base/model.ckpt. INFO:tensorflow:train_accuracy = 0.5, train_loss = 0.8626036 INFO:tensorflow:loss = 0.8626036, step = 1

[ ]:

9.62 Crawler

There is no official compiled package for the Crawler inside Malaya, but it’s included in the repository.

606 Chapter 9. Contents: malaya Documentation

9.62.1 From Source

The crawler is actively developed on Github. You need to clone the public repo: git clone https://github.com/huseinzol05/malaya

You need to install dependencies before able to use the crawler. For ubuntu / debian based: pip3 install bs4 newspaper3k fake_useragent unidecode apt-get install libxml2-dev libxslt-dev libjpeg-dev zlib1g-dev libpng12-dev -y curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py

˓→| python3

For Mac OS: brew install libxml2 libxslt brew install libtiff libjpeg webp little-cms2 pip3 install bs4 newspaper3k fake_useragent unidecode curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py

˓→| python3

And start the crawler: python3 crawl/main.py -i "isu mahathir" -s 2009 -e 2019 -l 10

9.62.2 Get Help

You can check help from the crawler: python3 crawl/main.py --help usage: main.py [-h]-i ISSUE-s START-e END-l LIMIT [-p SLEEP] [-m MALAYA] optional arguments: -h,--help show this help message and exit -i ISSUE,--issue ISSUE issue to search -s START,--start START year start to crawl -e END,--end END year end to crawl -l LIMIT,--limit LIMIT limit of articles to crawl -p SLEEP,--sleep SLEEP seconds to sleep for every 10 articles -m MALAYA,--malaya MALAYA boolean to use Malaya

9.62. Crawler 607 malaya Documentation

9.62.3 How to start python3 crawl/main.py -i "isu mahathir" -s 2009 -e 2019 -l 10

The data will be saved later inside crawl/ in .json Example result return:

{ "title":"Mahathir alergik isu agama", "url":"http://www.utusan.com.my/berita/politik/mahathir-alergik-isu-agama-1.

˓→583393", "authors":[ "Muhammad Hasif Idris" ], "top-image":"http://www.utusan.com.my/polopoly_fs/1.557597!/image/image.jpg_gen/

˓→derivatives/landscape_650/image.jpg", "text":"KOTA BHARU 2 Jan. \u2013 Pas menyifatkan tindakan Pengerusi Parti

˓→Pribumi Bersatu Malaysia (PPBM), Tun Dr. Mahathir Mohamad seolah-olah alergik

˓→dengan isu agama kerana sering mengeluarkan kenyataan yang menyerlahkan

˓→kejahilannya sendiri.\n\nNaib Presiden Pas, Datuk Mohd. Amar Nik Abdullah berkata,

˓→pandangan yang diberikan oleh Dr. Mahathir menunjukkan beliau tidak boleh menerima

˓→hakikat sebenar yang berlaku.\n\n\u201cSejak dahulu lagi, bila beliau (Dr.

˓→Mahathir) cakap bab agama, tak layak pun, bukan saya hendak merendah-rendahkannya.

˓→Namun, beliau tiada kelayakan untuk bercakap, lagi baik diam, apabila bercakap

˓→nampak kejahilan diri sendiri.\n\n\u201cMalah, beliau seolah-olah alergik dengan

˓→isu agama, apabila memberikan respon nampak keras, macam tidak boleh terima. Saya

˓→tidak tahu apa perasaan sebenar beliau sebab sejak dari dahulu lagi dia tidak suka

˓→Pas, orang UMNO mana hendak suka Pas,\u201d katanya.\n\nBeliau berkata demikian

˓→ketika ditemui pemberita selepas Majlis Amanat Khas Tahun Baharu 2018 dan

˓→Perhimpunan Penjawat Awam Kelantan di Kompleks Kota Darul Naim di sini hari ini.\n\

˓→nYang turut hadir Menteri Besar, Datuk Ahmad Yakob. - UTUSAN ONLINE", "keyword":[ "kota", "nampak", "pas", "mahathir", "sebenar", "alergik isu agama" ], "summary":"beliau ditemui pemberita majlis amanat khas tahun baharu perhimpunan

˓→penjawat awam kelantan kompleks kota darul naim. yang hadir menteri besar datuk

˓→ahmad yakob. utusan online", "news":"", "date":"01-01-2018", "language":"MALAY" }

608 Chapter 9. Contents: malaya Documentation

9.62.4 Parameters issue : (string) An issue or search you want to crawl, if your search is a sentence, you need to include double quote, "isu terkini ". start: (int) Year start of news to start, eg, 2009. end: (int) Year end of news to end, eg, 2020. limit: (int) Limit of news want to crawl, eg, if put 100 only get more or less than 100. sleep: (int) Seconds to let the crawler sleeps to prevent IP block, eg 10 represents 10 seconds. malaya: (bool) Boolean to use Malaya, if False, summary and language will not returned, but not required Malaya to be install in local machine.

9.63 Donation

9.63.1 Special shoutout for donators

1. Ahmad Syazwan, since Jul 6 2019. 2. Ang Chin Han, since Jul 7, 2019. 3. Hafiz Azmi, since Jun 4, 2020. 4. Lee Fai, since Jun 4, 2020. 5. Norhidayah Azman, since Jul 6, 2019. 6. Rashidee Mohd Rashid, since Jun 4, 2020.

9.63.2 Patreon

Support Malaya and Malay-Dataset development on Patreon.

9.63. Donation 609 malaya Documentation

9.63.3 BuyMeACoffee

Support Malaya and Malay-Dataset development on BuyMeACoffee.

9.63.4 Initiatives

1. Pay B2 storage and egress! I store checkpoints and big dataset inside B2. 2. Pay linguists to validate my dataset and corpus improving active learning. 3. Maintaining and developing new features to all these projects takes a considerable amount of time and universal exchange, and I am currently exploring the possibility of working on Malaya full time.

9.64 How Malaya gathered corpus?

Note: This tutorial is available as an IPython notebook here.

We use a translator to translate from a validated English dataset to a Bahasa dataset. Everyone agree that Google Translate is the best online translator in this world, but the problem here, to subscribe the API from Google Cloud is really insane expensive. Good thing about https://translate.google.com/, it open for public internet! So we just code a headless browser using Selenium with PhantomJS as the backbone, that’s all! You can check the source code here, translator/ from translate_selenium import Translate, Translate_Concurrent

9.64.1 Translate a sentence with open('sample-joy') as fopen: dataset= list(filter( None, fopen.read().split('\n'))) len(dataset)

18 translator= Translate(from_lang='en', to_lang='ms')

You can get list of supported language in here, https://cloud.google.com/translate/docs/languages

%%time translator.translate(dataset[0])

CPU times: user4 ms, sys:0 ns, total:4 ms Wall time: 1.23s

610 Chapter 9. Contents: malaya Documentation

'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya'

1.23 seconds, it took a very long time to translate a single sentence. What if you have 100k of sentences? It will cost you around 123000 seconds! insane to wait! So, we provide multihreading translator, concurrently translate multi sentences.

9.64.2 Translate batch of strings translators= Translate_Concurrent(batch_size=3, from_lang='en', to_lang='ms')

%%time translators.translate_batch(dataset[:3])

100%|| 1/1 [00:01<00:00, 1.44s/it]

CPU times: user8 ms, sys: 12 ms, total: 20 ms Wall time: 1.44s

['kawan yang sudah berkahwin rapat hanya mempunyai anak pertamanya', 'pengenalan rapat menangis untuk saya saya merasa gembira kerana ada yang peduli', 'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya

˓→']

See, we predicted 3 sentences at almost wall time. You can increase the batch_size to any size you want, limit is your spec now, this method will never make Google blocked your IP. Malaya already tested it more than 300k of sentences. Remember, 1 translator took a quite toll, here I spawned 10 translators, look from my top,

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14628 husein 20 0 3175700 398980 43036 S 33.6 2.4 5:38.05 phantomjs 14652 husein 20 0 3188824 408880 43084 S 29.9 2.5 5:34.62 phantomjs 14489 husein 20 0 3204708 411520 43064 S 28.6 2.5 5:35.29 phantomjs 14466 husein 20 0 3171668 400304 43008 S 24.6 2.5 5:26.74 phantomjs 14443 husein 20 0 3181056 403228 42916 S 21.9 2.5 5:26.24 phantomjs 14512 husein 20 0 3187592 416036 42956 S 20.3 2.6 5:30.03 phantomjs 14558 husein 20 0 3206104 419800 43640 S 19.9 2.6 5:30.76 phantomjs 14535 husein 20 0 3179416 405508 43196 S 18.3 2.5 5:27.54 phantomjs 14420 husein 20 0 3202472 422448 43064 S 17.6 2.6 5:26.78 phantomjs 14581 husein 20 0 3181132 401892 43056 S 16.3 2.5 5:33.48 phantomjs

1 translator cost me around,

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14628 husein 20 0 3175700 398980 43036 S 33.6 2.4 5:38.05 phantomjs

My machine specifications,

H/W path Device Class Description ======system G1.Sniper H6 (To be filled by O.E.M.) /0 bus G1.Sniper H6 /0/3d processor Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz (continues on next page)

9.64. How Malaya gathered corpus? 611 malaya Documentation

(continued from previous page) /0/42 memory 16GiB System Memory /0/42/0 memory DIMM [empty] /0/42/1 memory 8GiB DIMM DDR3 Synchronous 1600 MHz (0.6

˓→ns) /0/42/2 memory DIMM [empty] /0/42/3 memory 8GiB DIMM DDR3 Synchronous 1600 MHz (0.6

˓→ns) /0/100 bridge 4th Gen Core Processor DRAM Controller /0/100/1 bridge Xeon E3-1200 v3/4th Gen Core Processor PCI

˓→Express x16 Controller /0/100/1/0 display GM206 [GeForce GTX 960] /0/100/1/0.1 multimedia NVIDIA Corporation

So, beware of your machine!

9.65 References

9.65.1 Recurrent neural network

Malaya use Long-Short-Term-Memory for all RNN gates. LSTM References: 1. Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). “Long Short-Term Memory”. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. Malaya use recurrent neural network architecture on some models.

Sentiment Analysis

1. malaya.deep_sentiment('luong') 2. malaya.deep_sentiment('bahdanau') 3. malaya.deep_sentiment('hierarchical')

Toxicity Analysis

1. malaya.deep_toxic('luong') 2. malaya.deep_toxic('bahdanau') 3. malaya.deep_toxic('hierarchical')

612 Chapter 9. Contents: malaya Documentation

Entities Recognition

1. malaya.deep_entities('entity-network')

POS Recognition

1. malaya.deep_pos('entity-network')

Stemmer

1. malaya.deep_stemmer() You can read more about Recurrent Neural Network here.

References

1. Li, Xiangang; Wu, Xihong (2014-10-15). “Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition”. arXiv:1410.4281 [cs.CL]. 2. Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). “Long Short-Term Memory”. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. 3. Schmidhuber, Jürgen (January 2015). “Deep Learning in Neural Networks: An Overview”. Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637.

9.65.2 Bidirectional recurrent neural network

Malaya use Long-Short-Term-Memory for all BiRNN gates. LSTM References: 1. Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). “Long Short-Term Memory”. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. Malaya use bidirectional recurrent neural network in some models.

Sentiment Analysis

1. malaya.deep_sentiment('bidirectional')

Entities Recognition

1. malaya.deep_entities('concat') 2. malaya.deep_entities('bahdanau') 3. malaya.deep_entities('luong')

9.65. References 613 malaya Documentation

POS Recognition

1. malaya.deep_pos('concat') 2. malaya.deep_pos('bahdanau') 3. malaya.deep_pos('luong')

Normalizer

1. malaya.deep_normalizer()

Topics & Influencers Analysis

1. malaya.deep_siamese_get_topics() 2. malaya.deep_siamese_get_influencers() 3. malaya.deep_get_topics() 4. malaya.deep_get_influencers()

Summarization

1. malaya.summarize_deep_learning() You can read more about Bidirectional Recurrent Neural Network here.

References

1. M. Schuster, K.K. Paliwal. Bidirectional recurrent neural networks (November 1997). https://ieeexplore.ieee. org/document/650093

9.65.3 Seq2Seq

Malaya use seq2seq in some models.

Normalizer

1. malaya.deep_normalizer()

Stemmer

1. malaya.deep_stemmer() You can read more about Seq2Seq here.

614 Chapter 9. Contents: malaya Documentation

References

1. Ilya Sutskever, Oriol Vinyals: “Sequence to Sequence Learning with Neural Networks”, 2014; [http://arxiv.org/ abs/1409.3215 arXiv:1409.3215].

9.65.4 Conditional Random Field

Malaya use CRF in some models.

Entities Recognition

1. malaya.deep_entities('concat') 2. malaya.deep_entities('bahdanau') 3. malaya.deep_entities('luong') 4. malaya.deep_entities('entity-network')

POS Recognition

1. malaya.deep_pos('concat') 2. malaya.deep_pos('bahdanau') 3. malaya.deep_pos('luong') 4. malaya.deep_pos('entity-network') You can read more about CRF here

References

1. Zhiheng Huang, Wei Xu: “Bidirectional LSTM-CRF Models for Sequence Tagging”, 2015; [http://arxiv.org/ abs/1508.01991 arXiv:1508.01991].

9.65.5 BERT (Deep Bidirectional Transformers)

Malaya use BERT in some models.

Sentiment Analysis

1. malaya.deep_sentiment('bert')

9.65. References 615 malaya Documentation

References

1. Jacob Devlin, Ming-Wei Chang, Kenton Lee: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018; [http://arxiv.org/abs/1810.04805 arXiv:1810.04805].

9.65.6 Entity-Network

Malaya use Entity-Network in some models.

Sentiment Analysis

1. malaya.deep_sentiment('entity-network')

Toxicity Analysis

1. malaya.deep_toxic('entity-network')

Entities Recognition

1. malaya.deep_entities('entity-network')

POS Recognition

1. malaya.deep_pos('entity-network')

References

1. Andrea Madotto: “Question Dependent Recurrent Entity Network for Question Answering”, 2017; [http://arxiv. org/abs/1707.07922 arXiv:1707.07922].

9.65.7 Skip-thought Vector

Malaya use skip-thought in some models.

Summarization

1. malaya.summarize_deep_learning()

616 Chapter 9. Contents: malaya Documentation

Topics & Influencers Analysis

1. malaya.deep_get_topics() 2. malaya.deep_get_influencers()

References

1. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun: “Skip- Thought Vectors”, 2015; [http://arxiv.org/abs/1506.06726 arXiv:1506.06726].

9.65.8 Siamese Network

Malaya use siamese network in some models.

Topics & Influencers Analysis

1. malaya.deep_siamese_get_topics() 2. malaya.deep_siamese_get_influencers()

References

1. Anfeng He, Chong Luo, Xinmei Tian: “A Twofold Siamese Network for Real-Time Object Tracking”, 2018; [http://arxiv.org/abs/1802.08817 arXiv:1802.08817].

9.65.9 Normalizer

References

1. N. Samsudin, Mazidah Puteh, Abdul Razak Hamdan, Mohd Zakree Ahmad Nazri, Normalization of noisy texts in Malaysian online reviews; https://www.researchgate.net/publication/287050449_Normalization_of_noisy_ texts_in_Malaysian_online_reviews

9.65.10 XGBoost

Malaya use XGBoost in some models.

Sentiment Analysis

1. malaya.sentiment.pretrained_xgb_sentiment()

9.65. References 617 malaya Documentation

Language Detection

1. malaya.xgb_detect_languages()

References

1. Tianqi Chen: “XGBoost: A Scalable Tree Boosting System”, 2016; [http://arxiv.org/abs/1603.02754 arXiv:1603.02754]. DOI: [https://dx.doi.org/10.1145/2939672.2939785 10.1145/2939672.2939785].

9.65.11 Multinomial

Malaya use multinomial in some models.

Sentiment Analysis

1. malaya.sentiment.pretrained_bayes_sentiment()

Language Detection

1. malaya.multinomial_detect_languages()

Toxicity Analysis

1. malaya.multinomial_detect_toxic()

References

1. https://medium.com/@johnm.kovachi/implementing-a-multinomial-naive-bayes-classifier-from-scratch-with-python-e70de6a3b92e

9.65.12 Logistic Regression

Malaya use logistic regression in some models.

Toxicity Analysis

1. malaya.logistics_detect_toxic()

References

1. https://itnext.io/machine-learning-sentiment-analysis-of-movie-reviews-using-logisticregression-62e9622b4532

618 Chapter 9. Contents: PYTHON MODULE INDEX

m malaya.translation.ms_en, 67 malaya, 27 malaya.true_case, 67 malaya.augmentation, 27 malaya.word2num, 68 malaya.cluster, 29 malaya.wordvector, 68 malaya.constituency, 32 malaya.zero_shot.classification, 71 malaya.coref, 33 malaya.dependency, 34 malaya.emotion, 35 malaya.entity, 35 malaya.generator, 37 malaya.keyword_extraction, 39 malaya.knowledge_graph, 41 malaya.language_detection, 42 malaya.lexicon, 42 malaya.normalize, 44 malaya.nsfw, 45 malaya.num2word, 45 malaya.paraphrase, 46 malaya.pos, 46 malaya.preprocessing, 47 malaya.qa, 48 malaya.relevancy, 49 malaya.segmentation, 49 malaya.sentiment, 50 malaya.similarity, 59 malaya.spell, 51 malaya.stack, 55 malaya.stem, 55 malaya.subjectivity, 56 malaya.summarization.abstractive, 57 malaya.summarization.extractive, 58 malaya.tatabahasa, 57 malaya.topic_model, 61 malaya.toxicity, 65 malaya.transformer, 65 malaya.transformers.albert, 92 malaya.transformers.alxlnet, 92 malaya.transformers.bert, 93 malaya.transformers.electra, 94 malaya.transformers.gpt2, 95 malaya.transformers.xlnet, 96 malaya.translation.en_ms, 66

619 malaya Documentation

620 Python Module Index INDEX

A malaya.relevancy), 49 analogy() (malaya.wordvector.WordVector method), available_transformer() (in module 70 malaya.segmentation), 49 analyze() (malaya.model.bert.TaggingBERT available_transformer() (in module method), 75 malaya.sentiment), 50 analyze() (malaya.model.xlnet.TaggingXLNET available_transformer() (in module method), 90 malaya.similarity), 59 attention() (in module malaya.keyword_extraction), available_transformer() (in module 40 malaya.spell), 52 attention() (in module malaya.topic_model), 62 available_transformer() (in module attention() (malaya.transformers.albert.Model malaya.subjectivity), 56 method), 92 available_transformer() (in module attention() (malaya.transformers.alxlnet.Model malaya.summarization.abstractive), 57 method), 93 available_transformer() (in module attention() (malaya.transformers.bert.Model malaya.tatabahasa), 57 method), 94 available_transformer() (in module attention() (malaya.transformers.electra.Model malaya.toxicity), 65 method), 94 available_transformer() (in module attention() (malaya.transformers.xlnet.Model malaya.transformer), 65 method), 96 available_transformer() (in module AttentionTopic (class in malaya.topic_model), 63 malaya.translation.en_ms), 66 available_gpt2() (in module malaya.generator), available_transformer() (in module 38 malaya.translation.ms_en), 67 available_transformer() (in module available_transformer() (in module malaya.constituency), 32 malaya.true_case), 67 available_transformer() (in module available_transformer() (in module malaya.dependency), 34 malaya.zero_shot.classification), 71 available_transformer() (in module available_transformer_ontonotes5() (in malaya.emotion), 35 module malaya.entity), 35 available_transformer() (in module available_transformer_squad() (in module malaya.entity), 35 malaya.qa), 48 available_transformer() (in module available_vectorizer() (in module malaya.generator), 38 malaya.topic_model), 61 available_transformer() (in module available_wordvector() (in module malaya.keyword_extraction), 40 malaya.wordvector), 68 available_transformer() (in module malaya.knowledge_graph), 41 B available_transformer() (in module babble() (in module malaya.generator), 37 malaya.paraphrase), 46 batch_calculator() available_transformer() (in module (malaya.wordvector.WordVector method), malaya.pos), 46 69 available_transformer() (in module batch_n_closest()

621 malaya Documentation

(malaya.wordvector.WordVector method), describe() (in module malaya.dependency), 34 69 describe() (in module malaya.entity), 35 beam_decoder() (malaya.model.tf.KnowledgeGraph describe() (in module malaya.pos), 46 method), 86 describe() (in module malaya.tatabahasa), 57 beam_decoder() (malaya.model.tf.Paraphrase describe_ontonotes5() (in module method), 85 malaya.entity), 35 beam_decoder() (malaya.model.tf.Segmentation Doc2Vec (class in malaya.model.extractive_summarization), method), 84 79 beam_decoder() (malaya.model.tf.Translation doc2vec() (in module method), 83 malaya.summarization.extractive), 58 beam_decoder() (malaya.model.tf.TrueCase doc2vec_vectorizer() (in module method), 84 malaya.similarity), 59 BinaryBayes (class in malaya.model.ml), 80 doc2vec_wordvector() (in module BinaryBERT (class in malaya.model.bert), 72 malaya.similarity), 59 BinaryXLNET (class in malaya.model.xlnet), 87 Doc2VecSimilarity (class in malaya.similarity), 60 C E calculator() (malaya.wordvector.WordVector edit_candidates() (malaya.spell.Symspell method), 69 method), 51, 54 cluster_dendogram() (in module malaya.cluster), edit_step() (malaya.spell.Symspell method), 51, 54 30 Encoder (class in malaya.model.extractive_summarization), cluster_entities() (in module malaya.cluster), 79 29 encoder() (in module cluster_entity_linking() (in module malaya.summarization.extractive), 58 malaya.cluster), 31 cluster_graph() (in module malaya.cluster), 31 F cluster_pos() (in module malaya.cluster), 29 fasttext() (in module malaya.language_detection), cluster_scatter() (in module malaya.cluster), 30 42 cluster_tagging() (in module malaya.cluster), 30 cluster_words() (in module malaya.cluster), 29 G Constituency (class in malaya.model.tf ), 83 general_entity() (in module malaya.entity), 36 correct() (malaya.spell.Probability method), 51, 54 generate() (malaya.transformers.gpt2.Model correct() (malaya.spell.Symspell method), 51, 54 method), 95 correct() (malaya.spell.Transformer method), 53 Generator (class in malaya.model.t5), 82 correct_match() (malaya.spell.Symspell method), get_sentences() (malaya.topic_model.DeepTopic 52, 54 method), 64 correct_text() (malaya.spell.Symspell method), get_sentences() (malaya.topic_model.Topic 52, 54 method), 64 correct_text() (malaya.spell.Transformer method), get_topics() (malaya.topic_model.AttentionTopic 53 method), 63 correct_word() (malaya.spell.Transformer method), get_topics() (malaya.topic_model.DeepTopic 53 method), 64 get_topics() (malaya.topic_model.Topic method), D 64 deep_model() (in module get_vector_by_name() malaya.language_detection), 42 (malaya.wordvector.WordVector method), deep_model() (in module malaya.stem), 55 68 DeepLang (class in malaya.model.tf ), 83 gpt2() (in module malaya.generator), 38 DeepStemmer (class in malaya.stem), 55 greedy_decoder() (malaya.model.bigbird.Summarization DeepTopic (class in malaya.topic_model), 63 method), 77 dependency_graph() (in module greedy_decoder() (malaya.model.bigbird.Translation malaya.dependency), 34 method), 77 DependencyBERT (class in malaya.model.bert), 76 greedy_decoder() (malaya.model.pegasus.Summarization DependencyXLNET (class in malaya.model.xlnet), 91 method), 81

622 Index malaya Documentation greedy_decoder() (malaya.model.t5.Generator M method), 82 malaya greedy_decoder() (malaya.model.t5.KnowledgeGraph module, 27 method), 82 malaya.augmentation greedy_decoder() (malaya.model.t5.Paraphrase module, 27 method), 82 malaya.cluster greedy_decoder() (malaya.model.t5.Segmentation module, 29 method), 83 malaya.constituency greedy_decoder() (malaya.model.t5.Spell method), module, 32 82 malaya.coref greedy_decoder() (malaya.model.t5.Summarization module, 33 method), 82 malaya.dependency greedy_decoder() (malaya.model.tf.KnowledgeGraph module, 34 method), 86 malaya.emotion greedy_decoder() (malaya.model.tf.Paraphrase module, 35 method), 85 malaya.entity greedy_decoder() (malaya.model.tf.Segmentation module, 35 method), 84 malaya.generator greedy_decoder() (malaya.model.tf.Tatabahasa module, 37 method), 85 malaya.keyword_extraction greedy_decoder() (malaya.model.tf.Translation module, 39 method), 83 malaya.knowledge_graph greedy_decoder() (malaya.model.tf.TrueCase module, 41 method), 84 malaya.language_detection module, 42 H malaya.lexicon heatmap() (malaya.model.bert.SiameseBERT module, 42 method), 75 malaya.normalize heatmap() (malaya.model.xlnet.SiameseXLNET module, 44 method), 90 malaya.nsfw heatmap() (malaya.similarity.Doc2VecSimilarity module, 45 method), 61 malaya.num2word heatmap() (malaya.similarity.VectorizerSimilarity module, 45 method), 60 malaya.paraphrase module, 46 J malaya.pos jamspell() (in module malaya.spell), 52 module, 46 malaya.preprocessing K module, 47 KnowledgeGraph (class in malaya.model.t5), 82 malaya.qa KnowledgeGraph (class in malaya.model.tf ), 86 module, 48 malaya.relevancy L module, 49 malaya.segmentation lda2vec() (in module malaya.topic_model), 62 module lexicon() (in module malaya.nsfw), 45 , 49 malaya.sentiment load() (in module malaya.transformer), 65 module load() (in module malaya.transformers.albert), 92 , 50 malaya.similarity load() (in module malaya.transformers.alxlnet), 92 module load() (in module malaya.transformers.bert), 93 , 59 malaya.spell load() (in module malaya.transformers.electra), 94 module load() (in module malaya.transformers.gpt2), 95 , 51 malaya.stack load() (in module malaya.transformers.xlnet), 96 module load() (in module malaya.wordvector), 68 , 55 malaya.stem

Index 623 malaya Documentation

module, 55 malaya.generator, 37 malaya.subjectivity malaya.keyword_extraction, 39 module, 56 malaya.knowledge_graph, 41 malaya.summarization.abstractive malaya.language_detection, 42 module, 57 malaya.lexicon, 42 malaya.summarization.extractive malaya.normalize, 44 module, 58 malaya.nsfw, 45 malaya.tatabahasa malaya.num2word, 45 module, 57 malaya.paraphrase, 46 malaya.topic_model malaya.pos, 46 module, 61 malaya.preprocessing, 47 malaya.toxicity malaya.qa, 48 module, 65 malaya.relevancy, 49 malaya.transformer malaya.segmentation, 49 module, 65 malaya.sentiment, 50 malaya.transformers.albert malaya.similarity, 59 module, 92 malaya.spell, 51 malaya.transformers.alxlnet malaya.stack, 55 module, 92 malaya.stem, 55 malaya.transformers.bert malaya.subjectivity, 56 module, 93 malaya.summarization.abstractive, 57 malaya.transformers.electra malaya.summarization.extractive, 58 module, 94 malaya.tatabahasa, 57 malaya.transformers.gpt2 malaya.topic_model, 61 module, 95 malaya.toxicity, 65 malaya.transformers.xlnet malaya.transformer, 65 module, 96 malaya.transformers.albert, 92 malaya.translation.en_ms malaya.transformers.alxlnet, 92 module, 66 malaya.transformers.bert, 93 malaya.translation.ms_en malaya.transformers.electra, 94 module, 67 malaya.transformers.gpt2, 95 malaya.true_case malaya.transformers.xlnet, 96 module, 67 malaya.translation.en_ms, 66 malaya.word2num malaya.translation.ms_en, 67 module, 68 malaya.true_case, 67 malaya.wordvector malaya.word2num, 68 module, 68 malaya.wordvector, 68 malaya.zero_shot.classification malaya.zero_shot.classification, 71 module, 71 MulticlassBayes (class in malaya.model.ml), 80 Model (class in malaya.transformers.albert), 92 MulticlassBERT (class in malaya.model.bert), 73 Model (class in malaya.transformers.alxlnet), 93 MulticlassBigBird (class in Model (class in malaya.transformers.bert), 94 malaya.model.bigbird), 77 Model (class in malaya.transformers.electra), 94 MulticlassXLNET (class in malaya.model.xlnet), 88 Model (class in malaya.transformers.gpt2), 95 MultilabelBayes (class in malaya.model.ml), 80 Model (class in malaya.transformers.xlnet), 96 multinomial() (in module malaya.emotion), 35 module multinomial() (in module malaya.nsfw), 45 malaya, 27 multinomial() (in module malaya.sentiment), 50 malaya.augmentation, 27 multinomial() (in module malaya.subjectivity), 56 malaya.cluster, 29 multinomial() (in module malaya.toxicity), 65 malaya.constituency, 32 malaya.coref, 33 N malaya.dependency, 34 n_closest() (malaya.wordvector.WordVector malaya.emotion, 35 method), 70 malaya.entity, 35 Naive (class in malaya.stem), 56

624 Index malaya Documentation

naive() (in module malaya.pos), 46 predict() (malaya.model.xlnet.DependencyXLNET naive() (in module malaya.stem), 55 method), 91 network() (malaya.wordvector.WordVector method), predict() (malaya.model.xlnet.MulticlassXLNET 71 method), 88 ngrams() (in module malaya.generator), 37 predict() (malaya.model.xlnet.SigmoidXLNET normalize() (malaya.normalize.Normalizer method), method), 89 44 predict() (malaya.model.xlnet.TaggingXLNET Normalizer (class in malaya.normalize), 44 method), 90 normalizer() (in module malaya.normalize), 44 predict_proba() (malaya.model.bert.BinaryBERT nucleus_decoder() method), 72 (malaya.model.bigbird.Summarization predict_proba() (malaya.model.bert.MulticlassBERT method), 78 method), 73 nucleus_decoder() predict_proba() (malaya.model.bert.SiameseBERT (malaya.model.pegasus.Summarization method), 75 method), 81 predict_proba() (malaya.model.bert.SigmoidBERT nucleus_decoder() (malaya.model.tf.Paraphrase method), 74 method), 85 predict_proba() (malaya.model.bert.ZeroshotBERT method), 76 P predict_proba() (malaya.model.bigbird.MulticlassBigBird P() (malaya.spell.Probability method), 51, 54 method), 77 Paraphrase (class in malaya.model.t5), 82 predict_proba() (malaya.model.ml.BinaryBayes Paraphrase (class in malaya.model.tf ), 85 method), 80 parse_from_dependency() (in module predict_proba() (malaya.model.ml.MulticlassBayes malaya.coref ), 33 method), 80 parse_from_dependency() (in module predict_proba() (malaya.model.ml.MultilabelBayes malaya.knowledge_graph), 41 method), 80 parse_nltk_tree() (malaya.model.tf.Constituency predict_proba() (malaya.model.tf.DeepLang method), 84 method), 83 parse_tree() (malaya.model.tf.Constituency predict_proba() (malaya.model.xlnet.BinaryXLNET method), 84 method), 87 pos_entities_ngram() (in module predict_proba() (malaya.model.xlnet.MulticlassXLNET malaya.generator), 37 method), 88 predict() (malaya.model.bert.BinaryBERT method), predict_proba() (malaya.model.xlnet.SiameseXLNET 72 method), 90 predict() (malaya.model.bert.DependencyBERT predict_proba() (malaya.model.xlnet.SigmoidXLNET method), 76 method), 89 predict() (malaya.model.bert.MulticlassBERT predict_proba() (malaya.model.xlnet.ZeroshotXLNET method), 73 method), 91 predict() (malaya.model.bert.SigmoidBERT predict_proba() (malaya.similarity.Doc2VecSimilarity method), 74 method), 60 predict() (malaya.model.bert.TaggingBERT predict_proba() (malaya.similarity.VectorizerSimilarity method), 76 method), 59 predict() (malaya.model.bigbird.MulticlassBigBird predict_stack() (in module malaya.stack), 55 method), 77 predict_words() (malaya.model.bert.BinaryBERT predict() (malaya.model.ml.BinaryBayes method), method), 72 80 predict_words() (malaya.model.bert.MulticlassBERT predict() (malaya.model.ml.MulticlassBayes method), 73 method), 80 predict_words() (malaya.model.bert.SigmoidBERT predict() (malaya.model.ml.MultilabelBayes method), 74 method), 80 predict_words() (malaya.model.xlnet.BinaryXLNET predict() (malaya.model.tf.DeepLang method), 83 method), 87 predict() (malaya.model.tf.SQUAD method), 85 predict_words() (malaya.model.xlnet.MulticlassXLNET predict() (malaya.model.xlnet.BinaryXLNET method), 88 method), 87 predict_words() (malaya.model.xlnet.SigmoidXLNET

Index 625 malaya Documentation

method), 89 Summarization (class in malaya.model.pegasus), 81 Preprocessing (class in malaya.preprocessing), 48 Summarization (class in malaya.model.t5), 82 preprocessing() (in module malaya.preprocessing), Symspell (class in malaya.spell), 51, 54 47 symspell() (in module malaya.spell), 52 Probability (class in malaya.spell), 51, 54 synonym() (in module malaya.augmentation), 27 probability() (in module malaya.spell), 52 project_2d() (malaya.wordvector.WordVector T method), 70 TaggingBERT (class in malaya.model.bert), 75 propagate_graph() (in module malaya.lexicon), 43 TaggingXLNET (class in malaya.model.xlnet), 90 propagate_probabilistic() (in module Tatabahasa (class in malaya.model.tf ), 85 malaya.lexicon), 43 textrank() (in module malaya.keyword_extraction), 39 R to_cardinal() (in module malaya.num2word), 45 rake() (in module malaya.keyword_extraction), 39 to_currency() (in module malaya.num2word), 46 random_walk() (in module malaya.lexicon), 42 to_ordinal() (in module malaya.num2word), 45 replace_similar_consonants() (in module to_ordinal_num() (in module malaya.num2word), malaya.augmentation), 28 45 replace_similar_vowels() (in module to_year() (in module malaya.num2word), 46 malaya.augmentation), 29 tokenize() (malaya.preprocessing.Tokenizer method), 48 S Tokenizer (class in malaya.preprocessing), 48 Sastrawi (class in malaya.stem), 56 top_topics() (malaya.topic_model.AttentionTopic sastrawi() (in module malaya.stem), 55 method), 63 scatter_plot() (malaya.wordvector.WordVector top_topics() (malaya.topic_model.DeepTopic method), 69 method), 63 segment() (malaya.segmentation.Segmenter method), top_topics() (malaya.topic_model.Topic method), 50 64 Segmentation (class in malaya.model.t5), 83 Topic (class in malaya.topic_model), 64 Segmentation (class in malaya.model.tf ), 84 Transformer (class in malaya.spell), 53 Segmenter (class in malaya.segmentation), 50 transformer() (in module malaya.augmentation), 28 sentence_level() (malaya.model.extractive_summarization.Doc2Vectransformer() (in module malaya.constituency), 32 method), 79 transformer() (in module malaya.dependency), 34 sentence_level() (malaya.model.extractive_summarization.SKLearntransformer() (in module malaya.emotion), 35 method), 78 transformer() (in module malaya.entity), 35 sentence_ngram() (in module malaya.generator), transformer() (in module malaya.generator), 38 37 transformer() (in module SiameseBERT (class in malaya.model.bert), 75 malaya.keyword_extraction), 40 SiameseXLNET (class in malaya.model.xlnet), 90 transformer() (in module SigmoidBERT (class in malaya.model.bert), 74 malaya.knowledge_graph), 42 SigmoidXLNET (class in malaya.model.xlnet), 89 transformer() (in module malaya.paraphrase), 46 similarity() (in module transformer() (in module malaya.pos), 47 malaya.keyword_extraction), 40 transformer() (in module malaya.relevancy), 49 SKLearn (class in malaya.model.extractive_summarizationtransformer()), (in module malaya.segmentation), 49 78 transformer() (in module malaya.sentiment), 50 sklearn() (in module transformer() (in module malaya.similarity), 59 malaya.summarization.extractive), 58 transformer() (in module malaya.spell), 52 sklearn() (in module malaya.topic_model), 61 transformer() (in module malaya.subjectivity), 56 socialmedia_form() (in module transformer() (in module malaya.augmentation), 29 malaya.summarization.abstractive), 57 Spell (class in malaya.model.t5), 82 transformer() (in module malaya.tatabahasa), 57 spylls() (in module malaya.spell), 52 transformer() (in module malaya.toxicity), 65 SQUAD (class in malaya.model.tf ), 85 transformer() (in module stem() (malaya.stem.DeepStemmer method), 56 malaya.translation.en_ms), 66 Summarization (class in malaya.model.bigbird), 77

626 Index malaya Documentation transformer() (in module vectorize() (malaya.transformers.albert.Model malaya.translation.ms_en), 67 method), 92 transformer() (in module malaya.true_case), 67 vectorize() (malaya.transformers.alxlnet.Model transformer() (in module method), 93 malaya.zero_shot.classification), 71 vectorize() (malaya.transformers.bert.Model transformer_encoder() (in module malaya.spell), method), 94 53 vectorize() (malaya.transformers.electra.Model transformer_ontonotes5() (in module method), 94 malaya.entity), 36 vectorize() (malaya.transformers.xlnet.Model transformer_squad() (in module malaya.qa), 48 method), 96 Translation (class in malaya.model.bigbird), 77 VectorizerSimilarity (class in Translation (class in malaya.model.tf ), 83 malaya.similarity), 59 tree_plot() (malaya.wordvector.WordVector visualize_attention() method), 69 (malaya.transformers.albert.Model method), TrueCase (class in malaya.model.tf ), 84 92 visualize_attention() U (malaya.transformers.alxlnet.Model method), unpack_english_contractions() (in module 93 malaya.preprocessing), 47 visualize_attention() (malaya.transformers.bert.Model method), V 94 vectorize() (malaya.model.bert.BinaryBERT visualize_attention() method), 72 (malaya.transformers.electra.Model method), vectorize() (malaya.model.bert.DependencyBERT 95 method), 76 visualize_attention() vectorize() (malaya.model.bert.MulticlassBERT (malaya.transformers.xlnet.Model method), 96 method), 73 visualize_topics() vectorize() (malaya.model.bert.SiameseBERT (malaya.topic_model.DeepTopic method), method), 75 63 vectorize() (malaya.model.bert.SigmoidBERT visualize_topics() (malaya.topic_model.Topic method), 74 method), 64 vectorize() (malaya.model.bert.TaggingBERT viterbi() (in module malaya.segmentation), 49 method), 75 voting_stack() (in module malaya.stack), 55 vectorize() (malaya.model.bert.ZeroshotBERT vowel_alternate() (in module method), 76 malaya.augmentation), 29 vectorize() (malaya.model.bigbird.MulticlassBigBird method), 77 W vectorize() (malaya.model.tf.Constituency method), word2num() (in module malaya.word2num), 68 83 word_level() (malaya.model.extractive_summarization.Doc2Vec vectorize() (malaya.model.tf.SQUAD method), 86 method), 79 vectorize() (malaya.model.xlnet.BinaryXLNET word_level() (malaya.model.extractive_summarization.SKLearn method), 87 method), 78 vectorize() (malaya.model.xlnet.DependencyXLNET WordVector (class in malaya.wordvector), 68 method), 91 wordvector() (in module malaya.augmentation), 27 vectorize() (malaya.model.xlnet.MulticlassXLNET method), 88 Z vectorize() (malaya.model.xlnet.SiameseXLNET ZeroshotBERT (class in malaya.model.bert), 76 method), 90 ZeroshotXLNET (class in malaya.model.xlnet), 91 vectorize() (malaya.model.xlnet.SigmoidXLNET method), 89 vectorize() (malaya.model.xlnet.TaggingXLNET method), 90 vectorize() (malaya.model.xlnet.ZeroshotXLNET method), 91

Index 627