Semi-Supervised Learning with HALFADO: Two Case Studies

IT 20 042 Examensarbete 30 hp Juli 2020 Semi-supervised learning with HALFADO: two case studies Moustafa Aboushady Institutionen för informationsteknologi Department of Information Technology . Abstract Semi-supervised learning with HALFADO: two case studies Moustafa Aboushady Teknisk- naturvetenskaplig fakultet UTH-enheten This thesis studies the HALFADO algorithm[1], a semi-supervised learning algorithm designed for detecting anomalies in complex information flows. This Besöksadress: report assesses HALFADO’s performance in terms of detection capabilities (pre- Ångströmlaboratoriet Lägerhyddsvägen 1 cision and recall) and computational requirements. We compare the result of Hus 4, Plan 0 HALFADO with a standard supervised and unsupervised learning approach. The results of two case studies are reported: (1) HALFADO as applied to a Postadress: FinTech example with a flow of financial transactions, and (2) HALFADO as Box 536 751 21 Uppsala applied to detecting hate speech in a social media feed. Those results point to the benefits of using HALFADO in environments where one has only modest Telefon: computational resources. 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student Handledare: Kristiaan Pelckmans Ämnesgranskare: Niklas Wahlström Examinator: Mats Daniels IT 20 042 Tryckt av: Reprocentralen ITC . Popular Scientific Summary Almost everyone nowadays have a smartphone, a laptop, a tablet, or even an Internet of Things (IoT) device. People expect their requests on these devices to be handled and processed instantaneously, especially if they work in a specific domain e.g. Stock Market. At the same time, these devices generate continues data streams, that usually represent a behaviour or even a change in behaviour, and the need to address these real-time information and decision making constraints on mobile/ubiquitous data stream analysis, have led to an increased demand for efficient and scalable online learning algorithms, that is concerned with processing/analyzing data in real-time, and adapting to any changes in behaviour. In this thesis we introduce HALFADO, an algorithm for processing and detecting anomalies/faults in data in real-time. Our goal is to investigate HAL- FADO's implementations in terms of detection capabilities, computational requirements, and its scalability to be utilized in many different domains. There- fore we show its ability in two different applications while being implemented on a modest hardware: (1) applied to detect fraud transactions in a flow of financial transactions, (2) applied to detect hate speech in a social media feed. HALFADO could be of interest for anyone who is interested in discovering anomalies in a certain data flow in real-time, or works in a field where anomalies are critical to be detected in real-time, e.g. Healthcare monitoring system, where an anomaly would be some patient having a serious change in heart beats, that requires instant handling of the situation. The implementations of HALFADO are beneficial to different prospects, in industry for any entity interested in getting insights on data in real-time, or even individuals who in some cases need help as fast as possible like the above mentioned healthcare system. 1 Contents 1 Introduction3 1.1 Online Learning............................3 1.2 Anomaly/Fault detection......................4 2 Theory6 2.1 Online learning vs Offline/Batch learning.............6 2.2 Supervised Learning.........................7 2.3 Unsupervised Learning........................8 2.4 Semi-Supervised Learning...................... 11 2.5 Evaluation Metrics.......................... 12 3 Methods 17 3.1 HALVING for online supervised learning.............. 17 3.1.1 HALVING for Prediction From Expert Advice...... 17 3.1.2 The HALVING Algorithm: Soft implementation using constant factor Alpha.................... 17 3.2 FADO for online unsupervised learning............... 17 3.3 HALFADO for online semi-supervised learning.......... 19 4 HALFADO in TWO Case Studies 22 4.1 Case study in Fin-Tech........................ 22 4.1.1 Dataset............................ 22 4.1.2 Setup............................. 22 4.1.3 Performance......................... 23 4.2 Case Study in Social Media..................... 24 4.2.1 Dataset............................ 24 4.2.2 Setup............................. 25 4.2.3 Performances......................... 26 4.2.4 Performance comparison................... 28 5 Discussion 29 5.1 Performance.............................. 29 5.2 Existence Condition......................... 30 6 Conclusion 31 6.1 Conclusion of the work........................ 31 6.2 Open Problems............................ 31 6.2.1 Feature selection in HALFADO............... 31 6.2.2 Unsupervised Online Learning............... 32 7 Further Work 33 2 1 Introduction 1.1 Online Learning In Machine Learning (ML), data is a crucial key component. It's the part that decides on how well your ML model will perform on unseen data: more data leads to a better generalization (prediction power over unseen data). As described in [1] there are two design options based on the nature of the modeling pipeline on which you receive your data. The first is to build your learning model while your data is at rest (batch learning), and the second is when your data is flowing in streams into the model (online learning). Batch learning is the more traditional approach, it splits the dataset into two sub-sets for training and testing, and with that comes the underlying assumption that the test data have similar statistics to the training data [2]. It also assumes that the data is stored and can be accessed several times, however that assumption imposes several resource constrains including storage and computational power. More than ever, the volume of data streams has increased exponentially due to amongst others advances in hardware technology. Applications such as financial processing [3], sensor networks [4], web logs, and sentiment analysis, generate continuously fresh data. These datasets often become so large to the point that it might be infeasible to store them [5]. Moreover, some critical applications like healthcare monitoring requires that these data flows have to be analyzed in a real-time manner. As a result, Online learning has gathered more attention in recent years as the solution for continuous data streams analysis. Online learning is concerned with learning a pattern incrementally by processing examples one at a time as defined in this overview by Widmer et al.[6]. It's performed in a sequence of consecutive rounds, and can be thought of as answering a sequence of questions [7]. In the case of online classification, the Yes or No answers point to the target classes, and a question is classified to either the Yes class or the No class. The goal of Online learning for classification is to make as few mistakes as possible, that is to minimize the total number of erroneous classifications [6]. In contrast to batch learning, online learning has the flexibility to scale and adapt to changes in the data properties, to process data in real-time with limited resources, and to discover and learn new patterns in continuous data streams. However, batch learning can also be used for the analysis of continuous data streams, as suggested in [2] by keeping a buffered dataset of past data records. Subsequently, the model will be re-trained at regular intervals on such a new batch hence adapting to the changes reflected in this batch. Although this paradigm manages to make batch learning more flexible to changes, it doesn't address necessarily the computing and storage resources required. As presented in [8], online learning is also needed to address the real-time information and decision making constraints on mobile/ubiquitous 3 data stream analysis. A wide range of tasks nowadays with such temporal con- strains1 increases the need for efficient and scalable online learning algorithms. As in many application areas, it is impossible to make assumptions regarding the distribution of the data as the sequence of the data can be deterministic[7]. Examples of online prediction problems include Online Regression, Prediction with expert advice, Online detection, and the multi-armed bandit problem in a limited feedback setting [7]. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train based on the entire dataset 2. Thus, it processes incoming real-time data sequentially. 1.2 Anomaly/Fault detection Anomaly/Fault detection is a method of detecting outliers that do not agree to the normal behavior (or normal pattern). The anomaly/fault detection task is to recognize the presence of these outliers with respect to a model definition of "normal" [9]. Anomalies are defined not by their own characteristics, but in contrast to what is normal. You may not know what the anomalies will look like, but you can build a system to detect them in contrast to what you've discovered and defined as being a normal pattern , see [10] for example. In our financial transactions "section 4.1" case where a fault detection is called fraud detection [11], the supervised approach can be used when there is a dataset available with the records "transactions" labeled as e.g. "normal", or "fraudulent". The labels help the algorithm to form an idea of how a normal/fraudulent transaction looks like. However, we focus on online fault/fraud learning using a semi-supervised approach, where the records "transactions" are coming at a high-frequency and are not labeled in advance. After a while, the algorithm has formed a pattern for a "normal" transaction, and counts every transaction that deviates enough from the normal pattern as a possibly "fraudulent" transaction. Those potentially fraudulent transactions will then be submitted for follow-up analysis, and hence acquire a "label". This then allows the detection algorithm to learn. The detection of anomalies in high-frequency streams of high-dimensional (financial, social media) data does pose challenges beyond the reach of many existing approaches. There is an algorithm dubbed FADO (Online Fault De- tection) which is the basis of this thesis described in [12] and elaborated in the work [11].

Semi-Supervised Learning with HALFADO: Two Case Studies

Kitsune: an Ensemble of Autoencoders for Online Network Intrusion Detection

An Introduction to Incremental Learning by Qiang Wu and Dave Snell

Adversarial Examples: Attacks and Defenses for Deep Learning

Deep Learning and Neural Networks Module 4

An Online Machine Learning Algorithm for Heat Load Forecasting in District Heating Systems

Online Machine Learning: Incremental Online Learning Part 2

Data Mining, Machine Learning and Big Data Analytics

Merlyn.AI Prudent Investing Just Got Simpler and Safer

Tianbao Yang

Learning to Generate Corrective Patches Using Neural Machine Translation

NIPS 2017 Workshop Book

1 Machine Learning and Microeconomics