<<

Event Mining Over Multiple Text Streams

John Calvo Martinez

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Computer Science and Engineering Faculty of Engineering December 2019

Thesis/Dissertation Sheet

Surname/Family Name : Calvo Martinez Given Name/s : John Steven Abbreviation for degree as give in the University calendar : Ph.D. – Research in Computer Science and Eng. Faculty : Engineering School : Computer Science Thesis Title : Event Mining over Multiple Text Streams

Abstract 350 words maximum: (PLEASE TYPE) Event Mining is the set of information extraction tasks that aim to extract events from text identifying the what (action or event category), the who (actors and targets), the when (date), and the where (location). Extraction of events requires a number of automated steps for recognizing all of these components. Current state-of-the-art event extraction systems rely on batch learning, but analysts need near-real time socio-political conflict understanding. Therefore, a key research question is how to deal with Event Mining in near-real-time scenarios.

In this thesis, a novel framework was developed to deal with Event Extraction, Event Detection, Event Classification, and Argument Classification, using online learning and prequential testing to work in near-real time scenarios. The framework was tested using three different social science datasets of the -Pakistan conflict using events reported by news, social media and local experts (ACLED). A novel method called SPLICER was built to tackle these tasks in real time, using stream mining models in a multi-layered constraint learning approach. The feasibility of SPLICER is shown against state-of-the-art event extraction systems. Results show that SPLICER outperformed baselines more than 10% F1 in both ACLED and AfPak datasets. Knowledge represented in domain specific ontologies, in conjunction with constraint learning are used along with base stream mining algorithms.

In addition to improvements over stream mining algorithms, this thesis addresses the question of how to automatically combine multiple sources of information for stream classification. We propose SLICER, a stream mining ensemble to handle stream partitioning automatically. It assesses when and how it is better to “horizontally” split a stream dataset to build multiple local models to boost global models. SLICER was tested under single layer event mining tasks, showing better results than single stream mining baselines and classic stream mining ensembles.

Finally, SPLICER and SLICER were jointly used, improving results from 3 to 5% for event extraction tasks. In conclusion, stream mining algorithms can be efficiently used for event mining and other information extraction tasks, if horizontal partitions are carefully made by using Information Gain or Gini measures to split the source into multiple streams.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

…………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Signature Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………......

Date ……………………………………………......

INCLUSION OF PUBLICATIONS STATEMENT

UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure.

Publications can be used in their thesis in lieu of a Chapter if: • The student contributed greater than 50% of the content in the publication and is the "primary author", ie. the student was responsible primarily for the planning, execution and preparation of the work for publication • The student has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. • The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis

Please indicate whether this thesis contains published material or not. This thesis contains no publications, either published or submitted for publication □ I I Some of the work described in this thesis has been published and it has been documented in the relevant Chapters with acknowledgement

This thesis has publications (either published or submitted for publication) □ incorporatedinto it in lieu of a chapter and the details are presented below

CANDIDATE'S DECLARATION I declare that: • I have complied with the Thesis Examination Procedure • where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Sig Date (dd/mm/yy)

Postgraduate Coordinator's De

I declare that: • the information below is accurate • where listed publication(s) have been used in lieu of Chapter(s), their use complies with the Thesis Examination Procedure • the minimum requirements for the format of the thesis have been met. PGC's Name PGC's Signature Date (dd/mm/yy)

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………......

Dedication

Mi amor por ti ha ilusionado. Mi amor por ti ha desilusionado. Mi amor por ti ha evolucionado. Mi amor por ti va mas allá de lo humano.

Para mi amor, recordado por hoy y siempre, para Olga. Acknowledgements

Real world applications are demanding more complex process automation techniques from a vari- ety of data sources at an unprecedented speed, variety and structure. Real world datasets are noisy, imperfect and imbalanced. The world needs more efforts to exploit our capabilities as scientists to deal with such complexity, in an era of information age imperfection. Our main lesson learned is that research is better done under real-life scenarios, and we hope our contribution can grow towards finding that balance of research with an impact to the real world.

I would like to thank all persons that were around me during this major step in my life. Family, friends and co-workers that helped me on achieving this thesis from one way or the other:

• I would like to thank my sponsors, Data to Decisions Cooperative Research Centre, and Colciencias for giving me the opportunity to be involved in this endeavour.

• I would like to thank my supervisor, Associate Professor Wayne Wobcke. You have been such an inspiration and guidance during these years. Thanks for teaching me not only how to be a good researcher but how to be an impactful and ethical professional. I really appreciate and enjoyed the opportunity to work with you during these years.

• Thanks to Dr Alfred Krzywicki for his helpful guidance and comments on my research area. I really appreciate the opportunity to work with you and to build a meaningful project. It was always useful to have your expert eye on my project.

• Thanks to Dr Mike Bain for his guidance as my co-supervisor. I hugely enjoyed and learnt from you during classes and mostly during our weekly meetings.

• Thanks to Dr Bradford Heap for his invaluable comments. I gained an enormous amount of knowledge from you when we worked together. • Special thanks to Dr Susanne Schmeidl, for whom I have special admiration. Thank you for helping us construct our research datasets, and thank you for allowing me to learn much more than machine learning. My thesis is the result of your efforts in socio-political conflict analysis too.

• To Olga, for being present in the good and not so good moments. And to be always on my side. I love you.

• To my mom Carmen and my dad Jhon Jairo, this is also for you. Thank you for all what I’ve learnt from you. Without your guidance it would have been impossible to reach what I have reached in my life. Dad, thank you for teaching me maths. Mom, thank you for believing in me all the time. For both, thank you for teaching me how to be a good person and to persist on achieving what we want.

• To my brother Alejandro, I love you so much. I have the privilege of being your brother. You raised yourself, I am proud of you. I hope I make you proud too.

• To my sister Diana, thanks for keeping an eye on me and for inspiring me. You are such an amazing sister, mom and researcher.

• Special thanks to Sandeepa, for helping me, and being present during all these years, I hope to keep your friendship for the rest of our lives.

• Dear Yesenia, thanks for tolerating me and for being present during this phase of my life. It is very nice to have your friendship which I appreciate very much.

• My friend Kimie, thanks for being my “mom” overseas. Your comments and advice were always helpful for achieving this milestone.

• Dear Shayan, Dianne, Angelo and Pamela, thanks for having the privilege of knowing you. Thanks for those delicious Persian and Chilean plates.

• My friend Carlos, compita, thanks for making me laugh during these years. I am glad that I knew you in my Ph.D. journey.

• Dear Astrid, you are amazing. Thank you for helping me with my English, I enjoyed all our conversations, coffees and I hope keep enjoying them for much more time!

Page ii of 175 • Dear Matt, thank you for being present towards the end of my Ph.D. I am delighted to have conversations with you, and to engage with good food.

• To all my WiseTech friends, Ji, Raul, Juan, Axle, Stephen, Igor and the Data Science Team, thank you for being supportive and cheerful during my comments phase.

• To all my relatives, don Jose, doña Gladys, Juan, Juan Andres, Isabella, High School friends, Sydney friends, Colombian friends, cousins, uncles, aunties, grandmother, my “madrina”, thank you for being present in spirit all the time.

Page iii of 175 Abstract

Event Mining is the set of information extraction tasks that aim to extract events from text iden- tifying the what (action or event category), the who (actors and targets), the when (date), and the where (location). Extraction of events requires a number of automated steps for recognising all of these components. Current state-of-the-art Event Extraction systems rely on batch learning, but analysts need near-real time socio-political conflicts understanding. Therefore, a key research question is how to deal with Event Mining in near-real-time scenarios.

In this thesis, a novel framework was developed to deal with Event Extraction, Event De- tection, Event Classification, and Argument Classification, using online learning and prequential testing to work in near-real time scenarios. The framework was tested using three different social science datasets of the Afghanistan-Pakistan conflict using events reported by news, social me- dia and local experts (ACLED). A novel method called SPLICER was built to tackle these tasks in real time, using Stream Mining models in a multi-layered constraint learning approach. The feasibility of SPLICER is shown against state-of-the-art Event Extraction systems. Results show that SPLICER outperformed baselines more than 10% F1 in both ACLED and AfPak datasets. Knowledge represented in domain specific ontologies, in conjunction with constraint learning are used along with base Stream Mining algorithms.

In addition to improvements over Stream Mining algorithms, this thesis addresses the ques- tion of how to automatically combine multiple sources of information for stream classification. We propose SLICER, a Stream Mining ensemble to handle stream partitioning automatically. It assesses when and how it is better to ?horizontally? split a stream dataset to build multiple lo- cal models to boost global models. SLICER was tested under single layer Event Mining tasks, showing better results than single Stream Mining baselines and classic Stream Mining ensembles. Finally, SPLICER and SLICER were jointly used, improving results from 3 to 5% for event extraction tasks. In conclusion, Stream Mining algorithms can be efficiently used for Event Min- ing and other information extraction tasks, if horizontal partitions are carefully made by using Information Gain or Gini measures to split the source into multiple streams.

Key words: Computational Social Science, Stream Text Mining, Multi Stream Mining, Event Mining, Event Extraction, Event Coding, Information Extraction, Natural Language Processing.

Page v of 175 Contents

Acknowledgements i

Abstract iv

Contents vi

List of Figures xiii

List of Tables xv

1 Introduction 1

1.1 What is an Event? A General Overview ...... 3

1.1.1 Structuring and Coding an Event in a Knowledge Base ...... 6

1.2 Event Mining ...... 7

1.2.1 Topic Detection ...... 7

1.2.2 Event Detection and Classification ...... 8

1.2.3 Argument Detection and Classification ...... 8

1.2.4 Event Extraction ...... 9 1.2.5 Event Coreference Resolution and Synthesis ...... 10

1.3 Current Event Mining Methods ...... 10

1.4 Near Real Time Event Extraction and Stream Mining Methods ...... 12

1.5 Summary ...... 14

2 Event Mining and Text Stream Mining 17

2.1 Socio-Political Conflict Analysis ...... 17

2.2 Event Mining Systems and Pipelines ...... 20

2.3 Topic Detection and Tracking ...... 21

2.4 Event Extraction Systems ...... 23

2.4.1 Manual Event Extraction ...... 24

2.4.2 Event Detection ...... 25

2.4.3 Real-Time Event Detection ...... 27

2.4.4 Automated Event Extraction and Coding ...... 28

2.4.5 Event Coreference Resolution ...... 32

2.4.6 Event Synthesis and Other Tasks ...... 33

2.5 Stream Mining Algorithms ...... 34

2.5.1 Drift Detection and Stationarity ...... 35

2.5.2 Incremental Stream Classifiers ...... 38

2.5.3 Windowing Approaches ...... 39

2.5.4 Rule Learning ...... 41

Page vii of 175 2.5.5 Times Series Based Approaches ...... 41

2.5.6 Ensembles and Meta-Algorithms ...... 42

2.5.7 Distributed Stream Mining ...... 44

2.6 Stream Mining Approaches Applied to Text Mining ...... 46

2.7 Summary ...... 47

3 The Stream Event Mining Framework - SEMF 49

3.1 Principles and Definitions ...... 50

3.1.1 Input: Text Stream ...... 50

3.1.2 Event as a Knowledge Base Structure ...... 50

3.1.3 Ontology Structure ...... 51

3.2 SEMF Architecture ...... 52

3.2.1 Online Text Pre-Processing ...... 54

3.2.2 Online Event Detection and Classification ...... 56

3.2.3 Online Argument Detection and Classification ...... 57

3.2.4 Online Event Extraction ...... 57

3.2.5 Event Coreference Resolution ...... 57

3.3 Domain Specific Datasets ...... 58

3.3.1 The Afghanistan-Pakistan Socio-Political Conflict Datasets ...... 58

3.3.2 ACLED Dataset Sample ...... 69

3.3.3 Stream Datasets in Other Domains ...... 71

Page viii of 175 3.4 Evaluation and Metrics ...... 73

3.4.1 Prequential Evaluation ...... 73

3.4.2 Precision, Recall and F1 ...... 74

3.4.3 Event Detection and Classification Measures ...... 75

3.4.4 Argument Detection and Classification Measures ...... 75

3.4.5 Event Extraction Metrics ...... 76

3.5 Summary ...... 77

4 Event Extraction From Text Streams Using SPLICER 79

4.1 Introduction ...... 80

4.2 SEMF Framework and SPLICER ...... 81

4.2.1 Ontology as Base Knowledge ...... 81

4.2.2 Event Logical Form (ELF) ...... 84

4.3 SPLICER – Stream Processing with Logical Integrity Constraints for Event ex- tRaction ...... 86

4.3.1 Text Pre-Processing ...... 87

4.3.1.1 Incremental Pre-Processing for Event Type Classification . . . 88

4.3.1.2 Text Pre-Processing for Event Argument Classification . . . . 89

4.3.2 First Layer: Component Detection and Classification ...... 90

4.3.3 Second Layer: Integrity Constraint Learning ...... 91

4.3.4 Horn Integrity Constraint ...... 91

4.3.5 Integrity Constraint Consistency and Evolvability ...... 92

Page ix of 175 4.3.6 Plausible Candidate Selection ...... 93

4.4 SPLICER’s ELF Selection Algorithm ...... 94

4.5 Experiments ...... 96

4.5.1 Datasets ...... 96

4.5.2 Event Extraction Baselines ...... 96

4.6 Analysis of the Results ...... 98

4.6.1 AfPak Twitter and News ...... 99

4.6.2 ACLED ...... 101

4.6.3 The Impact of Pruning Rules ...... 103

4.6.4 Runtime Analysis ...... 104

4.7 Conclusions ...... 105

5 Multi-stream Event Components Classification Using SLICER 107

5.1 Problem Definition ...... 109

5.2 SLICER ...... 110

5.2.1 Text Pre-Processing ...... 111

5.2.1.1 Event Classification Text Pre-Processing ...... 111

5.2.1.2 Argument Classification Text Pre-Processing ...... 112

5.3 SLICER - Splitting to Locally Indexed Classifiers Ensemble boosteR ...... 112

5.3.1 Classification Task ...... 113

5.3.2 Subspacing Procedure ...... 114

Page x of 175 5.3.2.1 Information Gain (IG) and Minimum Sampling ...... 115

5.4 SLICER Training and Testing Algorithms ...... 116

5.4.1 Horizontal Splitting ...... 117

5.4.2 Vertical Splitting ...... 119

5.5 Experiments ...... 119

5.5.1 Datasets ...... 120

5.5.2 Stream Mining Baselines ...... 121

5.5.3 Drift Detection Methods ...... 123

5.6 Analysis of the Results ...... 123

5.6.1 Event Detection and Classification ...... 123

5.6.2 Argument Detection and Classification ...... 124

5.6.3 The Effect of Drift in SLICER ...... 125

5.6.4 SLICER in Other Domains ...... 127

5.7 Multi-stream Event Extraction Using SPLICER and SLICER ...... 128

5.7.1 Methodology ...... 129

5.7.2 Experiments ...... 130

5.7.3 Event Extraction Baselines ...... 132

5.7.4 Analysis of the Results ...... 132

5.7.5 Runtime Analysis ...... 134

5.8 Summary ...... 135

Page xi of 175 6 Conclusion 137

6.1 Summary ...... 138

6.1.1 SPLICER ...... 138

6.1.2 SLICER ...... 139

6.2 Limitations ...... 141

6.3 Future Work ...... 142

BIBLIOGRAPHY 144

APPENDICES 162

A Table of Equivalences Between JET and AfPak Categories ...... 162

B Table of Equivalences Between JET and AfPak Argument Roles ...... 164

C Table of Equivalences Between JET and ACLED Categories ...... 165

D AfPak Ontology values ...... 166

Page xii of 175 List of Figures

1.1 Description of a reported killing event in Colombia ...... 5

1.2 Major events affecting Afghan displacements [141, p. 173] ...... 9

2.1 General events extraction and coding pipelines [147, p. 24.] ...... 29

3.1 SEMF framework architecture ...... 53

3.2 The Event Coding Assistant annotation tool [85] ...... 60

3.3 AfPak events annotation pipeline ...... 61

3.4 Events in News articles times series ...... 62

3.5 Events by sources in AfPak News ...... 63

3.6 Top 12 Actors in AfPak News ...... 63

3.7 Top 12 Targets in AfPak News ...... 64

3.8 Top 12 Locations in AfPak News ...... 64

3.9 Top 20 event types distribution in AfPak News dataset ...... 65

3.10 Events by source (author) in AfPak Twitter ...... 67

3.11 Top 20 event types in AfPak Twitter dataset ...... 67 3.12 Top 12 actors in AfPak Twitter ...... 68

3.13 Top 12 targets in AfPak Twitter ...... 68

3.14 Top 12 locations in AfPak Twitter ...... 68

3.15 Actors distribution in ACLED dataset ...... 71

3.16 Locations distribution in ACLED dataset ...... 72

4.1 AfPak ontology fragment used for logic translation ...... 83

4.2 Representation and extraction of an event in Twitter. In this example the tweet is represented as an ELF in KB ...... 85

4.3 SPLICER pipeline to extract events and integrity constraints ...... 87

5.1 Representation and extraction of an event in Twitter. In this example the tweet is represented as three separate events containing some common components . . . . 109

5.2 Joint, single classifier and multiple classifier MSM approaches ...... 113

5.3 The SLICER process. The ensemble choses when to split the incoming stream according to a gain metric, and it is trained at every drift using (E)DDM [57, 11] or any other drift detection approach ...... 114

5.4 EE pipeline using SLICER in the first layer and SPLICER in the second reach full MSM approach ...... 130

Page xiv of 175 List of Tables

2.1 Main Event Mining Systems and Pipelines ...... 20

3.1 Event Mining Tasks vs Multi-layer Approach ...... 53

3.2 Text Pre-processing Techniques in SEMF Framework ...... 55

3.3 # Tweets by Author ...... 59

3.4 # News by Source ...... 60

3.5 News Articles ρ Correlation Matrix ...... 65

3.6 AfPak Twitter ρ Correlation Matrix ...... 68

3.7 ACLED Afghan Sample Statistics ...... 70

3.8 ACLED Top 3 Values per Component ...... 70

4.1 Event Component Features ...... 90

4.2 Prequential Event Extraction Results ...... 98

4.3 Prequential Event Detection and Classification Results ...... 99

4.4 Prequential Argument Detection and Classification Results ...... 99

4.5 Prequential ACLED Event Extraction Results ...... 101 4.6 ACLED Event Detection and Classification Results ...... 102

4.7 ACLED Argument Detection and Classification Results ...... 102

4.8 AfPak Prunning vs Non-prunning ...... 103

4.9 Event Extraction AfPak and ACLED Runtimes ...... 104

5.1 Drift Detection Methods Parameters Configuration ...... 123

5.2 Prequential Event Detection and Classification Results(MICRO Precision, Recall

and F1)...... 124

5.3 Prequential Argument Detection and Classification Results ...... 125

5.4 SLICER’s F1 and Calculated Lift-per-drift for Event Detection (ED), Event Clas- sification (EC), Argument Detection (AC) and Argument Classification (AC) in AfPak Datasets ...... 126

5.5 SLICER’s Accuracy in other Domain Datasets (best accuracy rates are in bold) . 128

5.6 Twitter Authors vs Types Correlation Matrix ...... 131

5.7 News Authors vs Types Correlation Matrix ...... 132

5.8 Prequential Event Extraction Results ...... 132

5.9 Prequential Event Detection and Event Classification Results ...... 133

5.10 Prequential Argument Detection and Argument Classification Results ...... 133

5.11 SLICER Runtime Results ...... 135

Page xvi of 175

Chapter 1

Introduction

The well known term of ‘’Big Data” has played a key role during the last decade of comput- ing. Volume and velocity of data at unprecedented rates from vast sources of information have exploded over the last two years. It has been calculated that 90% of existing internet data has been produced during this time period1, in particular, huge amounts of text are being crawled day-by-day by search engines such as Google, and text applications such as Twitter generates data at unprecedented time rates in real time [158]. Text streams have been shown to be useful when reporting terrorist attacks, environmental catastrophes and disease outbreaks [37], and text automation can significantly improve people’s lives by predicting and quickly reacting to such life-threatening conditions.

Text automation techniques to extract meaningful patterns over time have been successfully developed for detecting such patterns in text streams, and in particular to find trends in Twitter [5, 14, 105, 109, 121, 123, 136, 170, 176, 33]. Most of the existing real-time text mining applica- tions have been applied to detect events or patterns, e.g. to classify whether or not there is a life threatening event. Nonetheless, the complex nature of text data in addition to its vast granularity and the potential of such data extraction can be incredibly beneficial to societies. For example, we can expand such Event Detection techniques to find categories of life-threatening events, actors, and the locations of such events. 1http://www.baselinemag.com/analytics-big-data/slideshows/ surprising-statistics-about-big-data.html Event Mining is the set of complementary techniques, methods and models to deal with more complex information extraction than mere event identification. However, there is a gap that can be filled when extracting such complex patterns using Stream Mining. Recent explorations show that the field of Stream Mining is one of the most promising fields for dealing with data in real time using online learning, which aims to update a learning model at any time using data in sequential order, as opposed to batch learning, which trains a model using all data at once [28]. This gives the system the facility to both train the model and predict from a given data point almost instantly after the data is collected. First, Stream Mining models use the given labelled data instances at time ti that can be obtained from either an automated process, or from feedback between the system and the user. Second, the prediction is performed using the learned model at the time of use [24, 54]. In addition, Stream Mining approaches adapt and update the machine learning model over time when the data distribution changes (concept drift) [59], making Stream Mining models feasible to use when new labels are defined, which is usually the case in Event Extraction tasks, in contrast to batch learning, where the learning model needs to be re-trained from scratch to identify the new labels.

The work in this thesis concerns stream mining in near-real time, that is, when the label(s) of a data item are not ready when the instance arrives, but are available before the next instance arrives, so can be used for retraining after the prediction in a test-then-train fashion. The model trained on the available training data can be retrieved at any time. As a result, the time required to build a model is considerably reduced. To give an example, a stochastic gradient descent algo- rithm [28] reduces the computation time of classification learning with Naïve Bayes. This is due to the approximation technique used in these methods, that can reduce learning complexity util- ising statistical measures. Although research results and algorithms present promising theoretical foundations, the majority of them have been tested over numerical, and categorical datasets, and text-based Stream Mining has been tested over single layer classification tasks. Thus, more re- search is needed to fill the gap between real applicability and theory, especially for more complex, text classification tasks.

Although Stream Mining can be used to detect complex events from text streams, more re- search can be done in this direction. In particular, we focus attention on Multi-Stream Mining (MSM) for unstructured, text data, a new area of research where Stream Mining algorithms are used to work with a multiplicity of information sources. Drift techniques were expected to be

Page 2 of 175 applied to detect complex events through different sources of information looking at the change of stationarity in each source [59]. The initial idea was that a multiplicity of sources should enhance the accuracy of any machine learning model.

More specifically, this research seeks to explore the field of Event Mining using data stream processing, techniques and models, to improve the feasibility of applying data mining in near-real time over text streams such as Twitter or newswires, with the aim of giving quick response to emergencies, applied to socio-political situations. Moreover, this dissertation explores ways to deal with complex multi-task learning as found in other information extraction tasks, and initially tested under batch, non-real time scenarios [143, 89, 10, 65], in which researchers looked at ex- tracting events at the sentence level, extracting detailed information such as the who, the what, the where and the when on each sentence, to populate knowledge bases from unstructured text data. In this scenario, events are different from simple topics or categories of information and even topics can be considered as one additional feature of an event.

The motivation of this research is therefore, to extend the field of Stream Mining under unstructured multiple sources of information, specifically designed for detecting and extracting events from text. The idea is to cope with complex Event Detection under multiple data stream scenarios. This is illustrated in the Socio-Political Science domain. This chapter motivates the need for mining events from text in near-real-time, to make quick responses to major crises. First, we define what is an event, and what we want to extract from text streams. Second, we briefly revise how to mine events from text and the existing techniques to deal with Event Extraction. Fi- nally, we summarise the research questions to be solved in this thesis to deal with Event Extraction over multiple text streams.

1.1 What is an Event? A General Overview

Several definitions of an event have been given over the last decade in a number of areas including machine learning, psychology, and computational linguistics. Philosophical definitions include efforts in formal logic, and mathematical formulation of language semantics for “situations”, de- fined by Barwise and Perry [13]. One recent discussion of events in computational linguistics are mentioned by Monahan [108], drawing on the definition given by Quine [128] of an event as a

Page 3 of 175 physical fact that happened at some point in history. Therefore, we could argue that an event is detected when an entity changes its state at some point in time, triggered by, or even affecting other entities.

In Computer Science, the term “event” has been extensively used to refer to digital, structured knowledge extracted from a report of an event at some point in time. Such reports can come in many different forms, from text to video and images. All this data can be represented in Knowl- edge Graphs (KG), that can be stored in Knowledge Bases [84]. From there, we can extract and store the affected entities, their relationships, and change of states, or actions.

In our case, an event is usually reported in a text fragment, in form of a group of sentences, or part of sentences, e.g. a sentence appearing in a news article. For example (Figure 1.1), the news article “Colombia Farc: Dissident leader Rodrigo Cadete killed in military operation”1 is reporting a real-life event. This text contains the named entity “Rodrigo Cadete” as the target of the event, that changes its state when killed in a military operation. The reported event also has Colombia as its location, that can also be represented as another named entity. Figure 1.1 shows this example as it may appear in a news article, in which it might report more related events in the story, found in multiple sentences.

Whether a sentence describes an event is often indicated by the presence of a trigger word, which may also determine the main category of the event. In this case, the trigger is the word “killed”. In addition to the trigger and affected entities, an event has boundaries in time and space. In this example, the Colombian military is the actor of the killing action (the entity executing the event), the target(s) – Rodrigo Cadete in this case – are those affected by the action. The boundaries in time and space are given by the reported dates of the news article, but can also be reported within the text, and can differ from those of the text being reported. Additional sentences can contain further relevant events of interest, or can contain more details of the event reported in the headline. Formally speaking, an event in computational linguistics is defined as follows.

Definition 1.1. Given a text instance xi, an event e from xi is a component-tuple hC, A, T, L, Di, in which each component is a set interpreted as follows:

• Event types: Event types C is a set representing the labelled classes, or labelled categories in e, defined in a domain specific ontology. It might contain more than one class. 1https://www.bbc.com/news/world-latin-america-47106730

Page 4 of 175 Figure 1.1: Description of a reported killing event in Colombia

Page 5 of 175 • Actors: The actors set A is a set of zero or more named entities which are executing the main action of the event. An actor could be represented as an ontology entity or a named

entity contained in xi.

• Targets: Targets T is a set of named entities representing the targets on which the action is being executed. Similarly to actors a target is represented either as an ontology entity or a

named entity contained in xi.

• Locations: Locations are given by the set L containing the geographic locations (geo- graphic entities) in which the event is taking place. A location can also be represented as an

ontology entity or as a named entity found in xi.

• Time: This component is a set that represents the date times in which the event was exe- cuted. Event dates can be obtained from metadata or lexical inference, or a combination of both and it is represented by D.

Hence, the main idea of Event Extraction is to convert textual, unstructured information, into structured form (events) stored in data structures such knowledge bases [84], in order to be anal- ysed further by human experts, supplying factual information to complement highly detailed and qualitative information.

1.1.1 Structuring and Coding an Event in a Knowledge Base

Using the structure defined above, events can be obtained and stored in databases as a tuples hC, A, T, L, Di, in which each component is stored in the event data record. Additionally, any event component can be mapped against a canonical name for disambiguation purposes. For example, we might not know that the killed person “Ricardo Cadete” refers to a FARC guerrilla member in Colombia. Therefore, by disambiguating the named entity by mapping it with an entity value we can obtain extra information from the extracted event, making more sense of the data and at the same time facilitating analysis. This disambiguation and mapping task is called event coding.

Event coding is usually done by mapping entities against an ontology concept. Ontologies are powerful mechanisms to describe domain specific knowledge. There are many examples of

Page 6 of 175 domains of interest such as: social conflict analysis as described in CAMEO [145], emergency control topics [37, pp. 1–17], and general news corpus understanding [65]. Domains of interest play a pivotal role on understanding and summarising events, and they are of huge help analysts looking to identify trends and quickly process large amounts of information.

1.2 Event Mining

Event Mining comprises all the methods and techniques to categorise, extract, reference and syn- thesise events from diverse text data sources. Useful techniques include Named Entity Recogni- tion (NER), text categorisation, topic modelling, clustering, sentiment analysis and deep learning. Event Mining comprises all forms of mining information from text data in order to digitise a phys- ical event into manageable forms to handle large amounts of information. Specifically, we focus on the English language, having in mind that nouns and verbs are key factors for detecting and extracting an event from text.

We will refer to Definition 1.1 in all Event Mining tasks, including Event Detection, Event Categorisation, Event Extraction, Event co-Reference and Event Synthesis. Similar definitions can be found in the Event data project [145] and the ACE competition [65]. In this way, Event Mining a set of Information Extraction (IE) tasks to computationally extract structured, useful information from Big Data sources taking into account volume, velocity, variety and veracity, as expressed by Castillo [37, p. 15]. Therefore, there are several Event Mining tasks that can be performed to extract such structured information, as an event is a complex data structure that comprises many steps to detect and extract each component in the event structure described below.

1.2.1 Topic Detection

A topic is understood as a category of information in which an event belongs. A topic can contain more than one event or even multiple events, and vice versa. There are a number of publications in which events are treated as topics, different from our event definition [6, 72]. In this work, Topic Detection and Tracking is the problem of detecting the underlying topic of a document. Several methods have been used for this purpose, including supervised an unsupervised machine learning

Page 7 of 175 techniques. Nonetheless, our work differs from this work as we consider events as a richer, more complex structure.

1.2.2 Event Detection and Classification

Event Detection is the most basic sub-task of Event Mining. Similar to topic detection, Event Detection focuses attention on classifying text into a positive or negative class based on whether a text instance contains an event of interest. The main difference with topic detection is that Event Detection goes beyond document-level classification, i.e. it recognises events at the sentence level.

Similarly, event classification is the task of classifying a recognised event into a set of defined event types. For example, the Colombian news article can contain multiple events, that can be classified as an “battle”, “military operation” and “killing”. In this way, Event Classification supports the classification of events in a certain domain of interest. These methods are discussed in detail in Chapter 2.

1.2.3 Argument Detection and Classification

Argument detection and argument classification (sometimes known as role classification), involve the extraction of entities that are part of an event of interest and its underlying relationships. Argu- ment detection in particular is the task of extracting such entities (persons, locations, organisations, dates). Argument detection may use Named Entity Recognition (NER) [37, pp. 43–45]. Different NER uses are also commented in other related text mining literature [2].

In addition, Argument Classification refers to the entity classification task of recognising which entity performs the role of an actor, target or location in an event. Argument and role classification are a set of key Information Extraction (IE) tasks of Event Mining, given that such techniques extract more fine-grained information from the data that is further used in Event Ex- traction.

Page 8 of 175 1.2.4 Event Extraction

Event extraction is the Event Mining task of extracting a whole event tuple as defined in Section 1.2, that is, Event Extraction deals with all components of an event in any given text including actors, targets, locations, dates and classes of a particular event of interest. Event Extraction therefore, is one of the ultimate goals of Event Mining, as it produces a complex data structure that represents the whole event at once [3].

Event extraction is useful when dealing with human conflict analysis, and information flowing from news and social media contain key insights and specific details that can be analysed thor- oughly by experts in order to understand critical situations [37, pp. 5-10]. In these cases, very detailed analysis is needed. This is the case in the Afghanistan conflict, in which internal dis- placement is being analysed by social scientists using very detailed data such as percentage of displaced population by year, by region, by city, by ethnicity, and its underlying causes, gathered from multiple sources of information [141].

Figure 1.2: Major events affecting Afghan displacements [141, p. 173]

For instance, Figure 1.2 presents the number of displaced people in Afghanistan per year. The timeline shows a high correlation between major battles, political events and people movement over time. As shown in Figure 1.2, timeline analysis is useful when understanding conflicts, therefore it is important to analyse temporal aspects of events and Stream Mining.

Page 9 of 175 1.2.5 Event Coreference Resolution and Synthesis

Event coreference resolution is the task of recognising the same event from multiple sources of textual information. Coreference resolution is useful to de-duplicate events in order to avoid prob- lems as in the GDELT data project [90], in which many duplicates have been found [73, p. 50].

Moreover, apart from duplicates, the problem of data veracity arises when dealing with multi- ple sources of information. There are problems in the reports, given that there are some discrep- ancies in reporting the same event. Even so, analysts need to curate and compare event conflict datasets in order to know the level of accuracy with respect to the real event [134, 166]. Hence event synthesis is the Event Mining task of helping the analyst to recommend, or automatically curate such events to represent a single, unified and consisted version of the reality, enhanced from different text data sources.

So far, we have discussed the main Event Mining tasks, in order to correctly map unstructured text data and store a structured event into a Knowledge Base, but not how all these tasks are done. These questions are briefly introduced in the next section, and discussed in detail in Chapter 2.

1.3 Current Event Mining Methods

There are numerous projects to manually annotate events of interest. Humans read, identify and classify events into a set of text sources, usually news articles, in order to extract such information manually. Examples in social science are the ICEWS (Integrated Crisis Early Warning System) dataset [89], in which a group of highly specialised experts map social conflict related events from news articles. Other examples include the KEDS (Kansas Event Data System) dataset [143] and the ACLED (Armed Conflict Location and Event Data) dataset [130].

Information is gathered from a set of specific sources and curated and released every couple of months to the public. The advantage of using manually annotated event datasets is the accuracy of gathered events, but the main drawback of these methods are the high cost associated with extracting such information by human annotators in terms of time and readiness. On the other hand, the efficiency of fully automated Event Extraction systems is questionable as such systems have reported low accuracies on real-world datasets [147, 166].

Page 10 of 175 Computational systems have been developed to automatically extract events from news ar- ticles. These systems gather raw text information from news articles on the web, to be further semantically parsed and processed by a set of semantic parsers. The PETRARCH system [146] finds tuples of information to identify subject, verbs and linguistic patterns relating verbs to events and subjects to arguments.

More specialised systems extract events using processing pipelines [65, 93, 90]. First, they identify “trigger words” to be associated with an event category. Second, systems identify entities via an entity recogniser and then they classify such entities based on their contextual syntactic and semantic role within the text sentence and the trigger word. Finally, pipelines join the components together, recognising and reporting deduced events. Classification approaches include MaxEnt classifiers and usually statistical models to classify events and to find arguments’ roles.

In general, the following stages have been to automatically extract events from text:

1. Data ingestion

2. Text pre-processing

3. Event Detection

4. Argument detection via NER

5. Argument role classification

6. Event Extraction

Automatic Event Extraction has been found highly valuable by subject matter experts, and during the last 5 years, recent explorations have been directed towards the exploitation of deep learning (DL) mechanisms. For instance, Nguyen [110] developed several information extraction techniques using deep neural network methods, showing significant improvements in the accuracy of different IE tasks, such as argument and role classification, event classification and detection and Event Extraction. Moreover, Deep Learning models have the advantage of using the lat- est state-of-the-art machine learning models to deal with classification stages of Event Mining tasks. However, utilised neural network models heavily rely on featurisation steps (or matrix rep- resentations), transformed using embedding techniques to represent information in numeric vector

Page 11 of 175 representations, that have shown interesting results with neural network models, but are computa- tionally expensive to calculate as it requires a large corpus to be analysed by the pipeline during the pre-processing stage [107], making them infeasible to use in near-real time scenarios, in which the training is needed when the data arrives.

Recent research has found the usefulness of human feedback in the learning process. Human- in-the-loop machine learning is recognised as an important advance in the extraction of events [67], as it joins the computing velocity of event pipelines and the accuracy of human annotators. Heap et al. created such system [67] which updates a set of rules that can be changed by the user to produce more accurate, faster results.

This research aims to fill the gap between Event Mining tasks and online machine learning suitable for data streams. We define a multi-layered, rule-based Stream Mining model to perform learning tasks near-real time. The Event Extraction high level model structure is explainable (i.e. readable by a non-expert user by presenting a set of logical clauses), which is highly valuable for domain experts (social scientists in particular), which performs pre-processing at real-time, tackling two issues: to have a model able to predict at any time without being limited by batch processing, and to be efficient in terms of time and memory.

1.4 Near Real Time Event Extraction and Stream Mining Methods

There are many efforts to increase factual information in form of events to support socio-political conflict analysis and understanding [62], including manual annotation and automatic data gather- ing. However, manual annotation creates a big challenge, and it is mentioned by Subrahmanian that, data gathering should consider:

“The time required to gather data about a terrorist group, its actions, and contextual variables surrounding those actions so that the data gathered is near-real-time data, not data that is manually collected and is several years out of date by the time the collection is completed; Fine temporal granularity so that the data gathered can be at as fine a temporal resolution as desired (day, week or month) rather than being ag- gregated to yearly data as in many past studies; Fine-grained quantitative granularity

Page 12 of 175 so that instead of merely coding events as having happened (1) or not (0) during a given time frame, we can accurately state how many of a given type of event occurred (e.g. estimated number of bombings during a given time frame, estimated number of fatalities during a given time frame, etc.).” [160, pp v-vi]

Stream Mining [22, 91] is a framework in which models that can be trained as the data arrives, i.e. real time, allowing practitioners to work with models that can be interactively trained by the user. To perform such tasks, Stream Mining implies the following characteristics:

• Be ready to predict at any time.

• Process infinite instances, while training the model as the data arrives

• Train the model with a single pass, i.e. if a labelled instance arrives, the model should be updated without being re-trained with the whole dataset.

• Work with a limited amount of memory.

We apply Stream Mining in a computational social science scenario, addressing a real, com- plex problem, such as Event Extraction, as these tasks are highly challenging due the evolving nature of data in terms of feature distribution and shifts in class distributions, i.e. evolving con- flicts and shifting actors. In addition to that, these machine learning tasks are highly imbalanced, and also dealing with high dimensional datasets.

Consequently, an interesting research gap to fit is to test and experiment Stream Mining for Event Extraction and Event Mining tasks in general, with a joint framework that could help to explore in detail these challenges using Stream Mining in multi-task learning scenarios, as in the case of Event Extraction. In addition, this thesis will help to get a better picture of socio-political conflicts and its evolving nature and report bias in different sources of information.

Therefore, more discussion is needed on how stream, or online machine learning mechanisms might help to achieve Event Mining tasks in general under near-real time conditions. Therefore, online machine learning methods are found useful when dealing with evolving information. This is the case in text data sources (Twitter and news article web sites), as information comes at different variety, volume and velocity rates. In particular, this is the case of Twitter that has

Page 13 of 175 been reported useful on understanding mass emergencies [5]. Twitter and similar social media platforms present real time APIs that can be exploited in real time monitoring [158].

Current techniques are ingesting information in real time (the first stage of Event Mining tasks), but they are performing and calculating information extraction in batches. If Event Mining tasks were performed in an online mode, then a faster and more interactive way of annotating events could be proposed, exploiting the evolving nature of big data text streams, and human capabilities of analysis and detailed annotation [144].

1.5 Summary

Therefore, this research work will address Event Mining tasks under near-real time conditions, addressing online text pre-processing and online learning for text streams.The following research questions are derived from the challenges discussed so far:

1. How to run Event Mining tasks under Stream Mining scenarios with higher accuracies than batch learning techniques?

2. How to run machine learning models for text classification from different sources of infor- mation using online learning?

3. How drift affects accuracies of Stream Mining models in Event Mining tasks?

In summary, this research work proposes a novel framework to perform Event Mining in near- real time. The main application is intended to be use in social sciences, under socio-political analysis domains. The body of this thesis is organised as follows:

Literature Review: A main review of current Event Extraction techniques is made in chapter 2, in which a number of manual and batch learning techniques, methodologies and systems are described to perform Event Extraction. In addition, it was found that more research was needed to deal with multiple sources of information under Stream Mining conditions. An initial survey of Stream Mining techniques is given in Chapter 2, in which a vertical partition of streams is suggested in the Vertical Hoeffding Tree (VHT) [81] algorithm, suggesting not to use horizontal

Page 14 of 175 partitions. Contrary to other authors, this research suggests the use of horizontal stream partitions to boost accuracies if the partition is correctly made by using a statistical metric to satisfy certain confidence in the partition, suggesting features to split.

Stream Event Mining Framework: Chapter 3 presents a novel Event Mining framework for text streams – the SEMF framework – to give mathematical and theoretical ground to this research work. The SEMF framework defines the main Event Mining tasks to be performed, including Event Extraction and coding, Event Detection, event classification, argument detection and argu- ment classification. Hence, an event is considered as a constituent part of an Event Knowledge Graph, that used base knowledge in form of Ontologies [84]. In addition, the framework defines evaluation metrics of the performed Information Extraction tasks.

In addition, it presents an exploratory data analysis of all main event datasets used during experimentation: Afghanistan-Pakistan (AfPak) datasets and the ACLED dataset. Moreover, this thesis work developed a novel social conflict dataset for Event Extraction, resulting in the new Af- Pak Twitter dataset, that was made under the same characteristics, procedures and considerations of the News Afghanistan-Pakistan dataset, initially used by Heap et al. [67]. All datasets are used for testing our novel Stream Event Mining tasks.

SPLICER Chapter 4 presents SPLICER, a novel multi-task stream learning algorithm capable of making efficient Event Extraction under near-real time constraints. The algorithm uses a rule- based learning procedure to not only extract events, but also to give the user a set of rules to better understand the decisions made by the algorithm. Contrary to Deep Learning methods, SPLICER can be explainable, and the user can interact further with the algorithm while deciding which rule can be used or not. Empirical research under both AfPak datasets and ACLED shows efficiency and reliability of our results under the whole Event Extraction and coding task.

SLICER Chapter 5 presents a novel stream ensemble to automatically detect when it is bet- ter to use multi-streams over single streams for real-time classification tasks. The idea behind SLICER relies on the fact that statistical measures can greatly help on how an ensemble can par- tition a dataset to obtain higher accuracies. Therefore, our Stream Mining algorithm is capable of automatically identifying when it is worth to split the dataset into subspaces, and to train and classify with a boosted ensemble using local models to gain better accuracy in the whole learning task.

Page 15 of 175 In addition to single Stream Mining tasks, results are improved by using SPLICER and SLICER together, to make a full use of multi-Stream Mining ensembles over Event Extraction tasks in real- time. To summarise, the main contributions of this dissertation are the following:

• A novel Event Mining framework to deal with multiple text streams in near-real time (SEMF).

• A new integrity constraint learning model to improve Event Extraction efficiency in near- real time (SPLICER).

• A novel drift-based ensemble to deal with multiple text streams (SLICER).

Page 16 of 175 Chapter 2

Event Mining and Text Stream Mining

The aim of this chapter is to give background on the topic of Event Mining, and at the same time to identify gaps in the literature related to techniques for Event Mining for text streams under near-real time constraints. In summary, we will show that Event Extraction has mainly been done using batch processing techniques in both the pre-processing phase and the learning phase of the existing pipelines, and that Stream Mining has been used for detecting and classifying events in near-real time. Also, other related Event Mining tasks such as event coreference resolution and event synthesis have not been addressed using Stream Mining models, and that there is a need for more research in using online learning techniques in this field and how to use them effectively.

Initially, socio-political conflict analysis will be discussed, arguing the need for more research on online automation of Event Extraction tasks. Secondly, a review of Event Detection and clas- sification will be given, emphasising important issues to tackle in the area. Subsequently, other Event Mining tasks will be critically analysed, placing special consideration on argument detection and classification, and Event Extraction. Afterwards, Stream Mining techniques will be explained and analysed, and main challenges described.

2.1 Socio-Political Conflict Analysis

Social Conflict Analysis find its roots from a diverse range of social theories to understand so- ciological interactions between individuals, and groups of individuals [101] to find answers to questions of a sociological nature. Some of these questions are mentioned by Mack et al. :

“Why do serious situations sometimes not develop into violent conflicts while not so serious ones do? ... Why do some conflicts rather quickly run a natural course while others do not? What kinds of group attachments to which men are susceptible (in particular situations) are closely related to well-delineated lines of cleavage in society? What is the effect of size of groups on intergroup conflict?” [101, p 213]

Digging into the definition, social conflict is defined as a social relationship of struggle between two or more (human) actors to gain control or power to achieve each actor’s own will, even if this requires diminishing other actors’ interests, control and power [127, pp.7–8]. As a result of these conflictive interactive forces, involved actors can exhort certain unwanted actions that can lead to violent events, which can potentially escalate to human deaths and life-threatening situations [127, pp.87–101]. Social conflict analysts therefore, strive to understand a variety of social conflicts, usually in form of inter-state conflicts, with the purpose of mitigating, avoiding or solving them to reduce negative impact in society. For instance, conflict analysis has helped to understand current mass migrations, as a result of social conflict resolutions that impeded people to thrive in certain locations (such as Syria) with unfavourable political, economical and social conditions that has led to human rights violations, making mass migration a consequence of over-exhorting power of certain actors in the conflict [47, pp 103–132].

Social theorists strive for creating a clear understanding of human society and its social inter- actions, which claim to model such power relationships to respond the above-mentioned questions using a range of diverse theories which can lead to the understanding of social conflicts [148, pp 1- 7]. For example, the Afghan conflict has evolved into an inter-state conflict with major government states as main actors which has led to more than 580,000 displaced people between January and December 2016 [141]. Methodologies addressing conflict resolution include the following steps mentioned in Pruitt et al. [127]:

• Understanding the conflict.

• Analysing actors and their motivations.

• Exploring each actor’s tactics and possibilities.

Page 18 of 175 • Analysing outcomes and probability of escalations.

• Problem solving.

• Mediation.

Some authors went further and proposed early warning detection models and early indicators of mass migration events. For instance, the early warning model proposed in [140], identified the main root causes of mass migration as inequality, poverty and population pressure. There are some caveats in the model, especially how to handle coverage and reliability in sources of information, making it a barrier to quantitative analysis [140, pp.117–118]. Therefore more research effort is needed to increase such source coverage. There are many efforts to increase factual information in the form of events to support socio-political conflict analysis and understanding [62], includ- ing manual annotation and automatic data gathering. Hence automation has led to new research avenues by helping researchers to quickly gather and pre-compute large quantities of textual in- formation from a variety of sources [41], including automated methods to summarise conflict information [90]. Nonetheless, information quality is seen as a drawback to fully automated Event Extraction systems [41] and from the social science point of view, selection of event data is a major problem to be addressed when analysing a social conflict [73].

Semi-automatic data collection has been recently explored [67] in which a joint human/machine process enabled researchers to gather event information from the Afghanistan-Pakistan conflict by using machine learning and human annotation. In addition, comparisons of manual and au- tomated Event Extraction processes [166] were made to assess the degree of accuracy between manual and automated collection of events from news articles. Lastly, event prediction was done using automated Event Extraction using Auto-Regressive Moving Average (ARIMA) time series models, evidencing the better forecasting with more fine-grained spatiotemporal data and external uni-variates such as drug prices [173].

In conclusion, despite the efforts to analyse and understand social conflicts, there is an oppor- tunity to tackle conflict understanding by gathering large amounts of information, from diverse data sources, in which a fast and highly accurate, automated data extraction is highly desired as social science analysts are looking to make better quantitative data analysis from the observed data in a social conflict. In addition, data quality and data reliability are highlighted as a research avenue, given the discrepancies between sources of information.

Page 19 of 175 2.2 Event Mining Systems and Pipelines

There is a large body of knowledge relating to Event Mining tasks. In general, there is a huge amount of work in Event Detection under near-real time text streams, but there is more research to be done in this direction when dealing with Event Extraction, coreference resolution, and event synthesis from text streams.

Table 2.1 shows a summary of Event Mining approaches, ranging from Event Detection to Event Extraction and coreference resolution.

Table 2.1: Main Event Mining Systems and Pipelines

Reference Event Mining Task Datasets Main Method Azar [10] Event Detection News Manual Boschee et al. [89] Event Extraction News Manual Schrodt et al. [143] Event Extraction News Manual Petrovic et al. [121] Event Detection Twitter LSH clustering Becker et al. [14] Event Detection Twitter Clustering and statistical aggregates McCreadie et al. [105] Event Detection Twitter Scalable LSH Olariu [113] Event Detection Microblogs Word graph summarisation Nguyen and Jung [109] Event classification Twitter Signal clustering Lin et al. [97] Event classification ACE Newspapers Nugget embedding for Chinese characters Buntain [33] Event Detection Twitter multi-streams Bursty features Yao et al. [172] Relation extraction ACE Newspapers Narrative extraction via NLP models Xing et al. [170] Sub-Event Detection Twitter Mutually generative LDA Zhou et al. [176] Event Detection Twitter Probabilistic modelling Akbari et al. [5] Event Detection Twitter Multi-task learning Nguyen and Grishman [111] Event Detection ACE Newspapers Graph CN + arguments Lowe et al. [78] Event Extraction BBC newspapers Syntactic pipeline Grishman et al. [65] Event Extraction ACE Newspapers Computing subtasking pipeline Ritter et al. [136] Event Extraction Twitter LDA for topic generation and Bayesian for categorisation Leetaru et al. [90] Event Extraction Newspapers Supervised classification Schrodt et al. [146] Event Extraction Newspapers System syntactic pipeline Li Q. et al. [93] Event Extraction ACE Newspapers Joint Event Extraction task via MaxEnt Li J. et al. [92] Event Extraction Twitter Clustering LDA for detection and CRF for component categorisation Sha et al. [149] Event Extraction ACE Newspapers Regularisation and embeddings classification Nguyen [110] Event Extraction ACE Newspapers Deep Learning for Event Extraction Norris et al. [112] Event Extraction CAMEO coding Syntactic NLP pipeline Huang et al. [70] Event Extraction ACE Newspapers Zero-Shot transfer learning Bejan and Harabagiu [15] Event coreference resolution ACE Newspapers Dirichlet process Wei et al. [167] Event coreference resolution Newspapers Basic event clustering Chen and Ng [39] Event coreference resolution ACE Newspapers Semi-supervised Linear programming Heap et al. [67] Event categorisation News articles, text snippets Human-in-the-loop ML Ji et al. [74] Event synthesis Newspapers Position paper

During the last decades of research on text mining, several techniques dealing with text cat- egorisation were developed. Consequently, two related areas were developed at the same time: topic detection and tracking and Event Detection. We initially describe these two text mining tasks, and lastly, review the areas of Event Extraction and resolution.

Page 20 of 175 2.3 Topic Detection and Tracking

Topic Detection and Tracking (TDT) an area of research in which the goal is to detect topics given a text dataset in which, usually, the sources of information are news streams. The research was initially assessed by CMU and UMass in a DARPA initiative to analyse information from several sources [6]. It can be said that TDT is the set of techniques dealing with topic and Event Detection, in order to obtain a topic category by relating it to related stories. For example, TDT deals with tagging of news articles as a shopping mall attack in Kenya 1. In this case, the related news article will be tagged as probably “Nairobi”, or “Africa”, and perhaps more categories related to the attack. TDT techniques can also capture the stories of interest mentioned in the news article as a time-based sequence of events. In TDT, it is assumed that text entities are already extracted using an NLP technique and a dictionary of terms is also given. The main TDT methods are based on unsupervised techniques since labelling data is costly and sometimes impractical. As a consequence, most of these methods are based on clustering approaches.

Early, Event Detection clustering techniques utilise k-means approaches to find the distance between a sample and a selected cluster measure, such as the mean or the centroid of the cluster [6, 87]. Soft techniques have been included so in this way the method can fit overlapping clusters. Text Stream Mining includes techniques to detect new or changing clusters. This is the case of the work of Aggarwal [1], where clusters are made with k-means methods on a sliding window. A decay factor is used as a forgetting mechanism, then the algorithm is capable of detecting clusters if more than one example with a similar mean or centroid is observed in a time window, or a cluster is deleted if no examples are added after a period of time. Usually this period depends on how many instances are in the cluster. Recently, this method has been written on top of a Stream Processing Engine, Spark2 in this case, and it is - to our knowledge - the first to be implemented in a streaming fashion using a distributed parallel approach. Other algorithms such as BDMO [91, pp. 269–274] sets buckets to represent sliding windows, and each sliding window is twice of the size of the previous one. In addition, BDMO maintains a data structure summarising the main statistical features of clusters. Other clustering algorithms have been developed to analyse information in batch mode such Gaussian mixtures or LDA approaches [4, 26] but none of these methods respond to either the challenge of stationarity or dealing with multiple text sources.

1https://www.theguardian.com/world/2013/oct/04/westgate-mall-attacks-kenya 2http://spark.apache.org/

Page 21 of 175 On the other hand, the mainstream computing engine for big data is Hadoop [156, pp. 33– 35], a relatively new distributed processing engine capable of processing terabyte-sized sources of information in distributed fashion, but recently Stream Processing Engines (SPE) have been re- ceiving more industrial and academic attention, since SPEs are more suitable to analyse Streamed data understood as terabyte calibre datasets arriving at thousands of rows per second. Apache Spark [156, pp. 73-81] and Apache Flink3 show terabyte-like computational processing.

An enhancement going beyond the topic-as-an-event assumption is the work done by Bejan et al. [16]. In their work, an event is detected and they infer an event structure to link events in a graph structure. Documents are represented as a mixture of Dirichlet probabilities in which from each document, a finite set of events can be extracted and represented as hidden random variables. Although the model improved classification accuracy, the detection and selection of the values in each hidden parameter is complex and not fully explained. High dimensionality in large text sets could lead to time-consuming tasks. Finally, the scope of this work assumes that all documents are from the same text corpus, therefore the method may lead to miss-classifications due to noise in the trained corpus.

With respect to classification methods, trees such as Very-Fast Decision Tree (VFDT) [71] can be used to model categories in real time, since the algorithm is used in a streaming fashion. More widely used techniques to categorise information are supervised novelty detection techniques such as in [171]. Krzywicki and Wobcke proposed a supervised method to categorise e-mails exploiting concept clumping [83]. In their work, they used methods to detect change in categorisation of text corpora, and they relate drift and non-stationary techniques categorising information to closely related groups of documents (concept clumping). The techniques of local term boosting and sim- ple term statistics shown results comparable to the benchmarked batch learning models, but it dramatically reduced the time from hours to seconds (5 hours to 3 minutes) in the case of e-mail classification [82]. The clumping technique can be extended to address stationarity on complex events series.

The most important challenge in TDT is to distinguish between the concept of an event from the concept of a topic or category. Some recent techniques enhance events linking together as a group of events, but those are equally handled as topics enhanced with link structures. An event should improve the accuracy of topic detection tasks if it is better explained and used further

3https://flink.apache.org

Page 22 of 175 as feature inputs. In addition, recent TDT systems such ESA from CSIRO intuitively shows a trending line given from counting and grouping topic-related tweets. The relation is easily done with the hashtag term. Nonetheless, more research is needed to distinguish the concepts and relate them to each other. A topic evolves over time, but an event does not evolve over time since it is something that happens in the past. An evolving graph timeline used to link evolution of topics and its relationships. In addition, complex events can be detected effectively with drift techniques.

Although k-means clustering has shown good results, the main problem is to efficiently choose a good value of k. Interestingness and weighted measures related with the features of related events can lead to better choosing of the number of clusters. In addition, efficient techniques to detect links or hierarchical clustering should enhance the inference going beyond bag-of-words techniques.

In addition, variants in the parallelisation technique would lead to better performance measures in time and in effectiveness. More research can be done to consider when it would be useful to implement whether a vertical or a horizontal parallelisation approach, i.e. if the parallelisation is best done distributing a subset of examples or distributing a whole dimension to a node and then combining the answers, or a mix of both approaches.

To summarise, TDT is a relevant area for topic detection from multiple document corpora. Nevertheless, there are several challenges to tackle, including handling resolution of duplicates and complementary information from multiple sources of information, addressing scalability and fast processing, stationarity of information and the enhancement and extraction of whole events given the definition in Section 1.2.

2.4 Event Extraction Systems

With respect to Event Extraction, there is less research work on real time text streams. There are also some discrepancies in the research community on the term “Event Extraction” [128, 6, 13, 108] and from cognitive science [129], as some literature (mentioned in Section 2.3) belong to the category of TDT, as they are mainly focused on identifying events, rather than extracting complex structures. This is seen in Section 1.2, where the main idea is to help human analysts to efficiently extract event components from text sources, especially text streams. In this research,

Page 23 of 175 Event Extraction from text streams is defined as:

Given an infinite text stream X = {x1, ....., xt, ...}, discretised at a text fragment level (i.e. a set of sentences or part of sentences), in which each instance xi arrives at time index i, extract a set of events Ei = {ei1, ei2, ...., eim} from xi. Each event eij is an instance of the form eij = hC, A, T, L, Di as presented in Section 1.1.

In order to solve such a complex information extraction problem, there are several techniques that have been developed in the last decade, mostly in the form of system pipelines, in which the Event Extraction problem is divided into several sub-tasks. Table 2.1 above shows a summary of the surveyed techniques.

2.4.1 Manual Event Extraction

Initial event coding attempts were made manually in projects such as WEIS [104], in a joint col- laboration under the ICPSR (Inter-University Consortium for Political and Social Research). This work (The World Event/Interaction Survey WEIS project), manually reflects daily events reported in the press, making interactions between countries of interest, in order to look for their response to political violence events and counteractions. The project consisted of a manual procedure for annotating times, locations, actors, targets and classifications from text news articles. Similar man- ual annotations were made in [10], in which the authors looked for similar patterns in the data. Recently, similar approaches are found in ACLED1 and the Uppsala conflict dataset2 as mentioned in Chapter 1. All of the datasets extract detailed event based information. Differences between the reported datasets are driven by additional attributes such as latitude, longitude and additional geographic information. They can also differ in reported sources, as some datasets are focused on certain areas. This is the case of ACLED, which has more detail on Latin American conflicts than other datasets.

Going into the details, the WEIS dataset includes events between 1966 to 1978 reported in the New York Times. The dataset includes nearly 100,000 annotated events. The main drawback of the dataset is its inherent bias towards certain annotation patterns found from the reported source (NYT), as this newspaper company tends to publish certain types of events with more frequency

1https://www.acleddata.com 2http://ucdp.uu.se

Page 24 of 175 than others. In contrast, this is not the case in ACLED or Uppsala datasets, as analysts gathered information from different news sources for validity and completeness.

Although current manual annotations are reliable and accurate, there are some discrepancies between datasets. Recently, a Colombian dataset – Database of the Armed Conflict in Colombia (CERAC) was compared against UCDP (Uppsala) dataset, and it was found that CERAC shows more richness in event granularity, making such datasets more expressive than global datasets, as the annotation captures details and local sources of information not captured by UCDP, reporting more events (measured by killings) than its counterpart [134]. In addition, more targeted, quan- titative events have been made such as the Global Terrorism Dataset (GTD). GTD is much more detailed in terms of event categories and ontology items, particularly for terrorism analysis [88].

Analysts are currently looking for more granular annotations from local text sources, but this task is extremely labour-intensive and slow to perform. Because of this reason, some researchers have been looking for faster, and more accurate automation methods for Event Extraction, to ease and help the annotation process [41, 142].

2.4.2 Event Detection

Event Detection can be seen as the evolution of Topic Detection and Tracking, in the sense that rather than detecting topic categories from text instances, Event Detection focuses on detecting if an event occurred or not. Work related to Event Detection can be found in Becker et al. [14], in which a technique for detecting events from Twitter was devised by using an online clustering technique in addition to Twitter metadata, used as additional features. The clustering technique uses a threshold τ and a similarity function to map against a set of clusters 1, ..., k. If the similarity function σ(mi, cj) is greater than τ, the message is assigned to cluster cj with maximum similarity, otherwise a new cluster ck+1 is created. Event Detection is useful when dealing with data streams, as stated in the Chapter 1, event monitoring in near-real time may lead to quick response to violent events preventing live losses or further violent escalations, and has been heavily researched in social media datasets [72].

Real time Event Detection in Twitter is found in [105], in which a distributed event detec- tion pipeline was built to perform parallel local key group hashing, that was used afterwards to

Page 25 of 175 build a global cosine distance document calculation to perform k-means clustering to detect event categories based on a parameter θ to create new categories. A recent review in Event Detection for critical events in real time is given in Imran et al. survey [72]. The review shows a broad understanding of the most relevant systems done for detecting and tracking events in text streams. In general, it is noticed that research performed in this area was mainly driven by domain spe- cific applications which enhance the prediction accuracy as models have background knowledge represented as additional features [153] or embedded representations such as concept inference featurisation by using ontology patterns as in [161] , particularly for the case of Twitter events. The surveyed systems identify burst keywords or “event mentions”, by using clustering techniques, and there was also a need for geotagging which was mainly provided by Twitter metadata in form of JSON files, containing features such as coordinates, country of origin, and city [158]. To deal with machine learning techniques, data pre-processing was made using standard NLP procedures, such tokenisation, TF-IDF, part-of-speech tagging (POS) tag, semantic role labelling, dependency parsing and named entity recognition (NER). In the case of tweets, hashtags were also identified and they played a major role in the Event Detection task.

Another practical system, the Emergency Situation Awareness (ESA), was developed by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) of Australia, to detect bursty topics from Twitter to enable fast response in emergency situations. A timeframe of a given topic is provided by the platform and real time analysis is given with an intuitive GUI3. The contribution can be extended to improve the hierarchical representation of topics and it is worth to note that the platform includes only Twitter data. Another important contribution is the system of Petrovic et al. [122, 121, 123], directed towards detecting topics in Twitter and relating them to real events. In their work, techniques such as local-sensitive hashing were made to perform such an operation in addition to scalability by using a stream processing engine.

There is a large research community on Event Detection, and this research is more related to our work, and there is an extensive list of literature in the area, including whole Event Detection systems and some datasets as found in [145, 65, 14, 136, 93, 90, 166, 92, 170, 5, 149, 176, 119]. There is more research related to domain specific Event Detection, and Twitter has been used more for the identification of large events rather than fine-grained Event Extraction. In addition, current research evaluate models using either cross-validation or train-test validation, but not prequential

3More details at https://esa.csiro.au/

Page 26 of 175 test-then-train evaluation, a common Stream Mining evaluation method [21, pp. 12–15], therefore more work is needed for performing experimentation using prequential metrics which aim to be used to evaluate Stream Mining models.

In addition, more recent Event Detection work includes the use and creation of multi-relational semantic graphs to make story line Event Detection [167], making use of Hadoop for Event De- tection [109], and domain specific Event Detection [5] via loss function optimization and regular- ization. Other related tasks have arisen such as sub-event discovery via LDA [170] and sub-event discovery on Twitter and credibility [33, 109].

2.4.3 Real-Time Event Detection

Regarding real time event detection systems, a domain specific approach by Li et al. [92] used a system pipeline with unsupervised and supervised methods to extract event categories and an event property, which was the event entity related to the event. Their system initially extracted the main keywords for topic extraction using Latent Dirichlet Association (LDA), and then performed Conditional Random Fields (CRF) training. Although the test and train was made using Twitter data, the models were not run in an online setting with prequential evaluation. Other approaches, such as that described by Ritter et al. [136], in which the authors used a similar feature represen- tations on Twitter data [92]. This work also used batch training-testing scenarios in which Stream Mining was not used.

Perhaps the first online Event Detection pipeline system was made by Olariu [113], efficiently retrieving event keywords from tweets. More recent research can be found in [5], but the authors focused attention on detecting and categorising events rather than extracting a more fine-grained structure as required from the discussion in Section 1.1. Therefore, most of the methods are only extracting what we call the event type, whereas we are interested in extracting more details, the event components, which are the named entities selected as actors, targets, locations and dates.

Twitter datasets offer a more efficient approach in both processing time and memory consump- tion, has shown to be a useful dataset for detecting early event trends [69, 176], and the quality of the information can be highly accurate if the sources are carefully chosen. On the other hand, and as seen in previous chapters, news articles provide more granularity in the information.

Page 27 of 175 2.4.4 Automated Event Extraction and Coding

Event Extraction and Coding methods are essentially system pipelines defined as a set of tasks to first extract entities of interest, and then they classify and build an event based on the extracted entities and their relationships. The Kansas Event Data System (KEDS) [143] and TABARI [146] were among the first automated Event Extraction systems made. The system used shallow pars- ing to identify actor and targets. The Global Data on Events, Location, and Tone GDELT [90] is the successor of TABARI, and it extracts events to be coded with the Conflict and Mediation Event Observations (CAMEO) coding schema [145]. Event coding schemas, therefore, were im- portant in the identification and definition of the coded events and its annotation and validation methodology. Other annotation guidelines such as IDEA [27] were made for Event Extraction and Coding.

The Integrated Crisis Early Warning System (ICEWS) system and dataset [89] was made in order to generate an event early warning system. The system was supported by DARPA and is operational under Lockheed Martin facilities4 with very similar features, as it performs pre- processing via NLP techniques, and it uses Bayesian models and statistical analysis for event coding and categorisation [144]. Recent versions include the Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH) event coding [146] written in Python.

One of the main drawbacks of using one single version of coding events is the issue of the event schema definition [134]. Depending on what the analyst wants to measure, the system will allow such event coding but it will miss some local characteristics of the dataset. For example, ICEWS and GDELT have been compared by Ward et al. [166] and there are similar tendencies between both datasets. In the study, ICEWS was recognised as more accurate than GDELT, but the latter offers more reported events, sometimes missed in the ICEWS dataset. In [147], claims are made for more accuracy in the methods. The main drawbacks in these automated methods are their availability for reproducing experiments, as they show data but methods are not publicly available.

Nonetheless, current Event Detection methods are motivated by general news text corpora, in tasks in which a more detailed quantitative analysis should be made. This is the case of the ACE

4https://www.lockheedmartin.com/en-us/capabilities/research-labs/ advanced-technology-labs/icews.html

Page 28 of 175 dataset, which was built to extract detailed events from general news corpora [3]. In contrast, social scientists are looking for more fine-grained explanation of events to predict new violence outbreaks, and also to find causal relationships between them. However, a more fine-grained set of event categories naturally increases the imbalanced properties of the data, making it difficult to accurately classify with current state-of-the-art techniques. For instance, Pavlick et al. [119] benchmarked last Event Detection techniques under the gun violence dataset, a domain specific dataset for gun violence in the US, and low precision and recall results were found using the best reported techniques in the Automatic Content Extraction (ACE) competition, reaching only 30.2% precision and 20.1% approximate recall on target identification.

One of the earliest attempts for Event Extraction, very similar to event coding, can be found in the work of Grishman et al. [65] in the ACE competition and the JET system presented by Ji and Grishman [74]. Their task was to recognise a set of events from a defined set of news articles. Generally speaking, the JET, GDELT and ICEWS systems use a multi-task pipeline in which the following computing tasks are being performed (Figure 2.1):

Figure 2.1: General events extraction and coding pipelines [147, p. 24.]

• Split each article into sentences, and tokenise them using a featurisation method via NLP processing such as TF-IDF.

• Identify a trigger action word based on hand-made pattern recognition.

• Identify all the possible arguments in a likely event via MaxEntropy [74] or other statistical classification technique.

• Identify argument role labelling via a statistical approach or MaxEntropy (JET).

Page 29 of 175 • Identify event using previous features via a statistical approach or MaxEntropy (JET).

Other similar systems were developed in using a similar pipeline approach. Li et al. [93] made a joint Event Extraction system in which the idea was to identify all event components at the same time, using a structured perceptron with a decoding algorithm. The decoding algorithm is based on global and local features. Local features are used in sub task classification by using additional functions (trigger and argument detection and labelling). Reported F1 values were around 67.5% for trigger classification, 56.8% for argument identification and 52.7% for argument role classification. Although the authors claimed this to be the first joint Event Extraction system, no specific event metrics were used to validate whether or not the system recognises a whole event, but instead, sub-task classification tables where presented.

The Regularization-Based Pattern Balancing Method (RBPB) system [149] on the other hand, balanced the pattern recognition task by using a Support-Vector Machine (SVM) classifier. They used pattern, sentence and trigger embedding on the features, and in addition, they embedded ar- gument positive or negative correlations in the final step, regularising arguments and role relation- ships. This approach is one of the best event extraction methods known for the ACE competition dataset, with 68.9%, 61.2% and 53.8% F1 on trigger classification, argument identification and role classification respectively. However it is unclear how RBPB would work for detecting whole events, and also how it might work in an imbalanced, or domain specific event dataset.

Regarding domain specific event extraction systems, the GDELT system5 [90] and PETRARCH [112]6 were used to identify conflict related events. One of the most emblematic Event Detection systems is GDELT7. GDELT has contributed to several social science analysis [173]. Nonetheless, the quality of the extracted event is yet far from perfect, and duplicated information was found as a relevant issue. Moreover, duplicated events have the same actors or affected entities, and there are sometimes cases in which the information varies from event to event. For instance, an event taken from Al-jazeera and at the same time taken from Dawn.com could differ in the number of casu- alties or even the perpetrators of a bombing attack. Moreover, GDELT computing techniques are not public, making difficult to benchmark against other published methods and techniques. Only the results or outputs can be obtained from the system. In addition, specific local violence events

5https://www.gdeltproject.org 6http://github.com/openeventdata/petrarch2 7http://gdeltproject.org

Page 30 of 175 are barely contained in the list of scrapped sources by GDELT. On the other hand, a pure human annotated dataset is very resource-expensive and the news sources are costly to obtain in order to replicate results [41, 73]. These approaches were tested on domain specific datasets such as the gun violence dataset provided in [119], tested these approaches obtaining approximately 39% precision and 23% recall on entity role classification tasks, compared previously reported 64.7% precision and 53.5% recall on the Automatic Content Extraction (ACE) competition dataset.

On the other hand, domain specific Event Extraction has been found particularly interesting for analysts, as they require a higher number of details to be presented to, for instance, understand the very details of a socio-political conflict, in which ontologies are found useful to represent information in this scenario. For instance, the CAMEO (Conflict and Mediation Event Observa- tions) [61, 145] ontology represents events from news sources classifying information in hundreds of conflict-related categories and sub-categories. Moreover, stream-based approaches, although efficient, they have been tested in less complex learning scenarios, mainly for Event Detection and text classification as in [14, 136, 105, 123, 72, 109, 5, 176]. Therefore, it is imperative to make further explorations in the area to create novel online event extraction algorithms and alternative system pipelines. Moreover, reported results were presented on separate, event subtasks, such event type classification and entity role classification (actor, target, location), rather than presented in a single metric able to capture how well the procedures were performing in the whole event extraction task. More recent explorations consisted of human-assisted Artificial Intelligence, by using a joint human-machine coding process to categorise Events, for example, Heap et al. [67] have devised a method to perform event classification (which can be extended to text classifica- tion) using human annotation feedback to the model after the learning task, in order to reduce error propagation. Results outperformed current coding systems by reducing misclassification in sen- tences. Although the human-machine coding process outperforms current categorisation systems, the whole event extraction task was not assessed in the research.

Lastly, most recent works have used the power of neural networks for Event Mining tasks [110]. In Nguyen work [110] an exhaustive experimentation of neural network methods for rela- tion extraction, entity link/coreference resolution and Event Extraction was undertaken. Experi- ments were done under a number of different mechanisms such as recurrent neural networks and convolutional neural networks, suggesting better results with the use of deep learning for such Event Mining tasks. However, the work was done under batch learning scenarios, leaving open

Page 31 of 175 the question of Event Mining under stream conditions. Hence, research opportunities arise when dealing with humanly understandable rules as proposed in this paper, and under online learning constraints.

Regarding online event extraction techniques, relevant challenges come to mind on dealing with structured Event Extraction, especially with online Event Extraction. From the investigated literature, few techniques deal with Event Extraction for fast datasets [136, 92]. However, these systems do not extract all defined components as defined in Section 1.1, and learning models are trained in batch mode. Research contributions on Twitter such [135, 136],[92], and other contributions such in [5, 33, 72, 170, 176], extract the event defined as a category, very similar to a topic detection task.

In addition, there is less research on domain specific online Event Extraction, requiring more research on class-imbalanced datasets that are especially common when dealing with event anal- ysis. Finally, more research could be done to add external information to the incoming instances from multiple sources of information, e.g. an ontology over common domain-specific knowledge, such as locations, entities and event-specific actions.

2.4.5 Event Coreference Resolution

Event coreference can be understood as the way to find duplicates, repeated references, or event reports found from text that refer to the same physical event. This learning task can be defined as: given one or more stream of events to match equivalent events, i.e. to determine if two separate events from a single stream report the same physical, real event. This learning task can be formally defined as a numeric task returning a similarity matching number between [0, 1], being 1 a full match and 0 a nil match between two particular event reports.

There is only a small amount of research concerned with event coreference resolution. From the literature, we can highlight that there are unsupervised, semi-supervised and supervised ap- proaches. Bejan and Harabagiu [15] applied an unsupervised Bayesian hierarchical Hidden Markov Model to validate event mentions on the ACE dataset. In addition, they presented the ECB8 – now ECB+ – dataset, an event cross-document referenced dataset from several news articles. In their

8http://www.newsreader-project.eu/results/data/the-ecb-corpus/

Page 32 of 175 work, they assessed different parametrised versions of the hierarchical Markov model, and they found a model which is able to automatically select feature values. On the other hand, Wei et al. [167] has shown a heuristic approach to handle this learning task via semantic graphs. Their al- gorithm works with a semantic similarity metric to assess how similar are certain event mentions. Their work did not use event components but rather the whole labelled sentence to assess such semantic similarity over a subset of the TREC corpus.

In addition, Chen and Ng [39] utilised a joint Integer Linear Programming approach to build a relation between all event component candidates and validate how much similarity a pair of event components could have using a linear function. Their results surpassed the unsupervised approach by using a non-extensive annotated corpus in the Chinese language, encouraging the use and extension of such approaches to our domain. Recent work has used similar computations in other languages [42].

2.4.6 Event Synthesis and Other Tasks

Event Synthesis

Event Synthesis can be defined as: given a pairwise set of matched events {hea, ebi,... hen, emi}, e each event containing its own known labels L i = {l1, ..., lz}, the idea is to select the true label across features amongst the matched events to get the final version of the real event tei with the e true labels TL i = {tl1 . . . tlz}.

Relating to event synthesis, we can highlight the work of Ji et al. [74] as an initial discussion on how to generate and evaluate such a cross event extraction task. Peyrard et al. [124] focused on a strategy to summarise within 100 or 200 words multiple documents optimizing a recall-based metric – ROUGE – to gather the highest n-gram overlap between multi-document summaries. The dataset used was DUC9, however, this dataset is not publicly available but just for the invited institutions, and the task was related to sentence summarisation.

Efforts made by Kedzie et al. [77] were directed to build sequential Event Detection updates. The system is based on a probabilistic approach. This is slightly different to our problem, in the sense that we are more interested in returning an updated version of a structured event rather than

9https://www-nlpir.nist.gov/projects/duc/

Page 33 of 175 returning a new summary update.

From the event synthesis task, it can be concluded that there are more researchers lured on the topic, which has hitherto mainly concentration event summary updates and has not addressed event synthesis and summarisation. Hence, drift detection techniques might be needed to evaluate the impact of event updates over time, that can be supported by multiple sources of information. More research can explore benefits of a multi Stream Mining technique in this area.

Other Event Mining Tasks In addition to event synthesis, recent research has explored event inference as a novel task to be addressed in Event Mining [132]. The idea is to infer future states, or event states from a given event. The proposed approach was an encoder-decoder method as used to word embeddings using a neural net to reach certain chain of states.

From event coreference resolution, we have identified that from the best of our knowledge, there is no research on online event coreference resolution. Unsupervised approaches using event similarity metrics might be used to handle event representation in a vector space representation [53], that can properly identify distances between each reported event. In addition, it might be in- teresting to evaluate the evolution of an initially reported event by using drift detection techniques.

It might also be useful to validate differences between reported events in order to assess data quality between the reported datasets. This is of particular interest in the socio-political conflict analysis field, given that researchers rely on the data to assess delicate topics. Therefore event coreference can be applied to find such discrepancies and validate them either manually or auto- mated event synthesis.

2.5 Stream Mining Algorithms

In this section, the main techniques on Stream Mining are discussed showing that there is a re- search gap in the area of Stream Mining techniques and algorithms dealing with multiple sources of information, especially for unstructured text data. More research is needed addressing the lack of stationarity over streaming textual information. Better Stream Mining approaches will benefit several important areas of text analytics including entity recognition, relation extraction and topic

Page 34 of 175 modelling. This review is focused on drift detection methods, since those methods are particu- larly relevant in the National Security domain, e.g detecting early complex events from several information sources or forecasting a terrorist attack from open data. This is because data sources are often shifting probability distributions on event categories, and the relevant actors are also changing from time to time. For instance, new event categories might appear months after the first definition and new labelling might appear and change data distribution over time. Also, new persons of interests, such as the change in the head of terrorist or government organisations can emerge. Therefore, Stream Mining techniques have to address shifts in information. In addition, analysts tend to change the way they are analysing information, and new categories will even- tually be inserted. New category discovery is known as novelty detection in the Stream Mining community [49].

From Stream Mining and, as defined in [114, 22, 57], the idea is to perform a learning task, either supervised or unsupervised, in an online fashion. That is, given a stream instance xi arriving ∗ at time i, the task is to classify or label xi with a predicted value yi = H(xi) being H(xi) a hypothesis function based on the instance xi. In addition, H(xi) needs to be updated incrementally in one pass, avoiding the use of batch sets of information because Stream Mining algorithms maintain sufficient statistics in memory to calculate the required function H(xi) . Stream Mining assumes that such a learning task can be potentially larger than a computer memory, and the learner should return the most accurate prediction possible at any time. Hulten and Domingos [71] initially developed the idea of incremental learning and their foundations paved the way to Stream Mining algorithms. A comprehensive view of different Stream Mining algorithms and their uses is given in [21].

2.5.1 Drift Detection and Stationarity

Drift is defined as the change of a probability distribution in a population over time. Since the main challenge in Stream Mining is to maintain a stable measure of efficiency, drift is important to detect, i.e. the idea is to maintain a relatively constant predictive probability at any time. As a consequence, the algorithm can keep relatively stable prediction measures throughout its lifetime. In addition, Stream Mining methods need to be efficient in terms of computational time. To achieve this, the concept of drift was proposed and is now being extensively studied by researchers.

Page 35 of 175 By definition, drift is a way of measuring and detecting any change in the stability of a dis- tribution in a data sample over some period [59]. This is typically addressed by comparing the joint distributions of the samples at different points in time. If the joint probability functions are different, then drift is detected. The formal definition is given by the following formula:

∃X : pt0 (X, y) 6= pt1 (X, y) (1)

where X is the input attributes and y is the predicted class or the target variable at time point t0 and t1. By Bayesian probability laws, the fact that the joint distribution changes over time implies that the prior probabilities of p(y) and p(X) may change over time and then the predictive probability p(y | X) may be affected, reducing the effectivity of the predictive model. Virtual drift is found when drift is detected but p(y | X) is not affected, i.e. although the joint distribution changes this does not affect the predictions, and real drift is found when the change in the joint distribution does affect the prediction. A review of drift detection methods is given by Gama et al. [59]. The main approaches can be divided into online and windowing approaches. It is important to note that the data distribution may change in different ways, so several types of drift might occur, such as sudden, recurring, gradual and outlier or noise drift. Typically, a mix or a variation of these types can be observed in real-life data. To address this challenge, there are several techniques developed by researchers.

Regarding drift distribution techniques, Page [116] introduced a cumulative sum algorithm to detect change in probability of statistical parameters (CUSUM). CUSUM resamples the parameter – the mean values are usually taken as an effective parameter – every arriving instance and it counts the error according to a defined cumulative sum rule. The main idea is to calculate the sum of changes in the parameter by using a cumulative sum when the process has a significant change. If the function detects that the difference between the true value and the predicted value is greater than zero, then it buffers this value for calculation in the next instance arriving at time i + 1. If the the value exceeds a defined threshold, then the algorithm considers the current instance as a significant change. CUSUM tends to be a sensitive metric, and the threshold has to be defined before the process takes place.

Gama [57] created Drift Detection Method (DDM) as an automatic drift detection mechanism.

Page 36 of 175 The algorithm performs online learning and calculates the current classifier’s error rate. If the error rate exceeds a warning threshold, then DDM buffers the incoming instances until the error rate reaches a drift threshold. DDM trains a new model based on the buffered window and discards the current training model used for prediction, making a new classifier. DDM is useful when detecting sudden drift, but does not perform well with gradual changes, as thresholds are more stable, detecting less changes than when an abrupt change takes place. To address this issue, Baena-Garcia et al. [11] developed EDDM, a mechanism based on the monitoring and tracking of the mean and standard deviations of the errors.

Bifet and Gavaldà [20] proposed ADWIN, an adaptive window based change detector algo- rithm. ADWIN performs an automatic sliding window training set which holds the current data distribution using the Bernstein bound. ADWIN maintains data points which stay within the bound comparing the initial window W against two sub-windows W0 and W1. If the mean difference between data points in W0 are statistically significant from W1, ADWIN removes data points in

W0 and keeps W1 as the current sample. ADWIN outperformed DDM during experiments, but large computational time consumption has been reported [20], as well as issues with noisy data [59, 64].

In addition, Pears et al. [120] presented a new version of the SeqDrift1 algorithm, which in- cludes the use of reservoir sampling and the use of the Bernstein bound. The reservoir sample allows the algorithm to run with lower false positive rates than ADWIN and the use of sequen- tial hypothesis testing greatly reduces the time complexity against the same baseline. Possible future work can investigate the automatic setting of the bound threshold. Recently, drift has been assessed under various datasets to validate different drift detection alternatives for the Hoeffding Adaptive Tree (HAT) using typical ADWIN, DDM, Page Hinkley test (PH) and HDDM. From the observed experiments, it can be said that there is not a specific drift detection technique that provides consistently better results than others. As stated by the authors [159], the chosen drift detection technique should be analysed with the dataset and the learning task.

Lastly, recent advances in drift detection theory has led to new metrics for drift assessment, for example, lift-per-drift [8], a metric that assesses any drift detection technique against the same model without drift detection. The metric is as follows:

Page 37 of 175  accd−accd  #drifts , if # drifts > 1 and r = 1.  lpd = accd−accd∗(1−r) (1−r)#drifts , if # drifts ≥ 1 and 0 < r < 1.   0, otherwise

in which accd and accd are model accuracies with and without drift respectively, the variable #drifts represents the detected number of drifts by the algorithm, and r is a cost-sensitive penal- isation that by default is 1. In this way, different drift techniques can be assessed for a Stream Mining model to validate whether or not drift detection is making any difference in terms of accu- racy.

Although there is an improved breed of single drift detection mechanisms, important issues have to be resolved in order to detect drift over multiple data streams. In this case, stationarity is different from one source to another, and more research is needed to evaluate the correct alterna- tive to address drift in order to be useful for topic detection or complex event recognition. One approach can be derived by assessing a local learner for each stream data set and then combining the results in an efficient way in terms of time and correctness.

Inductive transfer learning techniques [117] can be useful to combine a set of heterogeneous learners to build a more accurate global model, but those techniques should be efficiently modelled utilising parallel implementations and improving time complexities. Another approach might be done joining the datasets into a single stream processing engine in order to process them as a single stream. The main challenge of this approach is the data normalisation in the sense that the whole dataset must come from the same distribution.

2.5.2 Incremental Stream Classifiers

Incremental learning methods are seeking for estimating the class label of each incoming instance as it arrives. In this strategy, the testing is done before training and it is done once. Then, when the true label of a classifier is received after some time, an error function f(y ˆt, yt) is computed given that yt is the true label and yˆt is the predicted label at time t. Finally, the method finds drift if the error function is over a threshold and the classifier L is retrained with the new labelled example (X, y). The Very Fast Decision Tree (VFDT) by Hulten et al. [71], is one example of

Page 38 of 175 an online detector mechanism, another methods makes use of SVMs [79]. Incremental Stream Mining methods are models that can be updated each instance at a time, without any form of windows. Perhaps the best known and most used incremental online learning model is the Naïve Bayes algorithm. Other incremental approaches have been developed, such as Hoeffding trees introduced by Hulten and Domingos [71], ensembles introduced by Oza [114], and online rule learning [139]. A comprehensive study of last state-of-the-art online ensemble techniques can be found in [63].

Furthermore, techniques such as Perceptrons, stochastic gradient descent (SGD) for SVM, logistic regression models and linear regressions and moving averages can be employed for both classification and regression tasks [21, 91]. Generally speaking, a learning decay rate is being used to represent a forgetting mechanism as each instance arrives [1], in order to forget older instances at a specified rate. Usually, an exponential forgetting mechanism is shown to be useful for online incremental classifiers.

Active learning [178] has been also proposed when is difficult to handle large labelled sets of data. In these scenarios, the idea is to sample appropriately based on selective sampling procedures that also work under drift uncertainty. Therefore, classifiers are less prone to drift overfitting and evolvable in nature under drift scenarios.

2.5.3 Windowing Approaches

In addition to incremental methods, windowing mechanisms has been extensively proposed in the field. The main idea is to use a finite most recent sample or window from the data stream to keep it in main memory and monitoring the distribution over the current window. The window is updated when a set of examples are given after some time. At that point, drift is measured over the given sample, and usually, a new learning model is trained. Drift Detection Method (DDM) [57] and its family are the most representative approaches. DDM varies the size of its window according to a warning threshold given by evaluating pi + si >= pmin + 2 ∗ smin, where pi is an error probability following a binomial distribution counting the number of misclassifications that the current model can perform at time i, pmin is the minimum value seen in the current model, 2 and si is the standard deviation of the model corresponding to si = pi(1 − pi)/i. If the warning threshold is reached, the algorithm creates a new sample window until drift is finally detected

Page 39 of 175 when pi + si >= pmin + 3 ∗ smin. When this happens, the stored variable window is used to train a new classifier. Different improvements have been done to the basic DDM method [11, 155].

Other techniques updates the current model and sufficient statistics to update the learner [71]. The main problem of these techniques relies on the selection of the window size. Adaptive Win- dowing (ADWIN) and its variants [20] allow the learner to get a variable window size depending on some statistical measures for drift. The idea is to train the model according to a variable win- dow size in favour of reacting accurately to drift complexity. The main idea is to keep growing a window of size W when each instance arrives. The window is partitioned into k sub-windows. If the mean of k0 is the same as in k1 then both windows are merged. If drift is detected compar- ing the means of each window, a first-in first-out (FIFO) technique is used to remove the oldest window. Although drift detection is used in ADWIN, the prune strategy may mislead the most im- portant window. Although ADWIN is shown as an improvement over DDM, a recent evaluation ranks DDM over the other techniques performing well under sudden and gradual drift [64].

Another way to improve or maintain the prediction measure over time is done implement- ing forgetting mechanisms. In general, forgetting mechanisms can be modelled in several ways such using a FIFO structure or utilising a decay function over time to forget the past samples [1]. The main issue in these techniques is that useful learned models are forgotten to predict the future. When seasonal drift is detected after forgetting similar past samples, the model needs to be updated again being inefficient in the presence of recurrent drift. Note that little research has been implemented or tested under unstructured stream scenarios using Stream Mining evaluation methods 1.

To sum up, although drift detection techniques can be easily used for text data [2, pp. 312– 316], little evaluation has been done to assess the challenge of high dimensional data. In addition, note that none of the presented techniques assess the challenge of dealing with multiple sources of stream datasets and there is a need to quickly represent unstructured data to be consumed by stream learners. 1Software can be downloaded in https://www.cs.waikato.ac.nz/~abifet/MOA-TweetReader/, using a non-streaming TF-IDF representation on the whole dataset

Page 40 of 175 2.5.4 Rule Learning

Inductive Logic programming (ILP) was extensively studied decades ago for building integrity rules to validate data consistency. Specifically, the field of deductive databases is mentioned in [12, 38, 102] or ground facts as in [17], pp.209-210. Logic programming is an effective and attractive way to build, maintain and store consistent knowledge bases.

From the revised literature, interesting work as in [95] and [96] used ILP and the power of ontologies to extract semantic relations between event arguments and at re-populate the ontology based on such semantic relationships. The authors built ILP rules to create clauses identifying types and subtypes hierarchies from the ACE 2004/2005 dataset. These research insights show the feasibility of using logic programming to validate information consistency, that can be extended to Event Extraction as a whole, which is our aim for this research.

2.5.5 Times Series Based Approaches

Times series analysis focuses attention on predicting and identifying the evolving nature of event data streams. Cai et al. [34] exploits time series networks to validate co-evolution of event men- tions. It is important to note that time series analysis is not fully useful for Event Extraction and Information Extraction tasks in general, given the nature of the classification tasks, as it deals with real-valued forecasting rather than categorical forecasting. On the other hand, time series analysis uses frequentist statistical techniques to predict behaviours in numerical data. It is particularly use- ful when predicting future stock prices, or monetary exchange rates, but it is not fully suitable for classification of data streams. A comprehensive understanding of times series analysis is provided by Shumway [151], in which several statistical times series analysis techniques are explained in detail, including the following models:

• Auto-regressive (AR) models.

• Moving average (MA) models.

• Autoregressive moving average (ARMA) models.

• Autoregressive integrated moving average (ARIMA) models.

Page 41 of 175 • Seasonal autoregressive Integrated Moving average (SARIMA) models.

• Intervention models.

• Generalised autoregressive conditional heteroskedasticity (GARCH) models.

The above mentioned approaches are constructed and analysed in order to select the most appropriate model to reflect the forecasting task. A large number of selection techniques can be used to utilise such models in R.

2.5.6 Ensembles and Meta-Algorithms

More recent techniques in the data mining area have included a set of ensemble algorithms and drift detection methods. Ensemble techniques are meta-algorithms running on top of base learners to train several versions of a model with the same training data, and generally a voting technique is used to perform the classification task [22]. The advantage of ensemble techniques is the improve- ment in accuracy over the general at the cost of time complexity [23]. Ensembles are desired to be as diverse as necessary, that is, to have as many unique classifiers as possible, so they can perform well for a specific characteristic of the dataset. Ensemble algorithms combine base learners based on different methods. The most common combination found is the well-known voting mechanism, that sums and weights the prediction of each classifier to perform a final prediction. The use of other taxonomic features for classifying online ensembles is proposed by Murilo-Gomes et al. [63], with a set of 4 main categories, including their combination style, inducing characteristic, base learner dependencies, and updating dynamics.

Online Bagging and boosting were presented by Oza [114] as the first algorithms in online fashion. In online bagging, a set of m classifiers is trained using k incoming instances chosen from P oisson(1) distribution. Then, the number of instances to train each model is chosen randomly and each model is updated accordingly. On the other hand, Boosting algorithms train the set of learned models sequentially, i.e. the arriving instance is used to train all models, but the instance is weighted at each learning stage if the current learner misclassifies the test point. In this way, the more “important” data points are used to train the next model. The classification task is done by taking the weighted sum of each model in the ensemble. Note that these methods do not use drift detection techniques and virtual drift cannot be detected.

Page 42 of 175 Context based algorithms such CAPA [177] aim to keep an ensemble of algorithms given a context. For example, in a sales forecast a seasonal context might be useful (winter, summer, fall, spring) to store a set of trained classifiers used for that specific context. If the context changes, then a previously stored set of classifiers can be used and updated by the current seasonal examples. Although useful, neither context aware algorithms nor ensemble techniques were constructed to deal with multiple streams of information. In addition, real datasets show a mixture of drifts rather that a specific drift distribution, especially in text mining. Similar techniques has been designed to compare the windows used in each iteration of the learning process using a statistical tests. Calvo [35] uses a Mann-Whitney test to select the members of the ensemble model. Although the technique shows reasonable efficiency results in terms of accuracy and time, the multi stream approach assumes the same distribution throughout the inputs.

Recent advances in terms of ensembles are looking to diversify the models. In [152], the au- thors presented a novel technique for building ensembles using diverse online learners. Brzezinski [31, 30] also presents a set of novel methods for constructing ensembles based on drift mecha- nisms, and Žliobaité [179] deals with evolving datasets using active learning. These models were used for numeric data streams, but none of them were tested under Information Extraction tasks in an online learning setting.

In addition, there are ensemble classifiers that partition the dataset to ease computational anal- ysis, i.e. to reduce the amount of information that each local classifier needs to learn in the training task: these ensembles are known as input manipulation ensembles. Dataset partitioning has been found useful for increasing the diversity of each base learner used in the ensemble. There are different ways of implementing input manipulation. In horizontal partitioning, ensembles split information by separating the rows, i.e. a subset of instances from the input dataset, then each base classifier focuses only on certain sub-samples or chunks of the dataset, whereas vertical partitioning focuses on subspace formation by splitting the features of the dataset.

In general, window based mechanisms and drift detection techniques are a basic form of hor- izontal partitioning, as they resample data once drift is detected, but only use one single classi- fier. In contrast, the goal of Oza bagging is to make an ensemble based on base classifiers using their own detected drift windows, similar to Brzezinski and Žliobaité’s methods [31, 30, 179]. Drift-based horizontal partitioning has been found useful when dealing with seasonal drift, but is

Page 43 of 175 computationally expensive compared to base learners.

On the other hand, Random Forests [29] are a particular set of tree ensemble classifiers that performs vertical subspacing. The main idea is to perform a vertical subspacing by choosing a random feature, and to train base tree learners using different features on each model. Partitioning the dataset in this way results in a diverse base tree models set which can enhance the final predic- tion. The splitting principle of Random Forests can be used in other base learners such as nearest neighbour methods and linear classifiers.

In summary, ensemble methods have improved data mining tasks over a single stream source of information. However, a limited amount of ensemble techniques has been tested under real conditions. Most of the research evaluation has been compared using datasets less than 1GB in size and there is little or no research done on detecting drift over multiple datasets. The latest advances in ensemble approaches have tested Stream Mining under distributed platforms [44].

2.5.7 Distributed Stream Mining

According to Zeng et al. [175], Distributed Data Mining is the set of data mining techniques running on parallel architectures. The main goal of Distributed Data Mining is to mine huge amounts of data that can neither be stored in a single location nor processed by a single server. The main techniques include parallelisation-based models using the ideas proposed by Park et al. [118], and Meta-learning techniques that builds a model based on intermediate results taken from local models such as Collective Data Mining (CDM) [76]. Most of meta-learning techniques are ensembles [162, pp. 158–160], and most are implemented on P2P networks. Nonetheless, SPEs play an important role in the field, given that those architectures are aware of computational communication and reliability. The latest platforms, including Apache Mahout are good examples of Data Mining approaches on Distributed Computing.

On the other hand, the area of Data Mining is being extended to building feasible models for data streams. The basic approach is to use classic data mining on top of a distributed environ- ment, an approach called Distributed Stream Mining. This field is focused on converting classic data mining applications into a distributed fashion. Algorithms such as k-means clustering and Bayesian methods have been converted using parallelised techniques [175]. Applications such

Page 44 of 175 Apache Mahout [46] implements data mining algorithms for distributed processing engines. The main issue of this approach is that the model cannot be updated in real time and algorithms take a long time to build the mining model.

Regarding distributed techniques, the first clustering technique to deal with streams in a dis- tributed fashion was by Kargupta in the VEDAS system [75]. The main goal of this application was to monitor a vehicle fleet in order to detect anomalies in real time. The system was built using a distributed architecture where sensors in each vehicle were taken by a data mining algorithm lo- cally installed in a Personal Digital Assistant (PDA). Each PDA learnt its own model and sent its results to a central system. This system performs Principal Component Analysis (PCA) to reduce the dimensionality of the stream data and it performs a clustering technique over the projected space. The performed clustering technique was a k-means approach where k-normal behaviours were empirically selected by the user. The clustering algorithm used a sliding window to generate the clusters when data arrives according to a polygon boundary. A sliding window approach was chosen mainly because of memory constraints in each local node. In addition, statistical measures such the mean, variance and covariances were stored to detect drift and manually retrain the model. Although the clustering technique built in VEDAS dealt with drift, the data mining algorithm was not compared against other techniques. In addition, only structured information was used in this approach and SPEs were not mature enough to be used at that point in time.

One interesting approach was given by Hong et al. [69] to mine multiple text streams using a mixture of probabilities following an LDA-like model. They build a clustering technique in which the model learns a set of hidden parameters. The models assigns to each example in the stream a probability of belonging to a common or a local topic. Hence local topics are unshared topics only seen in a single source, while shared or common topics are those seen in the whole set of streams. In this way, the whole model builds a cluster of documents according to a combination of distri- butions, having the Dirichlet distribution as the main component. In addition, spikes in time are seen on each assigned topic using a temporal dynamics technique. The model performs better than LDA and extends knowledge to multiple streams, however the drift condition or non-stationarity of the topics is not addressed, and Gibbs sampling [43] over high dimensions in large datasets might be insufficient. The experimental evaluation shows promising results and this technique can be implemented in parallel to improve the runtime measure (that is not given in the report). In addition, the model is only concerned about single topics rather than hierarchical topics useful for

Page 45 of 175 modelling complex events.

Finally, distributed data mining and meta-learning methods have not been applied to Twitter, and the research found on horizontal partitioning reported poor results on the final target accu- racies [126, 18, 131, 48, 51, 76, 86, 174, 50, 118, 162, 175, 163, 75, 7, 98, 19, 44]. However, there is no research on identifying the best way to make such partitions, by using mathematical constraints such the found in information theory, which has proposed to achieve a good perfor- mance on this form of distributed data mining. In conclusion, the above mentioned techniques have found horizontal splitting as a non-recommended way to perform ensembles. Nonetheless, ensemble algorithms might be boosted by better partitioning of datasets if accuracy improvement is guaranteed.

2.6 Stream Mining Approaches Applied to Text Mining

The Stream Mining literature includes several methods and algorithms which have been widely used, but yet to be applied in text mining [71, 115, 22, 152, 20]. There are efforts to use text streams on top of batch-learning classifiers [6, 16, 103], as well as online text classification on Twitter [69, 72, 2, 33]. However, none of them has attempted to characterise a fine-grained event as we aim to do, in order to extract not only the category of a tweet, but also event components such actors, targets, locations and dates inside the tweet. In addition, Chen et al. [40] implemented a novel approach to update ontologies under Stream Mining scenarios, using semantic concept drift. In their research, an Ontology Stream was defined as a semantic, evolutionary structure which depicts snapshots of reality that needs to be updated from time to time. They updated Ontology Streams by detecting drift understood as a semantically significant difference between ontology snapshots. As a result, prediction over two different ontology datasets were improved over 25% on average.

In summary, drift detection methods have been improved to mine over a single stream source of information. However, to the best of our knowledge, there is no research to assess drift for Event Extraction. Most of the research has been evaluated using small datasets and there is little or no research done on detecting drift over multiple datasets.

Another challenge in complex Event Detection is to identify how complex the event is and in

Page 46 of 175 this way to identify how large or short the training window could be. If an event is complex, then drift is supposed to be more unstable due to the large sequence of sub-events that can lead to a change in the distribution of the labelled categories or clustered data. It is unclear if concept drift can be managed over distributed models, each learning at its own rate and if whether is it useful to combine the results from one model with the other to enrich the knowledge task. More research is needed in this direction. A potentially useful approach is to handle models to automatically detect drift from different data sources. The complexity of an event can be directly related to the type of drift or stationarity across different text streams.

Another challenge is drift adaptability, i.e. how well the algorithm reacts to (predicts) changes in the data distribution [64]. On the one hand, a large window size could represent a slowly changing learning algorithm, but on the other hand, a short time window could represent a dy- namic fast-reactive algorithm in the presence of drift, as in ADWIN [20]. Another issue is how to efficiently select a small or a large window under mixed drift. In addition, recent techniques propose new directions addressing mixed drift [30]. More research is needed to evaluate and com- pare scalable drifting techniques. However, this research focused on handling multiple sources of information automatically in the learning model.

2.7 Summary

Several challenges remains open in the field of Distributed Stream Mining for text data. In this section, we emphasise the gaps to be addressed in this research. In summary, the most important challenge is to extract events from multiple text streams and correctly assess the main components of an event as defined in the previous chapter. Drift detection techniques can be used for this purpose, but there are still research gaps to successfully use drift detection for Event Extraction. Drift is important in this context because the technique is aware of changes in the stationarity over a set of examples, useful to detect event updates in the case of event synthesis.

From the literature, we found that the problem of extracting events can be tackled under on- line learning assumptions: be ready to predict at any time, learn in one-pass, and learn in limited time and memory. The application and update of the true label can be done online every time for using real labelled instances, updating the model at any time. Interactive labelling and human

Page 47 of 175 feedback is done to update the online model every time as needed, which is required in scenarios where drift can dramatically change the classification of events over a period of days or weeks. Stream Mining models can be used in this task to enhance time performance as shown in the subse- quent chapters of the thesis, which is highly valuable for social conflict analysis, and in particular to quickly react to violent events, in which the extraction task is needed as fast as possible. It is proven that batch-learning models will take hours, if not days (depending on the dataset size) to update, which is not acceptable for real time violence monitoring [37, 33, 144]. Moreover, Stream Mining can improve response to event extraction systems and be accurate at the same time.

In addition, there is an opportunity to address Event Detection, Event Classification, Argument Detection and Argument Classification under near-real time conditions. Therefore, we aim to use Stream Mining approaches to the Event Extraction, Event Detection, Event Classification, Argument Detection and Argument Classification – Event Mining tasks –, under a novel stream Event Mining Framework.

Finally, deep neural network mechanisms are not explainable models as these models do not easily translate to human language, which is particularly useful for event extraction tasks, given that accuracies are improved by the constant interaction and feedback between humans and machines via recommendations and human-in-the-loop mechanisms [67]. Additionally, pre- processing mechanisms can be done in near-real time to allow online learning techniques to be useful. Lastly, event coreference resolution and event synthesis are identified as a research gap but addressing these gaps is beyond the scope of this thesis.

Page 48 of 175 Chapter 3

The Stream Event Mining Framework - SEMF

In this chapter, the Stream Event Mining Framework (SEMF) is presented. As stated in previous chapters, this framework aims to efficiently deal with multiple sources of information for different Event Mining tasks under Stream Mining constraints. This chapter gives the principles, metrics, tools and practices behind the proposed solutions for Event Mining in general, and online Event Detection and extraction in particular. The methodology can be further extended to include other Event Mining tasks such as event coreference resolution.

Initially, we present the overall framework, including principles and definitions of the main Event Mining tasks we aim to address, and a general explanation on how Event Mining will work in the online machine learning setting, defining the main metrics. Secondly, we present the domain specific datasets in which the defined framework was tested, including two Afghanistan-Pakistan (AfPak) datasets. Additionally, we present the ACLED conflict dataset along with initial data analysis.

SEMF is empirically demonstrated in subsequent chapters to perform well on different Event Mining tasks, performing well under different learning scenarios. Chapter 4 contains an appli- cation and evaluation of the proposed integrity constraint learning approach in a single stream version, in order to validate improvements over the existing single source baselines over the event extraction task, whereas in Chapter 5, an initial exploration of partitions is made, proposing a com- parison between single and multiple integrity constraint models. Further experiments in which an application of SEMF using multi-stream ensembles are presented, showing that our framework selects the best way to combine such different sources of information, improving final results on the whole event extraction task.

3.1 Principles and Definitions

In general, the idea of Event Extraction from text is to identify, at the text fragment granularity, what types, actors, targets, locations and times are involved in a specific event. This information comes from an unstructured text instance that can also be categorised within a set of predefined event categories. We now define in detail the input (text stream) and output (event) to set up our learning tasks.

3.1.1 Input: Text Stream

A text stream X is a potentially infinite sequence of instances {x1, x2, ..., xi, ...} arriving in time order. In addition, each text stream instance xi is composed of a word-set W(xi) = {w1, ...., wm}.

Similarly, a set of of multiple text streams is just a set of different text streams M = {X1, ..., Xn}, th in which Xi is the i text stream in the set of text streams. In this way, a text stream is used to characterise events as follows.

3.1.2 Event as a Knowledge Base Structure

The event definition found in Section 1.1 is used here to express it as a data structure. Hence, an event can be stored in a Knowledge Base structure from a text stream if we follow the components- tuple structure hC, A, T, L, Di, in which each component is a set represented as:

• Event types: Event types C is a set representing the labelled classes, or labelled categories in e, defined in a domain specific ontology. It might contain more than one class.

• Actors: The actors set A is a set of zero or more named entities which are executing the main action of the event. An actor could be represented as an ontology entity or a named

Page 50 of 175 entity contained in xi.

• Targets: Targets T is a set of named entities representing the targets in which the action is being executed. Similarly, as actors, targets are represented either as an ontology entity or a

named entity contained in xi.

• Locations: Locations are given by the set L containing the geographic locations (geo- graphic entities) in which the event is taking place, locations can also be represented as an

ontology entity or as a named entity found in xi.

• Time: This component is a set that represents the datetimes in which the event was executed. Event dates can be obtained whether from metadata or lexical inferring, or a combination of both and it is represented by D.

Also, it is important to remark that our definition allows an event to potentially have multiple labels, rather than just one single event type label. Therefore, an event can be stored in an event knowledge base. Moreover, our event definition allows the use of canonical entities in the form of ontology concepts. That is, event components can come from any of the recognised entities from the raw text, or can be mapped against an ontology.

3.1.3 Ontology Structure

An ontology is a well-defined, structured and hierarchical set of concepts that represents entities and its relationships in a particular domain. Ontologies [9] comprise a set of entities or individu- als, which belong to certain classes or categories, where classes have a hierarchical structure, and a set of relationships between individuals that gives knowledge about the specific domain. Usu- ally, ontologies in the semantic web are expressed in the OWL1 language using RDF to represent an ontology as a graph structure, or textual representations, in which ontology values are, for in- stance, hierarchically separated by the dot symbol “.”. Each value preceding a dot is a higher level into the hierarchy. For example, values and concepts in a specific ontology might include “Or- ganisation.” or “Location.Afghanistan.Nangarhar”. In general, ontologies can be used to enrich and standardise information, especially for knowledge base construction [84], making them particularly useful for Event Extraction processes.

1https://www.w3.org/OWL/

Page 51 of 175 For conflict-driven events, one of the more commonly used ontologies made for event coding is CAMEO [145]. The Afghanistan-Pakistan (AfPak) ontology is based on CAMEO coding [67], with a finer level of granularity, such Afghanistan places, Afghan government agencies, relevant persons and organisations. In total, the ontology contains 441 individuals, 155 organisations, 709 locations, 7 equipment categories, and 119 event types.

3.2 SEMF Architecture

Our Event Mining framework deals with events in a logical form to populate a domain specific events knowledge base, following database base constraint principles as shown in [133] and knowl- edge base construction principles and methods to build logically consistent event knowledge bases [84]. Therefore, the Stream Event Mining Framework (SEMF) deals with all these characteristics, as its primary purpose is to provide a set of practices required to build event knowledge bases in near-real-time and at the same time, to provide the means to measure the efficiency of any model that complies with the SEMF framework.

The SEMF framework is presented in Figure 3.1, defining the main tasks required to perform any near-real time Event Mining task and its subsequent evaluation under the Stream Mining constraints. First, the required Event Mining task to be extracted is either Event Classification, Event Detection, Argument Classification or the whole event extraction task. Second, textual information sources are selected, ranging from tweets to news and RSS feeds. If needed, events could be read from a database, which is particularly useful when joining information from different sources (stream reader) which are integrated into the stream.

Third, text pre-processing techniques are performed in near-real time, selecting individually if the specified task should run with stemming, lemmatisation, stop words removal or numbers removal, in addition to Part-of-Speech (PoS) tagging and semantic parsing processing, in which all words are transformed to build relevant features to be further used by the machine learning algorithm. Frequently, TF-(I)DF like conversions are used [138] plus some additional semantic features to contextualise the algorithmic procedure.

Lastly, an initial prediction is made for the given instance, by using either a single-layered or a multi-layered learning approach, depending on the Event Mining task to perform. For instance,

Page 52 of 175 Figure 3.1: SEMF framework architecture event classification uses a single-layered Stream Mining approach, whereas Event Extraction runs on top of a multi-layered Stream Mining model, as this task requires multiple learning models to perform the whole extraction task. Table 3.1: Event Mining Tasks vs Multi-layer Approach

Event Mining task Layering approach Event Detection Single layer learner Event Classification Single layer learner Argument Detection Single layer learner Argument Classification Single layer learner Event Extraction Multi layer learner Event Co-Reference Multi layer learner

In this research, we created novel Stream Mining algorithms for both single-layered learning under multiple text streams, and multi-layered learning for multiple text streams, conforming to the SEMF framework. Models for each layered approach will be presented in subsequent chapters of this thesis. Once a prediction is made, a training phase is executed in the learning model, to get training inputs in a prequential fashion. Finally, extracted events and labelled events are being stored in a Knowledge Base (KB), and evaluation metrics are also stored in each model’s performance statistics. Definitions on how each SEMF subtask is implemented are given in the following section, and also initially discussed briefly in [36].

Page 53 of 175 3.2.1 Online Text Pre-Processing

Online text pre-processing is performed at this stage. The idea is to transform a text instance into numbers, or a more understandable form to be read by classification algorithms. Usually, online pre-processing is composed by a numeric transformation in the form of TF-DF, TF-IDF or frequency count procedures, which can be implemented with current state-of-the-art libraries, e.g. OpenNLP1 and Stanford CoreNLP2, and more recently Google NLP API3.

In addition, Twitter has its own streaming REST API that can be used to pull data given a set of query items. Query items can contain keywords, locations, hashtags, or user mentions. In ad- dition, Twitter API provides access to initial pre-processing and data cleansing including hashtag cleansing, JSON formatting, retweet identification, URL linkage and metadata manipulation [158, pp 45–60].

Recall that in Stream Mining setting, a pure TF-(I)DF calculation, or a word embedding repre- sentation as in word2vec [106] or similar embedding representations, does not entirely suit Stream Mining considerations, as they need to be calculated with the whole set of instances beforehand, which is not possible under real-time scenarios. However, word embeddings could be consid- ered if a large number of training examples are given beforehand, or if a windowed mechanism is implemented to re-calculate word vectors every given number of instances.

We initially follow an incremental version of TF-(I)DF as proposed in [164], in which a win- dowed mechanism is applied to update the corpus at an established set of instances. Once the features have been updated, the whole model can be trained and tested under the new calculations. Particularly, word embeddings were not taken into account as an incremental setting is not avail- able in these models, and an early attempt to make incremental updates gave misleading results to the Stream Mining model. Further discussion on the differences between text pre-processing techniques will be given in the subsequent chapter of this research.

The resulting numeric representation is understood by base algorithms, such as Bayesian clas- sifiers, Hoeffding trees or stream ensembles. Depending on the learning task, it might be more convenient to transform a text instance into different representations, e.g. a word-based represen-

1https://opennlp.apache.org 2https://nlp.stanford.edu 3https://google.com

Page 54 of 175 tation, in which words are splitting the instance, or word chunks, like noun phrases, in cases in which our performed learning task is to classify the word itself, as is the case for argument clas- sification. Sometimes is better to extract entities, as such entities are the ones to be classified into actors, targets or locations in the role classification task. In this case, existing NLP APIs are particularly useful in recognising them.

It is important to remark that Event Extraction is a multi-layered learning task, i.e. a learned event is the result of learning its internal components, that is, learning the tuple hC, A, T, L, Di using, first, a learning model for each of its components and, second, learning the best component combination (second learning layer) to return accurate events. For multi-layered learning tasks, such as Event Extraction, text pre-processing is used to create event components, i.e. the outputs of the first learning layers. These outputs are being used as inputs in the next learning layer, specif- ically in the Event Extraction learning layer, i.e. in the event candidate selection learning task.

Table 3.2: Text Pre-processing Techniques in SEMF Framework

Event Mining task Available text pre-processing technique All tasks lemmatisation All tasks Porter stemming All tasks Standard stop word filtering All tasks Custom stop word filtering All tasks Number removal All tasks PoS tagging All tasks Date and time extraction Event Detection and classification Frequency count Event Detection and classification TF-(I)DF Event Detection and classification Single Term Statistics (STS) Argument detection and classification Entity recognition Argument detection and classification Word windowing Argument detection and classification Semantic parser Argument detection and classification Syntactic parser Event extraction All of the above Event coreference All of the above

Any of the pre-processing techniques mentioned in Table 3.2 are available in the SEMF frame- work, according to the specific Event Mining task to perform. The main difference between pre- processing techniques is their featurisation in either a numeric matrix representation or a split word/entity representation with semantic contextual features as is the case for argument detection and classification. It is important to emphasise that multi-layered tasks, such as event extraction,

Page 55 of 175 works with both featurisation approaches, as it needs to perform both learning sub-tasks, therefore needing available text pre-processing techniques for Event Detection, classification, and argument detection and classification.

3.2.2 Online Event Detection and Classification

Event Detection and classification is performed using a trigger word identification and subsequent classification of such words. Sometimes, the ontology can be used to perform string matching, which drives better results in the identification and further classification task. For instance, if the text instance contains the word “attack”, it is more likely that such an instance is labelled as an attack. This word can be easily matched against the ontology values available at the time of the online evaluation.

Online Event Detection and classification occurs when the trained Stream Mining algorithm labels a text instance xi with one or more event types c as defined in Section 3.1.2. Event types take the form of the defined concepts in the ontology. An event can be categorised into multiple types. Therefore, the selected Stream Mining algorithm to perform event classification should be able to extract multi-labelled instances.

Once a text instance is classified, and an event is detected as such, it is assumed that there is one event detected per predicted type label, i.e. if text instance xi is labelled twice, it is assumed that the Stream Mining model detected two different events within the same instance. In this way, the calculation of true positives and true negatives becomes more natural, and evaluation can be made between other existing baselines that performs similar calculations (as in the JET system and similar event pipelines).

In general, any online machine learning model can be used in this learning task if it conforms to online learning setting. For implementation purposes, the SEMF framework performs predictions by using any available learning model in the MOA platform4 written in Java. MOA is an existing and widely used Stream Mining framework and evaluation toolkit written in Java [22]. This Stream Mining platform allows API implementation under Java libraries, that can be further extended if additional Stream Mining models are being created, as in this research work. Therefore, all the

4http://moa.cms.waikato.ac.nz/

Page 56 of 175 created Stream Mining models in this thesis were written in MOA, extending the existing Java API for our Event Mining purposes. MOA has the advantage of including prequential evaluation, and integrating with the WEKA Java API. A more comprehensive review of WEKA algorithms can be found in [168], Chapters 10 to 17.

3.2.3 Online Argument Detection and Classification

The case of argument detection and classification does not differ much from Event Detection and classification. While event classification deals with a multi-labelling problem, argument classi- fication performs single label predictions, and it only uses available Stream Mining algorithms under these conditions. Similarly to event classification, any single label Stream Mining model can be chosen from the MOA platform.

3.2.4 Online Event Extraction

As a result, our event extraction problem is defined as: given a text stream X = {x1, ..., xi, ...} 1 n and a text instance xi at a particular time i, extract the set of events E(xi) = {ei , ..., ei } in xi, n being a finite number of events to be stored in the event knowledge base KB. The learning task is defined to be online, i.e. according to the Stream Mining principles, be ready to predict at any time, with a time and memory constraint, and accessing each example one at a time [58]. Our proposed solution contains a novel, supervised multi-layered Information Extraction (IE) approach that initially performs separate classification learning tasks on the first layer, making use of existing Stream Mining techniques, and integrity constraint learning in the second layer.

3.2.5 Event Coreference Resolution

This particular learning task is to find the closest references between event mentions, to find a group of text instances reporting the same real event. To achieve this, we propose a stream clus- tering mechanism by using a particular similarity metric we called eventiveness similarity metric, combined with cosine measure as defined in [157]. Stop words, lemmatisation and Porter stem- ming are applied before running the cosine function. In the end, a Bayesian model is applied to

Page 57 of 175 perform the binary classification task.

3.3 Domain Specific Datasets

In this section, we present the Afghanistan-Pakistan (AfPak) and the ACLED datasets to be used in this research. These datasets are event annotated datasets, extracted from both news articles and Twitter. The idea is to help social science analysts to describe and automate the complex and expensive task of extracting factual reported information from different news sources to gather as much information as possible. In this way, analysts will be able to accurately analyse a particular conflict, relying on collected, processed and curated information from such sources, allowing them to identify possible root of conflicts, recurrent events, and actors/targets in a particular social conflict.

All the presented datasets were annotated by following specific methodologies according to the analysed conflict domain and the scope of the defined event types to be annotated. All presented datasets were processed as a set of event tuples and added to a knowledge database stored in PostgreSQL1.

3.3.1 The Afghanistan-Pakistan Socio-Political Conflict Datasets

The Afghanistan-Pakistan (AfPak) socio-political conflict datasets are a set of human-annotated events from a set of local and conflict-oriented text media sources. Two datasets were built for analysing and testing Event Extraction from News articles, and from Twitter. Datasets followed a rigorous annotation process as follows.

Annotation Process and Validation In essence, event coding requires both the development of a coding scheme that in our case, is demonstrated by the use of an ontology, and the systematic application of coding rules to the ex- tracted data for inclusion into the knowledge base [142]. Numerous coding schemes have been de- veloped, originally focused on international relations [104, 10, 145], and later expanded to include domestic relations and local events [27, 61], and recently compared against each other [166, 134].

1https://www.postgresql.org

Page 58 of 175 Initially, for data collection purposes, both sources, Twitter and news articles were gathered using a big data platform deployed using Apache Spark2 on Elasticsearch platforms,3. For this purpose, social scientists provided a list of sources (Tables 3.3 and 3.4 ). Data were collected by using the Twitter API and an image of the annotation procedure is given in Figure 3.3.

Table 3.3: # Tweets by Author

Author # Tweets SOHR 1061 AAN Afghanistan 153 Afghanistan News.Net 89 Al-Monitor 1772 Clarion Project 378 Daniele Raineri 1501 Dawn.com 10069 Habib Khan Totakhil 758 INSO 9 ISW 386 Jihad Threat Monitor 117 Khaama Press (KP) 4105 Pajhwok Afghan News 2458 Syria Deeply 23 The Frontier Post 6289 The Long War Journal 82 U.S. Inst. of Peace 311

In the News articles dataset, the annotation process was done by two annotators, using software (the Event Coding Assistant) [85] that allowed them to highlight the relevant information (actors, targets, locations and times) from the text sentences in a news article (Figure 3.2), and to categorise the article into the set of categories defined in the AfPak Ontology. The system was used between December 2016 and February 2017 by two experts. Annotator 1 worked the first 6 days, annotator 2 worked the following 15 days, and both annotators worked in parallel for the next 16 days. It was noticed that annotators had a learning curve of 22 days, and after that period the annotation became stable. The Twitter dataset was also annotated using the same system by one annotator between November 2016 and January 2017. The first annotation set was made between May and July 2016, and the second set gathered tweets between August and November 2016. The first set was discarded as it was also noticed that the annotator became proficient only after the second

2http://spark.apache.org 3https://www.elastic.co

Page 59 of 175 Table 3.4: # News by Source

News Source # News articles Alemarah-English.com 1438 Dawn.com 4706 Long War Journal 324 Jihadology.net 981 NY Times 2093 shahamat-English.com 101 The Guardian 6710 Aghanistan-analysts.org 68 AghanistanNews.net 1808 Khaama.com 1651 Memrijttm.org 316 Pajhwok Afghan News 2595 Spinsaba.com 25 ToloNews.com 3341

week of using the annotation tool.

Figure 3.2: The Event Coding Assistant annotation tool [85]

In total, there were 29,561 tweets and 45,000 news articles processed from the period between August and November 2016. Regarding data filtering and cleansing, a domain specific ontology was made by the social science analysts to replicate similar procedures as in GDELT and CAMEO, in which events are coded with ontology instances (more details on the ontology will be given in the next section). Thus data filtering and cleansing procedures took place by filtering out data by using domain specific ontology topics, organisations, persons, and locations.

Data filtering occurred during Twitter gathering and news articles scraping, and during local storage. Local storage only took into account those text articles covering at least one ontology

Page 60 of 175 Figure 3.3: AfPak events annotation pipeline value, or possible event of interest via ontology filtering, i.e. only tweets and news with values in the AfPak ontology (keywords such as Afghanistan, or Taliban) were added to the raw repository for further analysis. Data cleansing was made to remove HTML links and noise such images and advertising. Once data has been cleaned and filtered, the annotation process took place by using a tailored-made annotation tool [85]. This software allows the annotator to extract from each of incoming text an event of interest. An event of interest, therefore, was an event contained in the sample dataset related to the socio-political conflict analysed in this research work.

Data validation took place after the annotation phase, and both annotated event datasets were double checked by the annotators making a second pass on each annotated event. In addition, a random 10% sample from the whole annotated data set was validated by a third expert annotator and changes/discrepancies were reflected in both the News and Twitter datasets.

Domain Specific Ontology As previously mentioned, an ontology can be described as a structured set of information, usually in the form of a knowledge base, represented in a persistent format, e.g. a database, or a set of text files, capturing the classification of entities and its related relationships to one another [9]. In other words, an ontology is a hierarchically structured set of entities and their relationships in a particular domain.

In general, ontologies are meant to be used to enrich and standardise information, and in par- ticular, event extraction offers a reasonable scenario to exploit ontologies. Ontologies are highly expressive, and they allow users to relate and organise information in a structured, hierarchical

Page 61 of 175 Figure 3.4: Events in News articles times series and relational form, easing the visualisation and analysis of the knowledge base. It is important to note that the AfPak ontology is also hierarchically structured. Therefore, the annotator has more flexibility in the coding process in which raw text could be related to any level in the hier- archy. The ontology extends CAMEO [145] to include not only event types, as described above, but Afghanistan-specific individuals, organisations, locations, and types of equipment. Details on ontology types definitions are defined in Appendix D.

AfPak News Dataset The Afghanistan-Pakistan news dataset was initially used for event classification tasks [67]. The dataset comprises a fine-grained set of human annotated events from a selection of news articles given by the expert. The annotation procedure followed the same process as the Twitter dataset. Human annotators mapped and identified actors, targets, locations, and event types against the defined Afghanistan-Pakistan (AfPak) ontology. As a result, annotators identified 1478 events mapped into 72 different event parent categories. Only relevant events were categorised and ex- tracted. Skewed distribution is observed in this dataset as shown in Figure 3.9.

The chart in Figure 3.4 shows trends in number of annotated events by time, between August and November 2016. From the chart, major military battles were marked throughout time (Jani Khel offensive, the battle of and the battle of Boz Qandahari). It can be said that there is a high spike of annotated events in mid-August. This is given by an increase of reported events from the Taliban perspective, that later in the year was reduced as the website went offline. Also, from the 10 sources, the most reported source was Tolonews, followed by Alemarah (Taliban website) and Khaama Press (Figure 3.5).

Page 62 of 175 Figure 3.5: Events by sources in AfPak News

Regarding other variables, similar skewed distributions are seen in both actors and targets across the entire dataset. Figures 3.6 and 3.7 show histograms of the most frequent actors and targets annotated in the news articles dataset. From the data, we can state that there is a prevalence to annotate events related to Taliban as an actor, and there is also a high tendency of the to appear as an actor in the events. As expected, the Taliban also appeared as the top of the histogram as it was highly active during the examined period. The main difference between actors and targets is that civilians and Afghan locations, in general, appear as remaining top targets in the dataset. From the annotated dataset, there are 138 fine-grained categories, but we focus the scope on classifying top-level categories, ending up with 58 highly imbalanced labels to recognise. The majority class of the fine-grained categories was 12%, and from the top level categories was 20% (kill event type), and a large number of sparse classes were found with around 1% of data representation, such as negotiate or mistreat.

Figure 3.6: Top 12 Actors in AfPak News

Similarly, location entities (also found as ontology concepts) were marked in the annotated

Page 63 of 175 Figure 3.7: Top 12 Targets in AfPak News

Figure 3.8: Top 12 Locations in AfPak News events in Figure 3.8. Kunduz appears to be the highest value. This seems reasonable as the took place during the selected annotated timeframe (October 2016). Also, Nangarhar and Kabul appear next as both places are major cities and economic centres of Afghanistan [169]. It is important to mention that battles seem to be causally-related, as, at the time between battles, as observed by AAN (Afghanistan Analyst Network) [137].

Finally, a correlation matrix is given in Table 3.5. Analysis of event categories vs actors and event categories vs targets is given in the correlation matrix. Correlation analysis gives insights if there is a direct dependency between labelled components. A relatively strong negative correlation is seen between the actors and target values, as intuitively events annotated with some entities as actors are not selected as targets at the same time.

From the initial exploratory analysis, it can be seen that the news articles dataset is a multi- class, multi-labelled dataset, containing mainly two tasks to perform (event type classification and argument (entity) role classification). From the class distributions shown in the data, we can

Page 64 of 175 Figure 3.9: Top 20 event types distribution in AfPak News dataset Table 3.5: News Articles ρ Correlation Matrix

Actors Targets Event Types Actors 1 -0.398 -0.067 Targets -0.398 1 -0.066 Event Types -0.067 -0.066 1

conclude that there is a repetitive skewness pattern across the annotated values. Also, there is an inverse correlation between actors and targets values. This result seems reasonable as actors are not usually chosen as targets within the same event. Lastly, the possibility of creating separate machine learning models is opened as there is a degree of dependency between sources and event category values.

AfPak Twitter Dataset Similarly to the previous dataset, we present the AfPak Twitter dataset, an annotated events knowl- edge base obtained from the publicly available Twitter data filtered to obtain information on polit- ical violence across Afghanistan. The list of sources was explicitly chosen and filtered by social science experts. This dataset has 1061 individual categorised events between July and November 2016. Each of these events is annotated with types, actors, actions, and targets accordingly, and

Page 65 of 175 contains dates and locations. Some events do not contain the full set of actors and targets, and we strictly annotate the ones including at least an actor or target in the sentence.

The dataset was categorised using the AfPak modified version of the CAMEO domain specific ontology. From the annotated dataset, there are 138 fine-grained categories, but we focus the scope on classifying top-level categories, ending up with 58, highly imbalanced labels to recognise. The majority class of the fine-grained categories was about 12%, and from the top level categories was about 20% (kill event type), and a large number of sparse classes were found with around 1% of data representation, such as negotiate or mistreat.

We have estimated that the labelled sample is sufficient to reach 95% confidence interval and 5% error margin for gathering a significant sample from the given population of Tweets between the annotated time, given that the available unlabelled data set has 28, 991 instances in total, in which 1495 were annotated. Also, the sample was annotated during October and November 2016. From the dataset, we could also be able to identify the non-labelled tweets, which were around 45% of the whole annotated sample. The non-annotated tweets were automatically rejected for annotation if the sentence lacked entities specified in the domain specific ontology made for this purpose.From the dataset, there were 843 annotated tweets and 64 of those text sentences were annotated containing more than one event. For this analysis, there are 907 categorised events in the dataset.

Similar to the AfPak News dataset, can be seen that skewed distributions are also prevalent in this dataset, as seen in the Figure, this, as a consequence of the annotation nature of a granular domain specific ontology. It is essential to see the similarity between both datasets, Twitter and News articles, as they respond to very close distributions, in which the main annotated categories are Kill, Injury and Attack, placing the highest labelled category at nearly 20% of the total. There are some variations between categories, but the essence of the skewness and major categories remain consistent.

Regarding correlations between main event components, we can observe a direct correlation between actors, targets and event types. All combinations have positive correlations, being actors vs event types being the highest (0.67). This suggests that, contrary to news articles datasets, Twit- ter annotation relates much more on the coding, and that reported actors and targets appear to be less in number compared to News articles dataset.

Page 66 of 175 Figure 3.10: Events by source (author) in AfPak Twitter

Figure 3.11: Top 20 event types in AfPak Twitter dataset

In summary, it can be stated that Twitter and News articles dataset presented similar distribu- tional patterns, in all annotated event components. Differences between both text datasets include the nature of the presented data, as Twitter is more concise, whereas news articles present more contextual information. Therefore, more granularity is reflected in the News dataset, especially

Page 67 of 175 Figure 3.12: Top 12 actors in AfPak Twitter

Figure 3.13: Top 12 targets in AfPak Twitter

Figure 3.14: Top 12 locations in AfPak Twitter Table 3.6: AfPak Twitter ρ Correlation Matrix

Actors Targets Event Types Actors 1 0.58 0.67 Targets 0.58 1 0.54 Event Types 0.67 0.54 1

Page 68 of 175 seen in actors and targets annotations. In addition, correlation matrices support this idea as Twitter presents a clear, direct correlation between different annotated components, as seen in Table 3.6. In contrast, the News articles annotated dataset shows less consistent correlations except for actors vs targets correlation. Although this difference was present, both annotations presented almost the same top results, including the top 3 targets, top 3 actors and top 3 categories, suggesting validity in the annotation procedure.

This exploratory data analysis suggests that it might be worth building models for each of the reported authors in the AfPak datasets, given that there is a high degree of correlation between the reported source of information and the labelled categories in each of the event components. Subsequent chapters of this thesis investigate the idea of training Event Mining models for each source of information, and even further, by highly correlated features. From both datasets, it can be seen that there is a negative correlation between entities used as actors and targets in the same event, and they tend to match often as there is a constant conflict between known parties (Taliban vs Afghan police/army, Taliban/government vs civilians).

3.3.2 ACLED Dataset Sample

The Armed Conflict Location & Event Data Project (ACLED4) is also a conflict-related collection of publicly available fine-grained events across Latin America, Europe, Africa, Asia, and Middle East. The dataset collects dates, actors, targets, locations, and fatalities of reported violence in these locations.

The downloaded and analysed dataset comprises 119,163 events between 2010 and 2018 clas- sified into 8 different categories. It also contains around 380 different sources of information over 11 different countries, including 2,402 events from Pakistan. For the sake of full Event Extraction experimentation, a sample from the ACLED dataset was taken only counting events belonging to the Afghanistan region. This sampling was complemented with an ontology construction made with ACLED’s entities and aka’s. This was necessary given that SPLICER (next chapter) works with the ontology as background knowledge to initially infer locations. In addition, the AfPak ontology was not possible to use as events were reported in ACLED using different actors and lo- cations. Sometimes, ACLED’s coded actors and targets were not present in the raw text, even with

4https://www.acleddata.com

Page 69 of 175 synonym substitution. Because of this reason, it was necessary to filter out inconsistent instances by mapping all event components against the ACLED Ontology. Instances were removed if there was one annotated component missing from the raw text.

In total, there are 4,307 fully annotated ACLED instances with the following statistics:

Table 3.7: ACLED Afghan Sample Statistics

# Distinct events # Events 4,307 # Entities 43 # Locations 519 # Categories 8

Table 3.8: ACLED Top 3 Values per Component

Top 1 Top 2 Top 3 Actors Taliban(59.4%) IS (10.3%) police (9.4%) Targets police (43.3%) Taliban (22.9%) IS (10.3%) Locations Kabul (4.1%) Ghazni (3.8%) Jalalabad (3.6%) Categories Battle-No chg. of territory (64.3%) Remote violence (24.4%) Violence against civilians (6.2%)

From Table 3.8, we can infer that the most frequent event type is “Battle-No change of ter- ritory” with 64.3% of the total. It is expected that both baselines and our algorithms perform better than the reported majority classes. Similarly, actors, targets and locations reach 59.4%, 43.3% and 4.1% respectively, in the top labelled entity which is Taliban, appearing in 3,620 text instances as either an actor or a target, depending on the event type and annotator’s style. Table 3.8 also shows the skewed distribution of actors, and the majority class (BATTLE-NO CHANGE OF TERRITORY) achieves 64.3% of the records.

Particularly for Afghanistan, the ACLED dataset shows similar distributions as in the AfPak datasets with some variations. An initial analysis shows that as ACLED is more specific towards violence, it shows more relations between the Taliban and the military forces. A similar pattern is also seen in target entities (Figure 3.15).

Finally, ACLED differs from AfPak regarding reported locations, as Nangarhar is the province with more battles in the country (Figure 3.16). As shown by both figures, the datasets tend to include more events of specific interest than others according to each dataset’s purpose. ACLED tend to report more battles and ground fights, whereas AfPak contains more granular details on

Page 70 of 175 Figure 3.15: Actors distribution in ACLED dataset what happened not only in battles but also in internal displacement and inter-group negotiations.

3.3.3 Stream Datasets in Other Domains

In addition to the socio-political conflict datasets described in the previous section, Stream Mining classifiers can also be tested under different domains. There are a number of available Stream Mining datasets outside the text processing domain, that can be used for testing under more generic classification tasks. This research work includes some single layer Stream Mining classifiers, that can be used in other classification tasks, apart from our main text Event Mining classification tasks. The following datasets have been selected to perform generic Stream Mining testing:

Cover Type The cover type dataset was initially used by Blackard [25], and is one of the most common datasets used in the Stream Mining literature5. It contains cartographic variables to predict a forest cover type category. It contains 581,012 instances and 54 attributes. Instances are classified into 7 different types. For testing purposes, we used a normalised version of the dataset.

Airlines The airlines dataset6 is a dataset containing information about delayed of scheduled flights depar-

5https://archive.ics.uci.edu/ml/datasets/covertype 6https://www.openml.org/d/1169

Page 71 of 175 Figure 3.16: Locations distribution in ACLED dataset

Page 72 of 175 tures. The main task in this dataset is to predict whether or not the flight will be delayed. This dataset contains 8 features, and 539,383 instances. The majority class is around 55.7% in this binary classification task.

Electricity The electricity dataset was developed by Harries [66], and it contains data about the News South Wales electricity market. Prices are set every 5 minutes and it is heavily affected by demand and supply changes over time. The dataset contains 45,312 instances, and the class identifies the price trend, either UP or DOWN with respect to the previous 24 hours. The majority class is close to 57.5% of the data (DOWN class label). The normalised dataset can be downloaded from the MOA page7.

3.4 Evaluation and Metrics

The SEMF framework presents a slight difference concerning other Event Mining frameworks, is the evaluation of online, or Stream Mining algorithms in the whole Event Mining task to be performed. As mentioned previously, the goal of this research is to implement Stream Mining methods into the whole event extraction process. The SEMF framework considers the online eval- uation of Stream Mining methods to achieve Event Mining tasks. Therefore, the SEMF framework deals with the notions of being ready to predict at any time, work with potentially infinite instances, be able to shift models throughout time (drift adaptation), to work with limited amount of memory and be able to train in updatable fashion, in a single pass as each instance arrives, as stated in [22, 91, 55, 56].

3.4.1 Prequential Evaluation

Prequential metrics were developed by Gama et al. [58] and widely used in the Stream Mining community. The basic idea behind prequential evaluation is to iteratively test a model as an in- stance arrives in the prediction component of the Stream Mining task. Then, every instance is first tested (or predicted as known in other machine learning communities), and later on, the same instance is used for training the evaluated model, once it has been tested. The main difference 7https://moa.cms.waikato.ac.nz/datasets/

Page 73 of 175 between prequential tests and batch machine learning tests is the way it trains and tests machine learning models. Batch learning requires a set, usually in the form of cross-validation to train the model, and then, it uses a separate set of test instances for model evaluation. On the other hand, prequential evaluation does not require cross-validation testing (although it can be done if needed).

Prequential methods are used as the de-facto Stream Mining testing standard in Stream Min- ing, as it has been proven using the statistics, that the error converges to a holdout estimator [165]. Correspondingly, prequential metrics are presented for the following Event Mining tasks: Event Detection, Argument Detection, Argument Classification, Event Extraction, and Event Corefer- ence resolution. The main metrics are precision, recall and F1 specifically tailored for each Event Mining task.

3.4.2 Precision, Recall and F1

All Event Mining tasks are analysed in terms of prequential precision (P ), recall (R) and F1 score. The difference between each Event Mining task calculation is identified in the way that true positives (TP ), true negatives (TN), false positives(FP ) and false negatives (FN) are defined and calculated. We initially present definitions of precision, recall and F1, and then definitions for each Event Mining task. Precision is referred to as:

TP P = (3.1) TP + FP where TP are the true positives, i.e. the number of correctly predicted instances by the Stream Mining model, and false positives FP are defined as the number of incorrectly predicted instances by the Stream Mining model. Similarly, recall is defined as:

TP R = (3.2) TP + FN in which FN is defined as the false negatives, i.e. the number of correct labels that the Stream

Mining algorithm did not assign. F1 then is defined as the harmonic mean between P and R:

Page 74 of 175 P ∗ R F = 2 ∗ (3.3) 1 P + R

These metrics have been widely used in data mining and machine learning research [32].

3.4.3 Event Detection and Classification Measures

As mentioned above, Event Classification metrics are drawn from the way TP , TN, FP and FN are calculated. For Event Classification tasks, we define true positives based on the intersection between the predicted and the true categories in an event e. The set of true categories in e are de- fined as C according to our definition in Section 3.1.2. Consequently, a true positive is a correctly classified event type in the predicted set event categories C0. Thus, TP for Event Classification is defined as a sum of correctly predicted event types in C0 belonging to C:

TP =| C0 ∩ C | (3.4)

Similarly, false positives and false negatives are defined as:

FP =| C0 | − | C0 ∩ C | (3.5)

FN =| C | − | C0 ∩ C | (3.6)

Calculations for Event Detection and Classification are performed in SEMF according to these definitions.

3.4.4 Argument Detection and Classification Measures

For Argument Classification, detected arguments become instances, and argument roles (Actor, Target, Location) become labels for this classification task. As a result, Argument Detection and

Page 75 of 175 Classification the true positives are the correct classifications of a given entity as actor, target or location. That is, the true positives TP are being calculated when the entity is correctly categorised in the correct component set:

TP =| A0 ∩ A | + | T 0 ∩ T | + | L0 ∩ L | (3.7)

FP =| A0 | − | A0 ∩ A | + | T 0 | − | T 0 ∩ T | + | L0 | − | L0 ∩ L | (3.8)

FN =| A | − | A0 ∩ A | + | T | − | T 0 ∩ T | + | L | − | L0 ∩ L | (3.9) where A0 is the predicted set of actors and A the real set of labelled actors, T 0 the predicted set of targets and T the real set of targets, and L0 the predicted set of locations and L the real set of locations.

3.4.5 Event Extraction Metrics

Final event extraction results are evaluated using the following metrics: prequential precision, recall and F1 over the text classification tasks and over an event correctness metric – function Ψ(Y,Y 0) – in which the intuitive definition is to what extent the extracted event resembles the real event. This can be measured as:

| Y 0 ∩ Y | Ψ(Y,Y 0) = (3.10) | Y | in which Ψ(Y,Y 0) measures the overlap between the predicted event components set Y 0 and the real event components set Y . Therefore, Ψ(Y,Y 0) will be close to 1 if Y 0 contains the maximum number of components in Y , and 0 if Y 0 is a completely unrelated event, that is, although plau- sible, Y 0 does not contain any item in Y . Thus, the goal is to maximise this function, and at the same time, we will explore the prequential precision, recall and F1 metrics, changing the true pos-

Page 76 of 175 itives (TP ), false positives (FP ) and false negatives (FN) to resemble results on event extraction component predictions, that is, to extend the calculations to Event Detection and classification, argument detection and classification.

TP =| Y 0 ∩ Y | (3.11)

FP =| Y 0 | − | Y 0 ∩ Y | (3.12)

FN =| Y | − | Y ∩ Y 0 | (3.13)

The true positives derive from the number of components present in both the predicted and the annotated event, whereas false positives derive from the predicted components not present in the real annotated event Y . Similarly, false negatives are drawn from the number of annotated components not present in the predicted event Y 0.

3.5 Summary

In this chapter, an overview of the SEMF framework was given, to define the Event Mining tasks to be performed under online constraints. In this way, the SEMF framework deals with different single layered and multi-layered Event Mining tasks in near-real time. The definitions of the framework were given, including Event Mining learning tasks to perform, text pre-processing techniques, online learning algorithms, and different Event Mining metrics to evaluate each Event Mining task.

In this chapter, event datasets were given and initially explored showing certain similarities between each of the presented datasets. We presented 4 datasets to be further explored and used to test Event Mining tasks, particularly event classification and the whole event extraction task. Datasets included were the AfPak Twitter and News articles events datasets and the ACLED dataset.

Page 77 of 175 Skewed distributions were commonly observed between all datasets and entities to be ex- tracted. The main event categories are related to violent events, including battles, attacks, killings, and injuries. The main actors and targets for Afghanistan, in particular, were related to the Taliban and the Afghan army. In addition, it is seen that the ACLED dataset has a different annotation style from the AfPak datasets.

In the next chapter, we perform Event Extraction from AfPak and ACLED datasets. We used the annotated data as prequential training instances, conforming to the requirements of the SEMF framework under real-time conditions. Details of the first event extraction model and its experi- ments are given in the next chapter of this dissertation.

Page 78 of 175 Chapter 4

Event Extraction From Text Streams Using SPLICER

This chapter will explain the application of SPLICER (Stream Processing with Logical Integrity Constraints for Event extRaction), a near-real time Event Extraction pipeline based on integrity constraint rule learning. As discussed in Chapter 2, current Event Extraction approaches are either computationally expensive batch pipelines, or yes-no real-time Event Detection models, without extracting fine-grained event components such as actors, targets or locations. Here, we present a novel, evolvable constraint rule learning approach designed to induce exceptions based on in- correct component combinations, to avoid further errors. In this way, a multi-layered learning approach uses statistical stream classifiers in the first layer to identify components, and integrity constraint rules in the second layer to extract events. Experimentation shows overall improve- ments in prequential F1 by extracting more logic event component combinations than statistical pipelines. Our approach improves prequential F1 around 10% over previous event extraction sys- tems, and is tested under Stream Mining constraints in socio-political conflict domains.

To fully extract the whole event from a given text instance, we propose a new rule learning approach to identify and selectively choose the right set of event components for a given instance that does not violate a learned set of constraints. A new online induction rule learning algorithm is proposed to recognise “wrong” Event Logical Forms (ELFs). This differs from previous stream rule learning in the sense that it uses a bottom-up approach, then it recognises incorrect ELF combinations, and then it attempts to classify an event using the inferred rules from the training algorithm. Therefore, the goal is to extract the most likely events set as defined in Chapter 3.

The remainder of this chapter is organised as follows: The main rule induction algorithm is explained, including the multi-task classification approach to find out which event components could be combined in the final event candidate representation. Finally, results and conclusions are given in the last section.

4.1 Introduction

As mentioned in Chapter 2, Event Extraction pipelines identify and extract events and its underly- ing components using batch-learning pipelines, and some of them use numeric embeddings to feed neural networks. We have shown that there is a research opportunity in which Event Extraction can be made using Stream Mining, in which both, the learning and the prediction tasks are executed in near-real time, allowing an interactive user feedback without waiting for large batch-learning computational training processes. Moreover, first-order logic is an effective way to represent an explainable representation of the model for humans to perform an interactive feedback with the learning model.

In this chapter, we present a novel approach to deal with Event Extraction, based on online integrity constraints learning. The idea is to find wrong component combinations in the event knowledge base to avoid misleading information (false positives), on top of a learning layer that uses known Stream Mining classifiers (multi-layered learning approach). In this way, the proce- dure is flexible enough to support a large amount of event component combinations, improving the overall recall in the general task. Furthermore, the technique adds novelty in the experimentation procedure on being, to the best of our knowledge, the first Event Extraction method to be tested under prequential approaches, and also using a comprehensive metric that captures how well the whole Event Extraction task performs rather than giving partial classification results.

Page 80 of 175 4.2 SEMF Framework and SPLICER

Recall our definition of events and text streams in Chapter 1. In general, the idea of Event Ex- traction from text is to identify, from a text fragment, what types, actors, targets, locations are involved in a specific collected event. This information comes from an unstructured text instance that can also be categorised within a set of predefined event categories. To emphasise, the input (text stream) and output (event) are defined to set up our learning tasks:

Input: Text Stream

A text stream X is a set of potentially infinite instances {x1, x2, ..., xi, ...} arriving in time order.

In addition, each text stream instance xi is composed by a word-set W(xi) = {w1, ...., wm}. Therefore, a text stream is used to characterise events as follows:

Output: Event

Given a text instance xi, an event e from xi is a components-tuple hC, A, T, L, Di, in which each component is a set represented as event types (C), actors (A), targets (T ), locations (L) and dates (D). A more detailed definition is given in Section 3.1.2. In addition, it is important to remark that our definition is different from previous definitions as an event can have multiple labels, rather than just one single event type label. Hence, it can be noted that an event can be stored in an event knowledge base in form of Event Logical Forms (ELF), supported by an ontology as knowledge B, with the aim of using such forms in the rule learning component of the model. Both, knowledge in B and the ELFs are logical data structures defined in the following sections.

4.2.1 Ontology as Base Knowledge

An ontology O (as explained in Section 3.1.3) represents a set of entities and their relationships in a particular domain. Our objective is to use O within a knowledge base KB as additional background knowledge about our domain of interest. This is achieved by translating O into background knowledge B in the form of first-order logic definite clauses. To define the ontology translation, each node in the ontology translates into a constant in B, and each link in O translates into a relation instance in B (except for a.k.a. links). An ontology is translated into a set of logical expressions in B using the following steps, resulting in a Datalog1 program:

1http://www.ccs.neu.edu/home/ramsdell/tools/datalog/datalog.html

Page 81 of 175 • All ontology values taken from existing nodes O are translated into constants in B, and are represented with upper-case words, e.g. Afghanistan, Omar_Manzur (except that for nodes that represent alternative names for the same entity using the a.k.a. relation, only one canonical name is used for the entity).

• All ontology classes and their hierarchical structure in O (equivalent to the terms Class and subClassOf used in the OWL language) translate into the relation is-a(x, y) in B. For example, class Location from the AfPak ontology is translated into B as the con- stant Location and if Afghanistan is a particular location then it translates into the relation is-a(Afghanistan,Location). In addition, the relation is-a(x, y) follows the transitivity ax- iom: is-a(x, y) ∧ is-a(y, z) → is-a(x, z) (4.1)

• Links in O (except for a.k.a. links) become atoms in B of the form f(x, y). Each link in O creates a new atom in B, such as part-of(Sansgar, Punjawye) from AfPak ontology links in O (see example below in Figure 4.1). Specifically, for socio-political ontology translation, we also created the relations part-of(x, y) to represent districts and states, and member-of(x, y) to represent affiliations between persons and organisations. We defined the following transitivity axioms:

part-of(x, y) ∧ part-of(y, z) → part-of(x, z) (4.2)

member-of(x, y) ∧ member-of(y, z) → member-of(x, z) (4.3)

Ontology Translation Example Figure 4.1 shows an example of how an ontology is translated to a knowledge base, in which the top levels in the hierarchy are the classes Place, Person, Organisation and Action. Each node has its own relationships, depicted by the predicates is-a, part-of and member-of.

Then, the result of applying the above definitions for translating O from Figure 4.1 into B is

Page 82 of 175 Figure 4.1: AfPak ontology fragment used for logic translation the following set of first-order clauses, in which the following are constants:

Place

Action

Person

Organization

Similarly, nodes in O translate into the following constants in B:

Group_Fight Taliban

Border_Clash Haqqani

Mullah_Omar Sangsar

Mohammed_Omar Punjawye_D

Taliban_Leader Afghanistan

Page 83 of 175 The links in the ontology translate into the following atoms:

is-a(Afghanistan, Place)

is-a(Punjawye_D, Place) is-a(Mullah_Omar, Taliban_Leader)

is-a(Sangsar, Place) member-of(Mullah_Omar, Taliban)

is-a(Taliban_Leader, Person) member-of(Taliban, Haqqani)

is-a(Haqqani, Organization) part-of(Sangsar, Punjawye_D)

is-a(Group_Fight, Action) part-of(Punjawye_D, Afghanistan)

is-a(Border_Clash, Group_Fight)

The following facts can be inferred using transitivity:

is-a(Mullah_Omar, Person)

member-of(Mullah_Omar, Haqqani)

part-of(Sangsar, Afghanistan)

is-a(Border_Clash, Action)

In this way, the ontology in the example from Figure 4.1 is incorporated into background knowl- edge B to be used for building the event knowledge base KB.

4.2.2 Event Logical Form (ELF)

An event e then, is represented by an event logical form (ELF), expressed as a set of logical atoms, in which each of the tuple elements in hC, A, T, L, Di is represented as a set of atoms using the form of relations: action(e, c), actor(e, a), target(e, t), location(e, l), time(e, d) for each c ∈ C, a ∈ A, t ∈ T , l ∈ L, d ∈ D.

In addition to ELFs, our framework also works with a background knowledge B, represented as a set of clauses and relations derived from a logical representation of an ontology. Thus, any element in B can be used to represent any component in the tuple hC, A, T, L, Di. Note that ELF relations are part of the knowledge base KB. The use of ontology values to represent ELFs is called event coding. In addition, an ELF can also contain entities that are not in B as constants, and these entities are treated as text strings.

Page 84 of 175 In this way, our Event Knowledge Base (KB) is a finite set of ELFs plus the logical axioms derived from our ontology B (background knowledge). Consequently, Event Extraction takes different items following a tuple-defined structure, similar to database attributes, in which each attribute should adhere to a set of domain integrity constraints [133] represented as definite clauses.

Example: An extracted event from a tweet is shown in Figure 5.1 as follows.

Figure 4.2: Representation and extraction of an event in Twitter. In this example the tweet is represented as an ELF in KB

In the following example, an event e = hC, A, T, L, Di is characterised by a selection of event types, actors, targets and locations, coming directly from the text in Figure 5.1. Hence, e has the following event logical form:

is-a(e, Event)

action(e, Kill)

action(e, Strike)

actor(e, UAV)

target(e, Taliban)

target(e, Daesh_fighters)

location(e, Nangarhar)

Note that the ELF representation uses canonical names, by replacing the raw value with its cor- respondent a.k.a. (as in actor(e, UAV)). Thus there is no need to translate the a.k.a. links into the background knowledge B. If the raw value is not present in the ontology, then we used the raw string value in the ELF (as in target(e, Daesh_fighters)). The following facts from the background

Page 85 of 175 knowledge B are used when processing this example:

is-a(Taliban, Organization)

is-a(Nangarhar, Place) is-a(Haqqani, Organization)

is-a(Afghanistan, Place) member-of(Taliban, Haqqani)

part-of(Nangarhar, Afghanistan) is-a(Strike, Attack)

is-a(UAV, Artifact) is-a(Attack, Action)

is-a(Kill, Action)

Also, it is important to note that our ontology translation is general in that it can be applied to any ontology consisting of hierarchically structured sets of classes (taxonomies) and relations between entities. Integrity constraint checking is done with SWI-Prolog2. In this way, our ultimate goal is to automatically extract ELFs from a given text stream X. Our approach is to build a set of integrity constraints IC to fill KB with consistent ELFs. Therefore, our proposed solution im- plements a way to create such rules or constraints by using rule learning, defined in the following section.

4.3 SPLICER – Stream Processing with Logical Integrity Constraints for Event extRaction

In this section, a novel approach for integrity constraint learning is given, to specifically address the correct generation of the events knowledge base in near-real time data sources. This approach is motivated by the issue of correct event candidate generation from texts represented as ELFs. Integrity constraints were initially introduced by Reiter [133] and further used in database rep- resentation and modelling. Consequently, our approach solves an integrity constraint induction problem.

SPLICER – Stream Processing with Logical Integrity Constraints for Event extRaction – is a multi-layered learning approach using probabilistic learning in the first layer and reasoning theory in the second (Figure 4.3). The system starts with no integrity constraints, and these are learned

2https://www.swi-prolog.org

Page 86 of 175 during the training/test cycle. Details are provided in the next section.

Figure 4.3: SPLICER pipeline to extract events and integrity constraints

Our framework, therefore, rather than choosing the correct items to form a tuple, rejects a can- didate (or knowledge base state) if it is not consistent according to the given integrity constraints, i.e. the chosen candidate should not be an exception captured by the integrity constraint rule set. Hence, the intuition is to take advantage of constraint induction on top of the Stream Mining classification approaches to generate accurate knowledge to populate the database of events.

In this way, correct candidates can be identified more efficiently and consistently, making the selection much more flexible. Therefore, we define and use some of our previous definitions on integrity constraints and the goal learning tasks in the pre-processing, first and second layers.

4.3.1 Text Pre-Processing

We perform text pre-processing to use available structured classifiers, converting each text stream instance in a numeric representation on each one of the learning tasks. Depending on the task, we convert from either text to an ontology feature vector, or from text to a semantic feature vector.

A window of a fixed amount of text streams is preprocessed every 1000 instances accord- ing to our computational capabilities, which included a 2.7 GHZ intel core i5 and 16GB RAM memory. The window can be set to more or fewer instances according to the corpus size and the available computing memory. Empirically, a short window for the pre-processing task leads to lower accuracy, given that it makes the dictionary and the number of features grow every time it is performed. Nonetheless, a large number of features might result in memory overflow errors. We used the windowed incremental approach proposed in [164].

Page 87 of 175 4.3.1.1 Incremental Pre-Processing for Event Type Classification

For the event type classification task, the given text instance is converted into a set of features to represent the text instance into a numeric matrix representation in the form of Term frequency + document frequency – TF-DF. In this method, each instance is matched against a set of word attributes coming from the text corpus. Afterward, each text instance is tokenised, and each word is matched against the word set attributes. The matrix is filled using a calculation of the frequency of each word appearing in the text instance, and how many times is seen in the whole window, i.e. how many times the word appears in other documents, each document is counted just once, even if the word appears more time in it.

A windowed approach is being used as Term Frequency-Document Frequency (TF-DF) and Term Frequency-Inverse Document Frequency (TF-IDF) [138] requires the whole text corpus for training. By using a certain amount of instances in a fixed or sliced window, the classifier can be fed in every thousand instances, as the whole matrix will change numbers and calculations, mis- leading the online algorithm. The following is a general explanation on how to build incremental term weighting mechanisms:

1. Collect text instances until the window are reached (1000 instances in this scenario).

2. Once the window is reached, for every instance within the window, split the text into words, and perform lemmatisation, stemming and stop words removal. We used OpenNLP1 in its Java library version for this purpose.

3. For each token in the text instance, term and document frequency counts were calculated to build TF-(I)DF.

4. The text instance is transformed into an incremental word matrix (bag of words) to be passed to the event classification algorithm.

5. Each word’s Part-of-speech tag PoS tag is added to the feature set described in OpenNLP 2 . PoS tags were annotated using the Penn Tree bank 3

6. Repeat on each instance until the end of the window is reached.

1https://opennlp.apache.org 2https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html 3https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

Page 88 of 175 Afterwards, each word token is matched hierarchically against the ontology (Appendix D). If the given word appears in any of the ontology labels, then we give a weight according to the hierarchical level of the category found in which the word belongs to. For example, the word “kill” appears in two ontology labels, “kill” and “complaint.kill civilians” in the AfPak ontology. Then the feature “kill” is represented with more weight than the label “complaint.kill civilians” in our feature representation.

We found that after a certain amount of instances the number of resulting features tends to stabilise, following Zipf’s law [94]. In addition, standard cleaning has been applied to Twitter text, removing hashtags, and making name conversions against the ontology. Also, it is important to remark that there is some similarity in this regard with research previously done in the area [154].

4.3.1.2 Text Pre-Processing for Event Argument Classification

In contrast, the pre-processing technique used in the Event Argument Classification is different from the explained above. Here, we split each incoming instance in tokens, identifying the relevant entities on each instance. As a result, the input dataset of the classifier is a stream of terms, with a set of contextual features depicted in Table 4.1. Similar feature representations were used in [93] but we propose a different semantic enrichment using the ontology.

First, a sentence detector is applied on top of the text instance to identify the sentence structure of the incoming text. After this, an NER algorithm plus a part of speech (PoS) tagger is used for analysing each text instance using OpenNLP4. However, stop words were not removed nor was another standard transformation applied, given that we wanted to keep the semantic knowledge given by these set of words. For instance, the term “by” is a preposition (expressed as PREP by the lexical annotator). In our case, it is more likely to find an actor after the preposition "by" appearing in the text. In this way, we build our sliding window feature representation to reflect the context of each named entity in the instance vector used by the learning algorithm.

Once we have obtained this initial token set, Named Entity Recognition (NER) was performed using the OpenNLP tool, and the resulting entities were matched against our domain specific

4https://opennlp.apache.org

Page 89 of 175 Table 4.1: Event Component Features

Feature Type Definition token value string text value of the named entity, given to classify as a likely event component. POS tag nominal part-of-speech tag of the evaluated word. Is on subject binary whether or not the given word is in the subject part of the sentence. main verb string main verb detected from the sentence. main verb voice nominal Whether the main verb is active or passive voice. ontology location binary Whether or not the word is part of a location in the Ontology. ontology PER/ORG binary Whether or not the word is part of a person or organi- sation in the Ontology. preposition before binary If there is a preposition before the word or not. token value-2 string value of 2 words before the current instance. POS tag-2 nominal part-of-speech of word-2 token value-1 string value of the word before the current instance. POS tag-1 nominal part-of-speech of word-1 token value+1 string value of the word after the current instance. POS tag+1 nominal part-of-speech of word+1 token value+2 string value of 2 words after the current instance. POS tag+2 nominal part-of-speech of word+2 word-main verb length number number of words between the current word and main verb. ontology. For each named entity, if the words in the entity are contained in any of the ontology entity values, we add the ontology value found by the matching process as an ontology person, organisation or location.

Finally, the raw text is converted into a structured representation of the text stream plus the real label of each word according to the corresponding component category, Actor (ACT), Target (TGT), Location (LOC), or no role (NONE), given by the expert in order to validate the classi- fication results. Once the text is being pre-processed, we perform the required Event Extraction tasks.

4.3.2 First Layer: Component Detection and Classification

Given an instance xi from a given text stream X, the goal of a first level stream classifier is to classify xi according to a given set of labels Υ = {υ1, ..., υz} belonging to a particular labelling family Υ ∈ FΥ. We particularly focus attention over event type and argument classification learn-

Page 90 of 175 ing tasks. Therefore, the output of such classification tasks will be used as inputs for the next learning layer.

Any Stream Mining classifier can be used in such first layer classification as long as it follows the Stream Mining framework unified in [55, 56]. Then, the classifier L can learn using any Stream Mining algorithm, conforming to the given definitions. The learning correctness is being assessed by any chosen objective function, such as a Bayesian error function in a prequential mode.

A pool of learners L = {l1, ..., lm} are proposed to predict the most likely set of event types and arguments separately. Then, the second learning layer (IC induction) is responsible for joining consistent ELFs and constraint checking. Hence, because our learning problem is being set in the selection of a maximally covered ELF candidate from the inputs conforming to KB and IC, then, a pool of learners L = {l1, ..., lm}, are proposed to classify the most likely set of event components to be part of the final ELF candidate. Then, the second learning layer is responsible for plausibility checking of the given ELF candidate.

4.3.3 Second Layer: Integrity Constraint Learning

The definitions above are used to predict correct, or more precisely, non-inconsistent ELFs. There- fore, an integrity constraint induction problem is defined as we are interested in finding such in- tegrity constraints given that the system does not start with any integrity constraints. Therefore, we propose a stream-based induction program to discover integrity constraints.

4.3.4 Horn Integrity Constraint

An integrity constraint is deemed to be a good alternative for maintaining consistent dataset repre- sentations in KBs, and the term refers to all the rules managed to “enforce legal states” in database feeding processes [133]. Hence, the idea is to enforce evolving legal states, i.e. to update Horn constraints in our system as every instance arrives. In this way, an integrity constraint is a Horn clause of the form:

⊥ ← a1 ∧ a2 ∧ ... (4.4)

Page 91 of 175 In which a1, a2... can be any atom(s) and ⊥ is a logical contradiction. In this way, and as de- fined by [99], a knowledge base is represented using consistent first-order sentences to be updated by SPLICER.

4.3.5 Integrity Constraint Consistency and Evolvability

A knowledge base KB is said to be consistent with the integrity constraint set IC if:

KB ∪ IC 6` ⊥ (4.5)

Each integrity constraint has a confidence support greater than certain threshold β. Therefore, SPLICER deals with such rule evolvability by counting rule supports from the training data along the model is being trained throughout time. For this reason, SPLICER includes an additional integrity constraint pruning mechanism explained later in this chapter, using the Hoeffding bound [68], which is re-worked by Domingos et al. [45] as follows:

Definition 4.3.1. Let n be the number of independent observations of a random variable r with range R, the Hoeffding bound ensures with a probability 1 − δ, that the true mean Etrue is at least

Eest −  from the estimate mean Eest where:

P [|Etrue − Eest| > ] < 1 − δ (4.6)

r R2 ∗ ln(1/δ)  = (4.7) 2 ∗ n where δ denotes the desired confidence level in the algorithm.

Therefore, an event candidate e can be derived by extracting a consistent ELF, i.e. our extracted components in e are likely to be a valid combination, given that e is not contradicted by IC.

Integrity Constraint Example An integrity constraint might be given by the clause:

Page 92 of 175 ⊥ ← action(e, Bombing) ∧ actor(e, Minister of mines)

Then, we can infer that any extracted ELF that incorrectly places the Minister of Mines as the perpetrator of a bombing attack is inconsistent.

In the previous section, we defined the foundations for learning possibly correct, i.e. plausible ELF candidates by rejecting the inconsistent candidates from the final prediction. In this way, a plausible ELF Y 0 is found when Y 0 is the “best guess” of components belonging to the real ELF Y . We propose the learning goal in Section 4.3.6.

4.3.6 Plausible Candidate Selection

Given an instance represented by a non-inconsistent ELF Y 0, and the ground-truth ELF Y , the goal is to cover as many components as possible in Y 0 such that:

B ∪ IC ∪ Y 0 6` ⊥ (4.8)

The best candidate selection is given by maximising the formula:

| Y 0 ∩ Y | Ψ(Y,Y 0) = (4.9) | Y |

Therefore, the components of Y 0 are chosen from the top predicted categories from SPLICER’s first layer, ranked by probability of occurrence. Therefore, if SPLICER extracts two plausible ELFs Y 0 and Z0, with Ψ(Y,Y 0) = Ψ(Y,Z0), the system chooses the one with highest joint proba- 0 0 0 Q bility max(P (Y ),P (Z )), where P (Y ) = comp∈Y 0 p(comp), in which p(comp) is the proba- bility of occurrence of the highest label in each component comp . Consequently, SPLICER will incrementally choose plausible candidates via the following incremental integrity constraint rule learning method.

Page 93 of 175 4.4 SPLICER’s ELF Selection Algorithm

In this Section, the Horn integrity constraint learning algorithm is presented in conjunction with an item-set selection procedure performed in the testing phase of the prequential approach, defin- ing the top-level induction algorithm. In this way, the integrity constraint testing and training algorithms are presented in Algorithms 1 and 2.

Test and train algorithms work with any given instance arriving at time index in a time-ordered fashion. First, base Stream Mining models predict their required tasks in their respective test methods. Secondly, it takes the outputs of the first-level pool of classifiers and it validates the plausibility of the candidate Y 0 by using the integrity rules set IC.

Algorithm 1: Event extraction procedure : test(xindex,ISc) Input: A set of predicted items ISc = {fc1(xi), fc2(xi), ..., fcm(xi)} using the pool C = {c1, ..., cm} of single stream learners, predicting individual learning tasks from a given stream instance xindex. A background knowledge B. A set of integrity constraint rules IC. Output: An event candidate Y 0 . initialCandidate ⇐ ISc ; if initialCandidate is rejected by IC then 0 Y ⇐ generateNextBestCandidate(c1, ..., cm, xi) ; if Y 0 ≡ ∅ then Y 0 ⇐ initialCandidate− rejectedItems(IC, initialCandidate) ; end return Y 0 end

Test algorithm: If Y 0 is inconsistent against KB, then SPLICER’s testing algorithm attempts to select an alternative candidate by choosing the next possible combination returned by the pool of learners L. This work is being done in the generateNextBestCandidate method. Then, an alter- native candidate is chosen with the next predicted label in L. Hence, the intuition is that although we want to be confident in generating a correct answer, we also want to maximise the productiv- ity of the base learners. In addition to that, and if there is no other better candidate, the initial candidate is selected as a last resort, without the subset of incorrect items in Y 0. This approach is particularly useful as it increases partially right answers and at the same time, maintaining the best possible recall without rejecting a potential good candidate to populate KB. Similarly, the training algorithm is shown in Algorithm 2.

Page 94 of 175 0 Algorithm 2: Integrity constraint induction algorithm : train(xindex,Y,Y ) Input: Text instance xindex. 0 Predicted event Y using the pool of single label classifiers C = {c1, ..., cm} and the set of constraints IC. The real event Y Initial set of integrity constraints IC. An ontology set B, and a set of blocked constraints BS. An objective function Ψ(Y,Y 0) → [0, 1]. Output: An updated integrity constraint set IC. for item ∈ Y 0∧ 6∈ Y do NR ⇐ makeConstraint(item); if NR ∈ IC then updateSupport(NR,IC); end else if NR ∈ BS then coverageError ⇐ computeCovError(NR, Ψ(Y,Y 0)); hoeffdingBound ⇐ computeHoeffdingBound(θ, support(NR) + unSupport(NR)); if coverageError <= hoeffdingBound then BS ⇐ BS − NR; end end IC ⇐ IC ∪ NR; end end if isRejected(IC,Y ) then updateBlockingConstraints(Y,IC,BS); end return updated IC, updated BS;

Training algorithm: The algorithm creates, maintains and prunes the set of evolving integrity constraints IC. First, the makeConstraint method creates integrity constraints derived from the current instance Y , for every incorrect combination of components extracted in Y 0. Then, it checks and updates the support of all existing constraints in IC. If any newly created constraint is found in unsupport the already pruned constraint set BS, then it computes the coverage error, set as support+unsupport , to determine whether or not the already pruned constraint should be reinstated to IC, resulting in an evolving mechanism using the Hoeffding Bound [71], calculated using support(NR) and unsupport(NR) constraint metrics and the parameter θ set as 0.9 confidence by default. The Ho- effding bound is computed as:

Page 95 of 175 s r2 ∗ ln 1  = θ (4.10) 2 ∗ n

in which r = 1, and n is the number of processed instances. Additionally the function of updateBlockingRules(Ev, P, BS) is to attempt avoiding the generation of rules that block good candidates to be validated and returned. Hence, blocking rules act as a pruning mechanism to remove such rules from the constraint set. If a rule in the constraint rule set is discovered to constantly reject the annotated event during the training phase, then the algorithm assess how well supported is this rule compared to previous rules. If the rule has a coverage error over the Hoeffding Bound [71], then, we have confidence in pruning such rule from the constraint set, and we move it to the blocking set. Similarly, if the rule has a coverage error below the Hoeffding Bound, and it is already in the blocked set, then the rule might be used again and then removed from the blocking set.

Finally, it is also important to remark that the rejection-based integrity constraint knowledge base population framework deals with multi-labelled prediction, that is, that any base Stream Mining algorithm could return one or more predictions in the testing phase.

4.5 Experiments

4.5.1 Datasets

The baselines and SPLICER were tested using the AfPak event datasets and the ACLED dataset, described in Section 3.3.

4.5.2 Event Extraction Baselines

There is no direct comparison to our approach, given that current baselines were tested under batch learning scenarios (see Section 2.4). Nonetheless, we replicate Event Extraction task using the JET system [65], and additionally, we present a baseline which independently joins individual Bayesian classifiers from two separate event component classification tasks.

Page 96 of 175 JET System The JET system was used with its off-the-shelf parameters to identify events from such text datasets1. If needed, equivalencies were drawn according to the definitions given in the ACE annotation guidelines2. A table of equivalences is given in Appendix B.

The JET system involves a set of steps required to perform such a complex Event Extraction task. Initially, the pipeline extracts all textual information from the whole set of text instances, and performs rule operations to analyse and extract entities via part-of-speech pattern recognition. To extract an event patterns are given from a previous, manually constructed set of English-based rules that performs recognition of trigger words and likely argument entities. Secondly, once trigger words and argument entities are being recognised, the system performs a classification step via MaxEntropy classification [65].

This model was trained using the ACE competition dataset [65], which comprises news articles from a range of sources and categories. Similar systems have followed the same training data [93, 111]; however, due to the nature of our problem, a domain specific dataset was used for testing [145]. Off-the-shelf models were taken for experiments, as data processing for re-training would involve expensive re-annotation and processing. Nonetheless, we believe that experiments make a fair comparison as all methods were be tested on unseen text instances, in addition to the normalised classification explained in the previous section.

Independent Multinomial Naïve Bayes (MNB) Learners For the independent stream learners, we trained two separate, independent Bayesian approaches to deal with event type classification and argument role classification (Actor, Target, Location). Each learner outputs a predicted label per incoming instance but it is not a procedure to perform joint prediction. The baseline extracts the highest label from the event type learning task and the best event role candidate, using the best performing task (online Multinomial Naïve Bayes – MNB). For the event type classification, a TF-DF featurisation was made (Section 3.2.1), whereas, with the role classifier, each named entity was found using OpenNLP3 and each entity was treated as a potential argument (stream instance).

1https://cs.nyu.edu/grishman/jet/license.html 2https://www.ldc.upenn.edu/collaborations/past-projects/ace/ annotation-tasks-and-specifications 3https://opennlp.apache.org

Page 97 of 175 Evaluation metrics Given that previous approaches were more focused in detecting batch, independent precision and recall metrics for each classification task, we define our True Positives, True Negatives, False Positives and False Negatives based on the distance between the predicted ELF Y 0 and the real event Y as defined in Section 3.4.5.

Given that previous approaches were more focused on partial precision and recall metrics for each classification task, we use a new metric based on the distance between the predicted ELF Y 0 and the real event Y . Consequently, true positives TP =| Y 0 ∩ Y | , false positives FP =| Y 0 | − | Y 0 ∩ Y | and false negatives FN =| Y | − | Y 0 ∩ Y |. Accuracy calculations were performed according to these metrics. Event Extraction metrics, alongside with Argument Detection and Classification metrics, and Event Detection and Classification metrics will be used to analyse the results of this chapter.

4.6 Analysis of the Results

This section contains reported results for all the presented datasets. AfPak Twitter and News datasets were analysed separately as both use the AfPak Ontology (Section 3.3.1), and ACLED is analysed in afterwards. In general, it is observed that model and algorithm outputs tend to consistently show the same accuracy patterns. It was observed that SPLICER outperformed other mechanisms in the overall Event Extraction task, even when the JET system was trained in batch mode. Details are presented in the next sub-sections. Table 4.2: Prequential Event Extraction Results Component Detection Twitter News Method Prec Rec F1 Prec Rec F1 JET 70.4% 48.4% 57.4% 36.9% 35.2% 36% Indep. MNB 51.7% 74.2% 61% 28.17% 64.3% 39.2% SPLICER 52.7% 78.3% 63% 55.3% 57.6% 56.5% Component Classification Twitter News Method Prec Rec F1 Prec Rec F1 JET 35% 24.1% 28.5% 18% 17.2% 17.6% Indep. MNB 21.7% 31.1% 25.6% 10.2% 23.5% 14.3% SPLICER 27.7% 41.2% 33.2% 26.1% 27.3% 26.7%

Page 98 of 175 Table 4.3: Prequential Event Detection and Classification Results Event Detection Twitter News Method Prec Rec F1 Prec Rec F1 JET 100% 65.1% 78.8% 100% 56.7% 72.4% Indep. MNB 52.8% 100% 69.1% 61.7% 100% 76.3% SPLICER 53.1% 100% 69.3% 93.3% 100% 96.5% Event Classification Twitter News Method Prec Rec F1 Prec Rec F1 JET 32.5% 21.2% 25.7% 28.7% 16.3% 20.8% Indep. MNB 23.3% 44.1% 30.5% 13.7% 22.2% 17% SPLICER 24.4% 46.1% 32% 22.9% 24.5% 23.7% Table 4.4: Prequential Argument Detection and Classification Results Argument Detection Twitter News Method Prec Rec F1 Prec Rec F1 JET 56.4% 39.9% 46.8% 22.3% 25.3% 23.7% Indep. MNB 50.8% 61.1% 55.5% 16.8% 47.9% 24.9% SPLICER 52.4% 67.3% 58.9% 37.1% 38.1% 37.5% Argument Classification Twitter News Method Prec Rec F1 Prec Rec F1 JET 36.1% 25.5% 29.9% 15.5% 17.6% 16.5% Indep. MNB 20.5% 24.6% 22.3% 8.1% 23% 11.9% SPLICER 30.2% 38.8% 34% 27.7% 28.5% 28.1%

4.6.1 AfPak Twitter and News

Event Extraction results are given in Tables 4.2, 4.3, and 4.4, for the whole Event Extraction process, and Event Detection and Argument Classification respectively. As presented, an im- provement in prequential recall and F1 metrics is taking place when the proposed soft-constraint rule learning approach is used. First, it is noticed that the JET system does not perform as well as in the ACE dataset, given the skewed nature of the AfPak datasets. In general, SPLICER worked better in Event Detection tasks, reaching 63% prequential F1 for Twitter and 56% for news articles dataset.

From the results, JET was a more precise method to detect and classify event components. JET’s method of detecting and classifying information via annotated rules has been found effective as the method resulted in higher precision for Twitter than SPLICER (70.4% and 35% for detection and classification respectively), whereas SPLICER appears to be less effective with 52.7% and 27.7% precision for detection and classification (Table 4.2). Nonetheless, this is not the case for

Page 99 of 175 News articles, as SPLICER beats JET’s precisions with 36.9% vs 55.9% for detection, and 18% vs 26.1% for classifications (Table 4.2). This is given by the fact that AfPak News dataset contains more training examples than the AfPak Twitter dataset, allowing base MNB layers and the rule learning layer to train more effectively.

Apart from precision, SPLICER was able to outperform both JET and single layer MNB clas- sifiers in terms of recall and F1 with the exception of Event Detection. SPLICER obtained higher recall in event type classifications as our methods attempted to classify every single incoming in- stance seen, whereas JET attempted to classify only the known ones from the manually defined set of rules and broader news corpus.

Encouraging results were found in all detection tasks. SPLICER reached a 96% and 69% prequential F1 over the Event Detection task. Although Event Detection showed a reasonably good level of automation, manual classification accuracy is still hard to reach, especially for such skewed distributions. Also, it can be seen from Table 4.3 that all learning methods are highly accurate on detecting events for both tweets and news articles respectively. From here, it can be stated that Event Extraction methods are particularly effective for recognising events with a high chance of success. However, there is a need for more research to enhance the effectiveness of fully automated Event Extraction systems in the classification task, especially for dealing with class-imbalance, given that our best method (SPLICER) reached out 32% and 24% F1 for event classification (Table 4.3). Although SPLICER outperformed JET and MNB learners for this task, more research is needed to get a higher effectiveness level with majority classes under 20% of the total. ACLED experiments show that SPLICER also outperforms JET in the case of higher classes, reaching more than 60% F1 in event classification tasks.

Moreover, skewed distributions in the event type classification task reduced overall task accu- racy as expected in all methods. It is important to notice that the JET system does have a good F1 result on the Event Detection task, given that such system uses more powerful and complete syn- tactic and semantic manually built English rules, whereas SPLICER uses a more basic syntactic representation implemented using CoreNLP featurisation procedures.

Similarly, argument detection and classification reached better results using SPLICER (Ta- ble 4.4). Argument classification is particularly complex in the sense that human annotations tend to be subjective on the labelling of actors and targets. From the raw dataset, it was found that

Page 100 of 175 some texts were labelled using all entities as actors and targets, whereas there were some other instances in which annotation was made using only one or two entities from the text. That put an additional complexity to the algorithms, especially MNB and SPLICER for being trained on-the- fly, reacting and fluctuating across the training set. Nevertheless, online rule learning was found as an improving mechanism to handle such changes on top of the MNB layer.

An improvement over the baseline is always reached in the selected datasets on F1 metric. As expected, a soft-constraint learning approach increases recall as an alternative candidate generation improves algorithm’s flexibility, improving from 57.3% to 63% against JET in Twitter (Table 4.2), and from 36% to 56.5% F1 increase in news AfPak dataset over Event Detection task.

An intuitive way of improving accuracies under such skewed scenarios might be performed by using ensembles. An ensemble technique (SLICER) is tested in next chapters of this thesis. Another avenue yet to be explored is a joint human-AI method, that allows the user to guide the algorithm to reach better results. However, this scenario is out of the scope of this thesis.

4.6.2 ACLED

Similarly, experiments were performed under the ACLED dataset to show the validity of SPLICER and its use in a different dataset and a different ontology, although the domain is the same. Results are shown in Tables 4.5, 4.6 and 4.7.

Table 4.5: Prequential ACLED Event Extraction Results Component Detection ACLED Method Prec Rec F1 JET 24.7% 48.6% 32.8% Indep. MNB 31.5% 84% 45.8% SPLICER 34% 90.5% 49.4% Component Classification ACLED Method Prec Rec F1 JET 9.4% 18.5% 12.5% Indep. MNB 16.5% 44% 24% SPLICER 21.6% 57.7% 31.4%

It can be seen from Tables 4.5 and 4.6 that improvements on classification accuracies were

Page 101 of 175 Table 4.6: ACLED Event Detection and Classification Results Event Detection ACLED Method Prec Rec F1 JET 100% 98.5% 99.2% Indep. MNB 100% 99.9% 99.9% SPLICER 100% 99.9% 99.9% Event Classification ACLED Method Prec Rec F1 JET 38% 37.5% 37.7% Indep. MNB 60.8% 60.8% 60.8% SPLICER 64.3% 64.3% 64.3% Table 4.7: ACLED Argument Detection and Classification Results Argument Detection ACLED Method Prec Rec F1 JET 14% 32.2% 19.6% Indep. MNB 24.5% 78.8% 37.4% SPLICER 27.3% 87.5% 41.5% Argument Classification ACLED Method Prec Rec F1 JET 5.3% 12.3% 7.5% Indep. MNB 12% 38.5% 18.3% SPLICER 17.5% 56.6% 26.8% achieved, given the characteristics of the classification task. Event classification results were higher compared with AfPak, ranging form 55% to 64.3% in terms of F1 prequential metric.

It is also important to note that higher accuracies were not possible due the annotation nature of ACLED, as examples might contain multiple events within the sentence, but there is just one that is being labelled, taking into account extra features like number of casualties. This fact made MNB make contradictory predictions and instances “confused” the algorithm to a point in which it was better to use the majority class with rules on top. An advantage of SPLICER is that it can recreate results with different algorithms plus its stream rule mining mechanism, that improved results to 7 percentage points in the classification task, leading to an improvement of 30% F1 with respect to the previous highest baseline in the whole Event Extraction task.

Additionally, argument classification results were similar to AfPak with an overall F1 of 26.8% for the best method (SPLICER in Table 4.7). Note that these metrics counted the exact extraction from results. Approximate extractions, such as taking out the most relevant word in each ex-

Page 102 of 175 tracted entity, were not evaluated in these experiments. Therefore, it was expected that argument classification tasks were above the reported argument detection F1.

Finally, it is noticed that SPLICER increased recall in all Event Extraction tasks, event some- times with a value close to 100%, as seen in Table 4.7 for argument detection, when it reached 87% recall for ACLED arguments. Recall results validate our theory of integrity constraint learning, that output a high number of predictions, while continually improving precision from the MNB baselines. Finally, it is important to mention the impact of concept drift within the experiments, that allowed to improve the model’s response for distribution changes. The next Section discusses the impact of concept drift in SPLICER.

4.6.3 The Impact of Pruning Rules

As explained earlier, SPLICER’s second layer algorithm deals with changes by using a pruning mechanism based in the Hoeffding Bound. Change is detected by using the Hoeffding bound, and when drift is detected, SPLICER drops rules out above the Hoeffding Bound metric. In this section, an analysis of the impact of pruning rules using the Hoeffding bound calculation is made, by comparing SPLICER against a variant that does not prune rules using the Hoeffding bound. Results of the whole Event Extraction test are shown in Table 4.8 for the AfPak datasets:

Table 4.8: AfPak Prunning vs Non-prunning Components Detection Twitter News Method Prec Rec F1 Prec Rec F1 SPLICER-NoP 64.3% 52.2% 59.4% 33.1% 61.2% 43% SPLICER 52.7% 78.3% 63% 55.3% 57.6% 56.5% Components Classification Twitter News Method Prec Rec F1 Prec Rec F1 SPLICER-NoP 29.2% 25% 27% 14.9% 27.5% 19.3% SPLICER 27.7% 41.2% 33.2% 26.1% 27.3% 26.7%

As seen in Table 4.8, the effect of pruning rules makes a major difference in terms of the F1 measure, which is our ultimate goal in the optimisation task, as users would expect a high number of correct elements with the highest possible accuracy. In this case, SPLICER pruning mechanism improves F1 in both cases, Twitter and News articles. In Twitter, the effect of omitting rule pruning reduces F1’s value from 63% to 59% in the detection, and from 33% to 25% in the classification.

Page 103 of 175 Similarly, News corpus has the same effect, reducing from 56% to 43% in the detection task and from 26% to 19% during component classification tasks. This reduction in accuracy is seen during sudden distribution changes throughout the dataset, due to changes of the reported events and its underlying nature, e.g. there were a higher number of peace talks and negotiations after the battle of Kunduz, that shifted the distributions of actors, targets and event types. Therefore, it is advisable to use rule pruning mechanisms over SPLICER to obtain better results.

4.6.4 Runtime Analysis

Table 4.9 shows running times of baselines compared against SPLICER for all the reported datasets. Overall, it can be stated that SPLICER’s running time is higher than its counterpart, given that it performs a second layer rule learning algorithm that increases both data and running time vol- ume. Nonetheless, it is also important to note that multi-layering allows the algorithm to reduce the time of rule search and pathfinding. SPLICER runtime was between 25% and 35% slower than MNB, though this does not dramatically affect the performance of the whole Event Extrac- tion task. SPLICER was not compared against JET, given that JET was trained beforehand with a larger corpus, and its predictions are given without prequential training in the middle of the prediction phase. Table 4.9: Event Extraction AfPak and ACLED Runtimes Runtime (secs) Method Twitter News ACLED Indep. MNB 210 860 1,280 SPLICER 302 1,384 1,631

All tests were performed in a 64-bit CPU computer with 16 GB RAM memory, and 256 GB of Hard Disk space. Computing power affects the time spent by each algorithm to perform the whole task. Nonetheless, and more importantly, memory restrictions were the most significant hardware component that affected the runtimes. Complexity analysis suggests the same result, as SPLICER needs to run in Θ(N ∗ (K ∗ L)), being N the number of instances, K the number of learned rules (not large) and L = 10 the number of components in each message in average. In contrast, a plain independent MNB classifier runs in Θ(N) plus the training runtime of each individual MNB classifier. The main difference is the rule search and training in SPLICER.

Page 104 of 175 4.7 Conclusions

In this chapter, we presented how SPLICER can efficiently deal with Event Extraction tasks, under a novel approach for online Event Extraction on text streams. A constraint induction model and algorithms were presented and tested under the prequential paradigm. The system was tested by using two domain specific Event Extraction datasets, the AfPak Twitter and News datasets. Both datasets were annotated by human social scientists utilising a domain specific ontology created for such purposes. In addition, both datasets showed skewed distributions in both event classification and entities role classification, therefore, making it hard to predict classes in a machine learning automation context.

Moreover, domain specific datasets were tested under prequential conditions and compared against an Event Extraction baseline. An increase in F1 metric was observed on the Event Extrac- tion task on the event correctness metric up to 13% in cases such as in the AfPak News dataset. Additional experiments were performed on the ACLED dataset to show reliability of our approach, showing similar results with higher accuracies allowed by the dataset’s statistical characteristics. Empirical evaluations suggest the feasibility of SPLICER. Although SPLICER’s runtime is higher than MNB given its top-layer rule learning mechanism and rule search, performance is not reduced dramatically and improvements over 10% F1 worth the runtime sacrifice.

All in all, SPLICER is recommended for Event Extraction tasks under Stream Mining con- straints, as it is shown to perform more accurately under different scenarios. It is highly advisable to build a well defined ontology for this specific domain, and it is highly recommended for au- tomatic Event Detection and argument detection tasks. Details on how to build an Ontology for political events can be found in the CAMEO codebook [145]. A key research avenue is therefore explored in the next chapter and attempts to compare SPLICER in two different scenarios: sin- gle vs multiple sources of information, leading to the creation of an ensemble model to improve accuracies (SLICER) that automatically splits subspaces.

Finally, SPLICER is also highly suitable for being used as a recommender system to show possible event categories and argument roles to a user, and in that way she or he would be able to improve accuracies of Event Extraction tasks and reduce annotation times, and this could lead to new research frontiers on joint human-machine Event Extraction tasks under near-real time

Page 105 of 175 scenarios, as performed in [67] with batch learning algorithms.

Page 106 of 175 Chapter 5

Multi-stream Event Components Classification Using SLICER

In this chapter, a novel Stream Mining ensemble is presented to tackle event classification and argument classification tasks. This method is based on the hypothesis that, if splits are correctly assessed, multiple learning models can boost the accuracies of a single Stream Mining model. As a result, the SLICER ensemble assesses the feasibility of automatically splitting a single stream into multiple streams of information, to gain accuracy from the mechanism.

Stream Mining classifiers have not been used for such text mining tasks, but they have been rather used for simpler Event Detection techniques. Besides, none of them have attacked the imbalanced properties of a fine-grained event ontology category such CAMEO, and there is little work on dealing with overlapping categories, seen in real-world datasets, which are very likely to be error-prone. Our idea is to use a Stream Ensemble classifier to validate the pertinence of splitting text datasets into multiple subspaces.

Interestingly, other ensemble approaches suggest avoiding horizontal splitting such as in [81]. Researchers have found that a vertical split works reasonably well under distributed conditions. Other ensemble approaches can be found in [63], in which a high amount of algorithms are thor- oughly studied. From the study, it can be inferred that main features in stream ensembles are its topology, prediction mechanism, diversification, and updating dynamics. Concerning their topology, stream ensembles can change their base learner compositions, such as hierarchical (random forests), flat or networked learners. On the other hand, ensembles can vary their prediction mechanism, and they can predict classification tasks using majority voting, weighted voting ranked voting or selective voting. Moreover, regarding diversity, ensembles can be made of different learners of the same base model (such as random forests as all base learners are trees) or heterogeneous models [152].

SLICER can be categorised as a combined boosted meta-learner in terms of its topology, a non-diverse horizontal ensemble regarding the base model, and drift-based windowed ensemble regarding its updating mechanism. SLICER assesses the feasibility of splitting the stream into horizontal subspaces, that is, it detects when it is more useful to split a dataset by using two main assessment techniques. First, it chooses when to re-train the ensemble from scratch using drift detection. Second, when drift is detected, it re-assess each attributes’ information gain metric (IC) in order to determine if there is an attribute for which it is worth splitting the ensemble into separate models per attribute value. If this is the case, the algorithm automatically splits the stream and trains separate models for each data subspace. Results confirm the feasibility of this approach, as prequential accuracy is increased for both event classification and argument classification tasks.

Intuitively, the proposed ensemble approach is capable of managing better splitting ways for generating separate – or complementary – learning models, according to the characteristics of different inputs, and it finds when is better to create a single classifier over a set of multiple, diverse classifiers partitioning over different data subspaces. The method also provides a way to know better how to split the dataset to give the best possible prediction.

This chapter is organised as follows: first, an initial explanation of event classification and argument classification is given as expressed in previous chapters. Second, the SLICER dataset is presented and explained. Third, experimentation and baselines are explained, and lastly, analysis of the results and conclusions are drawn by comparing SLICER against similar stream ensembles. Finally, SLICER and SPLICER are both being used in the whole Event Extraction task.

Page 108 of 175 5.1 Problem Definition

It is important to emphasise the difference between the experiments for whole Event Extraction task vs single layer event and argument classification. Recall that Chapter 4 addresses Event Extraction as a whole, i.e. all sub tasks at once with single classifiers working on each Event Extraction sub-task. In contrast, this chapter explores in detail how to improve each Event Ex- traction sub-task separately, to combine such improvement with SLICER. Details on how such combination of methods perform is provided in next chapter of this thesis.

An example of the proposed Event Extraction task is given in Figure 5.1. Here, we focus our efforts on improving the most challenging learning sub-tasks found, including event type identi- fication and classification, and entity categorisation of actors, targets, locations and time. Time extraction is an issue that is effectively done using existing NLP tools such as Stanford CoreNLP1. We proceed to formulate the main classification task problems.

Figure 5.1: Representation and extraction of an event in Twitter. In this example the tweet is represented as three separate events containing some common components

Event Identification and Detection

Given a text instance xj, identify a set of z number of event candidates EC(xj) = {aw1, . . . , awz}, represented by a set of action words awi including nouns, adjectives, verbs or adverbs found within the set of words in xj.

1https://stanfordnlp.github.io/CoreNLP/

Page 109 of 175 Event Classification: Given a text stream instance xj and an action word awi, classify

{xj, awi} with a label li from L = {l1...., ln}, namely event types. The set of labels correspond to a defined set of event categories defined by a human expert, and it is usually a fine-grained domain specific category set.

Event Argument Identification: Given a text stream instance xj, and its corresponding word set (W(sn)) = {w1, ...., wm}, identify each relevant word – usually named entities – wi as an event argument in any of the event candidates in EC(xj).

Event Argument Classification: Given a text stream instance xj, and its corresponding word set (W(sn)) = {w1, ...., wm}, label each relevant word - usually named entities - wi as an actor, target or location according to its played role in the given event candidate in EC(xj).

5.2 SLICER

In this Section, the proposed method is described in more detail. The result is a systematic process using a carefully selected window-based text pre-processing and online learning models and algo- rithms, showing a similar approach to SPLICER, but just focusing on single layer classification tasks. The process is divided in two single learning tasks: Event classification, and event argu- ments classification. Both sub-learning tasks use the SLICER ensemble to perform the required prediction. We further combine SPLICER and SLICER in the whole Event Extraction process.

The process consists of the following steps: first, it gathers textual information from the in- coming source or joint sources; second, it performs two separate text pre-processing stages, one to build the features of the event type detection and classification, the other to perform the event component identification and classification.

During the pre-processing stage, a feature transformation is done according to each learning task. During the next stage, the resulting features are used in to classify event types and event arguments. Hence, we make use of our novel proposed Stream Mining algorithm – SLICER – which aims to boost accuracy in low represented classes, and at the same time to effectively execute both the test and the training on the fly, as mentioned and defined in [91, 54, 22]. The details on these steps are explained in the next section.

Page 110 of 175 5.2.1 Text Pre-Processing

We perform text pre-processing to use available structured classifiers, converting each text stream instance in a numeric representation on each one of the learning tasks. Depending on the task, we convert from either text to an ontology feature vector, or from text to a semantic feature vector.

5.2.1.1 Event Classification Text Pre-Processing

For this Event Mining task, we perform a windowed, numeric feature transformation to be the input of our learning procedures. At this stage, the input text instance arrived is analysed and partitioned into words (tokenisation). Words are stored in a hash table indexed by stemmed words using the Porter stemming procedure [125], and counted by the number of appearances within the document (term frequency). Second, the pre-processing procedure performs a calculation, similarly to [164], where the numeric text representation is calculated in a windowed procedure, but in contrast, our framework does not maintain an increasing dictionary, but instead, it recreates a dictionary with a fixed amount of instances in the window. The reason for making such a design decision was because of Stream Mining requirements, as the algorithm should be able to perform well at any time with limited memory, and endless dictionaries use a large amount of memory. We also noticed that an incremental approach does not perform well compared against windowed approaches, as the numeric calculation is frequently shifted every time a new document enters the corpus. As a consequence, statistical learners perform poorly as the statistical calculations shift quite frequently, making whole corpus updates impractical.

Therefore, a TF-DF or a TF-IDF representation is being made, recalculating dictionaries and counters at every number of instances, following the steps mentioned in Section 3.2.1. Numeric representations used word stemming, lemmatisation, stop words and numeric and syntactic fil- tering [2]. In addition to numeric featurisation, stop word filtering is being performed over the word dictionary. Once the bag-of-words representation is calculated, we proceed to perform the prediction phase on the current text instance.

Page 111 of 175 5.2.1.2 Argument Classification Text Pre-Processing

Apart from Event Classification, we also tested SLICER for Argument Classification tasks. There- fore, we needed to build a different feature representation than in event classification tasks. In this particular case, the idea is to label an incoming word, or named entity, into an action, target, loca- tion labelling space (event components labelling). In consequence, it is more convenient to treat each word, or named entity as a separate instance.

Consequently, we perform word partitioning as each text instance arrives, carefully using named entity recognition to find word chunks that may be jointly categorised. OpenNLP tools are also used to perform word splitting and tagging. Extra features were also used to create word contexts and global and local features were used as in [93].

Any Stream Mining classifier can be used in such first classification task layer as long as it follows the Stream Mining framework previously defined in [114, 22, 71, 91] and later unified by Gama [55, 56]. Then the classifier can learn using any Stream Mining algorithm, conforming to the previous definitions. Any chosen objective function is assessing the learning correctness, such a Bayesian error function in a prequential mode as proposed by Gama et al. [55].

5.3 SLICER - Splitting to Locally Indexed Classifiers Ensemble boosteR

The SLICER algorithm responds to the question of what is the best way to combine multiple ∗ sources of information for Stream Mining tasks. First, it attempts to predict a possible value Yi given an instance Xi = {f1, ..., fm} with m features at time i. The second task is to find the r r r best possible combination of feature values that better reflects the real tuple Ti = {f1 , ..., fp }, 1 1 z z given z sources of information and z tuple candidates {T1 = hf1 , ..., fp i, ..., Tz = hf1 , ..., fp i} . To present the framework, we initially provide the definitions of a multi-stream and the SLICER method defined for classification tasks.

Multi-Stream: As we recall definitions given in Chapter 2, a stream is an instance dataset

X = {x1, ..., xi} arriving at an specific time rate, one at a time, in which each instance xi at time i consists of m number of feature values xi = {f1, ..., fm}. A multi-stream then, is a set of z different streams Z = {X1, ..., Xz}, in which streams might be combined to generate a prediction

Page 112 of 175 Figure 5.2: Joint, single classifier and multiple classifier MSM approaches

∗ Yi at time i.

The first intuition here is that each single source element in Z is aimed to help the objective learning task. Also, every single source in Z might share a set of common features, that is, features sharing the same attribute space and attribute concept. Some of the features in each source zj might be unique to the source. In this way, the SLICER approach should deal with both standard and unique features in each source of information.

5.3.1 Classification Task

∗ A classification learning prediction task in SLICER can establish a possible value Yi at time i from a multi-stream Z, which contains z instances {x1, ..., xz} arriving within a time range.

Therefore, we might say that a joint feature set from Z {f1, ..., fp} comes from the combination of all sources in Z. A representation is shown in Figure 5.2.

In this way, we might deal with the prediction task using three different learning schemes. The first might come from a single joint model of the combined set of features in Z. The second might be built using separate classifiers per source of information (SOI), and then find the best answer or a combination of answers via voting in the ensemble output. Therefore, we define the stream classifier as a meta-learning stream algorithm to perform the classification task.

In summary, the classifier aims to selectively split the dataset into a set of classifiers. The algo- rithm uses a global classifier which decides the final prediction output based on the performance of local classifiers. Local classifiers are implemented using a simple selection of the feature with maximum gain value. An example of the process is shown in Figure 5.3.

In general, an information quality metric is used to validate the pertinence of either splitting sources horizontally or vertically, according to the highest information quality value. Therefore, SLICER provides a method to detect when it is better to use any of these three different learning

Page 113 of 175 Figure 5.3: The SLICER process. The ensemble choses when to split the incoming stream according to a gain metric, and it is trained at every drift using (E)DDM [57, 11] or any other drift detection approach models, via Instance Information Gain (IIG). The proposed procedure is described as follows:

5.3.2 Subspacing Procedure

In general, SLICER attempts to split the given dataset using dataset subspacing. Each dataset sub- space will train a separate classifier that will count as a local prediction further in the test phase. Each local classifier makes a prediction, and a global meta-learner decides the final answer, taking into account the pertinent local predictor and combining the answer via majority voting against a global classifier (boosting mechanism). However, it is not always is better to split a dataset, but instead, the procedure makes the best effort to ensure when the ensemble should train with more classifiers. The procedure includes the following steps, and it is depicted in Algorithms 3 and 5 (train and testing subspacing algorithms):

Page 114 of 175 1. First, the subspacing algorithm gets the subset of highest Information Gain (IG) feature.

2. For the highest feature, the procedure validates pertinence of a partition via minimum sam- pling from the least represented value.

3. If horizontal sampling is not possible, the procedure validates pertinence of a vertical parti- tion if the last IG > θ, θ being the chosen threshold.

4. If partitions are not pertinent, then the ensemble works with a single classifier.

5. The slicing decision is performed whenever drift is detected.

5.3.2.1 Information Gain (IG) and Minimum Sampling

The data partition is made using an entropy measure, called information gain (IG), described as follows:

(H(y) − H(y|a)) IG = (5.1) a H(a)

Equation 5.1 returns a set of the most relevant features found in the pre-processed dataset and this is calculated before the training tasks. H is defined as the information entropy defined by Shannon [150] taking the form of total and conditional entropies:

n X H(X) = − p(xi) ∗ log p(xi) (5.2) i=1

  X p(xi, yj) H(X|Y ) = − p(x , y ) ∗ log (5.3) i j p(y ) i j i

In addition to IG, SLICER calculates if the chosen feature has enough support for values in the least represented class. The calculation of a minimum support allows to confidently split the dataset into subspaces without loss of accuracy. For performing this task, we used the Cochran

Page 115 of 175 sampling formula, recommended by statisticians when the sampling is applied to large populations (formula for proportions in chapter 4) [43]. Cochran’s minimum sampling calculation formula is :

Z2 ∗ p ∗ q n = (5.4) 0 e2

where e is the estimated error (set as 4%) and P (a) = #valuesa/#instances is the proportion of the population with less values from the nominal values. The calculation of p is made by maintaining population statistics, i.e. counting the number of instances seen so far for each feature value. Similarly, q = 1 − p and Z is a z-value found in a Z table, in our case, 1.96, as it is the value for a 95% confidence interval.

If the current number of instances of the least represented value is found to be less than n0, then the slicing is not possible and SLICER maintains a single classifier. Conversely, if this value is found greater than n0, then the slicing is made and local classifiers will be trained until the drift detection window is reached and proved to be found by the drift detection technique. As a result, the measure is reweighted according to how significant the least probability value is, giving more importance to attributes with high influence on the least represented classes. This information is calculated in SLICER’s training algorithm (Algorithm 3).

5.4 SLICER Training and Testing Algorithms

With both formulas in mind, SLICER’s algorithms are presented in Algorithms 3 and 5 for train- ing and testing respectively. SLICER allows the user to choose either a “vertical” or a “hori- zontal” partition. During the training phase, the algorithm creates an initial set of classifiers, in either a vertical or a horizontal fashion. First, the system builds up a window of examples until drift is detected, and at the same time, it trains the global classifier to keep a model avail- able for prediction. If drift is detected, then SLICER calculates the highest feature to split in

(calculateHighestIGFeature(xindex)). If the user chooses a horizontal partition, SLICER splits and creates local models for each nominal value of the chosen feature. Conversely, if vertical par- titioning is chosen, then SLICER automatically splits and creates two vertical classifiers, one with the features with highest IG, and one with the rest of the features.

Page 116 of 175 Algorithm 3: SLICER training ensemble algorithm

Input: Stream instance xindex of pre-processed texts with features F = {f1, ...., fm} and label Yindex. globalClassifier = initialiseClassifier(commonFeatures(xindex)U {localTrainoutput }); classifiers = initializeLocalClassifiers(xindex); if (globalClassifier == ∅) then drift = calculateDDM(xindex); if drift! = OUT _OF _CONTROL then train(globalClassifier, commonFeatures(xindex)U {Yindex}); if partitionOption == HORIZONTAL then localClassifier = getLocalClassifier(xindex); train(localClassifier,localFeatures(xindex ∪ {Yindex}) ); end if partitionOption == VERTICAL then trainVerticalC(classifiers,localFeatures(xindex) ∪ {Yindex} ); end end if drift == OUT _OF _CONTROL then buildNewModel(xindex, drift) end updateStatistics(xindex, classifiers, globalClassifier); end

If the drift detection method finds the dataset in an OUT_OF_CONTROL drift level, it re- constructs new global and local models from scratch, using the instances recollected during the drift warning phase (trainWarningModel(xindex,w)). SLICER can use any of the available drift detection methods found in the MOA framework. Finally, during the prediction phase, SLICER attempts to predict xindex label by using a combination of the global and local outputs, via a normalised majority voting sum.

In summary, the classifier aims to split the dataset into a subset of classifiers. The algorithm uses a global classifier which decides the final prediction output boosted by the performance of the others. In this way, SLICER works as a boosting ensemble algorithm, in which the base classifier (the global classifier) is boosted with local classifiers with more expertise in their own subspace.

5.4.1 Horizontal Splitting

The feature with the highest IG is used to create local classifiers for each of its values. It is important to remark that we convert all numeric features to nominal values before the partition.

Page 117 of 175 Algorithm 4: SLICER build new local models algorithm

Input: Stream instance xindex of pre-processed texts with features F = {f1, ...., fm} and label Yindex. w = Buffered sliding window of instances stored if Drift Warning was found. infoGainIndex = calculateHighestIGFeature(xindex); if (infoGainIndex > 0) then horizontalAttributeWarningModel = infoGainIndex; end classifiers = initializeLocalClassifiers(xindex); trainWarningModel(xindex,w); resetCurrentModels(); drift = IN_CONTROL;

Algorithm 5: SLICER prediction algorithm

Input: Stream X = {x1, x2, x3, ..., xt} of pre-processed texts with features F = {f1, ...., fm} and label Yindex. Window w of initial window of instances to train. index = 0; ; partitionFeature = getTopFeature(X); for instance xindex in X do

localClassifierxindex = findLocalClassifToTest( xindex, partitionFeature); localLOutput = getPrediction(xindex); globalLOutput = getPrediction(commonFeatures(xindex)U {localTrainOutput }); finalPredOutput = globalLOutput + localLOutput ; return normalize(finalPredOutput ); end

Once the relevant features are obtained, the local classifiers are built making a horizontal partition on the dataset. The global algorithm attempts to predict if there is enough confidence to attempt such prediction by comparing its output against the local prediction, and the highest value is then selected according to the confidence of each classifier. Each partition also contains the label feature to perform a test-then-train prequential stream classification task following the approach described by Gama in [56]. Any available Stream Mining classifier can be used for training both global and local classifiers, but without the feature of diversity.

In the experiments, the global classifier, which gave us better results is the Multinomial Naïve Bayes classifier, which is used in all learning tasks, event type, argument classification and stream classification.

Page 118 of 175 5.4.2 Vertical Splitting

In addition, and similarly to a horizontal splitting procedure, vertical splitting is also proposed as a valid form of partitioning. Recent advances in Stream Mining algorithms, specifically in en- sembles [63] and recently under stream processing platforms [81, 44], which created the Vertical Hoeffding Tree. The authors [44] claimed that a vertical partition would lead to similar accura- cies than joint classifiers, and with enough instances, it will lead to a faster, and equal or better prediction results.

In contrast to other research works, our vertical split takes place automatically to boost the global classifier. First, the algorithm chooses the best-ranked features for training, according to the information threshold θ. The best-ranked set of features are being chosen to train the global model. A local classifier will evaluate the remaining features. During the prediction phase, the local and global classifier will merge prediction outputs via voting mechanism.

5.5 Experiments

To perform the Event Detection and Event Classification learning evaluation, a prequential method was implemented during the evaluation phase. In contrast to other Event Extraction works as in the ACE competition, we are devoted to online or Stream Mining methods and classification tech- niques. The ACE competition and other similar efforts are focused on evaluating batch learning techniques.

With prequential evaluation, we use each incoming instance as a test instance, and then the classifiers perform an online training step. It is important to note that, under streaming settings, the most used metric is prequential accuracy [22]. Nonetheless, our learning tasks require the analysis of other metrics. In particular, research in Event Extraction has been validated using micro precision, micro recall and F1 metric. In our case, we are interested in obtaining useful (precise) but relevant values over low sampled class labels.

Standard precision and recall will as defined in Chapter 3 be evaluated in the experiments.

Page 119 of 175 5.5.1 Datasets

Two main sets of experiments are proposed to show the feasibility of SLICER. First, we show that SLICER works better by using horizontal slicing under the event classification task. Second, we tested SLICER for the argument classification task. The main goal of the experiments is to show that SLICER works well under different Event Extraction learning sub-tasks (event classification and argument classification) and that the ensemble improves the accuracy of the learning task against single online learners and also against other kind of ensembles.

Event Mining Datasets: We worked with the AfPak datasets the ACLED datasets presented in Section 3. Similarly, ACLED and AfPak datasets were used for argument detection classification. Recall that both datasets were extensively analysed in Section 3.3. Both datasets, ACLED and AfPak deal with coded entities in an ontology. Distributions across annotated entities are also skewed, and top mentioned entities are civilians, Afghan National Army, and Taliban forces.

Other Stream Mining Datasets: SLICER was also tested under other classification tasks to show its usefulness into different data domains, apart from text mining. SLICER was tested with the datasets described in 3.3.3.

Featurisation For Event Detection and Event Classification, a TF-IDF representation of the relevant words is being used (nouns, adjectives, and verbs) as proposed in [149]. It is important to remark that a pre-processing step was made on all cases to select the most relevant words and that the same pre-processing procedure was applied to all baselines, as explained in Section 5.2.1.

In the case of Argument Detection and Argument Classification, features used are described in Section 4.3.1.2. Recall that in this case we first recognise entities and then we classify them as actor, target or location.

Page 120 of 175 5.5.2 Stream Mining Baselines

To compare our results, we used several state-of-the-art stream classifiers implemented in WEKA1 and MOA2. The implementation of our stream meta-learner is also implemented in MOA. SLICER is being compared against the following baselines:

Majority Class – MC Our first baseline is the majority class, which attempts to classify the largest class in the dataset as instances come to the learning algorithm. The majority class classifier implemented in MOA counts the number of records being labelled in each category and then calculates the highest class label in the seen dataset. The predicted class is the highest class seen in the current dataset.

Finally, the global classifier which is being used in our experiments, and the one which gave us better results is the Multinomial Naïve Bayes classifier, which is used in both learning tasks, event type and component classification.

Multinomial Naïve Bayes We use online Multinomial naïve Bayes classifier which is usually tested under text classification problems. The featurisation process only considers ontology tags for each instance as the main features. Similar to Multinomial Naïve Bayes defined in [22] , the posterior probability of class ci is then represented by:

P (c ) ∗ Q P (o|c )wod P (c |o) = i i (5.5) i P (d)

where wod is the float value calculated when word w occurs in ontology value d, and P (d) is the probability of getting document d – the tweet itself for our purposes – from the analysed corpus. An online version of the Multinomial Naïve Bayes method is used as a baseline in our experiments. The MOA implementation was used to perform our experiments.

Hoeffding Tree – HT A Hoeffding tree classifier is also used to validate the effect of the meta-learning task on top of a single Stream Mining method different from a Bayesian classifier. Implementation details can be

1http://www.cs.waikato.ac.nz/ml/weka/ 2https://moa.cms.waikato.ac.nz

Page 121 of 175 found in [71]. A Hoeffding tree uses a divide and conquer algorithm to select the correct leaf to predict classes according to attributes’ values. Similarly to SLICER, a Hoeffding tree creates tree leaves on information gain (IG) values, and then it calculates simple count statistics on each leaf. The highest class probability is the one being returned in each prediction.

Hoeffding trees also provide pruning and refining mechanisms if the accuracy is below the Hoeffding bound (more details can be found in Section, 2.5) acting as an evolving mechanism to adjust to drift conditions.

Online SVM – SVM Similarly, an online version of the SVM classifier was tested using MOA platform under the SMO algorithm. A definition of the online SVM classifier can be found in [2].

Oza Bagging – OBag Oza Bagging is one of the most recognised stream ensembles in the literature and one of the first ensemble methods to be used in the community. Oza bagging is well known for using a bag of classifiers, each being trained using random sampling following a Poisson distribution, and then the final answer is given by the highest vote [115].

Oza Boosting – OBoost Similarly to SLICER, Oza boosting is a weighted combination of all learners trained by the en- semble. Boosting algorithms usually put weights as a “punishment" mechanism, in which errors in hi from base models h1, ..., hm are avoided knowing errors from the previous learner hi−1, giving more representation in the Poisson distribution to such samples [115].

Diversified Weighted Majority Ensemble – DDWM The DWME algorithm uses a range of diverse base algorithms to classify instances in classification tasks [152]. Similarly to Oza’s algorithms, DDWM uses Poisson distribution during the training phase, and two ensembles, a low diverse and a high-diverse ensemble to train and re-weight mod- els. It uses a weighted sum of the whole set of classifiers as the final prediction.

Page 122 of 175 5.5.3 Drift Detection Methods

Our experiments include testing of DDM [57], EDDM [11] (non-parametric), Page-Hinkley Test (PHT) and CUSUM [116] as our available alternatives for drift detection in SLICER and the other drift-detection-aware models included in these experiments. Testing was done using the following parameters (Table 5.1): Table 5.1: Drift Detection Methods Parameters Configuration

Drift detection model Parameter Value DDM min number of instances 10% of the dataset PHT min number of instances 10% of the dataset PHT δ 0.005 (default) PHT λ 50 (default) PHT α 1 (default) CUSUM min number of instances 10% of the dataset CUSUM δ 0.005 (default) CUSUM λ 50 (default) CUSUM α 1 (default)

5.6 Analysis of the Results

Results of experimentation for both event classification and argument classification are presented in the following sub-sections. We can observe that prequential F1 tends to be higher using SLICER over other ensembles and classic algorithms.

Table 5.2 presents Event Detection and Classification results, while Table 5.3 presents results on argument identification and classification tasks.

5.6.1 Event Detection and Classification

Results in Table 5.2 suggest that SLICER outperformed known Stream Mining algorithms in event classification tasks. Results show that SLICER runs with better F1 over Event Classification for both AfPak datasets. It is interesting to note that HT obtained the highest result for the Event Detection task, given that HT reduced the number of irrelevant events. However, during the classi-

fication task, SLICER was found to be more effective with nearly 32% F1 compared against 30%

F1 of a Hoeffding Tree.

Page 123 of 175 Similarly, SLICER outperformed the analysed ensembles by larger margins, by making ad- vantage of drift detection techniques. Therefore, SLICER can be recommended to be used when dealing with classification of text datasets, especially under skewed distributions.

Table 5.2: Prequential Event Detection and Classification Results(MICRO Precision, Recall and F1) Event Detection Twitter News Method P R F1 P R F1 MC 52.8% 100% 69.1% 61.7% 100% 76.3% MNB 52.8% 100% 69.1% 61.7% 100% 76.3% HT 70.8% 100% 82.9% 61.9% 100% 76.4% SVM 63.9% 100% 78% 61.5% 100% 76.1% OBag 52.9% 100% 69.2% 61.6% 100% 76.2% OBoost 52.1% 100% 68.5% 61.9% 100% 76.4% DM Ensemble 52.5% 100% 68.9% 61.6% 100% 76.3% SLICER 52.7% 100% 69.1% 61.9% 100% 76.4% Event Classification Twitter News Method P R F1 P R F1 MC 19% 22.8% 21.1% 9.9% 16.2% 12.2% MNB 23.3% 44.1% 30.5% 13.7% 22.2% 17% HT 18.4% 26% 21.6% 14.2% 23% 17.6% SVM 23.1% 36.2% 28.2% 14.8% 24% 18.3% OBag 23% 43.3% 30% 18.2% 29.6% 22.5% OBoost 22.5% 43% 29.5% 18.2% 29.4% 22.6% DM Ensemble 21.7% 41.3% 28.5% 18.6% 30.2% 23% SLICER 23.9% 45.4% 31.4% 22.3% 36.2% 27.6%

Regarding event component identification, we can see that the best method to perform this classification task was reported with a combination of a Hoeffding tree with our proposed semantic featurisation. This proposed method improves up to 76% prequential F1. Recall that the last reported experiments in recent work [119, 149] obtained 61.2% standard F1 while performing batch learning. The fact that our method improves in more than 10% prequential accuracies might create a new baseline for future research in the area, specifically for stream Event Extraction mining.

5.6.2 Argument Detection and Classification

Regarding Argument Detection and Argument Classification, SLICER boosted base MNB classi-

fier with similar results. F1 was particularly improved by over 3% to 4% over the best ensemble (DM ensemble) in both datasets. This improvement, added to Event Detection and Event Classifi-

Page 124 of 175 cation performance results, suggests that SLICER’s change adaptation mechanism decides better ways to boost base learners than Oza bags and DM ensembles, as it makes advantage of knowing when drift is happening and at the same time it dynamically changes local boosters over time. By experimentation, the best reported drift detection technique managed for these datasets was DDM, with a minimum instance seed of 9 instances. EDDM showed results close to existing ensembles.

Table 5.3: Prequential Argument Detection and Classification Results Argument Detection Twitter News Method P R F1 P R F1 MNB 50.8% 61.1% 55.5% 14.5% 47.9% 22.9% HT 32% 24% 27.5% 21.8% 15.5% 18.1% SVM 30.9% 28.3% 29.5% 14% 7.4% 9.6% OBag 28.2% 59.1% 38.2% 14.8% 39.3% 21.5% OBoost 27.5% 77.2% 40.5% 14.2% 59.7% 23% DDWM 27.8% 64.2% 38.7% 14.7% 42.1% 21.8% SLICER 51.1% 61.7% 55.9% 16% 43.5% 23.5% Argument Classification Twitter News Method P R F1 P R F1 MNB 20.5% 24.6% 22.3% 6.9% 23% 10.7% HT 16.3% 12.3% 14% 13% 9.2% 10.8% SVM 12.6% 11.5% 12% 6% 2.8% 3.7% OBag 11.6% 24.2% 15.6% 7.4% 19.5% 10.6% OBoost 10.6% 29.8% 15.6% 4.5% 19.2% 7.4% DDWM 11.8% 27.4% 16.5% 7.2% 20.7% 10.7% SLICER 20.7% 25% 22.7% 8.6% 23.5% 12.6%

In addition to overall accuracies, the next section analyses the effect of identifying drift against a version without a drift detection mechanism. Although these are low accuracies over the AfPak datasets under independent classification tasks, a joint mechanism such as SPLICER can be used on top of SLICER to boost predictions further. The combination of both methods is used and analysed below.

5.6.3 The Effect of Drift in SLICER

As mentioned above, SLICER handles drift by using any of the implemented drift detection tech- niques found in MOA. SLICER was tested using DDM [57], EDDM [11], and HDDM [52]. From all tested drift detection techniques, DDM performed better than the other techniques and was

Page 125 of 175 used as the default drift detection technique during experiments. We found that DDM is useful to deal with both gradual drift found in the AfPak datasets, and sudden drift reflected when battle events occurred. Our results are consistent with similar results found by Gonçalves et al. [64] in their comparative study, in which DDM was found to be more accurate than the other evaluated methods, and it was particularly suited to gradual drift scenarios.

Drift detection was found useful in the improvement of Event Classification and Argument Classification accuracies in all datasets. This was expected as the analysed domain constantly changes and evolves. The Afghan conflict is an internal social, international and humanitarian conflict that has been alive for more than two decades. New actors, groups, forms of terrorism, attack and battle strategies have appeared, for instance, drone attacks.

In this case and during the length of the conflict, there were two main battles that shifted event type distributions and actors/targets. The main big battles: Kunduz and Nangarhar, affected accu- racies of algorithms without a drift detection mechanism. This fact is supported by the lift-per-drift metric [8] that was calculated for SLICER and presented in Table 5.4. This was calculated using the equation shown in Section 2.5.1, using r = 0.5 to emphasise the differences between drifts.

Table 5.4: SLICER’s F1 and Calculated Lift-per-drift for Event Detection (ED), Event Classification (EC), Argument Detection (AC) and Argument Classification (AC) in AfPak Datasets Twitter News Method ED EC AD AC ED EC AD AC SLICER no drift 69.1% 30.5% 55% 22.3% 76.2% 16.9% 22.3% 10.7% SLICER DDM 69.1% 31.4% 56% 22.6% 76.4% 27.6% 23.5% 12.6% Lift-per-drift 0% 2% 2% 0.6% 0.4% 21.5% 2.5% 4%

Results from Table 5.4 suggest that drift is at least maintaining the same accuracies from the baseline without drift detection, significantly improving F1 for AfPak News event classification in particular. From the cases, only one item (Event Detection F1 in Twitter) seems to have no effect on using drift detection with SLICER. This result is expected as detection of events occurred in almost every tweet given to the model, as irrelevant tweets were filtered out by using keywords before the learning task.

In addition, results show that there is more change in the news articles dataset than in Twitter. This could be explained by these being two different human annotators during the process, making different annotation choices between each other. Although both datasets were checked by experts, there are still bias between annotators and also subjectiveness on choosing actors, targets or cat-

Page 126 of 175 egories. It is important to note that SLICER has been shown to perform better by experiments. However, there is an open question over other domains, different than our current Event Mining conflict datasets.

5.6.4 SLICER in Other Domains

As explained earlier, SLICER was also tested in widely used Stream Mining datasets, to validate the feasibility of its use under different domains, as SLICER is a single layer Stream Mining classifier. Table 5.5 shows results for the cover type, airlines and the electricity datasets.

Due to the serial correlation nature of the electricity dataset, SLICER is also compared against a no-change predictor (NOC) that predicts the next value to what was seen as last value, i.e.

Y¯t+1 = Yt. and the Dynamic Weighted Majority (DWM) classifier presented by Kolter and Mal- oof [80], in which dynamically creates, updates and removes experts from the ensemble based on each local classifier’s past performance, and it adds new experts based on the overall performance maintaining a fixed given size of experts. Test included a comparison between drift detection techniques using a Bayesian classifier against SLICER with Bayesian local and global classifiers, using default parameter values, with Information gain as the selected information quality metric to split in. Additionally, a test with a global Naïve Bayes boosted by local NoChange detectors was included (SLICER (NOC+NB)) to validate if SLICER is capable of boosting a global classifier combined with a serial correlation-based algorithm. Results are reported in table 5.5.

From the results, SLICER reported the best results in each of the tested datasets. The best ac- curacy rates are in bold. Non-significant differences are seen in 3 out of 12 test comparisons made between SLICER and NB with a drift detection technique. All SLICER tests performed better than the base line NB without drift detection method. In addition, the best reported accuracies on each dataset were gathered using SLICER, reaching 95.04% in the cover type dataset, 68.1% in the airlines dataset, and 85.5% in the electricity dataset. It is important to note that Electricity and Cover Type datasets are highly time-dependent, and that NOC algorithms accuracies are amongst the best on these datasets. Nonetheless, no change detectors were less accurate than NB-based algorithms (58% in NOC vs. 64% in NB) in non-serial correlations as seen in the Airlines dataset. A less strong serial correlation is also seen in the textual datasets from Tables 5.3 and 5.2.

Page 127 of 175 Table 5.5: SLICER’s Accuracy in other Domain Datasets (best accuracy rates are in bold) Method CoverType Airlines Electricity NB 60.5% 64.5% 73.4% No drift DWM (NB) 82.88% 67.15% 79.71% NOC 95.03% 58% 85.3% NB 78.7% 65.3% 81.7% DWM (NB) 87.4% 67.15% 84.1% DDM SLICER (NB) 81.9% 66.7% 81.8% SLICER(NOC+NB) 95.03% 60.45% 85.4% NB 85.1% 65.1% 84.8% DWM (NB) 86.3% 67.1% 84.6% EDDM SLICER (NB) 86.3% 66.2% 85.4% SLICER(NOC+NB) 95.03% 58.9% 85.5% NB 81.5% 67.5% 79.3% DWM (NB) 83.4% 66.9% 80.8% CuSum SLICER (NB) 81.2% 68.1% 83.2% SLICER(NOC+NB) 95.04% 58% 85.4% NB 80.1% 67% 78% DWM (NB) 83% 67.1% 80% PageHinkley SLICER (NB) 79.5% 67.8% 79.9% SLICER(NOC+NB) 95.03% 58% 85.3%

It can be noted that the best reported accuracy in the Cover Type dataset involved a CuSum mechanism for drift detection, but it was very close to the result reported by NOC, and that the change in accuracy is negligible (0.01%). Similarly, the best reported accuracy for the Electricity dataset was 85.5% obtained with SLICER with a global NB boosted with NOC local experts. In addition, the flexibility of SLICER allowed us to use two different learners that combined produced better results than NOC algorithms and the DWM ensemble.

In addition, it is important to note that best accuracies were reached on top of the best reported drift detection technique on each dataset. For example, it was found that CuSum improved accu- racies in two of the datasets, and EDDM was the best drift technique for the electricity dataset. Consequently, overall performance obtained during empirical testing suggests that SLICER can be used for improving accuracies of stream classification tasks.

5.7 Multi-stream Event Extraction Using SPLICER and SLICER

This section explores the application of both the SPLICER and SLICER methods to perform Event Extraction from multiple text streams. The idea is to compare the difference between Event Mining applied to single vs multiple text streams. This chapter is devoted to analyse whether source splitting is useful or not for Event Extraction in textual datasets. As suggested by experimentation,

Page 128 of 175 we found that source splitting is useful when horizontal splitting is carefully done, by taking into consideration enough instance information gain (IIG) when splitting the models.

We first present an extended analysis of Twitter and news articles AfPak datasets, in order to present further relationships between components and datasets, demonstrating the feasibility of splitting this dataset by source of information. Afterwards, an application of a joint model using SLICER for Event Extraction and SPLICER for single component extraction is made by using a joint version of Twitter and News, and conclusions are drawn from the results of the experiments, suggesting that Instance Information Gain (IIG) metric is useful for splitting models while considering concept drift.

As explained in the Event Mining framework in Chapter 3, the idea is to perform experiments for a wide range of Event Mining tasks. In this case, we are particularly interested in improving the results of Event Extraction baselines as shown in Chapter 4. In this case, the idea is to combine SPLICER and SLICER to validate if an automatic splitting ensemble can enhance the performance of SPLICER.

Therefore, our main goal is to increase accuracies especially F1, of the whole Event Extraction learning tasks for social conflict datasets, by using a multi-layer rule based learning technique (SPLICER) on top of an automatic multi-stream ensemble (SLICER).

5.7.1 Methodology

The proposed approach combines both: SLICER as a first layer ensemble and SPLICER as a second layer event extractor. First, SLICER is used for event component classifications, i.e. SLICER is used for classifying event types and argument roles. Secondly, SPLICER is used to enhance the whole Event Extraction tasks, by using its base knowledge in form of ontologies, in addition to the extracted rules learned by the model in near-real time.

The Event Extraction pipeline is shown in Figure 5.4. Hence, a combination of both mech- anisms allows the implementation of a fully automated multi-stream Event Extraction pipeline. Therefore, a combination of algorithms is possible, using each model into different Event Extrac- tion layers. Moreover, our research findings suggest that if we have enough IIG confidence, it is better to split models into separate classifiers as the case of the AfPak datasets.

Page 129 of 175 Figure 5.4: EE pipeline using SLICER in the first layer and SPLICER in the second reach full MSM ap- proach

In order to test the feasibility of this approach, a comparison against experiments done for Event Extraction in Chapter 4 is proposed in the following section.

5.7.2 Experiments

The idea of having multiple text streams has been developed previously in Chapter 3, where we defined the term as a set of one or more text streams from different sources. Note that it is possible to build different text sub-streams from a single stream. SLICER automatically creates multi- streams from a joint single stream source, and it makes horizontal partitions according to features and feature values automatically.

This section explores a how splitting can be done not only for single layer Event Mining tasks as we done previously (for Event Classification and Argument Classification), but also for multiple layer Event Extraction tasks. In order to perform an analysis, we compare a joint version of the AfPak datasets (Twitter and News) against a horizontally partitioned version of the datasets to validate if results are better when IIG suggests splitting.

Joint Twitter and News AfPak Dataset For the sake of experiments, tests were made using a joint Twitter and News AfPak dataset de- scribed in Section 3.3.1, which combine both datasets together ordered by event date. In addition, the joint dataset contains an additional column (SOURCE) to identify if the reported event was reported via Twitter or News.

AfPak Twitter Dataset – Author Correlation

Page 130 of 175 The IIG metric suggests a partition by authors, as this feature obtained the highest IIG value. An initial intuition is that there is a co-dependence or correlation between tweets by authors and the reported event type categories. Table 5.6 shows the resulted Pearson correlation matrix between the two variables, authors and event type categories. Pearson correlation was applied after both nominal variables where label decoded and normalised. An initial analysis of the correlation ma- trix seems to determine almost no correlation between event type categories and author categories in Twitter.

Table 5.6: Twitter Authors vs Types Correlation Matrix

Authors Event Types Authors 1 -0.04 Event Types -0.04 1

Similar values are reached when analysing actors vs source and targets vs source. Nonetheless, if IIG is calculated for all Event Classification features, we found that the highest IIG is generated by the AUTHOR attribute with IG = 2.32 against the second best value (numerical representation of the word “Kill”, IG = 4.82). Recall that the less IG the better. Consequently, we expect that the SLICER algorithm will select this feature to partition the stream into separate models by source of information.

AfPak News Dataset Contrary to Twitter, news article type categories seems to be negatively correlated with the infor- mation source as shown in Table 5.7. The correlation coefficients suggest a relationship between event types and source of information(SOI). This type of correlation is likely to be generated by News sources that tend to report certain types of event more than others. For instance, the web portal www.afghanistan-analysts.org is more interested in publishing battle events and territory disputes rather than attacks on civilians. However, similarly to Twitter, AfPak News AUTHOR feature reaches the best IG (IG = 0.23).

Page 131 of 175 Table 5.7: News Authors vs Types Correlation Matrix

Authors Event Types Authors 1 -0.32 Event Types -0.32 1

5.7.3 Event Extraction Baselines

The same baselines found in Chapter 4 were used to compare full Event Extraction against es- tablished Event Extraction systems. JET as an initial baseline tested with the joint version of the datasets. Note that a multi-streamed version of JET does not apply as JET was trained with a different news corpus dataset, and we use basic transfer learning to mimic Stream Mining testing with a batch-trained model.

SPLICER: SPLICER is also tested and trained as a baseline. The model is being used in all versions of the dataset. First, a single stream version of the joint AfPak dataset was used to test SPLICER and compare it against the results found in Chapter 4. Secondly, the first multi-stream version of SPLICER was jointly analysed and presented in the results of this chapter. Finally, SPLICER was tested by using SLICER on top of MNB classifiers to split the stream further by authors.

5.7.4 Analysis of the Results Table 5.8: Prequential Event Extraction Results Components Detection Joint Twitter + News Method P R F1 JET 33.5% 28.7% 30.9% Single Stream SPLICER 37% 56.5% 42.2% Multi Stream SPLICER 54.2% 58.3% 56.2% Multi Stream SPLICER+SLICER 54.2% 64.8% 59% Components Classification Joint Twitter + News Method P R F1 JET 23% 19.7% 21.3% Single Stream SPLICER 23.1% 24.8% 23.9% Multi Stream SPLICER 26.3% 28.3% 27.2% Multi Stream SPLICER+SLICER 26.9% 32.1% 29.3%

Page 132 of 175 Table 5.9: Prequential Event Detection and Event Classification Results Event Detection Joint Twitter + News Method P R F1 JET 41.4% 24.8% 31% Single Stream SPLICER 74.6% 100% 85.5% Multi Stream SPLICER 72% 100% 83.7% Multi Stream SPLICER+SLICER 72.2% 100% 83.9% Event Classification Joint Twitter + News Method P R F1 JET 30.3% 18.2% 22.7% Single Stream SPLICER 19.4% 26% 22.2% Multi Stream SPLICER 23% 31.9% 26.7% Multi Stream SPLICER+SLICER 23.1% 32% 26.8% Table 5.10: Prequential Argument Detection and Argument Classification Results Arguments Detection Joint Twitter + News Method P R F1 JET 31.2% 30.5% 30.9% Single Stream SPLICER 22.6% 41.3% 29.2% Multi Stream SPLICER 41.5% 38.4% 39.8% Multi Stream SPLICER+SLICER 43.5% 48.1% 45.7% Arguments Classification Joint Twitter + News Method P R F1 JET 20.9% 20.5% 20.7% Single Stream SPLICER 25.7% 24.4% 25% Multi Stream SPLICER 28.7% 26.6% 27.5% Multi Stream SPLICER+SLICER 29.1% 32.2% 30.6%

Results for the AfPak datasets are shown in Table 5.10. Results are given for the whole Event Extraction task. The combination of both SPLICER and SLICER makes a significant difference compared against initial SPLICER baselines. First, we can see a major improvement between a single model for the joint AfPak dataset and a split multi-stream version of the AfPak Dataset. It was initially found that the splitting the dataset by main source of information, as proposed by

IIG, improved overall F1 from 42% to 56.2% in the detection and from 23.9% to 27.2% in the classification.

We performed a manual split as it was straightforward to visualise such splitting and prequen- tial training was not difficult as just two models were needed to train and test. Further splitting was done and decided by SLICER, initially splitting by the feature “KILL-TF-IDF”, and later by

Page 133 of 175 the “AUTHOR” feature when the second drift was detected. There were 2 detected drifts in total. It was found that SPLICER + SLICER performed better than multi-stream SPLICER improving results from 56.2% to 59% F1 in the events components detection, and from 27.2% to 29.3% in the event components classification task.

Similar results are found when analysing each first layer Event Extraction tasks. The best Event Detection result was found using the joint AfPak dataset, as the binomial learning task has more examples to be trained in. Nonetheless, a higher F1 is achieved in the Event Classification task, suggesting that the IG metric is boosting goal learning tasks as expected by our previous anal- ysis. Similarly, Argument Detection and Argument Classification F1 accuracies were significantly improved using SPLICER + SLICER.

In general, improvements were made by using multi-stream versions of both algorithms, cor- rectly split by the best feature found by the IG metric, and supported by the minimum sampling. Results validate the thesis that multi-Stream Mining can be performed and made automatically by analysing statistical measures, such as IG, that provide enough confidence to split the information to gain from such decision. As a result, a joint use of SPLICER + SLICER is suitable for Event Extraction in near-real time, while benefiting from the algorithmic and design features of Stream Mining algorithms.

5.7.5 Runtime Analysis

Table 5.11 shows baseline runtimes compared against SLICER for all the reported datasets. Over- all, it can be seen that SLICER’s runtime is higher than its counterpart, given that it performs a hierarchical layer that increases both data and time. Overall, SLICER’s runtime was between 25% and 35% faster than other ensemble models, and this does not dramatically affect the performance compared against such ensembles, event performing faster than algorithms such as Oza bags with better accuracy results.

All tests were performed on a 64-bit CPU computer with 16 GB RAM memory, and 256 GB of Hard Disk space. Complexity analysis suggests the same result, as SLICER needs to run in Θ(N ∗ (L)), where N is the number of instances, and K is the number of learned models. In contrast, a plain independent MNB classifier runs in Θ(N). Runtime can increase markedly if the

Page 134 of 175 Table 5.11: SLICER Runtime Results Runtime (secs) Method Twitter News ACLED CoverType Airlines Electricity MNB 151 450 969 28 11 0.16 HT 98 517 1280 166 367 0.4 SVM 87 498 1320 110 11 0.15 OBag 320 794 1720 262 6840 15 OBoost 340 738 1743 249 6732 14 DDWM 270 771 -F- 211 1244 5 SLICER 350 891 2312 283 4268 9 selected split mechanism detects many models. However, it is important to note that SLICER’s runtime is similar to the results found in all ensemble methods.

5.8 Summary

In this chapter, a new Stream Ensemble learning model – SLICER – for text classification was described and analysed under prequential conditions. This method is boosting prequential macro precision and macro recall, as well as standard precision, recall and F1 metrics in low-balanced classes while using the power of local boosting that is improving performance in around 3% to 7%

F1 on Event Detection / Event Classification and Argument Detection / Argument Classification tasks. Furthermore, the method also obtained better results compared against MNB classifiers.

Moreover, the model was tested on other domain datasets and stream classification tasks, showing effectiveness and reliability compared against baselines and drift detection based classi- fiers. Significant improvements were found in all datasets tested, and the highest overall accuracy found on each dataset was attained by SLICER. In addition, a combination of both the SPLICER and SLICER methods was tested on a joint version of the AfPak datasets to validate the feasibil- ity of creating multi-streams by using the highest IIG feature. Exploratory analysis showed that author-based partitions reached the best IG suggesting such splittings.

Experimental tests were drawn using both joint and split AfPak datasets. An improvement between 4% to 8% in F1 metric was possible by using SPLICER + SLICER while comparing it against using multi-stream SPLICER. Results are consistent across event sub-tasks, showing the feasibility of using both algorithms together to automatically deal with multiple sources of information under Stream Event Extraction learning tasks.

Page 135 of 175 All in all, SLICER works effectively in Stream Mining and Event Mining tasks. Future work might be made running SPLICER + SLICER under different domain datasets.

Page 136 of 175 Chapter 6

Conclusion

This research work has investigated how to efficiently apply Stream Mining methods to NLP classification tasks, in particular to Event Extraction, Event Detection, Argument Detection and Argument Classification learning tasks. A major application is pursued to be used by Social Scien- tists when analysing social and political conflicts. Automated Event Extraction techniques allow users to collect and transform information incredibly fast, and novel analysis can be made by using quantitative analyses in conjunction with current socio-political qualitative analyses.

All in all, this research work aimed to ask the following questions:

1. How to run Event Mining tasks under Stream Mining scenarios with higher accuracies than batch learning techniques?

2. How to run machine learning models for text classification from different sources of infor- mation using online learning?

3. How drift affects accuracies of Stream Mining models in Event Mining tasks?

Answers found during this research thesis can be summarised as follows:

1. We have developed a comprehensive framework (the SEMF framework) to efficiently deal with several Event Mining tasks under near-real time conditions, using Stream Mining ap- proaches. Results in Chapters 4 and 5 show the feasibility of the framework. 2. We found an automated way of dealing with multiple sources of information using a boost- ing ensemble algorithm (SLICER), that can be used not only for text classification but also for other stream classification tasks. Results show that SLICER efficiently boosts accuracies by analysing and splitting when there is a good sign to boost accuracies using combination of information quality and drift assessment to auto-regulate local learners.

3. An analysis of how drift affects accuracies was done over the chosen learning tasks, by using a specific drift assessment metric, allowing us to understand better the effect of drift over different datasets. We found that there is a temporal change over Event Extraction rules learned throughout time, detected in SPLICER by the Hoeffding bound. In addition, drift detection improves the self-regulation and self evolution of SLICER, making significant changes compared against non-drift detection techniques.

This research work provides a better understanding of how to deal with text data when per- forming real-time Stream Mining tasks under challenging scenarios such as social conflict analy- sis. As a result, the SEMF framework was developed to perform Event Mining tasks with online learning models, and the SPLICER and SLICER models were built on top of MOA to perform such operations.

6.1 Summary

6.1.1 SPLICER

We developed SPLICER, a novel multi-task stream learning algorithm capable of making efficient Event Extraction under Stream Mining constraints. The algorithm uses a rule-based learning pro- cedure to not only extract events, but also to give the user a set of rules to better understand the de- cisions made by the algorithm. SPLICER can be explainable, and the user can interact further with the algorithm while deciding which rule can be used or not. Explainability of our model is highly tighten to rule selection made by human annotators in the human-in-the-loop model explained by Heap et al. [67] in which the annotation was jointly made by humans and machines, and the model (rules) was improved and changed by humans. SPLICER uses a partially-explainable machine learning model which uses rules in the top layer and it can combine other machine learning mod-

Page 138 of 175 els in the base layer to perform the predictions. Empirical research under both AfPak datasets and ACLED shows efficiency and reliability of our results under the whole Event Extraction task.

In addition, SPLICER showed reliability under different datasets and ontologies, as the al- gorithm was tested under AfPak and ACLED, both needing different ontologies, as although all datasets extract events from the Afghan conflict, categories and granularities are different between datasets, making differences between both ontologies. Manual annotation as in ACLED was found to be expensive, while fully automated techniques as in GDELT or JET were found inaccurate or difficult to use. In addition, there were no Stream Mining techniques dealing with complex infor- mation extraction tasks but only with Event Detection.

Base knowledge derived from ontologies showed usefulness over combining entities in ELFs that were finally extracted in an integrity constraint rule extractor, to enhance recall over hard constraint versions of SPLICER. In addition, an analysis over SPLICER’s pruning mechanism was made, in order to show the appropriateness of using the Hoeffding Bound to maintain accuracies over time.

Overall, improvements over the F1 metric between 10% to 25% against baselines were reached in all event component subtasks. Finally, runtime analysis suggests that SPLICER runs moderately slower compared to independent classifiers, though complexity analysis tended to be minimised by rule pruning. Future work can be devoted to reduce time complexity in the rule search and construction algorithms.

6.1.2 SLICER

Chapter 5 presents a novel stream ensemble to automatically detect when it is better to use multi- streams over single streams for real-time classification tasks. The idea behind SLICER relies on the fact that statistical measures can greatly help on how an ensemble can partition a dataset to obtain higher accuracies. This research proposes two information quality metrics, Information gain (IG) and Gini index to validate the best attribute to horizontally split a stream.

Horizontal partitions were later used to train local classifiers for certain instances with differ- ent attribute values. A horizontal partition is decided once the selected attribute reaches certain sampling confidence, using Cochran sampling. This extra step gives the ensemble the required

Page 139 of 175 confidence to split a single stream, creating a multi-stream to train local classifiers per attribute value. During the prediction phase, SLICER combines the chosen local classifier with a main- tained global classifier and makes a normalised combined voting to give the final answer.

An additional step made in SLICER’s training algorithm is the idea of re-building the local classifiers if it finds drift over time. When drift is indicated, SLICER stores an instances window that is used to train new local classifiers. Once drift is detected, SLICER removes old local learners and it replaces with new classifiers with the current best attribute to split in, using an information quality metric.

Experiments were initially performed under single learning text classification tasks for Event Mining, such as event classification and argument classification using the AfPak datasets. Im- provements were found over single Stream Mining baselines as well as other Stream Mining ensembles and boosting mechanisms. In addition, an assessment over the effect of drift detec- tion techniques was made by comparing SLICER with and without a drift detection method. The best reported drift detector for these datasets was DDM. Improvements between 2% – 21% of lift-per-drift measure were found while comparing both mechanisms.

Finally, SLICER was tested under a set of well-know datasets used in the Stream Mining community, showing feasibility and reliability of results, improving accuracies over all tested datasets, including the electricity dataset, the cover type dataset and the airlines dataset, making possible its use under other Stream Mining classification domains.

In summary, SLICER was found to be useful to detect when and how to split single streams into multi-streams in order to boost accuracies in real time while taking consideration of distribu- tion changes over a range of datasets. Further tests might include implementing SLICER using parallel architectures, and comparing SLICER against vertical ensembles, available in other plat- forms.

In addition, this Thesis presented a combination and usage of the presented methods for the whole Event Extraction task. Empirical experimentation was made comparing a joint version of the AfPak datasets as a single dataset, and results using SPLICER, and SPLICER with SLICER were analysed, suggesting a further improvement of 3% over the whole Event Extraction F1 mea- sure.

Page 140 of 175 6.2 Limitations

This research work has limitations on different aspects. Firstly, the SPLICER model heavily re- lies on the ontology definition and expert knowledge, which might lead to wrong assumptions (if wrongly or weakly modelled). Domain-specific ontology modelling is a highly labour-intensive task if we want to achieve efficient outputs from the training-testing phase. For example, the Afghanistan domain-specific ontology took around 6 months to fully develop. Therefore, a draw- back of using ontologies as base knowledge is its inherent requirement to implement a carefully modelled ontology that can be used at full capacity during the labelling phase. In addition, ontolo- gies evolve through time, so they need to be updated from time to time to properly reflect current reality, which also takes time to build.

Secondly, SPLICER pre-processing techniques can be further improved as it does not use any word vector representations as used in neural network architectures. As explained in Chap- ter 4, real time and online procedures cannot use word vector representations as such, these feature representation techniques need to be done in batch learning, contrary to the nature of our online al- gorithm. Nonetheless, it is important to point out that neural networks can be used online if we can efficiently represent word vectors incrementally. In addition, SPLICER is not a fully interpretable machine learning technique, and more research needs to be done in improving interpretability of mathematical models such as Naïve Bayes models that helps to improve overall accuracy of the whole model.

Thirdly, SLICER does not allow hybrid partitions. However, this Thesis focused more on the overall usefulness of all Stream Mining techniques in real life applications. Additionally, it would be useful to extend drift detection techniques to allow automatic selection of drift detection techniques based on the type of drift present in the data, as we found that DDM worked particularly well for the tested datasets as most of them (social conflict datasets) have gradual drift and sudden drift conditions.

Page 141 of 175 6.3 Future Work

From this thesis, further work can be done to improve final Event Extraction accuracies, espe- cially dealing with better procedures to improve precision, recall and F1 over skewed distribu- tions under high levels of uncertainty, e.g. re-sampling ensembles [60]. Some papers suggest to attack the skewness problem in the data processing stage [100], some others suggest a feature extraction mechanism, while others treat the issue in the algorithmic procedure. Attacking data skewness from the pre-processing can be considered first. This research work has applied a par- ticular windowed-based technique to create numerical features from text datasets, to be able to run the algorithms under Stream Mining, incremental constraints. However, more research can be performed to expand an instance by instance, incremental idea to perform the classification tasks. Another avenue to explore can be devoted on how to manage feature vector representations incrementally, i.e. how to manage word vectors networks in real time, under Stream Mining con- straints. This could potentially pave the way to new explorations in deep learning approaches for text streams.

Additionally, additional feature extraction approaches can be made by using our base knowl- edge (the ontology) as a feature extraction mechanism. This research has tested an initial flavour of ontology feature extraction, by just filtering out words appearing in the ontology values. However, more research can be done to extend this idea and to make it more dynamic, involving semantic representations and word vector clustering to find certain word synonyms, or to transform the data based on this base knowledge.

From the algorithmic point of view, more research efforts can be made by combining human- machine frameworks as in [67], using Stream Mining approaches, to allow faster predictions under Stream Mining constraints. There are also other Event Mining tasks to be done under both near- real time and batch learning scenarios. Event synthesis is one important field that will be analysed in the near future, especially because the proliferation of fake news has increased the interest of researchers on finding a solution to this particular problem. An initial solution proposed in this thesis is the idea of synthesise the information coming from different and diverse sources of information. Hence information quality is crucial to perform accurate identification of fake news, and new groundwork can be performed in the area of information quality and information trustiness over the years to come. Tied up with event synthesis, event coreference resolution can

Page 142 of 175 solve some of the above-mentioned problems.

Moreover, although distributed algorithms were proposed to handle drift, none of them has been implemented on a Stream Processing Engine platform. This can be promising since Stream Processing Engines are naturally built to handle huge streams of information in distributed fash- ion. SLICER can also be extended and analysed to achieve higher accuracies by using a hybrid horizontal-vertical partitioning. It might be useful to extend SLICER in the SAMOA platform to analyse its behaviour in a fully distributed environment.

Finally, there are alternative types of datasources to work with, apart from purely text data. Novel investigations can arise by analysing video streams in real time, in order to extract events from such a challenging dataset. Moreover, a combination of a diversity of sources can enhance algorithms accuracy by taking into account separate reports and views of the real-world event.

Page 143 of 175 Bibliography

[1] Charu C Aggarwal, Jiawei Han, Jianyong Wang, and Philip S Yu, A Framework for Clustering Evolving Data Streams, Proceedings of the 29th International Conference on Very Large Data Bases, 2003, pp. 81–92.

[2] Charu C Aggarwal and ChengXiang Zhai (eds.), Mining Text Data, Springer Science and Business Media, New York, NY, 2012.

[3] David Ahn, The Stages of Event Extraction, Proceedings of the Workshop on Annotating and Reasoning about Time and Events, 2006, pp. 1–8.

[4] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Göker, I. Kompatsiaris, and A. Jaimes, Sensing Trending Topics in Twitter, IEEE Transactions on Multimedia 15 (2013), no. 6, 1268–1282.

[5] Mohammad Akbari, Xia Hu, Liqiang Nie, and Tat-Seng Chua, From Tweets to Wellness: Wellness Event Detection from Twitter Streams, Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence (AAAI-16), 2016, pp. 87–93.

[6] James Allan, Jaime G Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang, Topic Detection and Tracking Pilot Study Final Report, Proceedings of the DARPA Broad- cast News Transcription and Understanding Workshop, February 1998, pp. 194–218.

[7] Lisa Amini, Henrique Andrade, Ranjita Bhagwan, Frank Eskesen, Richard King, Philippe Selo, Yoonho Park, and Chitra Venkatramani, SPC: A Distributed, Scalable Platform for Data Mining, Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms, 2006, pp. 27–37.

[8] Robert Anderson, Yun Sing Koh, and Gillian Dobbie, Lift-Per-Drift: An Evaluation Metric for Classification Frameworks with Concept Drift Detection, AI 2018: Advances in Artifi- cial Intelligence (Li X. Mitrovic T., Xue B., ed.), Springer, Cham, 2018, pp. 630–642.

[9] Robert Arp, Barry Smith, and Andrew D. Spear, What is an Ontology?, pp. 1–27, MIT Press, Cambridge, MA, 2015.

[10] Edward E Azar, The Conflict and Peace Data Bank (COPDAB) Project, Journal of Conflict Resolution 24 (1980), 143–152.

Page 144 of 175 [11] Manuel Baena-Garcıa, José del Campo-Ávila, Raúl Fidalgo, Albert Bifet, Ricard Gavalda, and Rafael Morales-Bueno, Early Drift Detection Method, Proceedings of the Fourth Inter- national Workshop on Knowledge Discovery from Data Streams, 2006, pp. 77–86.

[12] Francois Bancilhon and Raghu Ramakrishnan, Performance Evaluation of Data Intensive Logic Programs, Foundations of Deductive Databases and Logic Programming (Jack Minker, ed.), Morgan Kaufmann, Los Altos, CA, 1988, pp. 439 – 517.

[13] Jon Barwise and John Perry, Situations and Attitudes, MIT Press, Cambridge, MA, 1983.

[14] Hila Becker, Mor Naaman, and Luis Gravano, Beyond Trending Topics: Real-World Event Identification on Twitter, Proceedings of the International Conference on Web and Social Media (ICWSM), 2011, pp. 438–441.

[15] Cosmin Adrian Bejan and Sanda Harabagiu, Unsupervised Event Coreference Resolution, Computational Linguistics 40 (2014), 311–347.

[16] Cosmin Adrian Bejan and Sanda M Harabagiu, Using Clustering Methods for Discovering Event Structures, Proceedings of the Twenty-Third AAAI Conference on Artificial Intelli- gence (AAAI-08), 2008, pp. 1776–1777.

[17] Mordechai Ben-Ari, Mathematical Logic for Computer Science, Springer, London, 2012.

[18] Gurpreet Singh Bhamra, Ajit Kumar Verma, and Ram Bahadur Patel, Agent Enriched Distributed Association Rules Mining: A Review, Agents and Data Mining Interaction. (Longbing Cao, A.L.C. Bazzan, A.L. Symeonidis, V.I. Gorodetsky, G. Weiss, and P.S. Yu, eds.), Springer, Berlin, 2011, pp. 30–45.

[19] Albert Bifet, Gianmarco de Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard Pfahringer, Efficient Online Evaluation of Big Data Stream Classifiers, Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 59–68.

[20] Albert Bifet and Ricard Gavaldà, Learning from Time-Changing Data with Adaptive Windowing, Proceedings of the 2007 SIAM International Conference on Data Mining, 2007, pp. 443–448.

[21] Albert Bifet, Ricard Gavaldà, Geoff Holmes, and Bernhard Pfahringer, Machine Learning for Data Streams with Practical Examples in MOA, MIT Press, Cambridge, MA, 2018.

Page 145 of 175 [22] Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer, MOA: Massive Online Analysis, Department of Computer Science, University of Waikato, 2010.

[23] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà, New Ensemble Methods for Evolving Data Streams, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2009, pp. 139–148.

[24] Albert Bifet and Richard Kirkby, Massive Online Analysis, The University of Waikato, 2009.

[25] Jock A. Blackard, Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types, Ph.D. thesis, Department of Forest Sciences, Colorado State University Fort Collins, 1998.

[26] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

[27] Doug Bond, Joe Bond, Churl Oh, J Craig Jenkins, and Charles Lewis Taylor, Integrated Data for Events Analysis (IDEA): An Event Typology for Automated Events Data Development, Journal of Peace Research 40 (2003), 733–745.

[28] Léon Bottou, On-line Learning and Stochastic Approximations, On-line Learning in Neu- ral Networks (David Saad, ed.), Cambridge University Press, New York, NY, USA, 1998, pp. 9–42.

[29] Leo Breiman, Random Forests, Machine Learning 45 (2001), 5–32.

[30] Dariusz Brzezinski,´ Block-Based and Online Ensembles for Concept-Drifting Data Streams, Ph.D. thesis, Institute of Computing Science, Poznan University of Technology, 2015.

[31] Dariusz Brzezinski and Jerzy Stefanowski, Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm, IEEE Transactions on Neural Networks and Learning Systems 25 (2014), 81–94.

[32] Michael Buckland and Fredric Gey, The Relationship Between Recall and Precision, Jour- nal of the American Society for Information Science 45 (1994), 12–19.

Page 146 of 175 [33] Cody Buntain, Discovering Credible Events in Near Real Time from Social Media Streams, Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 481–485.

[34] Yongjie Cai, Hanghang Tong, Wei Fan, Ping Ji, and Qing He, Facets: Fast Comprehensive Mining of Coevolving High-order Time Series, Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 79–88.

[35] John Calvo, Context-aware Multi-stream Mining for Predictions in Real-time Monitoring Systems, M.Sc. Thesis. School of Computer Science and Engineering, University of the Andes, 2013.

[36] John Calvo Martinez, Event Mining over Distributed Text Streams, Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 745– 746.

[37] Carlos Castillo, Big Crisis Data: Social Media in Disasters and Time-Critical Situations, Cambridge University Press, Cambridge, 2016.

[38] U.S. Chakravarthy, John Grant, and Jack Minker, Foundations of Semantic Query Optimization for Deductive Databases, Foundations of Deductive Databases and Logic Pro- gramming (Jack Minker, ed.), Morgan Kaufmann, Los Altos, CA, 1988, pp. 243–273.

[39] Chen Chen and Vincent Ng, Joint Inference over a Lightly Supervised Information Extraction Pipeline: Towards Event Coreference Resolution for Resource-Scarce Languages, In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016, pp. 2913–2920.

[40] Jiaoyan Chen, Freddy Lecue, Jeff Z. Pan, and Huajun Chen, Learning from Ontology Streams with Semantic Concept Drift, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), 2017, pp. 957–963.

[41] Sven Chojnacki, Christian Ickler, Michael Spies, and John Wiesel, Event Data on Armed Conflict and Security: New Perspectives, Old Challenges, and Some Solutions, Interna- tional Interactions 38 (2012), no. 4, 382–401.

[42] Prafulla Kumar Choubey and Ruihong Huang, Improving Event Coreference Resolution by Modeling Correlations between Event Coreference Chains and Document Topic Structures,

Page 147 of 175 In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), 2018, pp. 485–495.

[43] William G. Cochran, Sampling Techniques, 3rd Edition ed., John Wiley & Sons, New York, NY, 1977.

[44] Gianmarco De Francisci Morales and Albert Bifet, SAMOA: Scalable Advanced Massive Online Analysis, Journal of Machine Learning Research 16 (2015), 149–153.

[45] Pedro Domingos and Geoff Hulten, Mining High-speed Data Streams, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’00) (2000), 71–80.

[46] Rui Máximo Esteves and Chunming Rong, Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between k-means and Fuzzy c-means in the Cloud, Proceedings of The IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), 2011, pp. 565–569.

[47] Ariadna Estévez, Human Rights, Migration, and Social Conflict. Towards a Decolonized Global Justice, 1 ed., Palgrave Macmillan, New York, NY, USA, 2012.

[48] Theodoros Evgeniou and Massimiliano Pontil, Regularized Multi-task Learning, Proceed- ings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 109–117.

[49] Elaine Faria, Isabel Gonçalves, André de Carvalho, and João Gama, Novelty Detection in Data Streams, Artificial Intelligence Review 45 (2015), 235–269.

[50] Valerie Fiolet and Bernard Toursel, Distributed Data Mining, Scalable Computing: Practice and Experience 6 (2002), 349–367.

[51] Peter Flach, Machine Learning: The Art And Science Of Algorithms That Make Sense Of Data, Cambridge University Press, Cambridge, 2012.

[52] Isvani Freias-Blanco, Jose Campo-Avila, Gonzalo Ramos, Rafael Morales-Bueno, Agustin Diaz, and Yaile Caballero Mota, Online and Non-Parametric Drift Detection Methods Based on Hoeffding Bounds, IEEE Transactions on Knowledge and Data Engineering 27 (2015), 810–823.

Page 148 of 175 [53] Guoji Fu, Bo Yuan, Qiqi Duan, and Xin Yao, Representation Learning for Heterogeneous Information Networks via Embedding Events, arXiv:1901.10234 (2019).

[54] Mohamed Medhat Gaber, Advances in Data Stream Mining, Wiley Interdisciplinary Re- views: Data Mining and Knowledge Discovery 2 (2012), 79–85.

[55] João Gama, Knowledge Discovery from Data Streams, CRC Press, Boca Raton, FL, 2010.

[56] , A Survey on Learning from Data Streams: Current and Future Trends, Progress in Artificial Intelligence 1 (2012), 45–55.

[57] João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues, Learning with Drift Detection, Advances in Artificial Intelligence-SBIA 2004 (Ana L. C. Bazzan and Sofiane Labidi, eds.), Springer, Berlin, 2004, pp. 286–295.

[58] João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues, On Evaluating Stream Learning Algorithms, Machine Learning 90 (2013), 317–346.

[59] João Gama, Indre˙ Žliobaite,˙ Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia, A Survey on Concept Drift Adaptation, ACM Computing Surveys (CSUR) 46 (2014), 1–44.

[60] Jing Gao, Wei Fan, Jiawei Han, and Philip S. Yu, A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions, Proceedings of the 2007 SIAM International Conference on Data Mining, 2007, pp. 3–14.

[61] Deborah J Gerner, Philip A Schrodt, Omür Yilmaz, and Rajaa Abu-Jabr, Conflict and Mediation Event Observations (CAMEO): A New Event Data Framework for the Analysis of Foreign Policy Interactions, Presented at the Annual Meeting of the International Studies Association, 2002.

[62] Nils Petter Gleditsch, Peter Wallensteen, Mikael Eriksson, Margareta Sollenberg, and Hå- vard Strand, Armed Conflict 1946-2001: A New Dataset, Journal of Peace Research 39 (2002), no. 5, 615–637.

[63] Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet, A Survey on Ensemble Learning for Data Stream Classification, ACM Computing Surveys (CSUR) 50 (2017), 1–23.

Page 149 of 175 [64] Paulo M Gonçalves, Silas GT de Carvalho Santos, Roberto SM Barros, and Davi CL Vieira, A Comparative Study on Concept Drift Detectors, Expert Systems with Applications 41 (2014), 8144–8156.

[65] Ralph Grishman, David Westbrook, and Adam Meyers, NYU’s English ACE 2005 System Description, Proceedings of ACE 2005 Evaluation Workshop, 2005, pp. 1–51.

[66] Michael Harries, SPLICE-2 Comparative Evaluation: Electricity Pricing, Tech. report, The university of New South Wales, 1999.

[67] Bradford Heap, Alfred Krzywicki, Susanne Schmeidl, Wayne Wobcke, and Michael Bain, A Joint Human/Machine Process for Coding Events and Conflict Drivers, Advanced Data Mining and Applications (G. Cong, W.-C. Peng, W.E. Zhang, C. Li, and A Sun, eds.), Springer (Cham), 2017, pp. 639–654.

[68] Wassily Hoeffding, Probability Inequalities for Sums of Bounded Random Variables, The Collected Works of Wassily Hoeffding (1994), 409—426.

[69] Liangjie Hong, Byron Dom, Siva Gurumurthy, and Kostas Tsioutsiouliklis, A Time-dependent Topic Model for Multiple Text Streams, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 832–840.

[70] Lifu Huang, Heng Ji, Kyunghyun Cho, Ido Dagan, Sebastian Riedel, and Clare Voss, Zero-Shot Transfer Learning for Event Extraction, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), 2018, pp. 2160– 2170.

[71] Geoff Hulten, Laurie Spencer, and Pedro Domingos, Mining Time-changing Data Streams, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, 2001, pp. 97–106.

[72] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg, Processing Social Media Messages in Mass Emergency: A Survey, ACM Computing Surveys 47 (2015), 1– 38.

Page 150 of 175 [73] J Craig Jenkins and Thomas V Maher, What Should We Do about Source Selection in Event Data? Challenges, Progress, and Possible Solutions, International Journal of Sociology 46 (2016), 42–57.

[74] Heng Ji, Ralph Grishman, Zheng Chen, and Prashant Gupta, Cross-document Event Extraction and Tracking: Task, Evaluation, Techniques and Challenges, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), 2009, pp. 166–172.

[75] Hillol Kargupta, Ruchita Bhargava, Kun Liu, Michael Powers, Patrick Blair, Samuel Bushra, James Dull, Kakali Sarkar, Martin Klein, and Mitesh Vasa, VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring, Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 300–311.

[76] Hillol Kargupta, Byung-Hoon Park, Daryl Hershberger, and Erik Johnson, Collective Data Mining: A New Perspective toward Distributed Data Analysis, Advances in Distributed and Parallel Knowledge Discovery (Hillol Kargupta and Philip Chan, eds.), MIT Press, Cambridge, MA, 1999, pp. 133–184.

[77] Chris Kedzie, Fernando Diaz, and Kathleen McKeown, Real-time Web Scale Event Summarization Using Sequential Decision Making, Proceedings of the Twenty-Fifth In- ternational Joint Conference on Artificial Intelligence (IJCAI-16), 2016, pp. 3754–3760.

[78] Gary King and Will Lowe, An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design, International Organization 57 (2003), 617–642.

[79] Ralf Klinkenberg and Thorsten Joachims, Detecting Concept Drift with Support Vector Machines, Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 487–494.

[80] J. Zico Kolter and Marcus A. Maloof, Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts, Machine Learning Research 8 (2007), 2755–2790.

[81] Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, and Arinto Murdopo, VHT: Vertical Hoeffding Tree, Proceedings of the IEEE International Conference on Big Data, 2016, pp. 915–922.

Page 151 of 175 [82] Alfred Krzywicki and Wayne Wobcke, Incremental E-Mail Classification and Rule Suggestion Using Simple Term Statistics, AI 2009: Advances in Artificial Intelligence (A. Nicholson and X. Li, eds.), Springer, Berlin, 2009, pp. 250–259.

[83] , Exploiting Concept Clumping for Efficient Incremental E-mail Categorization, Advanced Data Mining and Applications (Longbing Cao, Yong Feng, and Jiang Zhong, eds.), Springer, Berlin, 2010, pp. 244–258.

[84] Alfred Krzywicki, Wayne Wobcke, Michael Bain, John Calvo Martinez, and Paul Compton, Data Mining for Building Knowledge Bases: Techniques, Architectures and Applications, The Knowledge Engineering Review 31 (2016), 97–123.

[85] Alfred Krzywicki, Wayne Wobcke, Michael Bain, Susanne Schmeidl, and Bradford Heap, A Knowledge Acquisition Method for Event Extraction and Coding Based on Deep Patterns, Knowledge Management and Acquisition for Intelligent Systems (Kenichi Yoshida and Maria Lee, eds.), Springer (Cham), 2018, pp. 16–31.

[86] Kazuto Kubota, Akihiko Nakase, Hiroshi Sakai, and Shigeru Oyanagi, Parallelization of Decision Tree Algorithm and its Performance Evaluation, Proceedings of The Fourth In- ternational Conference/Exhibition on High Performance Computing in the Asia-Pacific Re- gion, vol. 2, 2000, pp. 574–579.

[87] Giridhar Kumaran and James Allan, Text Classification and Named Entities for New Event Detection, Proceedings of the 27th Annual International Conference on Research and De- velopment in Information Retrieval (SIGIR) (2004), 297–304.

[88] Gary LaFree and Laura Dugan, Handbook of Computational Approaches to Counterterrorism, ch. The Global Terrorism Database, 1970-2010, pp. 3–22, Springer, New York, NY, 2013.

[89] Jennifer Lautenschlager and A. Shilliday, Data for a Global ICEWS and Ongoing Research, Presented at 2nd International Conference on Cross-Cultural Decision Making, 2012.

[90] Kalev Leetaru and Philip A Schrodt, GDELT: Global Data on Events, Location, and Tone, 1979-2012, Presented at ISA Annual Convention, vol. 2, 2013, pp. 1–49.

[91] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press, Cambridge, 2014.

Page 152 of 175 [92] Jiwei Li, Alan Ritter, Claire Cardie, and Eduard H Hovy, Major Life Event Extraction from Twitter Based on Congratulations/Condolences Speech Acts, Proceedings of the 2014 Conference on Empirical Methods on Natural Language Processing (EMNLP-14), 2014, pp. 1997–2007.

[93] Qi Li, Heng Ji, and Liang Huang, Joint Event Extraction via Structured Prediction with Global Features, Proceedings of the 51st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), 2013, pp. 73–82.

[94] Wentian Li, Random Texts Exhibit Zipf’s-law-like Word Frequency Distribution, IEEE Transactions on Information Theory 38 (1992), 1842–1845.

[95] R. Lima, B. Espinasse, and F. Freitas, Relation Extraction from Texts with Symbolic Rules Induced by Inductive Logic Programming, Proceedings of the 27th International Confer- ence on Tools with Artificial Intelligence, 2015, pp. 194–201.

[96] Rinaldo Lima, Bernard Espinasse, and Fred Freitas, OntoILPER: An Ontology-and Inductive Logic Programming-based System to Extract Entities and Relations from Text, Knowledge and Information Systems 56 (2018), 223–255.

[97] Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun, Nugget Proposal Networks for Chinese Event Detection, Proceedings of the 56th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), 2018, pp. 1565–1574.

[98] Ying Liu, Alok Choudhary, Jianhong Zhou, and Ashfaq Khokhar, A Scalable Distributed Stream Mining System for Highway Traffic Data, Proceedings of the International Confer- ence on Knowledge Discovery in Databases, 2006, pp. 309–321.

[99] John W. Lloyd and Rodney W. Topor, A Basis for Deductive Database Systems, The Journal of Logic Programming 2 (1985), 93–109.

[100] Rushi Longadge and Snehalata Dongre, Class Imbalance Problem in Data Mining Review, arXiv:1305.1707 (2013).

[101] Raymond W. Mack and Richard C. Snyder, The Analysis of Social Conflict—Toward an Overview and Synthesis, Conflict Resolution 1 (1957), no. 2, 212–248.

Page 153 of 175 [102] Sanjay Manchanda and David Scott Warren, A Logic-based Language for Database Updates, Foundations of Deductive Databases and Logic Programming (Jack Minker, ed.), Morgan Kaufmann, Los Altos, CA, 1988, pp. 363 – 394.

[103] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval, vol. 1, Cambridge University Press, Cambridge, 2008.

[104] Charles McClelland, World Event/Interaction Survey, 1966-1978, Ann Arbor, MI: Inter- University Consortium for Political and Social Research (1999), 1–22.

[105] Richard McCreadie, Craig Macdonald, Iadh Ounis, Miles Osborne, and Sasa Petrovic, Scalable Distributed Event Detection for Twitter, Proceedings of the IEEE International Conference on Big Data, 2013, pp. 543–549.

[106] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781 (2013).

[107] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, Distributed Representations of Words and Phrases and Their Compositionality, Advances in Neural Information Processing Systems (NIPS), 2013, pp. 3111–3119.

[108] Sean Monahan and Mary Brunson, Qualities of Eventiveness, Proceedings of the Sec- ond Workshop on EVENTS: Definition, Detection, Coreference, and Representation, 2014, pp. 59–68.

[109] Duc T Nguyen and Jason J Jung, Real-time Event Detection on Social Data Stream, Mobile Networks and Applications 20 (2015), 475–486.

[110] Thien Huu Nguyen, Deep Learning for Information Extraction, Ph.D. thesis, Department of Computer Science, New York University, 2018.

[111] Thien Huu Nguyen and Ralph Grishman, Graph Convolutional Networks with Argument-Aware Pooling for Event Detection, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018, pp. 5900–5907.

[112] Clayton Norris, Philip Schrodt, and John Beieler, PETRARCH2: Another Event Coding Program, The Journal of Open Source Software 2 (2017), 1–133.

Page 154 of 175 [113] Andrei Olariu, Efficient Online Summarization of Microblogging Streams, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Lin- guistics, Volume 2: Short Papers, 2014, pp. 236–240.

[114] Nikunj C Oza, Online Ensemble Learning, Ph.D. thesis, Department of Electrical Engineer- ing and Computer Science, University of California, Berkeley, 2001, p. 1109.

[115] , Online Bagging and Boosting, Proceedings of The IEEE International Conference on Systems, Man and Cybernetics, 2005, pp. 2340–2345.

[116] E.S. Page, Continuous Inspection Schemes, Biometrika 41 (1954), 100–115.

[117] Sinno Jialin Pan and Qiang Yang, A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering 22 (2010), 1345–1359.

[118] Byung-Hoon Park and Hillol Kargupta, Distributed Data Mining, Algorithms, Systems, and Applications (N. Ye, ed.), Lawrence Erlbaum Associates, Mahwah, NJ, 2002, pp. 341–358.

[119] Ellie Pavlick, Heng Ji, Xiaoman Pan, and Chris Callison-Burch, The Gun Violence Database: A New Task and Data Set for NLP, Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 2016, pp. 1018–1024.

[120] Russel Pears, Sripirakas Sakthithasan, and Yun Sing Koh, Detecting Concept Change in Dynamic Data Streams, Machine Learning 97 (2014), 259–293.

[121] Saša Petrovic,´ Miles Osborne, and Victor Lavrenko, Streaming First Story Detection with Application to Twitter, Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 181–189.

[122] Sasa Petrovic, Miles Osborne, and Victor Lavrenko, The Edinburgh Twitter Corpus, Pro- ceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media, 2010, pp. 25–26.

[123] Sasa Petrovic, Miles Osborne, Richard McCreadie, Craig Macdonald, Iadh Ounis, and Luke Shrimpton, Can Twitter Replace Newswire for Breaking News?, Proceedings of The 7th International AAAI Conference On Weblogs And Social Media (ICWSM), 2013.

[124] Maxime Peyrard and Judith Eckle-Kohler, Optimizing an Approximation of Rouge-A Problem-reduction Approach to Extractive Multi-Document Summarization, Proceedings

Page 155 of 175 of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1825–1836.

[125] Martin F. Porter, An Algorithm for Suffix Stripping, Program 14 (1980), no. 3, 130–137.

[126] Andreas Prodromidis, Philip Chan, and Salvatore Stolfo, Meta-learning in Distributed Data Mining Systems: Issues and Approaches, Advances in Distributed and Parallel Knowledge Discovery 3 (2000), 81–114.

[127] Dean Pruitt, Sung Hee Kim, and Jeffrey Rubin, Social Conflict: Escalation, Stalemate, and Settlement., 3rd edition ed., McGraw-Hill, New York, NY, 2004.

[128] Willard V. Quine, Events and Reification, Actions and Events: Perspectives on the Philoso- phy of Donald Davidson (E. LePore and B. P. McLaughlin, eds.), Blackwell, Oxford, 1985, pp. 162–171.

[129] Gabriel A. Radvansky and Jeffrey M. Zacks, Event Perception, Wiley Interdisciplinary Re- views: Cognitive Science 2 (2011), no. 6, 608–620.

[130] Clionadh Raleigh, Andrew Linke, Håvard Hegre, and Joakim Karlsen, Introducing ACLED: An Armed Conflict Location and Event Dataset: Special Data Feature, Journal of Peace Research 47 (2010), 651–660.

[131] Thirunavukarasu Ramkumar, Shanmugasundaram Hariharan, and Shanmugam Selva- muthukumaran, A Survey on Mining Multiple Data Sources, Wiley Interdisciplinary Re- views: Data Mining and Knowledge Discovery 3 (2013), 1–11.

[132] Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi, Event2Mind: Commonsense Inference on Events, Intents, and Reactions, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), 2018, pp. 463–473.

[133] Raymond Reiter, On Integrity Constraints, Proceedings of the 2nd Conference on Theoret- ical Aspects of Reasoning about Knowledge, 1988, pp. 97–111.

[134] Jorge A Restrepo, Michael Spagat, and Juan F Vargas, Special Data Feature; The Severity of the Colombian Conflict: Cross-country Datasets Versus New Micro-data, Journal of Peace Research 43 (2006), 99–115.

Page 156 of 175 [135] Alan Ritter, Sam Clark, and Oren Etzioni, Named Entity Recognition in Tweets: An Experimental Study, Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP-11), 2011, pp. 1524–1534.

[136] Alan Ritter, Oren Etzioni, and Sam Clark, Open Domain Event Extraction from Twitter, Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 1104–1112.

[137] Thomas Ruttig, In Search of a Peace Process: A ‘New’ HPC and Ultimatum for Taliban, Afghanistan Analysis Network 26 (2016), 1–8.

[138] Gerard Salton and Christopher Buckley, Term-weighting Approaches in Automatic Text Retrieval, Information Processing & Management 24 (1988), 513–523.

[139] Biplab Kumer Sarker, Toshiya Hirata, Kuniaki Uehara, and Virendra C Bhavsar, Mining Association Rules from Multi-stream Time Series Data on Multiprocessor Systems, Parallel and Distributed Processing and Applications (ISPA-05) (Y. Pan, D. Chen, M. Guo, J. Cao, and J. Dongarra, eds.), Springer, Berlin, 2005, pp. 662–667.

[140] Susanne Schmeidl, From Root Cause Assessment to Preventive Diplomacy: Possibilities and Limitations of the Early Warning of Forced Migration, Ph.D. thesis, Department of Sociology, Ohio State University, 1995.

[141] Susanne Schmeidl, Internal Displacement in Afghanistan: The Tip of the Iceberg, Afghanistan-Challenges and Prospects (Srinjoy Bose, Nishank Motwani, and William Ma- ley, eds.), Taylor & Francis, London, 2017, pp. 169–187.

[142] Philip Schrodt and Jay Yonamine, A Guide to Event Data: Past, Present, and Future, All Azimuth: A Journal of Foreign Policy and Peace 2 (2013), 5–22.

[143] Philip A. Schrodt, Kansas Event Data System (KEDS), Tech. Report Version 1.0, Depart- ment of Political Science. University of Kansas, 1998.

[144] Philip A Schrodt, Automated Production of High-volume, Real-time Political Event Data, Presented at the American Political Science Association Annual Meeting, 2010.

[145] , CAMEO: Conflict and Mediation Event Observations Event and Actor Codebook, Tech. report, Department of Political Science. Pennsylvania State University, March 2012.

Page 157 of 175 [146] Philip A Schrodt, John Beieler, and Muhammed Idris, Three’sa Charm?: Open Event Data Coding with El: Diablo, Petrarch, and the Open Event Data Alliance, Proceedings of the International Studies Association Annual Convention (ISA-14), 2014, pp. 1–27.

[147] Philip A. Schrodt and David Van Brackle, Handbook of Computational Approaches to Counterterrorism, ch. Automated Coding of Political Event Data, pp. 23–49, Springer, New York, NY, 2013.

[148] Steven Seidman, Contested Knowledge, sixth edition ed., Wiley-Blackwell (Sussex), 2016.

[149] Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, Baobao Chang, and Zhifang Sui, RBPB: Regularization-Based Pattern Balancing Method for Event Extraction, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1224–1234.

[150] Claude E Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal 27 (1948), 379–423.

[151] Robert H Shumway and David S Stoffer, Time Series Analysis and its Applications, Springer Science & Business Media, Berlin, 2013.

[152] Parneeta Sidhu and MPS Bhatia, A Novel Online Ensemble Approach to Handle Concept Drifting Data Streams: Diversified Dynamic Weighted Majority, International Journal of Machine Learning and Cybernetics 9 (2015), 1–25.

[153] Blaz Skrlj, Matej Martinc, Jan Kralj, Nada Lavrac, and Senja Pollak, tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification, ArXiv abs/1902.00438 (2019).

[154] Irena Spasic, Sophia Ananiadou, John McNaught, and Anand Kumar, Text Mining and Ontologies in Biomedicine: Making Sense of Raw Text, Briefings in Bioinformatics 6 (2005), 239–251.

[155] Eduardo J Spinosa, André Ponce de Leon F de Carvalho, and João Gama, Olindda: A Cluster-based Approach for Detecting Novelty and Concept Drift in Data Streams, Pro- ceedings of the 2007 ACM Symposium on Applied Computing, 2007, pp. 448–452.

[156] KG Srinivasa and Anil Kumar Muppalla, Guide to High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark, Springer (Cham), 2015.

Page 158 of 175 [157] Michael Steinbach, George Karypis, and Vipin Kumar, A Comparison of Document Clustering Techniques, Proceedings of the KDD Workshop on Text Mining, 2000, pp. 525– 526.

[158] Zachary C. Steinert-Threlkeld, Twitter as Data, Elements in Quantitative and Computa- tional Methods for the Social Sciences, Cambridge University Press, Cambridge, 2018.

[159] Moana Stirling, Yun Sing Koh, Philippe Fournier-Viger, and Sri Devi Ravana, Concept Drift Detector Selection for Hoeffding Adaptive Trees, AI 2018: Advances in Artificial Intelligence (Xiaodong Li Tanja Mitrovic, Bing Xue, ed.), Springer (Cham), 2018, pp. 730– 736.

[160] V. Subrahmanian (ed.), Handbook of Computational Approaches to Counterterrorism, vol. 1, Springer, New York, NY, 2013.

[161] Riccardo Tommasini, Pieter Bonte, Emanuele Della Valle, Erik Mannens, Filip De Turck, and Femke Ongenae, Towards Ontology-Based Event Processing, OWL: Experiences and Directions – Reasoner Evaluation (2017), 115–127.

[162] Grigorios Tsoumakas and Ioannis Vlahavas, Distributed Data Mining, Database Technolo- gies: Concepts, Methodologies, Tools, and Applications (John Erickson, ed.), IGI Global, Hershey, PA, 2009, pp. 157–164.

[163] S Urmela and M Nandhini, Approaches and Techniques of Distributed Data Mining: A Comprehensive Study, International Journal of Engineering and Technology 9 (2017), 63– 76.

[164] Jayashree Venkatesh, Pairwise Document Similarity using an Incremental Approach to TF-IDF, M.Sc. Thesis, Department of Computer Science, North Carolina State University, 2010.

[165] Abraham Wald, Sequential Analysis, Courier Corporation, North Chelmsford, MA, 1973.

[166] Michael D Ward, Andreas Beger, Josh Cutler, Matt Dickenson, Cassy Dorff, and Ben Rad- ford, Comparing GDELT and ICEWS Event Data, Analysis 21 (2013), 267–297.

[167] Yifang Wei, Lisa Singh, Brian Gallagher, and David Buttler, Overlapping Target Event and Story Line Detection of Online Newspaper Articles, Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, 2016, pp. 222–232.

Page 159 of 175 [168] Ian H Witten, Eibe Frank, and Mark A Hall, Data Mining: Practical Machine Learning Tools and Techniques, Third ed., Morgan Kaufmann, Burlington, MA, 2011.

[169] Siegfried Wolf, The Battle over Kunduz and its Implications, Social Science Research Net- work (2015), 1–3.

[170] Chen Xing, Yuan Wang, Jie Liu, Yalou Huang, and Wei-Ying Ma, Hashtag-Based Sub-Event Discovery Using Mutually Generative LDA in Twitter, Proceedings of the Thir- tieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016, pp. 2666–2672.

[171] Yiming Yang, Jian Zhang, Jaime Carbonell, and Chun Jin, Topic-conditioned Novelty Detection, Proceedings of the Eighth ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, 2002, pp. 688–693.

[172] Wenlin Yao and Ruihong Huang, Temporal Event Knowledge Acquisition via Identifying Narratives, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 537–547.

[173] James E Yonamine, Predicting Future Levels of Violence in Afghanistan Districts using GDELT, Unpublished Manuscript (2013).

[174] Mohammed J Zaki and Ching-Tien Ho, Large-scale Parallel Data Mining, Springer Science & Business Media, Berlin, 2000.

[175] Li Zeng, Ling Li, Lian Duan, Kevin Lu, Zhongzhi Shi, Maoguang Wang, Wenjuan Wu, and Ping Luo, Distributed Data Mining: A Survey, Information Technology and Management 13 (2012), 403–409.

[176] Deyu Zhou, Tianmeng Gao, and Yulan He, Jointly Event Extraction and Visualization on Twitter via Probabilistic Modelling, Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 269–278.

[177] Indre˙ Žliobaite,˙ Adaptive Training set Formation, Ph.D. thesis, Department of Physical Science and Informatics, Vilniaus Universitetas, 2010.

[178] Indre˙ Žliobaite,˙ Albert Bifet, Bernhard Pfahringer, and Geoff Holmes, Active Learning with Evolving Streaming Data, Machine Learning and Knowledge Discovery in Databases (Dim- itrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, eds.), Springer, Berlin, 2011, pp. 597–612.

Page 160 of 175 [179] Indre˙ Žliobaite,˙ Albert Bifet, Bernhard Pfahringer, and Geoff Holmes, Active Learning with Evolving Streaming Data, Proceedings of the European Conference of Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2011, pp. 597–612.

Page 161 of 175 APPENDICES

A Table of Equivalences Between JET and AfPak Categories

AfPak Category JET Type JET Subtype Abduct Movement Transport Accept Justice Execute Accuse Justice Charge-Indict Alliance Contact Meet Announce Business End-Org Arrest Justice Arrest-Jail Arson Conflict Attack Assault Conflict Attack Attack Conflict Attack Capture Justice Arrest-Jail Coerce Conflict Demonstrate Complain Contact Meet Consult Contact Meet Cooperation Contact Meet Corruption Justice Charge-Indict Critique Business End-Org Declare Contact Phone-Write Demand action Contact Phone-Write Demonstration Demonstrate Deny Contact Phone-Write Deploy Demonstrate Detain Justice Arrest-Jail Displace Movement Transport Engage in political cooperation Contact Meet Explode Conflict Attack Express intent Contact Meet Fight Conflict Attack Freeing or liberate Justice Release-Parole Gain territory Conflict Attack Helicopter crash Conflict Attack Increase aid Contact Meet Injure Life Injure Intensify fight Conflict Attack Internally Displaced Persons returning Movement Transport Intimidate Conflict Attack Investigate Justice Sentence Kill Life Die Lose territory Conflict Attack Make a public statement Business End-Org Military operation Demonstrate Negotiate Contact Meet Protest Conflict Demonstrate Provide international aid Contact Phone-Write Recruit Contact Meet Refugee repatriation Movement Transport Reject Contact Phone-Write Request assistance Contact Phone-Write Rescue Justice Release-Parole Shot down Conflict Attack Steal Conflict Attack Torture Life Injure Violent Crime Life Die

Page 163 of 175 B Table of Equivalences Between JET and AfPak Argument Roles

JET Role AfPak Equivalence Target Target Entity Actor Place Location Person Actor or Target, depending on the Event’s relation with the trigger Defendant Actor Seller Target Origin Actor Artifact Actor Instrument Actor Agent Actor Victim Target Prosecutor Actor Destination Location Attacker Actor Buyer Actor Org Target

Page 164 of 175 C Table of Equivalences Between JET and ACLED Categories

JET Type JET Subtype ACLED Category Business End-Org Strategic development Conflict Attack Battle-No change of territory Conflict Demonstrate Riots/Protests Contact Phone-Write Strategic development Contact Meet Strategic development Justice Release-Parole Justice Sentence Justice Sue Justice Charge-Indict Justice Execute Justice Arrest-Jail Life Injure Violence against civilians Life Die Violence against civilians Movement Transport Non-violent transfer of territory Personnel Elect Personnel End-Position Transaction Transfer-Ownership Battle-Government regains territory Transfer Money Non-violent transfer of territory Demonstrate Riots/Protests Meet Strategic development

Page 165 of 175 D AfPak Ontology values

This section provides information about all the used ontology information values in the AfPak Ontology. Ontology values are given in the next table.

Ontology Type Description Abduct.Recovery.aka: kidnap recovery When people rescued from being kidnapped. Abduct.aka: kidnap When civilians were reported as kidnapped or abducted. Accept.Accept policy When a government or institution accepted a new policy. Accept.Accept reform When a government or institution accepted an existing policy change. Accept.Assistance When a government or institution accepted human assistance. Accept.aka: approve When a government or institution accepted a new policy or legal action. Accuse.Of crime, corruption When a person is accused of corruption or crime. Accuse.of aggression When a person is accused of verbal aggression Accuse.of human rights abuses When a person is accused of human rights violations Accuse.of war crimes When a person is accused of war crimes Accuse.not specified below When a person from a governmental agency or institution is accused different from above. Address corruption When an organisation judges or adopts anticorruption policies. Alliance.Break When two different governments or parties break an alliance. Alliance.Change When two different governments or parties change and alliance to other parties. Alliance.Form When two different governments or parties makes a strategic alliance. Announce When a governmental institution announces a national policy/ economic measure.

Page 166 of 175 Ontology Type Description Announce.Financial Reform When a governmental institution announces a financial reform in form of taxes, pension or similar. Appeal for aid.Appeal for economic aid When an institution asks for economic help. Appeal for aid.Appeal for humanitarian aid When an institution asks for humanitarian help. Appeal for aid.Appeal for military aid When an institution asks for military assistance. Appeal for aid.Appeal for military protection When an institution asks for peacekeeping or peacekeeping army(UN army). Arrest.aka: Detain When a person or group of persons have been arrested. Arson When an arson has happened into a specific location to any. Assault When a person or group of persons takes violent control of a particular target. Assault.Sexual assault.aka: rape When a person or group of persons takes violent and sexual assault of a victim. Attack.Airstrike Airstrike attack. Attack.Ambush Ambush attack. Attack.Bombing Bombing attack. Attack.Complex.aka: coordinated attack Involves different parties coordinated at different locations. Attack.Drone strike.aka: Drone Drone attack, usually in form of bomb attack. Attack.Green on Blue Attack.Suicide Suicide attack. Attack Other types of attack not mentioned above. Border Close self-explained for locations closing borders. Border Open self-explained for locations opening borders. Break ceasefire When groups in conflict break a previous ceasefire agreement. Break.Contract When two or more organisations break economic deals.

Page 167 of 175 Ontology Type Description Break.Peace treaty When two or more parties in conflict breaks a formal peace agreement. Capture When a person, group or army is being captured. Capture.Weapons When a fight group have illegally captured weapons from counterparts. Charge.Abduct When a group or person is legally charged with the crime of abduction. Charge.Corruption When a person is being legally charged with the crime of corruption. Charge.Sexual assault When a person is being legally charged with the crime of sexual assault. Claim responsibility When a group or person admits responsibility of an action or event. Coerce When a group or person attempts coercive actions over a target(usually civilians). Coerce.Attack cybernetically When a group or person performs cyber-security attacks with the aim of taking down an online service. Coerce.Ban political parties or politicians When a group or person is banned from belonging to political parties. Coerce.Confiscate property When a government or institution confiscate property. Coerce.Destroy property When a government or institution destroy property. Coerce.Expel or deport individuals When a government or institution expel or export individuals. Extradition is also included in this category. Coerce.Impose administrative sanctions, not When a government or institution impose specified below sanctions, usually economic. Coerce.Impose curfew When a government impose curfew. Coerce.Impose restrictions on political When a government or institution impose freedoms restrictions on political freedoms.

Page 168 of 175 Ontology Type Description Coerce.Impose state of emergency or martial When a government impose state of law emergency or martial law Coerce.Seize or damage property, not When a group or institution seize property, specified below usually of military use. Coerce.Use tactics of violent repression When a group or institution performs violent repression actions, such as protest counteractions. Coerce.not specified below Self explanatory. Complain Self explanatory Complain.officially Diplomatic complain made from a country’s governmental institution against other country’s government. Condemn Public statement from a public figure to condemn violent events. Condemn.Attack Self explanatory. Condemn.Attack.Airstrike Self explanatory. Condemn.Attack.Drone strike Self explanatory. Condemn.Attack.Suicide Self explanatory. Condemn.Human shielding Self explanatory. Condemn.Kidnap Self explanatory. Condemn.Killing aid workers Self explanatory. Condemn.Killing civilians Self explanatory. Condemn.Killing civilians.Women and Self explanatory. children Condemn.Killing elders Self explanatory. Condemn.Killing journalist Self explanatory. Condemn.Killing religious figure Self explanatory. Conquer To gain an specific territory. Consult To ask for strategic consulting. Consult.Discuss by telephone Self explanatory. Consult.Engage in negotiation To ask for strategic infrastructure or diplomatic negotiations. Consult.Make a visit Self explanatory.

Page 169 of 175 Ontology Type Description Consult.Mediate Self explanatory. Consult.Meet at a third location Self explanatory. Consult.not specified below Self explanatory. Cooperation To engage in political or military cooperation. Cooperation.Counter-terrorism To engage in political and military cooperation for counter-terrorism. Cooperation.Military cooperation To engage in military cooperation. Corruption Text input about corruption cases. Critique To critique government’s political actions Death sentence Death sentence to any person, usually after judgement process. Declare To formally declare or change a political status Declare.Peace Self-explanatory Declare.Terrorist organisation To formally declare the creation of a new terrorist group Declare.Victory Self-explanatory Declare.War Self-explanatory Demand action To express via public communication a proof of action towards peace processes, such as to ask for hostage liberations. Demonstration.Protesting Self-explanatory Demonstration.Supporting Self-explanatory Deny.Deny responsibility To deny the responsibility of an action, such as a terrorist attack. Deploy.Military Deployment of military weapons in certain location. Destroy.Weapons To destroy weapons of a certain target. Displace.aka: Flee To escape, or to leave original settlement, usually civilians. Economic assistance To give economic assistance between institutions.

Page 170 of 175 Ontology Type Description Embed To attach civilians in a conflict, usually journalists. Engage in political cooperation Self-explanatory Explode To explode an explosive artefact Express intent.Address corruption Self-explanatory Extort Self-explanatory Fight Self-explanatory Fight.Border clash Marked when a reported fight between groups was located in country borders. Fight.Employ aerial weapons Fight reported and air artillery was used. Fight.Employ precision-guided aerial Fight where precision-guided air weaponry munitions was used. Fight.Employ remotely piloted aerial Fight in which drones were used as weapons. munitions Fight.Ground combat Self-explanatory. Fight.Impose blockade, restrict movement Self-explanatory, usually targeted to civilians. Fight.Occupy territory Self-explanatory. Fight.Violate ceasefire Self-explanatory. Fight.With artillery and tanks Self-explanatory. Fight.With small arms and light weapons Self-explanatory. Foreign investments Self-explanatory. Gain territory.aka: Seize; Capture Self-explanatory. Helicopter crash Self-explanatory. Hijack Terrorist aerial or technological hijacking. Human shielding Self-explanatory. Humanitarian assistance Used when institutions give assistance to any humanitarian crisis found at any location. Impose sanctions Sanctions imposed between governmental agencies. Increase aid.Economic assistance Self-explanatory. Increase aid.Humanitarian assistance Used when humanitarian assistance was already underway, but increased as a result of additional conflict escalades.

Page 171 of 175 Ontology Type Description Increase aid.Military assistance Self-explanatory. Injure.aka: Wound Self-explanatory. Intensify fight Self-explanatory. Internally Displaced Persons returning Self-explanatory. Intimidate.aka: threat, threaten Self-explanatory. Investigate Self-explanatory. Investigate.Investigate crime, corruption Self-explanatory. Investigate.Investigate human rights abuses Self-explanatory. Investigate.Investigate military action Self-explanatory. Investigate.Investigate war crimes Self-explanatory. Join peace process.aka: reconcile; Self-explanatory. Marked when official reconciliation reports are made. Kill.Assassinate Self-explanatory. Kill.Battle Marked when killing was the result of a battle. Kill.Behead Marked when when the target was reported as beheaded. Kill.Execute Marked when when the target was reported as executed, but not beheaded. Kill.Kill aid workers Marked when aid workers were targeted and killed. Kill.Kill civilians Marked when civilians were targeted and killed. Kill.Kill civilians.Women and children Marked when women and/or children were targeted and killed. Kill.Kill elders Marked when elders were targeted and killed. Kill.Kill journalist Marked when journalists were targeted and killed. Kill.Kill religious figure Marked when religious figures were targeted and killed. Kill.aka: Die, Dead, Casualty, Casualties, Marked when no details can be extracted Death from the above categories. Launch.Rocket Self-explanatory. Lose territory.aka: Retreat Self-explanatory.

Page 172 of 175 Ontology Type Description Make a public statement.Acknowledge or Self-explanatory. Usually found in battles or claim responsibility terrorist attacks claims. Make a public statement.Consider policy Self-explanatory. option Make a public statement.Decline comment When a public figure declined or abstained from commenting on an issue. Make a public statement.Deny responsibility Self-explanatory. Make a public statement.Engage in symbolic Self-explanatory. act Make a public statement.Express accord Self-explanatory. Make a public statement.Make empathetic Self-explanatory. comment Make a public statement.Make optimistic Self-explanatory. comment Make a public statement.Make pessimistic Self-explanatory. comment Make a public statement.aka: Say Used when public statements did not belong to the categories avobe. Military assistance Self-explanatory. Military operation.Begin Self-explanatory. Military operation.End Self-explanatory. Military operation.Launch Self-explanatory. Military operation Self-explanatory. Used if no details are specified about the start, end or launch of any specific operation. Military protection or peacekeeping assistance Mistreat Self-explanatory. Negotiate.Ceasefire Self-explanatory. Negotiate.Contract Self-explanatory. Negotiate.Formal agreement Self-explanatory. Negotiate.Peace treaty Self-explanatory. Negotiate Used when negotiations did not belong to the categories above.

Page 173 of 175 Ontology Type Description Night raids Self-explanatory. Poppy eradication.aka: Opium eradication Self-explanatory. Praise Self-explanatory. Prevent.Assault Self-explanatory. Prevent.Attack Self-explanatory. Prevent.aka: Foil Self-explanatory. Used for other conflict-based events preventions apart from assaults and attacks. Protest.Attack.Airstrike Protest event against airstrikes. Protest.Attack.Suicide Protest event against suicide attacks. Protest.Fight Protest event against fights or battles. Protest.Human shielding Protest event against human shielding events. Protest.Kidnap Protest event against kidnapping. Protest.Killing aid workers Protest event against killing aid workers. Protest.Killing civilians.Women and children Protest event against killing women and children. Protest.Killing elders Protest event against killing elders. Protest.Killing journalist Protest event against killing journalists. Protest.Night raids Protest event against night raids. Protest.Poppy eradication Protest event against poppy eradication. Recruit.Forced Self-explanatory. Reduce or stop aid.Economic assistance Self-explanatory. Reduce or stop aid.Humanitarian assistance Self-explanatory. Reduce or stop aid.Military assistance Self-explanatory. Refugee repatriation Self-explanatory. Reject Self-explanatory. Reject.Reject assistance Self-explanatory. Reject.Reject claim Self-explanatory. Reject.Reject policy Self-explanatory. Reject.Reject reform Self-explanatory. Request assistance Self-explanatory. Rescue.aka: rescued; rescues Self-explanatory.

Page 174 of 175 Ontology Type Description Sale.Weapons Self-explanatory. Shot down When artillery of military equipment is neutralised by conflict’s counterpart. Shot down.Aircraft Self-explanatory. Shot down.Drone Self-explanatory. Shot down.Helicopter Self-explanatory. Signing.Contract Self-explanatory. Signing.Formal agreement Self-explanatory. Usually used in peace processes and ceasefires. Signing.Peace treaty Self-explanatory. Steal Self-explanatory. Support Endorsement from an institution or public figure to a specific cause. Support.Political Self-explanatory. Surrender Self-explanatory. Torture Self-explanatory. Violent Crime Violent events separated from politically-oriented violent crimes.

Page 175 of 175