Event Mining Over Multiple Text Streams
Total Page:16
File Type:pdf, Size:1020Kb
Event Mining Over Multiple Text Streams John Calvo Martinez A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Engineering Faculty of Engineering December 2019 Thesis/Dissertation Sheet Surname/Family Name : Calvo Martinez Given Name/s : John Steven Abbreviation for degree as give in the University calendar : Ph.D. – Research in Computer Science and Eng. Faculty : Engineering School : Computer Science Thesis Title : Event Mining over Multiple Text Streams ABstract 350 words maximum: (PLEASE TYPE) Event Mining is the set of information extraction tasks that aim to extract events from text identifying the what (action or event category), the who (actors and targets), the when (date), and the where (location). Extraction of events requires a number of automated steps for recognizing all of these components. Current state-of-the-art event extraction systems rely on batch learning, but analysts need near-real time socio-political conflict understanding. Therefore, a key research question is how to deal with Event Mining in near-real-time scenarios. In this thesis, a novel framework was developed to deal with Event Extraction, Event Detection, Event Classification, and Argument Classification, using online learning and prequential testing to work in near-real time scenarios. The framework was tested using three different social science datasets of the Afghanistan-Pakistan conflict using events reported by news, social media and local experts (ACLED). A novel method called SPLICER was built to tackle these tasks in real time, using stream mining models in a multi-layered constraint learning approach. The feasibility of SPLICER is shown against state-of-the-art event extraction systems. Results show that SPLICER outperformed baselines more than 10% F1 in both ACLED and AfPak datasets. Knowledge represented in domain specific ontologies, in conjunction with constraint learning are used along with base stream mining algorithms. In addition to improvements over stream mining algorithms, this thesis addresses the question of how to automatically combine multiple sources of information for stream classification. We propose SLICER, a stream mining ensemble to handle stream partitioning automatically. It assesses when and how it is better to “horizontally” split a stream dataset to build multiple local models to boost global models. SLICER was tested under single layer event mining tasks, showing better results than single stream mining baselines and classic stream mining ensembles. Finally, SPLICER and SLICER were jointly used, improving results from 3 to 5% for event extraction tasks. In conclusion, stream mining algorithms can be efficiently used for event mining and other information extraction tasks, if horizontal partitions are carefully made by using Information Gain or Gini measures to split the source into multiple streams. Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only). …………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Signature Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award: ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed …………………………………………….............. Date …………………………………………….............. INCLUSION OF PUBLICATIONS STATEMENT UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure. Publications can be used in their thesis in lieu of a Chapter if: • The student contributed greater than 50% of the content in the publication and is the "primary author", ie. the student was responsible primarily for the planning, execution and preparation of the work for publication • The student has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. • The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis Please indicate whether this thesis contains published material or not. This thesis contains no publications, either published or submitted for publication □ I I Some of the work described in this thesis has been published and it has been documented in the relevant Chapters with acknowledgement This thesis has publications (either published or submitted for publication) □ incorporatedinto it in lieu of a chapter and the details are presented below CANDIDATE'S DECLARATION I declare that: • I have complied with the Thesis Examination Procedure • where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Sig Date (dd/mm/yy) Postgraduate Coordinator's De I declare that: • the information below is accurate • where listed publication(s) have been used in lieu of Chapter(s), their use complies with the Thesis Examination Procedure • the minimum requirements for the format of the thesis have been met. PGC's Name PGC's Signature Date (dd/mm/yy) COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' Signed ……………………………………………........................... Date ……………………………………………........................... AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Signed ……………………………………………........................... Date ……………………………………………........................... Dedication Mi amor por ti ha ilusionado. Mi amor por ti ha desilusionado. Mi amor por ti ha evolucionado. Mi amor por ti va mas allá de lo humano. Para mi amor, recordado por hoy y siempre, para Olga. Acknowledgements Real world applications are demanding more complex process automation techniques from a vari- ety of data sources at an unprecedented speed, variety and structure. Real world datasets are noisy, imperfect and imbalanced. The world needs more efforts to exploit our capabilities as scientists to deal with such complexity, in an era of information age imperfection. Our main lesson learned is that research is better done under real-life scenarios, and we hope our contribution can grow towards finding that balance of research with an impact to the real world. I would like to thank all persons that were around me during this major step in my life. Family, friends and co-workers that helped me on achieving this thesis from one way or the other: • I would like to thank my sponsors, Data to Decisions Cooperative Research Centre, and Colciencias for giving me the opportunity to be involved in this endeavour. • I would like to thank my supervisor, Associate Professor Wayne Wobcke. You have been such an inspiration and guidance during these years. Thanks for teaching me not only how to be a good researcher but how to be an impactful and ethical professional. I really appreciate and enjoyed the opportunity to work with you during these years. • Thanks to Dr Alfred Krzywicki for his helpful guidance and comments on my research area. I really appreciate the opportunity to work with you and to build a meaningful project. It was always useful to have your expert eye on my project. • Thanks to Dr Mike Bain for his guidance as my co-supervisor. I hugely enjoyed and learnt from you during classes and mostly during our weekly meetings. • Thanks to Dr Bradford Heap for his invaluable comments. I gained an enormous amount of knowledge from you when we worked together. • Special thanks to Dr Susanne Schmeidl, for whom I have special admiration.