Outlier Management in Intelligent Data Analysis

OUTLIER MANAGEMENT IN INTELLIGENT DATA ANALYSIS J. Gongxian Cheng A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy Department of Computer Science Birkbeck College University of London September 2000 1 ABSTRACT In spite of many statistical methods for outlier detection and for robust analysis, there is little work on further analysis of outliers themselves to determine their origins. For example, there are “good” outliers that provide useful information that can lead to the discovery of new knowledge, or “bad” outliers that include noisy data points. Successfully distinguishing between different types of outliers is an important issue in many applications, including fraud detection, medical tests, process analysis and scientific discovery. It requires not only an understanding of the mathematical properties of data but also relevant knowledge in the domain context in which the outliers occur. This thesis presents a novel attempt in automating the use of domain knowledge in helping distinguish between different types of outliers. Two complementary knowledge-based outlier analysis strategies are proposed: one using knowledge regarding how “normal data” should be distributed in a domain of interest in order to identify “good” outliers, and the other using the understanding of “bad” outliers. This kind of knowledge-based outlier analysis is a useful extension to existing work in both statistical and computing communities on outlier detection. In addition, a novel way of visualising and detecting outliers using self-organising maps is proposed, an active control of data quality is introduced in the data collection stage, and an interactive procedure for knowledge discovery from noisy data is suggested. The methods proposed for outlier management is applied to a class of medical screening applications, where data were collected under different clinical environments, including GP clinics and large-scale field investigations. Several evaluation strategies are proposed to assess various aspects of the proposed methods. Extensive experiments have demonstrated that that problem-solving results are improved under these proposed outlier management methods. A number of examples are discussed to explain how the proposed methods can be applied to other applications as a general methodology in outlier management. 2 Dedicated to the memory of my Aunt, Maofen. 3 Acknowledgements It has been a long journey since I started the work presented in this thesis. So when it comes to the time to write this page, it would not be easy to enumerate all of the support that I have received throughout my studies. I might not even know some of the people involved, but I am sure that without many of them, the thesis would not have become a reality. I especially appreciate the numerous inspirational and passionate brainstorming sessions with Dr Xiaohui Liu and Dr John Wu, which have contributed a great deal to this thesis. I am also most grateful to Professor Barrie Jones, who not only has secured part of the funding for this study, but more importantly, has also offered unreserved support throughout my studies at the University. He has also helped review many of my papers and has organised several trial runs with the IDA system, which collected critically important test data. My gratitude also goes to Professor George Loizou. Not only has he supported my studies, but he has also been a major driving force for the project. I must also thank Dr Roger Johnson, Dr Ken Thomas, Dr Trevor Fenner, Dr Roger Mitton and Dr Phil Docking at Birkbeck College, Professor Roger Hitchings and Dr Richard Wormald at the Moorfields Eye Hospital, Professor Gordon Johnson and Professor Fred Fitzke at the Institute of Ophthalmology, University College London, and Dr Simon Cousens at London School of Hygiene and Tropical Medicine, who have all played important roles at different stages in my studies. I wish to thank Dr Stephen Corcoran for collecting GP data from the MRC pilot study, and Professor A. Abiose at the National Eye Centre of Kaduna, Nigeria, Mr B. Adeniyi and other members of the onchocerciasis research team, for collecting the onchocercal data from field trials in Nigeria, and Grace Wu for collecting glaucomatous data at the Moorfields Eye Hospital. 4 I am also thankful to Jonathan Collins, Michael Hickman, Sylvie Jami and Kwa-Wen Cho, who provided help for many experiments related to this thesis. The United Nations Development Programme (UNDP) and the British Council for Prevention of Blindness have provided funding for my study; UK’s Medical Research Council and the UNDP, the World Bank and the World Health Organisation have funded the large scale of field trials for this system. Vital data were obtained from these trials. Words are not enough to express my thanks to my supervisor, Dr Xiaohui Liu, who has been vigorously supporting my studies, guiding my thesis, and making my life in London so enjoyable. The same thanks also goes to Dr John Wu and his family – it has been such a precious experience with you, that it makes me remember forever. This thesis would not be possible without my wife Chunyu. Thank you for your love and encouragement. I would also like to thank our beloved Sabrina for being quiet and understanding during the writing of the thesis. Last but by no means least, thanks to my parents for loving me, believing in me, and empowering my thought throughout my life. 5 Declaration The work presented in the thesis is my own except where it is explicitly stated otherwise. The papers have been jointly published with my supervisor, Dr X. Liu, and with fellow research workers, notably Dr J. Wu. However, I have personally conducted the research reported in the thesis, and am responsible for the results obtained. Gongxian Cheng: ____________ Xiaohui Liu: ____________ 6 TABLE OF CONTENTS ABSTRACT............................................................................................................................................................... 2 ACKNOWLEDGEMENTS...................................................................................................................................... 4 DECLARATION....................................................................................................................................................... 6 TABLE OF CONTENTS.......................................................................................................................................... 7 LIST OF FIGURES ................................................................................................................................................ 11 LIST OF TABLES .................................................................................................................................................. 12 LIST OF EQUATIONS .......................................................................................................................................... 13 ACRONYMS ........................................................................................................................................................... 14 CHAPTER 1 INTRODUCTION....................................................................................................................... 15 1.1 INTELLIGENT DATA ANALYSIS .................................................................................................................. 15 1.2 OUTLIER MANAGEMENT ............................................................................................................................ 16 1.2.1 Issues and Challenges.............................................................................................................................. 16 1.2.2 An Intelligent Data Analysis Approach.................................................................................................... 19 1.2.3 Applying the Approach to Real-World Problems..................................................................................... 21 1.3 KEY CONTRIBUTIONS................................................................................................................................. 23 1.4 A BRIEF DESCRIPTION OF CONTENTS ........................................................................................................ 24 CHAPTER 2 BACKGROUND.......................................................................................................................... 26 2.1 OUTLIER MANAGEMENT ............................................................................................................................ 26 2.1.1 Why Do Outliers Occur?.......................................................................................................................... 26 2.1.2 Outlier Detection...................................................................................................................................... 27 2.1.3 Methods for Handling Outliers ................................................................................................................ 28 2.2 DIVERSITY OF IDA METHODS ................................................................................................................... 30 2.2.1 Artificial Neural Networks ....................................................................................................................... 30 7 2.2.2 Bayesian Networks..................................................................................................................................

Load more