Outlier Detection

COMP20008 Elements of Data Processing Project discussion, outlier detection Announcements • Project has been released • Workshop next week – will be released today by 2pm • Friday next week 25th March (Good Friday holiday) – No lecture – No workshop – People who attend the Friday Workshop, you may come instead to the Wednesday afternoon workshop (3:15-5:15) and we should be able to accommodate you. Please bring a laptop if you can. Plan • Project discussion • Introduction to some outlier detection techniques Project Scenario • The Victorian Minister for Data Science and the Mayor of the Melbourne City Council wish to understand more about how open data can be used to benefit Melbourne. • At a high level, they would like to see demonstrations of how open data can be wrangled to gain insight into issues affecting Melbourne, for a broad range of areas such as transport, health, business, education, tourism, the environment, communities, the arts, commerce, public amenities, employment, sport, usage of facilities, real estate, finance or urban planning. • You are a data science consultant who is hoping to convince the Minister and the Mayor about the benefits of open data. Project Phases • Phase 1 (10%): write a proposal, outlining a question in a chosen domain that is relevant to Melbourne and proposing a data wrangling project to demonstrate the benefits of processing open datasets to answer this question. • Phase 2 (5%): commence the investigation, generate initial findings, provide an interim report on what you have learnt so far • Phase 3 (10%): oral presentation • Phase 4 (25%): deliver a written report outlining your methodology and findings. Deliver Python code as well Datasets • See the LMS Advice • Phase 1 – likely to be the most difficult phase, get started early! Need to • Formulate a question for a chosen domain • Explain who might care about this question, why and possible impact an answer would have • Select at least 2 datasets to help answer this question • Give thought to what processing, integration and visualisation is needed for these datasets? • Enrich compared to the raw data • Consider potential difficulties • Consider amount of effort needed for writing Python code Formulating a question • Browse the datasets. See what catches your eye. E.g. – Pedestrian count information around Melbourne – Usage of bikeshare pods – Bicycle volumes – Crime by location – Enrolments and class sizes at schools – Historical weather data – Road crash information – Mobile speed camera locations – Water consumption – Locations of council maintained trees around City of Melbourne – Projected population of Melbourne – ---- Starting Examples • Health: Analysis of bicycle usage over time, how is it correlated with weather, traffic volume? Might this information be useful for fitness campaigns? (bicycle data, traffic data, weather data) • Education: Do we have enough schools in Melbourne to support our growing population? (population data, school data) • Tourism: How do visitors behave when they visit Melbourne, what do they photograph? Can this knowledge be used to make Melbourne more tourist friendly? (open Flickr data, twitter) Final Report • Will need to discuss • What processing and integration was done and why • What was found – tables, graphs and visualisations • Static, not dynamic visualisations. You are not being asked to develop an interactive tool • Explain how the findings answer the question originally posed. Who might be interested and why? • Provide documented Python code Due dates and workload • Phase 1 (pitch): 12pm 4th April • Phase 2 (interim report): 12pm 22nd April • Phase 3: Oral presentation in weeks 11 and 12 • Phase 4 (final report): 12pm Friday 20th May • Phases 1,2 and 4: ~45-50hrs work • Phase 3: ~10-12 hrs work • Total is ~55-62 hrs work – Tightly scope a question. – Given the time constraints, we are not expecting a major investigation of the question. Rather, a proof of concept that might be used as evidence to conduct a more resource intensive investigation. Advice (2) • Make use of LMS discussion forum for questions. • Project is structured in phases – intention of promoting steady progress and to help avoid getting stuck – more feedback opportunities – take each phase seriously COMP20008 Elements of Data Processing Outlier Detection Outlier analysis • Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism (Hawkins, 1980) – Ex.: Unusual credit card purchase, sports: Michael Jordon, Lance Franklin, … • From a statistics perspective – Normal (non-outlier) objects are generated using some statistical process – The outlier objects deviate from this generating process Appl. Statist. (1978), 27, No. 3, pp. 242-250 The Study of Outliers: Purpose and Model By Vic BARNrr University of Sheffield, Britain [Received January 1978. Revised April 1978] SUMMARY Outliers may influence the analysis of a set of data in various different ways. Some practical examples are used to motivate a categorization of the different aims in handling outliers and of the different models which might be employed to reflect the presence of outliers. Keywords: OUTLIERS; OUTLIER-GENERATING MODELS; TEST OF DISCORDANCY 1. INTRODUCTION THE legal case of Hadlum v. Hadlum, held in 1949, is interesting! Mr Hadlum was appealing against the failure of his earlier petition for divorce on the grounds of Mrs Hadlum's claimed adultery. The sole evidence of adultery consisted of the birth of a child to Mrs Hadlum on 12 August 1945: 349 days after Mr Hadlum had left for military service abroad. The claim was essentially that the outlier of 349 days (compared with an average of 280 days) was discordant: statistically unreasonable in relation to the distribution of human gestation periods. The appeal judges agreed that the limit of credibility had to be drawn somewhere, but that on medical evidence 349, whilst improbable, was scientifically possible. The appeal failed. (Later, in 1951, in Preston-Jones v. Preston-Jones, the House of Lords drew the limit at 360 Example: Hadlum vs Hadlum paternity case days: in M.-T. v. M.-T., 1949, 340 days had been ruled "impossible in the light of modern gynaecological evidence"). Let us look at the distribution of gestation times (which is more than was done in any of the court cases). Fig. 1 (based on Chamberlain, 1975) shows the distribution of lengths of gestation for a sample of 13,634 births, and• the outliers described Paternity case: “The study of outliers”, V. Barnett, Journal of the above. Royal Statistical Society, 27(3), 1978 Percentage (n D 13,634) 20 - Hadlum v. Nor a -Taheu 30 3504550 Week FIG. 1. Distribution of human gestation periods. 242 This content downloaded from 128.250.144.144 on Thu, 17 Mar 2016 18:17:17 UTC All use subject to JSTOR Terms and Conditions Outlier analysis • Outliers can be different from the noise data – Noise is random error or variance in a measured variable – Noise should be removed before outlier detection • Outliers are interesting: Violation of the mechanism that generates the normal data • Applications: – Credit card fraud detection (change in behaviour) – Telecom fraud detection – Medical analysis (unusual test results) – Sports (identifying exceptional talent) Australian Rules Football • Daniel Giansiracusa Outlyingness of Daniel Giansiracusa (see arrow) versus 626 other players 60 50 40 30 20 A 10 Number of goals scored in 2014 0 0 10 20 30 40 50 60 70 80 90 100 Average percentage of time on field Why do we care? • Compute the average age of people in this room – Skewed results • Compute the average salary of people in this room – What if Donald Trump is in the audience? Types of Outliers • Global outlier (or point anomaly) Global Outlier – Object is Og if it significantly deviates from the rest of the data set – Ex. Intrusion detection in computer networks – Issue: Find an appropriate measurement of deviation • Contextual outlier (or conditional outlier) – Object is Oc if it deviates significantly based on a selected context – Is 5o in Melbourne an outlier? (depending on summer or winter?) – Attributes of data should be divided into two groups • Contextual attributes: defines the context, e.g., time & location • Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature – Issue: How to define or formulate meaningful context? Exercise • Provide two examples to motivate the importance of each of – A global outlier – A contextual outlier Types of Outliers (II) • Collective Outliers – A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers – Applications: E.g., intrusion detection: Collective Outlier • When a number of computers keep sending denial-of-service packages to each other n Detection of collective outliers n Consider not only behavior of individual objects, but also that of groups of objects n A data set may have multiple types of outlier n One object could be characterised as more than one type of outlier Challenges of Outlier Detection n Modeling normal objects and outliers properly n Hard to enumerate all possible normal behaviors in an application n The border between normal and outlier objects is often a gray area n Application-specific outlier detection n Choice of distance measure among objects and the model of relationship among objects are often application-dependent n E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger

Load more