Inter-Quartile Range, - Outliers, - Boxplots
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Review on Outlier/Anomaly Detection in Time Series Data
A review on outlier/anomaly detection in time series data ANE BLÁZQUEZ-GARCÍA and ANGEL CONDE, Ikerlan Technology Research Centre, Basque Research and Technology Alliance (BRTA), Spain USUE MORI, Intelligent Systems Group (ISG), Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), Spain JOSE A. LOZANO, Intelligent Systems Group (ISG), Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), Spain and Basque Center for Applied Mathematics (BCAM), Spain Recent advances in technology have brought major breakthroughs in data collection, enabling a large amount of data to be gathered over time and thus generating time series. Mining this data has become an important task for researchers and practitioners in the past few years, including the detection of outliers or anomalies that may represent errors or events of interest. This review aims to provide a structured and comprehensive state-of-the-art on outlier detection techniques in the context of time series. To this end, a taxonomy is presented based on the main aspects that characterize an outlier detection technique. Additional Key Words and Phrases: Outlier detection, anomaly detection, time series, data mining, taxonomy, software 1 INTRODUCTION Recent advances in technology allow us to collect a large amount of data over time in diverse research areas. Observations that have been recorded in an orderly fashion and which are correlated in time constitute a time series. Time series data mining aims to extract all meaningful knowledge from this data, and several mining tasks (e.g., classification, clustering, forecasting, and outlier detection) have been considered in the literature [Esling and Agon 2012; Fu 2011; Ratanamahatana et al. -
Outlier Identification.Pdf
STATGRAPHICS – Rev. 7/6/2009 Outlier Identification Summary The Outlier Identification procedure is designed to help determine whether or not a sample of n numeric observations contains outliers. By “outlier”, we mean an observation that does not come from the same distribution as the rest of the sample. Both graphical methods and formal statistical tests are included. The procedure will also save a column back to the datasheet identifying the outliers in a form that can be used in the Select field on other data input dialog boxes. Sample StatFolio: outlier.sgp Sample Data: The file bodytemp.sgd file contains data describing the body temperature of a sample of n = 130 people. It was obtained from the Journal of Statistical Education Data Archive (www.amstat.org/publications/jse/jse_data_archive.html) and originally appeared in the Journal of the American Medical Association. The first 20 rows of the file are shown below. Temperature Gender Heart Rate 98.4 Male 84 98.4 Male 82 98.2 Female 65 97.8 Female 71 98 Male 78 97.9 Male 72 99 Female 79 98.5 Male 68 98.8 Female 64 98 Male 67 97.4 Male 78 98.8 Male 78 99.5 Male 75 98 Female 73 100.8 Female 77 97.1 Male 75 98 Male 71 98.7 Female 72 98.9 Male 80 99 Male 75 2009 by StatPoint Technologies, Inc. Outlier Identification - 1 STATGRAPHICS – Rev. 7/6/2009 Data Input The data to be analyzed consist of a single numeric column containing n = 2 or more observations. -
A Machine Learning Approach to Outlier Detection and Imputation of Missing Data1
Ninth IFC Conference on “Are post-crisis statistical initiatives completed?” Basel, 30-31 August 2018 A machine learning approach to outlier detection and imputation of missing data1 Nicola Benatti, European Central Bank 1 This paper was prepared for the meeting. The views expressed are those of the authors and do not necessarily reflect the views of the BIS, the IFC or the central banks and other institutions represented at the meeting. A machine learning approach to outlier detection and imputation of missing data Nicola Benatti In the era of ready-to-go analysis of high-dimensional datasets, data quality is essential for economists to guarantee robust results. Traditional techniques for outlier detection tend to exclude the tails of distributions and ignore the data generation processes of specific datasets. At the same time, multiple imputation of missing values is traditionally an iterative process based on linear estimations, implying the use of simplified data generation models. In this paper I propose the use of common machine learning algorithms (i.e. boosted trees, cross validation and cluster analysis) to determine the data generation models of a firm-level dataset in order to detect outliers and impute missing values. Keywords: machine learning, outlier detection, imputation, firm data JEL classification: C81, C55, C53, D22 Contents A machine learning approach to outlier detection and imputation of missing data ... 1 Introduction .............................................................................................................................................. -
Fan Chart: Methodology and Its Application to Inflation Forecasting in India
W P S (DEPR) : 5 / 2011 RBI WORKING PAPER SERIES Fan Chart: Methodology and its Application to Inflation Forecasting in India Nivedita Banerjee and Abhiman Das DEPARTMENT OF ECONOMIC AND POLICY RESEARCH MAY 2011 The Reserve Bank of India (RBI) introduced the RBI Working Papers series in May 2011. These papers present research in progress of the staff members of RBI and are disseminated to elicit comments and further debate. The views expressed in these papers are those of authors and not that of RBI. Comments and observations may please be forwarded to authors. Citation and use of such papers should take into account its provisional character. Copyright: Reserve Bank of India 2011 Fan Chart: Methodology and its Application to Inflation Forecasting in India Nivedita Banerjee 1 (Email) and Abhiman Das (Email) Abstract Ever since Bank of England first published its inflation forecasts in February 1996, the Fan Chart has been an integral part of inflation reports of central banks around the world. The Fan Chart is basically used to improve presentation: to focus attention on the whole of the forecast distribution, rather than on small changes to the central projection. However, forecast distribution is a technical concept originated from the statistical sampling distribution. In this paper, we have presented the technical details underlying the derivation of Fan Chart used in representing the uncertainty in inflation forecasts. The uncertainty occurs because of the inter-play of the macro-economic variables affecting inflation. The uncertainty in the macro-economic variables is based on their historical standard deviation of the forecast errors, but we also allow these to be subjectively adjusted. -
Finding Basic Statistics Using Minitab 1. Put Your Data Values in One of the Columns of the Minitab Worksheet
Finding Basic Statistics Using Minitab 1. Put your data values in one of the columns of the Minitab worksheet. 2. Add a variable name in the gray box just above the data values. 3. Click on “Stat”, then click on “Basic Statistics”, and then click on "Display Descriptive Statistics". 4. Choose the variable you want the basic statistics for click on “Select”. 5. Click on the “Statistics” box and then check the box next to each statistic you want to see, and uncheck the boxes next to those you do not what to see. 6. Click on “OK” in that window and click on “OK” in the next window. 7. The values of all the statistics you selected will appear in the Session window. Example (Navidi & Monk, Elementary Statistics, 2nd edition, #31 p. 143, 1st 4 columns): This data gives the number of tribal casinos in a sample of 16 states. 3 7 14 114 2 3 7 8 26 4 3 14 70 3 21 1 Open Minitab and enter the data under C1. The table below shows a portion of the entered data. ↓ C1 C2 Casinos 1 3 2 7 3 14 4 114 5 2 6 3 7 7 8 8 9 26 10 4 Now click on “Stat” and then choose “Basic Statistics” and “Display Descriptive Statistics”. Click in the box under “Variables:”, choose C1 from the window at left, and then click on the “Select” button. Next click on the “Statistics” button and choose which statistics you want to find. We will usually be interested in the following statistics: mean, standard error of the mean, standard deviation, minimum, maximum, range, first quartile, median, third quartile, interquartile range, mode, and skewness. -
Effects of Outliers with an Outlier
• The mean is a good measure to use to describe data that are close in value. • The median more accurately describes data Effects of Outliers with an outlier. • The mode is a good measure to use when you have categorical data; for example, if each student records his or her favorite color, the color (a category) listed most often is the mode of the data. In the data set below, the value 12 is much less than Additional Example 2: Exploring the Effects of the other values in the set. An extreme value such as Outliers on Measures of Central Tendency this is called an outlier. The data shows Sara’s scores for the last 5 35, 38, 27, 12, 30, 41, 31, 35 math tests: 88, 90, 55, 94, and 89. Identify the outlier in the data set. Then determine how the outlier affects the mean, median, and x mode of the data. x x xx x x x 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 55, 88, 89, 90, 94 outlier 55 1 Additional Example 2 Continued Additional Example 2 Continued With the Outlier Without the Outlier 55, 88, 89, 90, 94 55, 88, 89, 90, 94 outlier 55 mean: median: mode: mean: median: mode: 55+88+89+90+94 = 416 55, 88, 89, 90, 94 88+89+90+94 = 361 88, 89,+ 90, 94 416 5 = 83.2 361 4 = 90.25 2 = 89.5 The mean is 83.2. The median is 89. There is no mode. -
How Significant Is a Boxplot Outlier?
Journal of Statistics Education, Volume 19, Number 2(2011) How Significant Is A Boxplot Outlier? Robert Dawson Saint Mary’s University Journal of Statistics Education Volume 19, Number 2(2011), www.amstat.org/publications/jse/v19n2/dawson.pdf Copyright © 2011 by Robert Dawson all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor. Key Words: Boxplot; Outlier; Significance; Spreadsheet; Simulation. Abstract It is common to consider Tukey’s schematic (“full”) boxplot as an informal test for the existence of outliers. While the procedure is useful, it should be used with caution, as at least 30% of samples from a normally-distributed population of any size will be flagged as containing an outlier, while for small samples (N<10) even extreme outliers indicate little. This fact is most easily seen using a simulation, which ideally students should perform for themselves. The majority of students who learn about boxplots are not familiar with the tools (such as R) that upper-level students might use for such a simulation. This article shows how, with appropriate guidance, a class can use a spreadsheet such as Excel to explore this issue. 1. Introduction Exploratory data analysis suddenly got a lot cooler on February 4, 2009, when a boxplot was featured in Randall Munroe's popular Webcomic xkcd. 1 Journal of Statistics Education, Volume 19, Number 2(2011) Figure 1 “Boyfriend” (Feb. 4 2009, http://xkcd.com/539/) Used by permission of Randall Munroe But is Megan justified in claiming “statistical significance”? We shall explore this intriguing question using Microsoft Excel. -
Lecture 14 Testing for Kurtosis
9/8/2016 CHE384, From Data to Decisions: Measurement, Kurtosis Uncertainty, Analysis, and Modeling • For any distribution, the kurtosis (sometimes Lecture 14 called the excess kurtosis) is defined as Testing for Kurtosis 3 (old notation = ) • For a unimodal, symmetric distribution, Chris A. Mack – a positive kurtosis means “heavy tails” and a more Adjunct Associate Professor peaked center compared to a normal distribution – a negative kurtosis means “light tails” and a more spread center compared to a normal distribution http://www.lithoguru.com/scientist/statistics/ © Chris Mack, 2016Data to Decisions 1 © Chris Mack, 2016Data to Decisions 2 Kurtosis Examples One Impact of Excess Kurtosis • For the Student’s t • For a normal distribution, the sample distribution, the variance will have an expected value of s2, excess kurtosis is and a variance of 6 2 4 1 for DF > 4 ( for DF ≤ 4 the kurtosis is infinite) • For a distribution with excess kurtosis • For a uniform 2 1 1 distribution, 1 2 © Chris Mack, 2016Data to Decisions 3 © Chris Mack, 2016Data to Decisions 4 Sample Kurtosis Sample Kurtosis • For a sample of size n, the sample kurtosis is • An unbiased estimator of the sample excess 1 kurtosis is ∑ ̅ 1 3 3 1 6 1 2 3 ∑ ̅ Standard Error: • For large n, the sampling distribution of 1 24 2 1 approaches Normal with mean 0 and variance 2 1 of 24/n 3 5 • For small samples, this estimator is biased D. N. Joanes and C. A. Gill, “Comparing Measures of Sample Skewness and Kurtosis”, The Statistician, 47(1),183–189 (1998). -
Continuous Random Variables
ST 380 Probability and Statistics for the Physical Sciences Continuous Random Variables Recall: A continuous random variable X satisfies: 1 its range is the union of one or more real number intervals; 2 P(X = c) = 0 for every c in the range of X . Examples: The depth of a lake at a randomly chosen location. The pH of a random sample of effluent. The precipitation on a randomly chosen day is not a continuous random variable: its range is [0; 1), and P(X = c) = 0 for any c > 0, but P(X = 0) > 0. 1 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences Discretized Data Suppose that we measure the depth of the lake, but round the depth off to some unit. The rounded value Y is a discrete random variable; we can display its probability mass function as a bar graph, because each mass actually represents an interval of values of X . In R source("discretize.R") discretize(0.5) 2 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences As the rounding unit becomes smaller, the bar graph more accurately represents the continuous distribution: discretize(0.25) discretize(0.1) When the rounding unit is very small, the bar graph approximates a smooth function: discretize(0.01) plot(f, from = 1, to = 5) 3 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences The probability that X is between two values a and b, P(a ≤ X ≤ b), can be approximated by P(a ≤ Y ≤ b). -
Binomial and Normal Distributions
Probability distributions Introduction What is a probability? If I perform n experiments and a particular event occurs on r occasions, the “relative frequency” of r this event is simply . This is an experimental observation that gives us an estimate of the n probability – there will be some random variation above and below the actual probability. If I do a large number of experiments, the relative frequency gets closer to the probability. We define the probability as meaning the limit of the relative frequency as the number of experiments tends to infinity. Usually we can calculate the probability from theoretical considerations (number of beads in a bag, number of faces of a dice, tree diagram, any other kind of statistical model). What is a random variable? A random variable (r.v.) is the value that might be obtained from some kind of experiment or measurement process in which there is some random uncertainty. A discrete r.v. takes a finite number of possible values with distinct steps between them. A continuous r.v. takes an infinite number of values which vary smoothly. We talk not of the probability of getting a particular value but of the probability that a value lies between certain limits. Random variables are given names which start with a capital letter. An r.v. is a numerical value (eg. a head is not a value but the number of heads in 10 throws is an r.v.) Eg the score when throwing a dice (discrete); the air temperature at a random time and date (continuous); the age of a cat chosen at random. -
On Median and Quartile Sets of Ordered Random Variables*
ISSN 2590-9770 The Art of Discrete and Applied Mathematics 3 (2020) #P2.08 https://doi.org/10.26493/2590-9770.1326.9fd (Also available at http://adam-journal.eu) On median and quartile sets of ordered random variables* Iztok Banicˇ Faculty of Natural Sciences and Mathematics, University of Maribor, Koroskaˇ 160, SI-2000 Maribor, Slovenia, and Institute of Mathematics, Physics and Mechanics, Jadranska 19, SI-1000 Ljubljana, Slovenia, and Andrej Marusiˇ cˇ Institute, University of Primorska, Muzejski trg 2, SI-6000 Koper, Slovenia Janez Zerovnikˇ Faculty of Mechanical Engineering, University of Ljubljana, Askerˇ cevaˇ 6, SI-1000 Ljubljana, Slovenia, and Institute of Mathematics, Physics and Mechanics, Jadranska 19, SI-1000 Ljubljana, Slovenia Received 30 November 2018, accepted 13 August 2019, published online 21 August 2020 Abstract We give new results about the set of all medians, the set of all first quartiles and the set of all third quartiles of a finite dataset. We also give new and interesting results about rela- tionships between these sets. We also use these results to provide an elementary correctness proof of the Langford’s doubling method. Keywords: Statistics, probability, median, first quartile, third quartile, median set, first quartile set, third quartile set. Math. Subj. Class. (2020): 62-07, 60E05, 60-08, 60A05, 62A01 1 Introduction Quantiles play a fundamental role in statistics: they are the critical values used in hypoth- esis testing and interval estimation. Often they are the characteristics of distributions we usually wish to estimate. The use of quantiles as primary measure of performance has *This work was supported in part by the Slovenian Research Agency (grants J1-8155, N1-0071, P2-0248, and J1-1693.) E-mail addresses: [email protected] (Iztok Banic),ˇ [email protected] (Janez Zerovnik)ˇ cb This work is licensed under https://creativecommons.org/licenses/by/4.0/ 2 Art Discrete Appl. -
A SYSTEMATIC REVIEW of OUTLIERS DETECTION TECHNIQUES in MEDICAL DATA Preliminary Study
A SYSTEMATIC REVIEW OF OUTLIERS DETECTION TECHNIQUES IN MEDICAL DATA Preliminary Study Juliano Gaspar1,2, Emanuel Catumbela1,2, Bernardo Marques1,2 and Alberto Freitas1,2 1Department of Biostatistics and Medical Informatics, Faculty of Medicine, University of Porto, Porto, Portugal 2CINTESIS - Center for Research in Health Technologies and Information Systems University of Porto, Porto, Portugal Keywords: Outliers detection, Data mining, Medical data. Abstract: Background: Patient medical records contain many entries relating to patient conditions, treatments and lab results. Generally involve multiple types of data and produces a large amount of information. These databases can provide important information for clinical decision and to support the management of the hospital. Medical databases have some specificities not often found in others non-medical databases. In this context, outlier detection techniques can be used to detect abnormal patterns in health records (for instance, problems in data quality) and this contributing to better data and better knowledge in the process of decision making. Aim: This systematic review intention to provide a better comprehension about the techniques used to detect outliers in healthcare data, for creates automatisms for those methods in the order to facilitate the access to information with quality in healthcare. Methods: The literature was systematically reviewed to identify articles mentioning outlier detection techniques or anomalies in medical data. Four distinct bibliographic databases were searched: Medline, ISI, IEEE and EBSCO. Results: From 4071 distinct papers selected, 80 were included after applying inclusion and exclusion criteria. According to the medical specialty 32% of the techniques are intended for oncology and 37% of them using patient data.