Chapter 3: Measures of Central Tendency
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Review on Outlier/Anomaly Detection in Time Series Data
A review on outlier/anomaly detection in time series data ANE BLÁZQUEZ-GARCÍA and ANGEL CONDE, Ikerlan Technology Research Centre, Basque Research and Technology Alliance (BRTA), Spain USUE MORI, Intelligent Systems Group (ISG), Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), Spain JOSE A. LOZANO, Intelligent Systems Group (ISG), Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), Spain and Basque Center for Applied Mathematics (BCAM), Spain Recent advances in technology have brought major breakthroughs in data collection, enabling a large amount of data to be gathered over time and thus generating time series. Mining this data has become an important task for researchers and practitioners in the past few years, including the detection of outliers or anomalies that may represent errors or events of interest. This review aims to provide a structured and comprehensive state-of-the-art on outlier detection techniques in the context of time series. To this end, a taxonomy is presented based on the main aspects that characterize an outlier detection technique. Additional Key Words and Phrases: Outlier detection, anomaly detection, time series, data mining, taxonomy, software 1 INTRODUCTION Recent advances in technology allow us to collect a large amount of data over time in diverse research areas. Observations that have been recorded in an orderly fashion and which are correlated in time constitute a time series. Time series data mining aims to extract all meaningful knowledge from this data, and several mining tasks (e.g., classification, clustering, forecasting, and outlier detection) have been considered in the literature [Esling and Agon 2012; Fu 2011; Ratanamahatana et al. -
Mean, Median and Mode We Use Statistics Such As the Mean, Median and Mode to Obtain Information About a Population from Our Sample Set of Observed Values
INFORMATION AND DATA CLASS-6 SUB-MATHEMATICS The words Data and Information may look similar and many people use these words very frequently, But both have lots of differences between What is Data? Data is a raw and unorganized fact that required to be processed to make it meaningful. Data can be simple at the same time unorganized unless it is organized. Generally, data comprises facts, observations, perceptions numbers, characters, symbols, image, etc. Data is always interpreted, by a human or machine, to derive meaning. So, data is meaningless. Data contains numbers, statements, and characters in a raw form What is Information? Information is a set of data which is processed in a meaningful way according to the given requirement. Information is processed, structured, or presented in a given context to make it meaningful and useful. It is processed data which includes data that possess context, relevance, and purpose. It also involves manipulation of raw data Mean, Median and Mode We use statistics such as the mean, median and mode to obtain information about a population from our sample set of observed values. Mean The mean/Arithmetic mean or average of a set of data values is the sum of all of the data values divided by the number of data values. That is: Example 1 The marks of seven students in a mathematics test with a maximum possible mark of 20 are given below: 15 13 18 16 14 17 12 Find the mean of this set of data values. Solution: So, the mean mark is 15. Symbolically, we can set out the solution as follows: So, the mean mark is 15. -
Simple Mean Weighted Mean Or Harmonic Mean
MultiplyMultiply oror Divide?Divide? AA BestBest PracticePractice forfor FactorFactor AnalysisAnalysis 77 ––10 10 JuneJune 20112011 Dr.Dr. ShuShu-Ping-Ping HuHu AlfredAlfred SmithSmith CCEACCEA Los Angeles Washington, D.C. Boston Chantilly Huntsville Dayton Santa Barbara Albuquerque Colorado Springs Ft. Meade Ft. Monmouth Goddard Space Flight Center Ogden Patuxent River Silver Spring Washington Navy Yard Cleveland Dahlgren Denver Johnson Space Center Montgomery New Orleans Oklahoma City Tampa Tacoma Vandenberg AFB Warner Robins ALC Presented at the 2011 ISPA/SCEA Joint Annual Conference and Training Workshop - www.iceaaonline.com PRT-70, 01 Apr 2011 ObjectivesObjectives It is common to estimate hours as a simple factor of a technical parameter such as weight, aperture, power or source lines of code (SLOC), i.e., hours = a*TechParameter z “Software development hours = a * SLOC” is used as an example z Concept is applicable to any factor cost estimating relationship (CER) Our objective is to address how to best estimate “a” z Multiply SLOC by Hour/SLOC or Divide SLOC by SLOC/Hour? z Simple, weighted, or harmonic mean? z Role of regression analysis z Base uncertainty on the prediction interval rather than just the range Our goal is to provide analysts a better understanding of choices available and how to select the right approach Presented at the 2011 ISPA/SCEA Joint Annual Conference and Training Workshop - www.iceaaonline.com PR-70, 01 Apr 2011 Approved for Public Release 2 of 25 OutlineOutline Definitions -
Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture
Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture Yuanwei Fang Chen Zou Andrew A. Chien CS Department CS Department University of Chicago University of Chicago University of Chicago Argonne National Laboratory [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION The data science revolution and growing popularity of data Driven by a rapid increase in quantity, type, and number lakes make efficient processing of raw data increasingly impor- of sources of data, many data analytics approaches use the tant. To address this, we propose the ACCelerated Operators data lake model. Incoming data is pooled in large quantity { for Raw Data Analysis (ACCORDA) architecture. By ex- hence the name \lake" { and because of varied origin (e.g., tending the operator interface (subtype with encoding) and purchased from other providers, internal employee data, web employing a uniform runtime worker model, ACCORDA scraped, database dumps, etc.), it varies in size, encoding, integrates data transformation acceleration seamlessly, en- age, quality, etc. Additionally, it often lacks consistency or abling a new class of encoding optimizations and robust any common schema. Analytics applications pull data from high-performance raw data processing. Together, these key the lake as they see fit, digesting and processing it on-demand features preserve the software system architecture, empower- for a current analytics task. The applications of data lakes ing state-of-art heuristic optimizations to drive flexible data are vast, including machine learning, data mining, artificial encoding for performance. ACCORDA derives performance intelligence, ad-hoc query, and data visualization. Fast direct from its software architecture, but depends critically on the processing on raw data is critical for these applications. -
Hydraulics Manual Glossary G - 3
Glossary G - 1 GLOSSARY OF HIGHWAY-RELATED DRAINAGE TERMS (Reprinted from the 1999 edition of the American Association of State Highway and Transportation Officials Model Drainage Manual) G.1 Introduction This Glossary is divided into three parts: · Introduction, · Glossary, and · References. It is not intended that all the terms in this Glossary be rigorously accurate or complete. Realistically, this is impossible. Depending on the circumstance, a particular term may have several meanings; this can never change. The primary purpose of this Glossary is to define the terms found in the Highway Drainage Guidelines and Model Drainage Manual in a manner that makes them easier to interpret and understand. A lesser purpose is to provide a compendium of terms that will be useful for both the novice as well as the more experienced hydraulics engineer. This Glossary may also help those who are unfamiliar with highway drainage design to become more understanding and appreciative of this complex science as well as facilitate communication between the highway hydraulics engineer and others. Where readily available, the source of a definition has been referenced. For clarity or format purposes, cited definitions may have some additional verbiage contained in double brackets [ ]. Conversely, three “dots” (...) are used to indicate where some parts of a cited definition were eliminated. Also, as might be expected, different sources were found to use different hyphenation and terminology practices for the same words. Insignificant changes in this regard were made to some cited references and elsewhere to gain uniformity for the terms contained in this Glossary: as an example, “groundwater” vice “ground-water” or “ground water,” and “cross section area” vice “cross-sectional area.” Cited definitions were taken primarily from two sources: W.B. -
Big Data and Open Data As Sustainability Tools a Working Paper Prepared by the Economic Commission for Latin America and the Caribbean
Big data and open data as sustainability tools A working paper prepared by the Economic Commission for Latin America and the Caribbean Supported by the Project Document Big data and open data as sustainability tools A working paper prepared by the Economic Commission for Latin America and the Caribbean Supported by the Economic Commission for Latin America and the Caribbean (ECLAC) This paper was prepared by Wilson Peres in the framework of a joint effort between the Division of Production, Productivity and Management and the Climate Change Unit of the Sustainable Development and Human Settlements Division of the Economic Commission for Latin America and the Caribbean (ECLAC). Some of the imput for this document was made possible by a contribution from the European Commission through the EUROCLIMA programme. The author is grateful for the comments of Valeria Jordán and Laura Palacios of the Division of Production, Productivity and Management of ECLAC on the first version of this paper. He also thanks the coordinators of the ECLAC-International Development Research Centre (IDRC)-W3C Brazil project on Open Data for Development in Latin America and the Caribbean (see [online] www.od4d.org): Elisa Calza (United Nations University-Maastricht Economic and Social Research Institute on Innovation and Technology (UNU-MERIT) PhD Fellow) and Vagner Diniz (manager of the World Wide Web Consortium (W3C) Brazil Office). The European Commission support for the production of this publication does not constitute endorsement of the contents, which reflect the views of the author only, and the Commission cannot be held responsible for any use which may be made of the information obtained herein. -
Notes on Calculating Computer Performance
Notes on Calculating Computer Performance Bruce Jacob and Trevor Mudge Advanced Computer Architecture Lab EECS Department, University of Michigan {blj,tnm}@umich.edu Abstract This report explains what it means to characterize the performance of a computer, and which methods are appro- priate and inappropriate for the task. The most widely used metric is the performance on the SPEC benchmark suite of programs; currently, the results of running the SPEC benchmark suite are compiled into a single number using the geometric mean. The primary reason for using the geometric mean is that it preserves values across normalization, but unfortunately, it does not preserve total run time, which is probably the figure of greatest interest when performances are being compared. Cycles per Instruction (CPI) is another widely used metric, but this method is invalid, even if comparing machines with identical clock speeds. Comparing CPI values to judge performance falls prey to the same prob- lems as averaging normalized values. In general, normalized values must not be averaged and instead of the geometric mean, either the harmonic or the arithmetic mean is the appropriate method for averaging a set running times. The arithmetic mean should be used to average times, and the harmonic mean should be used to average rates (1/time). A number of published SPECmarks are recomputed using these means to demonstrate the effect of choosing a favorable algorithm. 1.0 Performance and the Use of Means We want to summarize the performance of a computer; the easiest way uses a single number that can be compared against the numbers of other machines. -
Cross-Sectional Skewness
Cross-sectional Skewness Sangmin Oh∗ Jessica A. Wachtery June 18, 2019 Abstract This paper evaluates skewness in the cross-section of stock returns in light of pre- dictions from a well-known class of models. Cross-sectional skewness in monthly returns far exceeds what the standard lognormal model of returns would predict. In spite of the fact that cross-sectional skewness is positive, aggregate market skewness is negative. We present a model that accounts for both of these facts. This model also exhibits long-horizon skewness through the mechanism of nonstationary firm shares. ∗Booth School of Business, The University of Chicago. Email: [email protected] yThe Wharton School, University of Pennsylvania. Email: [email protected]. We thank Hendrik Bessembinder, John Campbell, Marco Grotteria, Nishad Kapadia, Yapai Zhang, and seminar participants at the Wharton School for helpful comments. 1 Introduction Underlying the cross-section of stock returns is a universe of heterogeneous entities com- monly referred to as firms. What is the most useful approach to modeling these firms? For the aggregate market, there is a wide consensus concerning the form a model needs to take to be a plausible account of the data. While there are important differences, quantitatively successful models tend to feature a stochastic discount factor with station- ary growth rates and permanent shocks, combined with aggregate cash flows that, too, have stationary growth rates and permanent shocks.1 No such consensus exists for the cross-section. We start with a simple model for stock returns to illustrate the puzzle. The model is not meant to be the final word on the cross-section, but rather to show that the most straightforward way to extend the consensus for the aggregate to the cross-section runs quickly into difficulties both with regard to data and to theory. -
Math 140 Introductory Statistics
Notation Population Sample Sampling Math 140 Distribution Introductory Statistics µ µ Mean x x Standard Professor Bernardo Ábrego Deviation σ s σ x Lecture 16 Sections 7.1,7.2 Size N n Properties of The Sampling Example 1 Distribution of The Sample Mean The mean µ x of the sampling distribution of x equals the mean of the population µ: Problems usually involve a combination of the µx = µ three properties of the Sampling Distribution of the Sample Mean, together with what we The standard deviation σ x of the sampling distribution of x , also called the standard error of the mean, equals the standard learned about the normal distribution. deviation of the population σ divided by the square root of the sample size n: σ Example: Average Number of Children σ x = n What is the probability that a random sample The Shape of the sampling distribution will be approximately of 20 families in the United States will have normal if the population is approximately normal; for other populations, the sampling distribution becomes more normal as an average of 1.5 children or fewer? n increases. This property is called the Central Limit Theorem. 1 Example 1 Example 1 Example: Average Number Number of Children Proportion of families, P(x) of Children (per family), x µx = µ = 0.873 What is the probability that a 0 0.524 random sample of 20 1 0.201 σ 1.095 σx = = = 0.2448 families in the United States 2 0.179 n 20 will have an average of 1.5 3 0.070 children or fewer? 4 or more 0.026 Mean (of population) 0.6 µ = 0.873 0.5 0.4 Standard Deviation 0.3 σ =1.095 0.2 0.1 0.873 0 01234 Example 1 Example 2 µ = µ = 0.873 Find z-score of the value 1.5 Example: Reasonably Likely Averages x x − mean z = = What average numbers of children are σ 1.095 SD σ = = = 0.2448 reasonably likely in a random sample of 20 x x − µ 1.5 − 0.873 n 20 = x = families? σx 0.2448 ≈ 2.56 Recall that the values that are in the middle normalcdf(−99999,2.56) ≈ .9947 95% of a random distribution are called So in a random sample of 20 Reasonably Likely. -
4.1 What Is an Average? Example
STAT1010 – mean (or average) Chapter 4: Describing data ! 4.1 Averages and measures of center " Describing the center of a distribution ! 4.2 Shapes of distributions " Describing the shape ! 4.3 Quantifying variation " Describing the spread of a distribution 1 4.1 What is an average? ! In statistics, we generally use the term mean instead of average, and the mean has a specific formula… mean = sum of all values total number of values ! The term average could be interpreted in a variety of ways, thus, we’ll focus on the mean of a distribution or set of numbers. 2 Example: Eight grocery stores sell the PR energy bar for the following prices: $1.09 $1.29 $1.29 $1.35 $1.39 $1.49 $1.59 $1.79 Find the mean of these prices. Solution: The mean price is $1.41: $1.09 + $1.29 + $1.29 + $1.35 + $1.39 + $1.49 + $1.59 + $1.79 mean = 8 = $1.41 3 1 STAT1010 – mean (or average) Example: Octane Rating n = 40 87.4, 88.4, 88.7, 88.9, 89.3, 89.3, 89.6, 89.7 89.8, 89.8, 89.9, 90.0, 90.1, 90.3, 90.4, 90.4 90.4, 90.5, 90.6, 90.7, 91.0, 91.1, 91.1, 91.2 91.2, 91.6, 91.6, 91.8, 91.8, 92.2, 92.2, 92.2 92.3, 92.6, 92.7, 92.7, 93.0, 93.3, 93.7, 94.4 4 Example: Octane Rating Technical Note (short hand formula for the mean): Let x1, x2, …, xn represent n values. -
Have You Seen the Standard Deviation?
Nepalese Journal of Statistics, Vol. 3, 1-10 J. Sarkar & M. Rashid _________________________________________________________________ Have You Seen the Standard Deviation? Jyotirmoy Sarkar 1 and Mamunur Rashid 2* Submitted: 28 August 2018; Accepted: 19 March 2019 Published online: 16 September 2019 DOI: https://doi.org/10.3126/njs.v3i0.25574 ___________________________________________________________________________________________________________________________________ ABSTRACT Background: Sarkar and Rashid (2016a) introduced a geometric way to visualize the mean based on either the empirical cumulative distribution function of raw data, or the cumulative histogram of tabular data. Objective: Here, we extend the geometric method to visualize measures of spread such as the mean deviation, the root mean squared deviation and the standard deviation of similar data. Materials and Methods: We utilized elementary high school geometric method and the graph of a quadratic transformation. Results: We obtain concrete depictions of various measures of spread. Conclusion: We anticipate such visualizations will help readers understand, distinguish and remember these concepts. Keywords: Cumulative histogram, dot plot, empirical cumulative distribution function, histogram, mean, standard deviation. _______________________________________________________________________ Address correspondence to the author: Department of Mathematical Sciences, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA, E-mail: [email protected] 1; -
STANDARDS and GUIDELINES for STATISTICAL SURVEYS September 2006
OFFICE OF MANAGEMENT AND BUDGET STANDARDS AND GUIDELINES FOR STATISTICAL SURVEYS September 2006 Table of Contents LIST OF STANDARDS FOR STATISTICAL SURVEYS ....................................................... i INTRODUCTION......................................................................................................................... 1 SECTION 1 DEVELOPMENT OF CONCEPTS, METHODS, AND DESIGN .................. 5 Section 1.1 Survey Planning..................................................................................................... 5 Section 1.2 Survey Design........................................................................................................ 7 Section 1.3 Survey Response Rates.......................................................................................... 8 Section 1.4 Pretesting Survey Systems..................................................................................... 9 SECTION 2 COLLECTION OF DATA................................................................................... 9 Section 2.1 Developing Sampling Frames................................................................................ 9 Section 2.2 Required Notifications to Potential Survey Respondents.................................... 10 Section 2.3 Data Collection Methodology.............................................................................. 11 SECTION 3 PROCESSING AND EDITING OF DATA...................................................... 13 Section 3.1 Data Editing ........................................................................................................