Analysis of Variance Time Series Biostatistics Big Data
Introduction to Statistical Data Analysis Lecture 9: Miscellaneous Topics
James V. Lambers
Department of Mathematics The University of Southern Mississippi
James V. Lambers Statistical Data Analysis 1 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Analysis of Variance
Previously, we have learned how test a single mean, and compare two means. Analysis of variance, also known as ANOVA, is useful for comparing three or more population means.
James V. Lambers Statistical Data Analysis 2 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The Hypotheses
Suppose that we have m samples, each of size ni , for i = 1, 2,..., m, drawn from m populations with means µ1, µ2, . . . , µm.
Let the ith sample consist of observations xij , j = 1, 2,..., ni , with sample meanx ¯i .
For one-way ANOVA, the null hypothesis is H0 : µ1 = µ2 = ··· = µm. That is, all of the population means are equal.
The alternative hypothesis H1 is that there is a statistically significant difference between at least two of the population means.
James V. Lambers Statistical Data Analysis 3 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Stuff We Need
To perform one-way ANOVA, we need to compute the following:
Pm n x¯ ¯ i=1 i i x¯ = Pm i=1 ni is the grand mean, which is the mean of all observations from all samples.
m ni X X 2 SSW = (xij − x¯i ) i=1 j=1
is the “sum of squares within groups”.
m X 2 SSB = ni (¯xi − x¯) i=1 is the “sum of squares between groups”.
James V. Lambers Statistical Data Analysis 4 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data More Stuff We Need
Let n = n1 + n2 + ··· + nm be the total number of observations from all samples. SSW MSW = n − m is the “within-sample variance”. It is an estimate of the variance σ2 whether H0 is true or not. SSB MSB = , m − 1 also known as MSE, is the “between-sample variance” or “mean square error”. It is an estimate of the variance only if H0 is true. Otherwise, it is quite large compared to MSW .
James V. Lambers Statistical Data Analysis 5 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The Test
The test statistic is MSB F ∗ = . MSW
This is compared to the critical value Fα,m−1,n−m from the F -distribution.
∗ If F ≤ Fα,m−1,n−m, then we do not reject H0, and conclude that the population means are equal. Alternatively, we could compute the p-value P(F > F ∗) and compare it to our level of significance α.
James V. Lambers Statistical Data Analysis 6 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The F -distribution
The F -distribution 1. is not symmetric; it is skewed toward zero. 2. becomes more symmetric as its degrees of freedom increase. 3. has a total area under the curve of 1. 4. has a mean that is approximately 1.
James V. Lambers Statistical Data Analysis 7 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The F -distribution
James V. Lambers Statistical Data Analysis 8 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Pairwise Comparisons
Suppose that H0 from ANOVA is rejected, so that we know that at least two of the means are statistically different. To find out which means are different, we can use the Scheff´etest to compare each pair of sample means. For each pair (i, j), i, j = 1, 2,..., m, the test statistic is
(¯x − x¯ )2 F ∗ = i j . S SSW 1 1 + n − m ni nj
This is compared to FSC = (m − 1)Fα,m−1,n−m.
If FS > FSC , we conclude that means i and j are different.
James V. Lambers Statistical Data Analysis 9 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example
A consumer group is testing the gas mileage of three different models of cars. Each car was driven 500 miles and the mileage recorded as follows: Model 1 Model 2 Model 3 22.5 18.7 17.2 20.8 19.8 18.0 22.0 20.4 21.1 23.6 18.0 19.8 21.3 21.4 18.6 22.5 19.7
James V. Lambers Statistical Data Analysis 10 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example: One-way ANOVA
We have n1 = n2 = 6, n3 = 5, n = n1 + n2 + n3 = 14, and m = 3.
Our means arex ¯1 = 22.2,x ¯2 = 19.7,x ¯3 = 18.9, and n x¯ + n x¯ + n x¯ x¯ = 1 1 2 2 3 3 = 20.3. n1 + n2 + n3 We then compute SSW = 21.6 and SSB = 31.5, followed by SSW 21.6 SSB 31.5 MSW = = = 1.54, MSB = = = 15.73, n − m 14 m − 1 2 which yields F ∗ = MSB/MSW = 10.19.
∗ This is compared to F0.05,2,14 = 3.74. Since F > F0.05,2,14, we reject H0 and conclude that the means are not equal.
James V. Lambers Statistical Data Analysis 11 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example: Pairwise Comparisons
To see which means are not equal, we obtain
FSC = (m − 1)Fα,m−1,n−m = 2F0.05,2,14 = 7.48.
This is compared against FC for all 3 pairs of means:
(¯x − x¯ )2 F ∗ = 1 2 = 11.7, S,1,2 SSW 1 1 + n − m n1 n2
∗ ∗ and similarly, FS,1,3 = 17.83 and FS,2,3 = 0.93. ∗ ∗ Since FS,1,2 and FS,1,3 are both greater than FSC , we conclude those pairs ∗ of means are different, whereas FS,2,3 < FSC , so µ2 and µ3 are equal.
James V. Lambers Statistical Data Analysis 12 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Statements
The following R statements can be used for the preceding example: > alpha=0.05 > x1=c(22.5,20.8,22.0,23.6,21.3,22.5) > x2=c(18.7,19.8,20.4,18.0,21.4,19.7) > x3=c(17.2,18.0,21.1,19.8,18.6) > n1=length(x1) > n2=length(x2) > n3=length(x3) > x1b=mean(x1) > x2b=mean(x2) > x3b=mean(x3) > x=c(x1,x2,x3) > xbb=mean(x)
James V. Lambers Statistical Data Analysis 13 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Statements, cont’d
> SSW=sum((x1-x1b)^2)+sum((x2-x2b)^2)+sum((x3-x3b)^2) > SSB=n1*(x1b-xbb)^2+n2*(x2b-xbb)^2+n3*(x3b-xbb)^2 > n=n1+n2+n3 > m=3 > MSW=SSW/(n-m) > MSB=SSB/(m-1) > F=MSB/MSW > FC=qf(1-alpha,m-1,n-m) > F [1] 10.18602 > FC [1] 3.738892
James V. Lambers Statistical Data Analysis 14 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Code, Pairwise Comparions
The following R code perform the pairwise comparions: > FSC=(m-1)*qf(1-alpha,m-1,n-m) > FC12=(x1b-x2b)^2/(SSW/(n-3)*(1/n1+1/n2)) > FC13=(x1b-x3b)^2/(SSW/(n-3)*(1/n1+1/n3)) > FC23=(x2b-x3b)^2/(SSW/(n-3)*(1/n2+1/n3)) > FSC [1] 7.477784 > FC12 [1] 11.66415 > FC13 [1] 17.82672 > FC23 [1] 0.9328217
James V. Lambers Statistical Data Analysis 15 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series
A time series is a sequence of data points measured at successive, equally spaced points in time. Examples: the daily closing value of the Dow Jones Industrial Average, or the annual flow volume of the Mississippi River Important in statistics, but also in many other fields, such as signal processing, econometrics, mathematical finance, meteorology, seismology, and more
James V. Lambers Statistical Data Analysis 16 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series Analysis
Time series analysis refers to a broad spectrum of methods for extracting meaningful statistics and characteristics from time series. While regression analysis is used to compare time series to one another, time series analysis itself is concerned with the study of a single series. Time series analysis differs from other forms of data analysis due to the natural ordering of the observations, which is absent from general data sets. Time series analysis has many motivations, but for statistics, the main motivation is forecasting.
James V. Lambers Statistical Data Analysis 17 / 55 Analysis of Variance Time Series Biostatistics Big Data Types of Time Series Analysis
Methods for time series analysis can be classified in several ways:
I Frequency domain (e.g., Fourier analysis or wavelet analysis) or time domain (auto-correlation, cross-correlation)
I Parametric (regression, moving averages) or non-parametric (estimating covariance or spectrum)
I Linear or non-linear
I Univariate or multivariate
James V. Lambers Statistical Data Analysis 18 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series Patterns
Much of time series analysis is about detecting patterns within the data. Types of patterns include:
I Stationary
I Linear trend
I Cyclic trend
I Random variations
James V. Lambers Statistical Data Analysis 19 / 55 Analysis of Variance Time Series Biostatistics Big Data Exploratory Analysis
Exploratory analysis is used to get a feel for the behavior of the observations, as a function of time. Some approaches to exploratory analysis are:
I Line chart, for manual examination
I Autocorrelation, for detecting statistical dependence or correlation between points separated by a time lag
I Spectral analysis, to identify cyclic behavior (not necessarily seasonal, such as sunspots)
I Trend estimation/decomposition, to separate the time series into components representing trend, seasonality, variation at different speeds, irregularity
James V. Lambers Statistical Data Analysis 20 / 55 Analysis of Variance Time Series Biostatistics Big Data Line Chart Example
Note that while TB cases generally declined, spikes can be detected in 1975 and 1990
James V. Lambers Statistical Data Analysis 21 / 55 Analysis of Variance Time Series Biostatistics Big Data Autocorrelation
Let Xi be the value of a random process at time i. The autocorrelation between times s and t is E[(X − µ )(X − µ )] R(s, t) = t t s s σt σs This value is between −1 and 1, with ±1 representing perfect linear correlation. Autocorrelation is useful for finding repeated patterns in spite of noise.
James V. Lambers Statistical Data Analysis 22 / 55 Analysis of Variance Time Series Biostatistics Big Data Stationary Processes
A random process is stationary if its joint probability distribution does not change when shifted in time. If a time series is a second-order stationary process (that is, the mean and variance are time-independent), then the autocorrelation only depends on the lag between t and s, τ = s − t:
E[(X − µ)(X − µ)] R(τ) = t t+τ σ2
James V. Lambers Statistical Data Analysis 23 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data What is Biostatistics?
Biostatistics is the application of statistics to biological research Biostatistics deals with the design of experiments, the collection of data, the analysis of data, and the interpretation of results
James V. Lambers Statistical Data Analysis 24 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Applications
I Public health: epidemiology, health services research, nutrition, environmental health, healthcare policy and management Check out https://publichealthwatch.wordpress.com by @RVAwonk
I Design and analysis of clinical trials
I Assessment of severity state of a patient
I Population genetics: correlating phenotype with genotype
I Human genetics: correlating alleles with diseases
I Climate envelope modeling: correlations between species distributions, environmental variables define a species’ tolerance
I Sequence analysis: matching new and existing DNA/RNA/peptide sequences to understand the biology of organisms
James V. Lambers Statistical Data Analysis 25 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Studies in epidemiology
Types of studies:
I Case series
I Case-control studies
I Cohort studies
James V. Lambers Statistical Data Analysis 26 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Case Series
A case series study compares periods of time during which patients are exposed to some potentially illness-causing factor with periods without exposure Poisson regression techniques are used to compare incidence rates between exposed and unexposed periods In Poisson regression, the response Y is assumed to follow a Poisson distribution, and its logarithm depends linearly on independent variables R code: glm(y ∼ offset(log(exposure)) + x, family=poisson(link=log) ) Used to study adverse reactions to vaccination
James V. Lambers Statistical Data Analysis 27 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Case-control Studies
Case-control studies are retrospective studies that select patients based on their disease status Given this table: Cases Controls Exposed A B Unexposed C D these studies examine the statistic AD/BC, the odds ratio (OR) If OR 1, cases are likely linked to exposure. If OR ∼ 1, not likely associated. If OR 1, exposure is protective Drawbacks: sensitive to bias, and cost prohibitive for smaller values of OR
James V. Lambers Statistical Data Analysis 28 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Cohort Studies
Cohort studies are prospective studies that select subjects based on their exposure status Using the same table as in Case-control Studies, the statistics of interest is relative risk A C RR = P /P , P = , P = e u e A + B u C + D
RR > 1 shows association RR more reliable than OR, but studies more costly, and follow-up problematic due to long duration
James V. Lambers Statistical Data Analysis 29 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Precision and Bias
Sources of error in epidemiological studies:
I Random error: can be reduced by increasing sampling size or reducing measurement error, but both costly
I Selection bias: for example, nonsmokers tend to participate in studies more often than smokers, which can skew results
I Information bias: systematic error in assessment of variables, for example recall bias
I Confounding: bias due to mixing of extraneous effects (confounders) with effects of interest
James V. Lambers Statistical Data Analysis 30 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Jackknifing
Jackknifing is a resampling technique useful for variance and bias estimation After estimating a parameter from a sample of size n (e.g. estimating a population mean with a sample mean), jackknifing entails computing new estimates based on any n − 1 observations from the sample, and then averaging those new estimates This allows bias to be substantially reduced
James V. Lambers Statistical Data Analysis 31 / 55 Analysis of Variance Time Series Epidemiology Biostatistics Techniques Big Data Bootstrapping
Inspired by jackknifing, boostrapping (Efron 1979) is a resampling technique based on the idea that making an inference about a population based on a sample is analogous to making an inference about the sample based on a resample, but the accuracy of the latter can be measured because the “exact answer” is known Process: given an original sample of size n, take resamples of size n using sampling with replacement, to obtain a sampling distribution of the desired parameter Useful when the underlying distribution of the population is unknown, or when the sample size is too small for standard hypothesis testing Particularly advantageous for obtaining standard errors and confidence intervals for complex estimators such as correlation coefficients
James V. Lambers Statistical Data Analysis 32 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings What is Big Data?
Wikipedia: “Big data” refers to any collection of data sets so large and complex that processing with on-hand data management tools or traditional data processing applications becomes problematic. SAS/Doug Laney: Big data is characterized by the “three V’s”: volume, velocity and variety. SAS adds two more dimensions: variability and complexity. Whereas business intelligence uses descriptive statistics, big data uses inferential statistics
James V. Lambers Statistical Data Analysis 33 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Examples
I UPS uses telematic sensors in its trucks, in conjunction with map data, to redesign delivery routes and reduce fuel costs
I United Healthcare analyzes voice data from customer calls to detect dissatisfied customers for possible intervention
I Bank of America analyzes customer interactions in order to present appealing offers
I Sears has used big data technology and real-time processing to release complex marketing campaigns in one week rather than eight weeks
I GE installs sensors on the blades of its gas turbines that generate gigabytes of data per day (each) to detect and analyze defects
I Schneider uses driving sensors on its trucks to detect indicators of potential accidents and allows intervention before they occur
James V. Lambers Statistical Data Analysis 34 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Case Study: 2012 Presidential Election
Organization of Obama campaign: driven by centralized analytics Key ingredients:
I Dedicated analytics department, and analysts at campaign offices and in the field
I HP Vertica, R and Stata
I Hadoop considered, but not as scalable or efficient, plus steeper learning curve
I AirWolf: data from field fed back into Vertica to drive digital arm of campaign
I Media Optimizer: combined voter data, TV ratings data and ad price data for targeted ad buys
James V. Lambers Statistical Data Analysis 35 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Exploratory Data Analysis
The size and complexity of raw data sets inhibits the discovery of insight from Big Data. Only if the data is viewed in the right way can such insight be gained Statistical methods, machine learning algorithms can only help so much. The best tool for the job? Our brains! Exploratory Data Analysis (EDA) involves “playing with data”, is the first step in working with Big Data Techniques: visualization, principal component analysis, multidimensional scaling, etc. Requirement: creativity!
James V. Lambers Statistical Data Analysis 36 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Business Analytics
Business analytics uses data from past performance to predict future performance and guide business planning Types of business analytics:
I Descriptive analytics: descriptive statistics applied to historical data
I Predictive analytics: techniques from inferential statistics and machine learning to make predictions from historical data
I Prescriptive analytics: a variation of predictive analytics in which data is used to guide decisions and predict their effects
James V. Lambers Statistical Data Analysis 37 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Predictive analytics
Main idea: use predictive models to forecast data that does not yet exist Techniques used: regression (linear, logistic, time series, etc.), machine learning (neural networks, support vector machines, etc.) Examples:
I CRM: using customer information (e.g. order history) to predict future purchases and promote relevant products at touch points
I Clinical decision support: using patient data to predict development of disease
I Non-temporal prediction: using social media activity data to predict potential to influence, or predicting one’s sentiment from their postings
James V. Lambers Statistical Data Analysis 38 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Prescriptive analytics
Beyond predicting what may happen, prescriptive analytics guides decisions based on predictions Prescriptive analytics adds these components to predictive analytics:
I Actionability: must be able to act on whatever prediction is produced
I Feedback: the result of any action is fed back into the predictive model to refine the prediction
Requirement: predictive window > reaction time Examples:
I Oil and gas: guiding deployment of capital to maximize effectiveness of resource extraction in spite of uncertainty
I Healthcare: guiding care delivery and capital investments by providers, or administration of clinical trials for pharmaceutical companies
James V. Lambers Statistical Data Analysis 39 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Google Flu Trends
In 2008, Google launched “Google Flu Trends”, to track the spread of flu across the United States. Using their top 50 million search terms, they looked for a correlation between searches and flu symptoms. As a result, flu could be tracked by Google Flu Trends more rapidly than by the CDC, without using data relied upon by the CDC obtained from physicians, and without any hypothesis about which search terms might correlate with flu symptoms– algorithms did all the work.
James V. Lambers Statistical Data Analysis 40 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Target
In Minnesota, a man complained to Target because they were sending maternity-oriented coupons to his teenage daughter. As it turned out, she really was pregnant. Target suspected before her father did, based on her purchases of unscented wipes and magnesium supplements.
James V. Lambers Statistical Data Analysis 41 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Street Bump
The city of Boston released the Street Bump app to enable smartphones to automatically detect potholes using their accelerometers. This way, city workers did not need to patrol the streets looking for potholes–the app could tell them where to go!
James V. Lambers Statistical Data Analysis 42 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Implications?
Incidents like these can lead us to believe:
I Data analysis can produce amazingly accurate results
I Every single data point can be captured (“n = All”), so who needs sampling anymore?
I Why worry about causation, if correlation tells us what we need to know?
I Given enough data, the numbers speak for themselves, so who needs theory?
James V. Lambers Statistical Data Analysis 43 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Not So Fast...
By 2013, Google Flu Trends lost its accuracy, instead dramatically overstating the prevalence of flu The lesson: correlation isn’t everything. If you don’t know the reason behind a correlation, you also don’t know what can cause it to break down! Possible explanation: news stories about the flu in 2012 may have prompted flu-related searches by healthy people Remember history: The Literary Digest sampled over 2 million readers to predict the result of the 1936 presidential election and got it wrong, while Gallup got it right with 3000. Statistical techniques matter.
James V. Lambers Statistical Data Analysis 44 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Oh, What About Target?
They may have “discovered” one pregnancy, but they used similar coupon-distribution techniques on many other customers for which their “discoveries” were wrong. The numbers don’t tell a story by themselves, no matter how many of them there are. Rudimentary analysis produces, at best, a somewhat educated guess. “There are a lot of small data problems in big data...they don’t disappear because you’ve got lots of the stuff. They get worse.” – Prof. David Spiegelhalter, Cambridge
James V. Lambers Statistical Data Analysis 45 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings And, About those Potholes...
The Street Bump app works well, but consider this: which residents does it really serve? The effect of the app was that potholes were fixed in areas with young, affluent residents who are more likely to own smartphones. Every bump from every enabled smartphone may have been recorded, but many potholes were still overlooked! The lesson: “n = All” is a seductive illusion. Don’t fall for it!
James V. Lambers Statistical Data Analysis 46 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #1
Using very large data sets can help detect correlations that might be overlooked by smaller samples. However, that doesn’t mean that these correlations are meaningful! From 2006 to 2011, the market share of Internet Explorer dropped precipitously...as did the murder rate in the United States. What does this tell us?
James V. Lambers Statistical Data Analysis 47 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #2
Big data should be considered a tool for scientific inquiry, not a replacement! Example: molecular biologists can’t understand the structure of proteins from data alone; any statistical analysis must be informed by knowledge of the underlying physics and biochemistry.
James V. Lambers Statistical Data Analysis 48 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #3
Tools based on big data can be gamed! Example: “Google bombing” to manipulate search results A more insidious example: big data programs for grading student essays tend to favor characteristics like sentence length or word sophistication, due to their correlation with grades given by humans. What do you expect will happen?
James V. Lambers Statistical Data Analysis 49 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #4
Large data sets are constantly changing, and so is the way in which it is collected. This is especially true for data based on web requests, such as data that Google collects. In fact, the changing nature of its data, and failure to take this change into account, is considered a contributing factor to the issues with Google Flu Trends.
James V. Lambers Statistical Data Analysis 50 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #5
Beware of the echo-chamber effect: much of big data comes from the web, but much of what’s on the web comes from big data! Example: Google Translate relies on pairs of parallel texts written in different languages, such as Wikipedia articles written in two languages. But how many of these foreign-language Wikipedia pages were written using Google Translate?!
James V. Lambers Statistical Data Analysis 51 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #6
If you look long enough for a correlation, you will find one! That is, given enough data, strong correlations may appear simply by chance. The more data, the more bogus patters can be found, resulting in badly flawed inferences!
James V. Lambers Statistical Data Analysis 52 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #7
Don’t be fooled into thinking that big data can deliver scientific-sounding answers to questions that we could never precisely answer. Example: Wikipedia is being used to rank people in terms of “historical importance” or “cultural contributions.” Two separate projects of this kind correctly identify Jesus, Lincoln and Shakespeare as very important people, but are we really supposed to believe that Nostradamus was the 20th most important writer in history, or that Francis Scott Key the 19th most important poet? Big data can reduce anything to a single number, but that doesn’t mean it should be accepted!
James V. Lambers Statistical Data Analysis 53 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #8
Big data works well when analyzing patterns that appear very frequently, but falters on rare occurrences. Example: search engines rely on trigrams, which are three-word sequences. Use Google Translate to translate the trigram “dumbed-down escapist fare” into German, and then back to English.
James V. Lambers Statistical Data Analysis 54 / 55 What Is It? Analysis of Variance Working with Big Data Time Series Anecdotes Biostatistics Reality Big Data Warnings Big Data: Warning #9
The biggest problem of all with big data? People who fall for the hype! We need to view big data with a reasonable perspective. There is no silver bullet. Big data is a valuable resource, it’s not going away, and its full implications are yet to be realized. But that doesn’t mean we should even consider tossing out centuries of ingenuity that has led to the data analysis techniques that we have today. In fact, the arrival of big data means we need these techniques more than ever!
James V. Lambers Statistical Data Analysis 55 / 55