Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official .

Big Data and Interference in Official Statistics Part 2

Diego Zardetto Istat ([email protected])

Eurostat

Big Data: Methodological Pitfalls

 Outline • Representativeness (w.r.t. the desired target population)  Selection Bias  Actual target population unknown  Often sample units’ identity unclear/fuzzy  Pre processing errors (acting like measurement errors in surveys)  Social media & sentiment analysis: pointless babble & social bots • Ubiquitous correlations  Causation fallacy  Spurious correlations • Structural break in Nowcasting Algorithms

Eurostat

1 Social Networks: representativeness (1/3)

 The “Pew Research Center” carries out periodic sample surveys on the way US people use the Internet  According to its Social Media Update 2013 research: • Among US Internet users aged 18+ (as of Sept 2013)  73% of online adults use some Social Networks  71% of online adults use Facebook  18% of online adults use Twitter  17% of online adults use Instagram  21% of online adults use Pinterest  22% of online adults use LinkedIn • Subpopulations using these Social platforms turn out to be quite different in terms of:  Gender, Ethnicity, Age, Education, Income, Urbanity

Eurostat

Social Networks: representativeness (2/3)

 According to Social Media Update 2013 research • Social platform users turn out to belong to typical social & demographic profiles  Pinterest is skewed towards females  LinkedIn over-represents higher education and higher income  Twitter and Instagram are biased towards younger adults, urban dwellers, and non-whites  Facebook too is skewed towards younger people (though much less than Twitter and Instagram) and females • More interestingly, even looking at all such subpopulations as a whole, Social Networks users turn out to constitute a biased sample of US online adults • Big Data generated via Social Networks are very likely to be non representative of the US general population

Eurostat

2 Social Networks: representativeness (3/3)

 Incidence of Social Network users among US Internet adults within specific social/demographical groups – Sept 2013 • superscript letters (e.g. a) indicate statistically significant differences between cells

Eurostat

Tweets: the relevance issue

 There is growing evidence that a big portion of tweets lack any relevant • And even more so in the Official Statistics perspective  Such tweets are categorized as Pointless Babble • e.g. “I am eating a sandwich now”  Pear analytics “Twitter Study 2009” estimated that about 40% of tweets were Pointless Babble  Recent extensive studies conducted by Statistics Netherlands on millions of Dutch Tweets revealed that more than 50% had no statistical interest  Challenge : how to filter out such a big noise in order to uncover meaningful signals inside huge amounts of Tweets?

Eurostat

3 How many Twitter users are actually humans?

 According to her Twitter bio, Carina Santos ( @scarina91 ) is a young, blonde haired, girl based in Rio de Janeiro, working as a journalist at Globo, a 24-hour news channel in Brazil • She sends around 50 tweets per day on a variety of popular topics (spanning from sport news to gossip) to her nearly 1000 followers • Twitalyzer, a social media analytics firm, judged her to be more influential than online personalities like Oprah Winfrey or the Dalai Lama

Eurostat

How many Twitter users are actually humans?

Eurostat

4 How many Twitter users are actually humans?

 According to her Twitter bio, Carina Santos ( @scarina91 ) is a young, blonde haired, girl based in Rio de Janeiro, working as a journalist at Globo, a 24-hour news channel in Brazil • She sends around 50 tweets per day on a variety of popular topics (spanning from sport news to gossip) to her nearly 1000 followers • Twitalyzer, a social media analytics firm, judged her to be more influential than online personalities like Oprah Winfrey or the Dalai Lama  Carina Santos is not a human but rather a social bot created in 2011 by a team of Brazilian computer science researchers: • a software robot able to interact and communicate with humans by following appropriate social behaviors and rules  “She” constantly checked Twitter API restrictions to avoid being identified as a bot and blocked by Twitter  Maximum number of requests per hour is 350  Each user can follow up to 2,000 users  A user cannot follow more than 1,000 users per day

Eurostat

Tweets: the Social Bots issue

 Security analysts and artificial intelligence researchers agree that the current amount of Twitter bots is big and growing fast • It has been observed that thousands of dormant Twitter profiles suddenly began posting hundreds of messages during last years social turmoil and political crises around the world (e.g. Syrian civil war and Arab Spring protests in Egypt, Libya, Tunisia)  PeekAnalytics estimated in 2011 that only 35% of Twitter users’ followers were real people  Maybe this estimate is not accurate or even grossly exaggerated, but the problem is still there: • How much of what can be mined from Tweets can safely be used to analyze, interpret or predict people behavior?  Challenge : beyond removing the pointless bubble noise, how to discriminate bot generated Tweets deliberately intended to mimic meaningful human signals? How to adjust analyses for bot traffic?

Eurostat

5 Inference in the Official Statistics Realm

 Outline  Traditional paradigm • Top-down: data are planned • Traditional inference approaches  Design based survey sampling theory  Model-assisted approach  Model based inference  The need of a new paradigm to deal with Big Data • Bottom-up: data are already there • Exploratory analysis / Knowledge discovery approach  Algorithmic inference: data mining techniques, machine learning, …

Eurostat

Design Based Inference (1/2)

 sampling • Samples are drawn by means of rigorously random algorithms • Every unit in the population has a known, non-zero probability of being selected in the random sample  Data sampling is entirely controlled  Ideally, THE STATISTICIAN IS THE ONE AND THE ONLY RANDOMIZER  Ideally means ignoring all non-sampling errors (e.g. list problems, total and item nonresponse, measurement errors…)  This approach allows to: • Build unbiased estimators (or nearly so)  even if samples are not naively representative, because we can adjust for unequal inclusion • Exploit probability theory to assess the quality of obtained estimates

Eurostat

6 Design Based Inference (2/2)

U = ,...,2,1{ N} s = ,...,2,1{ n} s ∈ S : Sp → ]1,0[ sp )( = Pr( s is selected )

π k = Pr( k ⊂ s) = ∑ sp )( ⊃ks

π kj = Pr({ jk }, ⊂ s) = ∑ sp )( ⊃ jks },{ ˆ YHT = ∑ yd kk = ∑ yk /1( π k ) ∈sk ∈sk

 ∆   ππ  ˆ ˆ  kj   jk  V (YHT ) = ∑∑ yd kk   yd jj = ∑∑1−  kk ydyd jj ∈sk ∈ sj  π kj  ∈sk ∈ sj  π kj 

Eurostat

Model Assisted Inference (1/2)  Reference framework is still Design Based • Observed values (y, x) of interest variable Y and auxiliary variables x are deterministic , non-random quantities • Relations (if any) between Y and X are generated by Nature (i.e. by real- world, domain-specific phenomena)  Information about the target population is available from sources external to the survey at hand • Can use this information to describe relations between Y and X through a model  Model Assisted Inference is a suite of methods to improve the quality of Design Based inferences by hinging upon available auxiliary information in a systematic and rigorous way  build more efficient (but still nearly unbiased) estimators  reduce nonresponse bias  The model is assisting only (i.e. descriptive): no stochastic structure ever assumed!

Eurostat

7 Model Assisted Inference (2/2)

x ξ : yk ~ k ⋅β + ε k

Eξ ε k )( = 0 ∀k ∈U 2 Vξ ε k )( = σ ∞< ∀k ∈U

Cov ξ ε ε jk ),( = 0 ∀ , jk ∈U

ˆ x ˆ x ˆ ˆ X Xˆ ˆ YGREG = (∑ k ) β +⋅ ∑d (ykk − k ⋅β) = YHT + ( − HT ) β =⋅ ∑ yw kk ∈Uk ∈sk ∈sk

 −1 X = x d d X Xˆ T−1 xt T−1 xt x ∑ k wk )( = gk )( dk gk += (1 − HT )⋅ ⋅ k = ∑d kk ⋅ k  ∈Uk  ∈sk    ˆ ˆ ˆ ˆ ˆ e = y − x ⋅βˆ (YV GREG ) ≈ (YV GREG ,lin ) = V ∑d ( eg kkk ) k k k  ∈sk 

Eurostat

Model Based Inference (1/2)

 Sampling design is non informative and can be ignored  Equivalent to assuming that the data have been selected by simple random sampling with replacement  Observed values (y, x) are regarded as realizations of random variables : Y (the response variable) and X (the explanatory variables)  Assumptions are made on the of Y and (sometimes) of X  A statistical model linking the response variable Y to the explanatory variables X is proposed and tested against the observed data (y, x) • Often models describe how the conditional expectation of Y given X (or some nice/smart function of it, think of GLM) depends on X  NATURE IS THE RANDOMIZER • A Model is a (human interpretable) guess on “how” Nature is working

Eurostat

8 Model Based Inference (2/2)

ξ : Y ~ N(X⋅β, σ 21) XY X Eξ ( ) = ⋅β XY 21 2 Vξ ( ) = σ σ ∞<

yˆ x)( x⋅= βˆ βˆ = (Xt X)−1 ⋅(Xt Y)

ˆ x ˆ Yξ = ∑ yk + ∑ yˆk = ∑ yk + ∑ k ⋅β ∈sk −∈ sUk )( ∈sk −∈ sUk )(

Eurostat

Algorithmic Inference (1/2)

 Models are replaced by Algorithms , whose ultimate goal is to map X values (input) to Y values (predicted output) • Machine Learning : emphasis on prediction, i.e. learning data properties from a training set and being able to mimic “what” would happen to another data set • Data Mining : emphasis on data exploration and discovery (of correlations, rules, clusters, patterns, hidden structures, …)  Fitting a Model is replaced by tuning an Algorithm , i.e. adjusting his free parameters so that it “works well” • To avoid overfitting, choose among different candidate algorithms, and to validate results, input data (X, Y) are often split into:  Training set (to learn from data)  Validation set (to avoid overfitting / select best candidate algorithm)  Test set (to evaluate the quality of the results: accuracy, sensitivity, specificity, generalization error, …)  Algorithms are validated by assessing their performance on a test set • this replaces hypothesis tests on goodness-of-fit statistics and residuals diagnostics for models

Eurostat

9 Algorithmic Inference (2/2) Algorithm Input

Data - - Data set D = kk }y,{x k =1, ..., n x:F - - x:F → yˆ

q ..., 1,i ) F(x, F - - F = F(x, αi ) =1,i ..., q Loss - - Loss : ˆ -y y ℜ→ + Output * - - αi such that Loss(D) is minimum

ˆ * YAlg = ∑ yk + ∑ yˆk = ∑ yk + ∑ F( x , αik ) ∈sk −∈ sUk )( ∈sk −∈ sUk )(

Eurostat

Big Data: why traditional inference methods cannot succeed  The computational complexity barrier • Examples: Matrix inversion (ubiquitous: least squares estimators, GLM maximum-likelihood via Newton-Raphson algorithm)  O(n^3) • Most traditional algorithm difficult to parallelize (for achieving Hadoop / MapReduce scalability) •…  Extreme sensitivity to erroneous data / outliers • Big data are noisy and unstructured  But due to huge volume cannot apply thorough procedures for Editing & Imputation / Outlier detection

Eurostat

10 Conclusion (1/2)

 Current methods in Official Statistics (e.g. design based and model assisted survey sampling theory, regression theory, generalized linear models, small area estimation methods, …) hinge upon specific features of NSI’ traditional data, namely • small amounts of high quality data  These methods: • are extremely sensitive to outliers and erroneous data (which explains the tremendous effort put by NSIs in data checking and cleaning activities) • typically exhibit high computational complexity (power-behavior is the rule, a feature that hinders their scalability on huge amounts of data)  Synthesis : NSIs’ statistical methods and Big Data are poles apart, at present  Diagnosis : in order to let Big Data gain ground in Official Statistics, NSIs will have to undertake some radical paradigm shift in statistical methodology

Eurostat

Conclusion (2/2)

 Despite it is far from obvious how to translate such awareness into actual proposals, we deem new candidate methods should be: 1. more robust (i.e. more tolerant towards both erroneous data and departures from model assumptions), perhaps at the price of some accuracy loss 2. less demanding in terms of a clear and complete understanding of obtained results in the light of an explicit statistical model (think of Artificial Neural Networks, Support Vector Machines, Classification and Regression Trees, Random Forests, …) 3. based on approximate (rather than exact) optimization techniques, which:  are able to cope with noisy objective functions (as implied by low quality input data)  typically ensure the mandatory scalability requirement inherent in Big Data processing, thanks to their implicit parallelism (think of stochastic metaheuristics like, e.g., Evolutionary Algorithms, Ant Colonies, Swarm Particles, …)

Eurostat

11 Tentative Bibliography (1/2)

 [Pew Research Center] “Social Media Update 2013” u http://www.pewinternet.org/2013/12/30/social-media-update-2013/

 [Pew Research Center] “Social Networking Fact Sheet”, 2013 Sep u http://www.pewinternet.org/fact-sheets/social-networking-fact-sheet/

 [Pear analytics] “Twitter Study 2009”, San Antonio, Texas, USA, 2009 Aug u https://www.pearanalytics.com/wp-content/uploads/2012/12/Twitter-Study-August-2009.pdf

 [P. Daas et al.] “Twitter as a potential data source for statistics”, Discussion paper 201221, The Hague/Heerlen: Statistics Netherlands, 2012 Dec u http://www.cbs.nl/NR/rdonlyres/04B7DD23-5443-4F98-B466-1C67AAA19527/0/201221x10pub.pdf

 [J. Messias et al.] “You followed my bot! Transforming robots into influential users in Twitter”, First Monday 18(7), 2013 u http://firstmonday.org/ojs/index.php/fm/article/view/4217/3700

 [D. Main] “How Much of Twitter Is Spam?”, Popular Mechanics, 2011 Aug 4 u http://www.popularmechanics.com/technology/how-much-of-twitter-is-spam

Eurostat

Tentative Bibliography (2/2)

 [P. Daas et al.] “Big Data as a Source of Statistical Information”, The Survey Statistician, 2014 Jan u http://isi.cbs.nl/iass/N69.pdf

 [P. Daas et al.] “Big Data and Official Statistics”, NTTS conference, Brussels, Belgium, 2013 Mar u http://www.cros-portal.eu/sites/default/files/NTTS2013fullPaper_76.pdf

 [B. Buelens et al.] “Shifting paradigms in official statistics: from design-based to model-based to algorithmic inference”, Discussion paper, Statistics Netherlands, 2012 u http://www.cbs.nl/NR/rdonlyres/A94F8139-3DEE-45E3-AE38-772F8869DD8C/0/201218x10pub.pdf

 [M. Scannapieco et al.] “Placing Big Data in Official Statistics: A Big Challenge?”, NTTS conference, Brussels, Belgium, 2013 Mar u http://cros-portal.eu/sites/default/files/NTTS2013fullPaper_214.pdf

 [L. Breiman] “Statistical modeling: The two cultures”, Statistical Science, Vol.16, No. 3, 2001 u http://www.uni-leipzig.de/~strimmer/lab/courses/ss09/current-topics/download/breiman2001.pdf

Eurostat

12