Big Data Effective Processing and Analysis of Very Large and Unstructured Data for Official Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Big Data and Interference in Official Statistics Part 2 Diego Zardetto Istat ([email protected]) Eurostat Big Data: Methodological Pitfalls Outline • Representativeness (w.r.t. the desired target population) Selection Bias Actual target population unknown Often sample units’ identity unclear/fuzzy Pre processing errors (acting like measurement errors in surveys) Social media & sentiment analysis: pointless babble & social bots • Ubiquitous correlations Causation fallacy Spurious correlations • Structural break in Nowcasting Algorithms Eurostat 1 Social Networks: representativeness (1/3) The “Pew Research Center” carries out periodic sample surveys on the way US people use the Internet According to its Social Media Update 2013 research: • Among US Internet users aged 18+ (as of Sept 2013) 73% of online adults use some Social Networks 71% of online adults use Facebook 18% of online adults use Twitter 17% of online adults use Instagram 21% of online adults use Pinterest 22% of online adults use LinkedIn • Subpopulations using these Social platforms turn out to be quite different in terms of: Gender, Ethnicity, Age, Education, Income, Urbanity Eurostat Social Networks: representativeness (2/3) According to Social Media Update 2013 research • Social platform users turn out to belong to typical social & demographic profiles Pinterest is skewed towards females LinkedIn over-represents higher education and higher income Twitter and Instagram are biased towards younger adults, urban dwellers, and non-whites Facebook too is skewed towards younger people (though much less than Twitter and Instagram) and females • More interestingly, even looking at all such subpopulations as a whole, Social Networks users turn out to constitute a biased sample of US online adults • Big Data generated via Social Networks are very likely to be non representative of the US general population Eurostat 2 Social Networks: representativeness (3/3) Incidence of Social Network users among US Internet adults within specific social/demographical groups – Sept 2013 • superscript letters (e.g. a) indicate statistically significant differences between cells Eurostat Tweets: the relevance issue There is growing evidence that a big portion of tweets lack any relevant information • And even more so in the Official Statistics perspective Such tweets are categorized as Pointless Babble • e.g. “I am eating a sandwich now” Pear analytics “Twitter Study 2009” estimated that about 40% of tweets were Pointless Babble Recent extensive studies conducted by Statistics Netherlands on millions of Dutch Tweets revealed that more than 50% had no statistical interest Challenge : how to filter out such a big noise in order to uncover meaningful signals inside huge amounts of Tweets? Eurostat 3 How many Twitter users are actually humans? According to her Twitter bio, Carina Santos ( @scarina91 ) is a young, blonde haired, girl based in Rio de Janeiro, working as a journalist at Globo, a 24-hour news channel in Brazil • She sends around 50 tweets per day on a variety of popular topics (spanning from sport news to gossip) to her nearly 1000 followers • Twitalyzer, a social media analytics firm, judged her to be more influential than online personalities like Oprah Winfrey or the Dalai Lama Eurostat How many Twitter users are actually humans? Eurostat 4 How many Twitter users are actually humans? According to her Twitter bio, Carina Santos ( @scarina91 ) is a young, blonde haired, girl based in Rio de Janeiro, working as a journalist at Globo, a 24-hour news channel in Brazil • She sends around 50 tweets per day on a variety of popular topics (spanning from sport news to gossip) to her nearly 1000 followers • Twitalyzer, a social media analytics firm, judged her to be more influential than online personalities like Oprah Winfrey or the Dalai Lama Carina Santos is not a human but rather a social bot created in 2011 by a team of Brazilian computer science researchers: • a software robot able to interact and communicate with humans by following appropriate social behaviors and rules “She” constantly checked Twitter API restrictions to avoid being identified as a bot and blocked by Twitter Maximum number of requests per hour is 350 Each user can follow up to 2,000 users A user cannot follow more than 1,000 users per day Eurostat Tweets: the Social Bots issue Security analysts and artificial intelligence researchers agree that the current amount of Twitter bots is big and growing fast • It has been observed that thousands of dormant Twitter profiles suddenly began posting hundreds of messages during last years social turmoil and political crises around the world (e.g. Syrian civil war and Arab Spring protests in Egypt, Libya, Tunisia) PeekAnalytics estimated in 2011 that only 35% of Twitter users’ followers were real people Maybe this estimate is not accurate or even grossly exaggerated, but the problem is still there: • How much of what can be mined from Tweets can safely be used to analyze, interpret or predict people behavior? Challenge : beyond removing the pointless bubble noise, how to discriminate bot generated Tweets deliberately intended to mimic meaningful human signals? How to adjust analyses for bot traffic? Eurostat 5 Inference in the Official Statistics Realm Outline Traditional paradigm • Top-down: data are planned • Traditional inference approaches Design based survey sampling theory Model-assisted approach Model based inference The need of a new paradigm to deal with Big Data • Bottom-up: data are already there • Exploratory analysis / Knowledge discovery approach Algorithmic inference: data mining techniques, machine learning, … Eurostat Design Based Inference (1/2) Probability sampling • Samples are drawn by means of rigorously random algorithms • Every unit in the population has a known, non-zero probability of being selected in the random sample Data sampling is entirely controlled Ideally, THE STATISTICIAN IS THE ONE AND THE ONLY RANDOMIZER Ideally means ignoring all non-sampling errors (e.g. list problems, total and item nonresponse, measurement errors…) This approach allows to: • Build unbiased estimators (or nearly so) even if samples are not naively representative, because we can adjust for unequal inclusion probabilities • Exploit probability theory to assess the quality of obtained estimates Eurostat 6 Design Based Inference (2/2) U = {1,2,..., N} s = {1,2,...,n} s ∈ S p: S → ]1,0[ p() s = Pr( s is selected ) π k = Pr( k ⊂ s) = ∑ p() s s⊃ k π kj = Pr({ k,} j ⊂ s) = ∑ p() s s⊃{,} k j ˆ YHT = ∑dk y k = ∑ yk /1( π k ) k∈ s k∈ s ∆ π π ˆ ˆ kj k j V (YHT ) = ∑∑dk y k dj y j = ∑∑1− dk y k dj y j k∈ s j ∈ s π kj k∈ s j ∈ s π kj Eurostat Model Assisted Inference (1/2) Reference framework is still Design Based • Observed values (y, x) of interest variable Y and auxiliary variables x are deterministic , non-random quantities • Relations (if any) between Y and X are generated by Nature (i.e. by real- world, domain-specific phenomena) Information about the target population is available from sources external to the survey at hand • Can use this information to describe relations between Y and X through a model Model Assisted Inference is a suite of methods to improve the quality of Design Based inferences by hinging upon available auxiliary information in a systematic and rigorous way build more efficient (but still nearly unbiased) estimators reduce nonresponse bias The model is assisting only (i.e. descriptive): no stochastic structure ever assumed! Eurostat 7 Model Assisted Inference (2/2) x ξ : yk ~ k ⋅β + ε k Eξ ()ε k = 0 ∀k ∈U 2 Vξ ()ε k = σ < ∞ ∀k ∈U Cov ξ (,)ε kε j = 0 ∀k, j ∈U ˆ x ˆ x ˆ ˆ X Xˆ ˆ YGREG = (∑ k )⋅β + ∑dk(y k − k ⋅β) = YHT + ( − HT )⋅β = ∑ wk y k k∈ U k∈ s k∈ s −1 X = x d d X Xˆ T−1 xt T−1 xt x ∑ k wk () = gk ()dk gk =1 + ( − HT )⋅ ⋅ k = ∑dk k ⋅ k k∈ U k∈ s ˆ ˆ ˆ ˆ ˆ e = y − x ⋅βˆ VY( GREG ) ≈ VY( GREG ,lin ) = V ∑dk(g k e k ) k k k k∈ s Eurostat Model Based Inference (1/2) Sampling design is non informative and can be ignored Equivalent to assuming that the data have been selected by simple random sampling with replacement Observed values (y, x) are regarded as realizations of random variables : Y (the response variable) and X (the explanatory variables) Assumptions are made on the probability distribution of Y and (sometimes) of X A statistical model linking the response variable Y to the explanatory variables X is proposed and tested against the observed data (y, x) • Often models describe how the conditional expectation of Y given X (or some nice/smart function of it, think of GLM) depends on X NATURE IS THE RANDOMIZER • A Model is a (human interpretable) guess on “how” Nature is working Eurostat 8 Model Based Inference (2/2) ξ : Y ~ N(X⋅β, σ 21) YX X Eξ ( ) = ⋅β YX 21 2 Vξ ( ) = σ σ < ∞ yˆ()x =x ⋅βˆ βˆ = (Xt X)−1 ⋅(Xt Y) ˆ x ˆ Yξ = ∑ yk + ∑ yˆk = ∑ yk + ∑ k ⋅β k∈ s k∈ ( U − s ) k∈ s k∈ ( U − s ) Eurostat Algorithmic Inference (1/2) Models are replaced by Algorithms , whose ultimate goal is to map X values (input) to Y values (predicted output) • Machine Learning : emphasis on prediction, i.e. learning data properties from a training set and being able to mimic “what” would happen to another data set • Data Mining : emphasis on data exploration and discovery (of correlations, rules, clusters, patterns, hidden structures, …) Fitting a Model is replaced by tuning