<<

recent research on big data and machine learning at the of

Presentation by Juri Marcucci Bank of Italy and BIS Workshop on “Computing Platforms for Big Data and Machine Learning” Bank of Italy , January 15th 2019

DG Economics, Statistics, and Research Bank of Italy (Email: [email protected]) Disclaimer: The views expressed are those of the authors and do not involve the responsibility of the Bank of Italy. outline

1. Big Data definition

2. Why should central use Big Data? 2.1 Bank of Italy Internal WG/task force on Big Data 3. Recent research at the Bank of Italy: a few use cases to illustrate the power of the approach

3.1 Inflation Expectations 3.2 Economic Policy Uncertainty (EPU) and Private Consumption 3.3 Sentiment from Social Media and Banks 3.4 Housing market

2 1. Definition of Big Data big data are the effects of digitalization

• Berman (2013): “Big Data can be characterized by the three V’s: volume (large amounts of data), variety (different types of data), and velocity (constant accumulation of new data).” • A mine of useful information for economists, statisticians and social scientists in general! 4 4 v’s: volume, velocity, variety, and value

Volume Variety

Velocity Value

5 or even more v’s?

1 Volume 8 Veracity 15 Visualization 2 Velocity 9 Viability 16 Vitality 3 Variety 10 Vincularity 17 Vocabulary 4 Value 11 Virility 18 Volatility 5 Variability 12 Viscosity 19 Vagueness 6 Validity 13 Visibility 20 ... 7 Venue 14 Visible

6 definition of big data

Pragmatic definition • If data is too big for a single machine. • If processing is too long for a single machine. • You have a Big data problem that requires Big data tools and techniques and ... • ... a good IT Department!

7 how sampling differs with “big data”

• Sampling starts with a preconceived idea of the outcome • With sampling few data points are extremely valuable (n = 1, 000) • Big data - you don’t know what the data holds • Big data - many data points extremely cheap (n = all) • Large data sets are messy, incomplete, inconsistent, and error prone. Require lots of data munging and data wrangling. • Signal vs noise 8 big data for economic research

Public Restricted Access/Private • Online News • Scanner Data • Twitter and Social Media • Credit Card Transactions • Web Corpus • Facebook behavior • Ebay • Online browsing • Wikipedia • Cell phone data • Online Reviews • Internet search and • Prices and price comparisons advertising • Google Trends • Health data • Real Estate Data • Energy utilization • ... • Cities data (Transportation, Housing, Environment, Crime, Education) • ... 9 2. Why Should Central Banks Use Big data? why should central banks use big data?

• Central banks and policy makers can/should use Big Data for macroeconomic analysis and nowcasting/forecasting macro or financial variables

• Can complement official statistics with data which are

• more granular and more complex, • more timely, • but unstructured (80-90% text)

• Can use internet-based information to gauge • public sentiment and expectation formation • impact of communication (e.g. Tobback, Nardelli and Martens, 2017)

• Can build new indicators ⇒ more accurate picture of economic reality for better policy making

11 big data at the bank of italy

• At the Bank of Italy we set up a multi-disciplinary internal working group which works across departments collaborating with the IT department. • Different skills: economists, statisticians, computer scientists, mathematicians,... • Two strands of projects: • Big data techniques and Machine Learning (ML) algorithms for economic analysis and economic research (e.g. build new indicators) using • granular data from web, social networks, news, payments, online ads, etc. • ML techniques, Neural networks, Deep learning, Natural Language Processing (NLP). • Using ML for statistical production (e.g. data pre-processing, outlier detection, etc.)

12 3. Recent Research at the Bank of Italy some recent papers on big data and ml

1. “Can We Measure Inflation Expectations Using Twitter?”, by Cristina Angelico, Marcello Miccoli, Juri Marcucci and Filippo Quarta 2. “News and Consumer Card Payments”, by Guerino Ardizzi, Simone Emiliozzi, Juri Marcucci and Libero Monteforte 3. “Listening to the buzz: Social media sentiment and retail depositors’ trust” by Matteo Accornero and Mirko Moscatelli 4. “Twitter Sentiment and Banks’ Financial Ratios: Is There A Causal Link?”, by Giuseppe Bruno, Paola Cerchiello, Juri Marcucci, and Giancarlo Nicola 5. “Predicting the Italian Unemployment Rate with Google and Twitter”, by Francesco D’Amuri and Juri Marcucci 6. “The Sentiment Hidden in Italian Texts Through the Lens of a New Dictionary” by Bruno Giuseppe, Marcucci Juri, Mattiocco Attilio, Scarnò Marco, and Sforzini Donatella 7. “Big housing data (Immobiliare.it): strengths and weaknesses”, by Michele Loberto, Andrea Luciani, and Marco Pangallo 8. “ Communications: Information Extraction and Semantic Analysis” by Giuseppe Bruno 9. ...

14 3.1 Can We Measure Inflation Expecta- tions Using Twitter? by Cristina Angelico, Marcello Miccoli, Juri Marcucci, and

Filippo Quarta the paper in one wordcloud

16 motivation

Inflation Expectations play a crucial role in macroeconomics:

• Key to understand consumption and investment choices • Informative on the effectiveness of central bank actions (both at short and long term)

Available sources of expectations:

• Survey-based: “true” expectations, but low frequency • Market-based: high frequency, but risk premia

Can we use social media to elicit inflation expectations? Could combine ‘true’ expectations & ‘high frequency’

17 data selection – relevant keywords

Keywords (EN/IT) to select tweets that talk about inflation/deflation

• price(s), cost of living • prezzo, prezzi, costo della vita • expensive bills, inflation, expensive, high prices, high-prices, high gas prices, higher bill, higher rents, high gasoline price, high oil prices, high gas bills • caro bollette, inflazione, caro, caro prezzi, caroprezzi, benzina alle stelle, bolletta salata, caro affitti, caro benzina, caro carburante, caro gas • deflation, disinflation, sale(s), less expensive, less expensive bills

• deflazione, disinflazione, ribassi, ribasso, meno caro, bollette più leggere

18 data

• 7.4 millions of tweets • Sample: 1 June 2013 – 25 June 2018 • About 780,000 individual users • Tweets contain one (or more) of the keywords • Full text and the meta-data (i.e. users’ bio)

Intuition

• Tweets reflect info on current or future prices • They can be inputs to the expectations formation process

19 twitter-based inflation expectations

Aim • Need to select tweets related to price dynamics • Build meaningful indexes of inflation expectations

Many tweets ⇒ “noise” (i.e. advertisements) Two-step procedure 1. Topic analysis • Exploit the full text of the tweets • Isolate valuable signals and filter the tweets of interest

2. Dictionary-based approach • Exploit the semantic of the keywords • Build a set of indicators on the filtered data

20 step one: topic analysis

Latent Dirichlet Allocation (LDA)

• Documents are random mixtures over latent topics • Each topic is described by a distribution over words

21 topic analysis: details

Steps: 1. Text cleaning • Stopwords, punctuation, stemming, IDF filtering 2. User-pooling • To get a set of documents given the shortness of tweets 3. Three runs of the LDA • Independent runs to deal with the stability of the topics

22 topics discovered by lda: valuable signals?

Smartphone e-commerce Topic 1 Topic 2

Italian [English] Italian [English] Italian [English]

iphon inflazion [inflat] salar [wage] apple ribass [sale] inflazion [inflat] caratterist [feaut] petrol [oil] deflazion [deflat] galaxy bors [stock exchange] lavor [job] samsung ital claud ital deflazion [deflat] aument [increas] nuov [new] istat bass [low] galaxy cresc [growth] ital smartphone aument [increas] disoccupaz [unempl] offert [offer] merc [market] econom [economy]

• 150 topics (but we tried many) • Focus on tweets assigned to these 2 topics (inflation/deflation) • Final dataset: 678,476 tweets

23 second step: dictionary-based approach

How can we aggregate tweets to get insights on agents’beliefs?

• Keywords reflect a message on the direction of price changes • Index computed as daily raw count of tweets

• Index Neutral: price(s) + cost of living

• Index Up: inflation + expensive bills + expensive + high prices + high gas prices + higher bill + higher rents + high gasoline prices + high gas bills

• Index Down: deflation + disinflation + sale(s) + less expensive + less expensive bills

24 dictionary-based indexes

Dictionary-based Inflation Indexes

1000 800 600 400 200 0

Index NEUTRAL Jan 14 Jan 15 Jan 16 Jan 17 Jan 18

Draghi's Lecture 2000 ISTAT(0.7% yoy) (2/4/2016) ISTAT(1.5% yoy) (1/14/2014) (2/28/2017) 1500 ISTAT(0.0% yoy) (1/7/2015) 1000 500 Index UP 0 Jan 14 Jan 15 Jan 16 Jan 17 Jan 18 ISTAT(-0.1% yoy) (8/29/2014) ISTAT(0.5% mom) 4000 (1/05/2017) 3000 ISTAT(-0.2% mom) (2/29/2016) 2000 1000 0 Index DOWN Jan 14 Jan 15 Jan 16 Jan 17 Jan 18

25 twitter-based directional indicators

e π0 = Index Up - Index Down

1. Infl. Exp. #1 • filtering on event dummies, standardization, winsorizing, backward looking MA 10, 30, 60 days 2. Infl. Exp. #2 • standardization, winsorizing, backward looking MA 10, 30, 60 days 3. Infl. Exp. #3 • exponential smoothing

e πln=(ln(Index Up+1)-ln(Index Down+1))

4 Infl. Exp. #4 • backward looking MA 10, 30, 60 days

26 twitter-based directional indicators 1 40 .5 20 0 0 -.5 -20 -1 -40 6/01/2013 1/1/2016 1/1/2018

Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) Infl. Exp. 3 (Exp -opt) Infl. Exp. 4 (ln) MA(30)

27 do we capture inflation expectations?

Compare Twitter-based indicators with:

• Survey-based • ISTAT monthly consumers’ expectations • Qualitative expectations over next 12m inflation • First best, but short time series

• Market-based • Daily inflation swap rates with 1 year maturity • Quantitative expectations • Second best, given the caveat of risk premia

28 twitter vs survey-based inflation expectations 20 20 0 0 -20 -20 Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) -40 -40

-20 -10 0 10 -20 -10 0 10 Infl. Exp. ISTAT Infl. Exp. ISTAT

CI 95% CI 95%

Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) 1 20 .5 0 0 Infl. Exp. 4 (Exp) -20 -.5 Infl. Exp. 5 (ln) MA(30) -1 -40

-20 -10 0 10 -20 -10 0 10 Infl. Exp. ISTAT Infl. Exp. ISTAT

CI 95% CI 95%

Infl. Exp. 4 (Exp) Infl. Exp. 5 (ln) MA(30)

Correlations: Twitter and ISTAT Infl. Exp

Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) Infl. Exp. 3 (Exp -0.1) Infl. Exp. 4 (ln) MA(30) 0.512*** 0.513*** 0.418*** 0.501***

29 twitter vs market-based inflation expectations 20 20 0 0 -20 -20 Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) -40 -40

-1 0 1 2 -1 0 1 2 IT Infl. Swap 1Y IT Infl. Swap 1Y

CI 95% CI 95%

Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) 1 20 .5 0 0 -20 Infl. Exp. 4 (Exp) -.5 Infl. Exp. 5 (ln) MA(30) -1 -40

-1 0 1 2 -1 0 1 2 IT Infl. Swap 1Y IT Infl. Swap 1Y

CI 95% CI 95%

Infl. Exp. 4 (Exp) Infl. Exp. 5 (ln) MA(30)

Correlations: Twitter Infl. Exp. and IT Infl. Swap 1Y

Infl. Exp. 1 MA(30) Infl. Exp. 2 MA(30) Infl. Exp. 3 (Exp-0.1) Infl. Exp. 4 (ln) MA(30) 0.546*** 0.539*** 0.535*** 0.625***

30 informativeness exercises

(1) (2) (3) (4) (5) ISTAT Lagged Et πt,t+12 0.596*** 0.309*** 0.437*** 0.294*** 0.265*** (0.0425) (0.0664) (0.0556) (0.0540) (0.0543) Inflation swap 1y rate, MA(10) 5.765*** 4.446*** 5.046*** (1.043) (1.093) (0.893) Infl. Exp. 4 (ln), MA(10) 4.493*** 2.301*** 2.033*** (0.954) (0.688) (0.704) Flash infl. estimate -355.7** (159.2) Constant -3.338*** -9.571*** -4.590*** -8.787*** -9.223*** (0.819) (1.298) (0.453) (1.148) (1.083) Observations 61 61 61 61 61 Adj. R2 0.354 0.532 0.476 0.550 0.562 Root MSE 4.566 3.885 4.109 3.812 3.76

1y • ISt and Infl.exp.t have explanatory power and correct sign • Twitter index provides small, but additional, information

31 3.2 News and Consumer Card Payments? by Guerino Ardizzi, Simone Emiliozzi, Juri Marcucci, and

Libero Monteforte news and consumer card payments (ardizzi, emiliozzi, marcucci, and monteforte)

• Aim: Investigate how private consumption (and preference for cash) in Italy react to news about Economic Policy Uncertainty (EPU) at daily frequency. • Data: Twitter, Bloomberg, Factiva (News), Payments (POS and ATM) • • Findings: Daily shocks in EPU temporarily reduce purchases (especially in crisis periods) and they increase ATM cash withdrawals.

33 News and Consumer Card Payments

by G. Ardizzi, S. Emiliozzi, J. Marcucci* and L. Monteforte (Bank of Italy)

SITE Workshop Macroeconomics of Uncertainty and Volatility

Stanford, August 23, 2018

* The views expressed are those of the authors only and do not involve the responsibility of the Bank of Italy. 34 Aug 2018 v7.pdf Aug 2018 v7.bb The paper in one chart

• Bloomberg Debit Card • Tw itte r Payments • Local projections • (Factiva) • POS Temporary • ATM decrease in • Daily frequency EPU & E(P)U consumption. • Strong seasonal patterns Precautionary channel?

5 35 Aug 2018 v7.pdf Aug 2018 v7.bb Daily E(P)U in Italian (Twitter) (HPT API)

E(P)U (in Italian) using the HPT API

28 36 Aug 2018 v7.pdf Aug 2018 v7.bb POS ← EPU: whole sample (2007m4-16m9)

• EPU generates a non negligible reduction in purchases (95% Confidence). • The effects tend to vanish after 1-2 months, except for the Twitter-based E(P)U. • Consistent with asymmetric effects on C of +/- transitory and unexpected income shocks 36 37 Aug 2018 v7.pdf Aug 2018 v7.bb 3.3 Listening to the Buzz: Social me- dia sentiment and retail depositor’s trust by Matteo Accornero and Mirko Moscatelli accornero and moscatelli - listening to the buzz: social media sentiment and retail depositor’s trust

• Aim: Investigate how social media sentiment and volume can anticipate bank runs or deposit withdrawals from more buzzed Italian banks. Create early warning indicators and evaluating retail depositors’ trust. • Data: Twitter and data on Italian Banks’ deposits and characteristics • Findings: Significant correlation between social media sentiment and monthly variation of retail deposits. Twitter can provide information on contagion and early warning indicators for banks’ liquidity distress.

39 LISTENING TO THE BUZZ: SOCIAL MEDIA SENTIMENT AND RETAIL DEPOSITORS’ TRUST

Matteo Accornero, Mirko Moscatelli

1 40 BANK OF ITALY 26/03/2018 DESCRIPTIVE STATISTICS 2

THE DYNAMICS OF SENTIMENT AND FUNDING FOR ‘NON-DISTRESSED’ BANKS

2.0 1.00

0.50 0.0

0.00 -2.0

-0.50

-4.0 -1.00

-6.0 -1.50

-8.0 -2.00

Retail deposits Retail uninsured deposits Sentiment (1)

17 41 BANK OF ITALY 26/03/2018 DESCRIPTIVE STATISTICS 3

THE DYNAMICS OF SENTIMENT AND FUNDING FOR ‘DISTRESSED’ BANKS

2.0 1.00

0.50 0.0

0.00 -2.0

-0.50

-4.0 -1.00

-6.0 -1.50

-8.0 -2.00 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 Retail deposits Retail uninsured deposits Sentiment (1)

18 42 BANK OF ITALY 26/03/2018 3.4 Big housing data (immobiliare.it): strengths and weaknesses by Michele Loberto, Andrea Luciani, and Marco Pangallo big housing data (immobiliare.it): strengths and weaknesses - loberto, luciani, and pangallo

• Aim: Adopt a new data set on online real estate ads to analyze market microstructure and build new indicators for real estate market. • Data: website www.immobiliare.it (like zoopla or zillow) • Findings: Analyze market segmentation along location and characteristics of dwellings. Obtain new indicators at higher frequency and more timely of the real estate market.

44 housing units dataset

• Final Dataset • From 1,037,095 ads to 654,000 dwellings (37% duplicates) then run the classification tree which outputs a probability that the two ads are duplicates. If this probability• Duplicates is largerincrease than 0.5, with we consider city size theand two there ads as is referring variability to the across same housing unit. In thecities last step we start with a list of pairs of duplicated ads and we create clusters of ads that refer• Additional to unique housing controls units.(ads Indeed, with > in2-weeks the simplified old, etc)example⇒ in465,000 Figure 1 we consider only two dwellings,dwellings but we can easily incur in groups of ads that refer to the same housing unit and some ad is not estimated to be a duplicate of another. Suppose for instance that ads A, B• andValidation C refer to the against same dwelling. indicators It is possible from official that the statistical pairs (A,B) and sources (B,C) are classified as duplicates,• # of but housing the pair units (A,C) going is not. out In of this the case market we use(proxy methods for transactions from graph theory) and consider alower cluster than of ads official as referring sales to the same housing unit if an internal similarity condition is satisfied.• Average Finally, price we aggregate level coherent information with coming average from prices the di↵ fromerentOMI ads by and considering the average of the values (as in Figure 1) or the most frequent characteristics. We applyaverage the deduplication discount from algorithm BI survey to the ( datasetbut lower of ads. frequency! According) to our procedure, the total number• Time of dwellingson market is aboutlargely 654,000 coherent units. with The estimates number of related from quarterly ads is insteadBI equal to 1,037,095 units,survey meaning that the number of e↵ective dwellings is only 63% of the total number of posted ads. Looking to Table 5, it should be noted that the large majority of dwellings have only one associated ad, while the duplicates are concentrated over a smaller number of houses.

1 2 3 4 5 6 7 or more Number of dwellings 465,041 113,365 37,566 15,981 7,723 4,264 9,559

Table 5: Distribution of dwellings by number of associated ads 45 According to our procedure, the main trouble with duplicates does not arise when we look at a single day, in which they account for about 20% of total ads. The real issue is that duplicates accumulate across several weeks, possibly for the reasons we discussed at the beginning of this section. We find that the share of duplicates over total ads increases with city size and there is significant variability across cities.16 After the deduplication process we make additional controls on the dataset to address for potential errors in the data. First of all we keep only the dwellings that have been on the market for almost two weeks. Then, we drop from the dataset those dwellings for which the price is not suciently consistent with the characteristics of the housing units. In this way we are also able to identify foreclosure auctions that were not previously identified, because for example in the textual description the auction was not reported. Our approach consists of running a hedonic regression, estimating for each dwelling the ratio between actual and predicted price and eliminating the housing units with a ratio between asking and predicted price lower than 0.5 or higher than 1.5.17 The cleaned sample that we will consider in most applications consists of those dwellings that have been on the market at least after January, 1, 2016 and it amounts to about 465,000 housing units.

3.3 Comparison between ads and housing units datasets In this section we compare the original datasets of ads and the one we derived on housing units, in order to find out under what circumstances omitting the deduplication procedure would entail a bias in the results 16For example, the ratio between the number of ads and housing units is equal to 1.75 for Naples and 2.15 for Milan. 17We keep a relatively large range because the hedonic regression is limited to a small set of housing unit characteristics, those less a↵ected by missing data issues. In this step we impute missing characteristics for each housing unit using the approach proposed by Honaker, King and Blackwell (https://gking.harvard.edu/ amelia).

12 validation

• Information coming from housing dataset is coherent with well established sources of statistics

46 applications

1. Analysis of heterogeneity of housing market 2. Analysis of segmentation of housing market 3. Nowcasting of aggregate/local prices 4. Computation of quality-adjusted price index to control for the evolution of supply composition 5. Study the evolution of housing demand and improve the forecasting of prices and sales 6. Hedonic regressions 7. Etc

47 level that online interest for a particular area is a leading indicator of prices. Finally, we test a prediction of search theory, finding no significant support. The common denominator of these exercises is that they would not be possible with any other currently available public data source on the Italian real-estate market. heterogeneity5.1 Heterogeneity of the italian housing market Heterogeneity is a key property of the housing market. For example, certain segments of the market may be disproportionally a↵ected by evolving credit conditions (Landvoigt et al., 2015). Heterogeneity occurs between and within cities, but also between and within neighborhoods. 50000 50000 30000 30000 Frequency Frequency 10000 10000 0 0

0 2000 4000 6000 8000 0 100 200 300 400

Price Floor area (a) Price per m2 (b) Floor area

Figure 5: Heterogeneity in the distributions of price per m2 and floor area.

Figures 5(a) and 5(b) show the distribution of asking prices per m2 and floor area respec- tively. Both distributions are skewed, with heavy right tails, indicating the existence of housing units with extremely high values. We represent the price per m2 in spatial form in Figure48 6, where we focus on the cities on Rome and Milan. In order to smooth the spatial distribution and mitigate the problem of outliers, we plot a kernel approximation of the prices. An important di↵erence between the distributions in the two cities is that in the case of Milan the prices decline radially from the center, whereas in the case of Rome we observe hotspots of high prices in peripheral neighborhoods (Appia Antica and Eur). Moreover, in Rome the prices do not decline radially from the center, because prices north of the center are larger than prices south of the center. This di↵erence can be traced back to historical, infrastructural and geographical reasons. The levels of the prices are similar in Rome and Milan, and the prices are among the highest within Italian cities. In Appendix E we show similar maps for eight other major cities, namely Turin, Naples, Genoa, Palermo, Venice, Florence, Bari and Bologna. The trends are similar, with high heterogeneity within and between cities. The cheapest city is Palermo, with prices ranging from 611 to 3242 per m2, while the most expensive is Milan, whose price range is 1600-9200 euros per m2. In Figure 7 we plot other variables. Instead of plotting a kernel approximation of their values, we aggregate these quantities over OMI micro-zones and color the OMI polygons according to the quartiles of the distribution. Figure 7(a) represents the median number of clicks on housing units, which are a proxy of demand. Comparing to Figure 6(a), we see that demand is highly correlated to price per m2, probably because both are correlated to an intrinsic attractiveness of the neighborhoods. There are some exceptions though. Consider the OMI micro-zone in the

19 heterogeneity of the italian housing market - p/m2 rome

(a) Rome 49

(b) Milan

Figure 6: Kernel approximation of the (asking) price per m2 during 2017Q1.

21 (a) Rome heterogeneity of the italian housing market - p/m2 milan

(b) Milan

Figure 6: Kernel approximation of the (asking) price per m2 during 2017Q1. 50

21 conclusions and take home lessons

• Big data and Machine Learning techniques are becoming ubiquitous in business, academia and institutions. • Strengths... • Timely observations rather than lagged and costly surveys • Many advantages over administrative data • Unlimited potential to answer business and policy questions ⇒ Value for our business • ...and Weaknesses • Problems of representativeness and less structured • Privacy issues • Noise vs signal • ... • More research is needed... • ...but all the things we have done so far would have been impossible without not just the support but the close collaboration and co-operation of our IT Department!

51 take home lessons

• Collaboration and co-operation across all departments and in particular the IT which is crucial for genuine Big data projects

• Put together different skills to use new voluminous data and get smarter or better answers to well known questions

• Data sharing and tools sharing

• We need to change not only the IT infrastructure but also our ways of thinking so that we can head towards a more integrated and unified framework to obtain the greatest value from Big Data.

52 Thank you! [email protected]