Big Data Investments: Effects of Internet Search Queries on German Stocks

Alternative Investments 7

Bearbeitet von Jan Becker

1. Auflage 2015. Taschenbuch. 92 S. Paperback ISBN 978 3 95934 597 2 Format (B x L): 14,8 x 21 cm

Wirtschaft > Internationale Ökonomie > Internationale Finanzmärkte

schnell und portofrei erhältlich bei

Die Online-Fachbuchhandlung beck-shop.de ist spezialisiert auf Fachbücher, insbesondere Recht, Steuern und Wirtschaft. Im Sortiment finden Sie alle Medien (Bücher, Zeitschriften, CDs, eBooks, etc.) aller Verlage. Ergänzt wird das Programm durch Services wie Neuerscheinungsdienst oder Zusammenstellungen von Büchern zu Sonderpreisen. Der Shop führt mehr als 8 Millionen Produkte. Leseprobe

Textprobe

Chapter 1.7, Data Scope of Analysis

The data scope of this study extends over the German Stock Index DAX® (Deutscher Aktien IndeX), MDAX® (Mid-Cap-DAX) and SDAX® (Small-Cap-DAX). The three parts are the German prime standard market indices for large, medium and small sized exchange listed companies. The blue chip index DAX covers 80% of Germany’s free float market and consist of the 30 largest companies in terms of market capitalization and exchange turnover. MDAX and SDAX both have 50 titles and follow directly DAX constituents. The full prime standard would be completed by adding TECDAX® to the sample. The TECDAX consists of the 30 largest technology shares. The composition of all indices is constantly review and rebalanced on a quarterly basis, except for new listings, deletions or mergers, which are taken into account immediately (Deutsche Börse 2013, p. 19). The TECDAX was initially not included into the sample for two reasons. First, in order keep the sample size manageable and secondly with respect to the online business model of some companies (e.g. Xing or Freenet) the correlations of search queries and the success of the companies were assumed to be high ex ante. So for a generalization of theory the hypothesis should work for standard companies too

1.7.1, Timeframe

The overall timeframe of nine years and four month from 10 January 2004 to 4 May 2013 refers to the first publicly available observation downloadable from Google and the time of this study. All regressions are based on this time frame. It is to say, that two major macroeconomic crisis fall into this period. The global financial crisis of 2008/09 and the European sovereign debt crisis that has been going on since 2010. Both crisis affected the global economy and lead to a slowdown of production. The financial crisis of 2008 is sometimes also referred to as the „Great Recession”

1.7.2, Necessary Adjustments in Sample Selection

Over this period not all the stocks could be added to the analysis, which adds a small selection bias. The final structure of constituents as of 6 May 2013 is modified with respect to the initial setup of January 2004 in the following way: All stocks which are included in the index in 2013 should also be in one of the three indices at the starting point of the analysis in 2004 in order to ensure that all control variables are available and the stocks are already exchange listed and tradable. This implies a survivorship bias in terms of excluding companies which defaulted or merged during the time in between. Some companies were taken private and are also excluded, because no trading prices are quoted anymore. Third, newly listed companies after 2004 are not included in the sample. There have been several event studies concerning IPO’s which cover this topic (cf. Da, Engelberg and Gao, 2011). The main argument for adjusting the sample is to focus on a continuous and comparable data set basis

1.8, Search Engines – Gateways to Information

The internet search query data is downloaded from Google. According to webhits.de the American company Google Inc. had a German market share of 80,4% (Webhits, 2013) and a rather higher score of 83,18% was reported by netmarketshare.com on a Global ranking (Netmarketshare, 2013). This leads to the assumption that Google data can, to a certain extent, allow to appropriately test hypothesis and has the necessary data scope to draw statistically significant conclusions about overall search activities

1.8.1, The Google Tool

Historically, there has been” Google Trends” and „Google Insights for Search” which both have been merged into „Google Trends” in September 2012 (Google, 2012b). Since then the combined interface, under Google Trends, is the only remaining platform

The service is provided by Google Inc. („Google”), located at 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States, and can be accessed via http://www.google.com/trends/. Access to the data is free of charge and furthermore all times series can be downloaded after registration and login with a free Google Account at the website

The tool allows to lookup an index of a specific search query from year 2004 until today and is available for worldwide data. The user interface of the software offers four options to specify the query into web search, area specific search, time frame and category. For this study „Web Search” is of relevance, which could also be modified into „Product” or „News Search”. Not relevant are „Youtube” and „Image Search”. The area specific search queries can be modified to a specific country, state or in some case also cities. For example Germany can be broken down into the state of „Hessen”, but the city of Frankfurt is not available as time series yet, although Google already displays the current respective search activities by city

Depending on the query’s frequency some time series are still on a monthly basis, whereas the most common downloadable format is of a weekly frequency. The existing possibility to download daily time series of the last 90 days can be extended to the further past by manually downloading one month windows by the „Select dates” functionality and then chaining the time series parts together manually

Under the category ‘filter’ there are 26 options with 241 subcategories available. Using the example of Choi and Varian (cf. 2009a, p. 4): „…query [car tire] would be assigned to category Vehicle Tires which is a subcategory of Auto Parts which is a subcategory of Automotive”

Of major interest are the categories „Business & Industrial” and „Finance” for stocks. These categories do not always deliver time series for all queries. So in the later analysis the most general form over all categories is applied, instead of the „Finance” filtering, to take care of the maximum likelihood to actually get a time series to analyze. Other research focused on these specific categories (e.g. Fink and Johann, 2013)

The query can be compared to its category. In this case the time series is scaled into a percentage of the initial starting value and is thus a growth rate (Google, 2012c). The category can add important information with respect to seasonality

1.8.2, Grouping

It is possible to group up to 25 search terms via a „+” sign. The Items are then displayed as separate graphs. In order to specify the query for combinations of terms the quotation marks („x+y”) have to be set at the beginning and end of each request

1.8.3, Multiple Counting is automatically avoided by Google

In order to avoid multiple counting the request are filtered by their IP address. The IP address (Internet Protocol address) is a numerical label assigned to each computer, which uses the Internet Protocol for communication. The IP can be used for host or network interface identification and location addressing. By identifying each user via an IP only the sum of their daily queries become part of the search volume index. If one user is not only searching via one computer (IP) then the queries are counted multiple times. It cannot be distinguished on a publicly available basis, for how many cross sectional queries the same user is responsible. It is possible that one user is responsible for generating all the signals over time

1.8.4, Synthetic Index rather than actual Numbers

Google does not publish the overall sum of search queries, but calculates an index. This index is bounded within the values of 0 and 100 and is recalculated under specific situations: Whenever there is a new maximum of search queries, this quantity is set to 100 and thereafter preceding quantities are scaled by this quantity via division and multiplication of 100 until a new high is reached. The old values are not recalculated and remain scaled by their old maxima. The Index could be interpreted as a percentage index (Google, 2012a)

For this reason it is difficult to compare different stocks by their search intensity. The actual quantities are not available to attribute increases in one stock query to a decrease of another. This may be interesting in the case of actual sales and shipped units of competitors. Moreover, the rescaling does not allow drawing conclusions about the original query quantity for a company, because the basis is constantly shifting

1.8.5, Empty Values

Another drawback in Google’s practice is to publish an index value of „0” instead of a very small number, whenever the search queries were below a certain threshold level. Google does not transparently explain how the threshold level is measured until now (Google, 2010)

1.8.6, Limited by German Language

As the later data will show, most of the relevant data emerges from the German language area and from queries within Germany. This is also true for most queries which refer to international exporting companies like car manufacturers (e.g. BMW) and is in contrast to the previous studies on individual stocks from the US market. On the higher level of DAX there are comparably more international queries than in the smaller company index MDAX and SDAX. This may be a hint to home bias and the local degree of familiarity with smaller stocks

When analyzing queries of a combination of the stocks name with a second word, the language barriers become more obvious. When searching for terms like „Aktie” (engl. stock) or „Dividende” (engl. dividend) already small changes in the denotation can tilt the data origination form German to English speaking countries. A study by Mondria and Wu (2011) showed that home bias delivers higher returns by advantages of higher information density. Therefore the study uses the German terms in the regression models. A comparable study by Bank, Larch and Peter (2010) on the German stock market for all Xetra-listed stocks used the Name of the companies, but without any „AG”. Their queries are restricted to only German queries. Fink and Johann (2013) apply the category filter „Finance” when downloading the data in addition to the name of the German companies. This procedure allows taking advantage of a particular Google feature, which assigns queries with the classification of the final website accessed, after activating the query. This anomaly to other studies adds additional information to the query and the authors show that it improves the query quality. As it is not transparent how Google classifies „Finance” queries, in this study nevertheless the standard query method is used and the focus is set via the additional terms „AG” and „Aktie”

1.8.7, Exact Wording of Search Terms and Search Term Combinations

When searching for data on a search engine one question which arises is: What do people type into the search engine? Most users start by typing in just one search term (cf. Spink et al., 2001). This seems to be common practice and is also supported by the data set later. To set up a list of words, the most common reference name for a company is searched (e.g. BMW for Bayerische Motoren Werke Aktiengesellschaft). This approach had some minor flaws because the German common understanding of some stock names conflict with some equal meaning in the English language. E.g. „MAN” is a German producer in the automobile industry and „Metro” a big retailer for consumer products. In these cases a more stock related perspective was introduced by searching for the combination of the stock’s name together with the German abbreviation for PLC (public limited company) namely „AG” (Aktiengesellschaft). Altogether, four main search combinations evolved: The common search name declared as „Name”, the name plus „AG”, the name plus „Aktie” and the name plus „News”. It is to say that the available data frequency dramatically decreases by combining terms. In the initial setup many more terms were included, but not enough data sets could be extracted. These terms were: name plus „Report”, name plus „Return”, name plus „Rendite”, name plus „HV” (engl. shareholders' meeting), name plus „IR”, name plus „Investor Relalations”, name plus „Bilanz” (engl. balance sheet), name plus „P&L” and name plus „GuV” (engl. P&L)

The fact that only top level search terms are available may support the initial assumption that only search queries with one word are preferred over full sentences or it may be due to Google’s policy not to publish time series which fall below a certain threshold level.