Individual Investors, Social Media and Chinese Stock Market: a Correlation Study

By

Yonghui Wu

B.E., Jiao Tong University, 2007 M.E., Shanghai Jiao Tong University, 2010

SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE IN MANAGEMENT STUDIES MASSACHUSETTS INSTITUTE OF TECHNOLOGY AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 082016 JUNE 2016 LIBRARIES @2016 Yonghui Wu. All rights reserved. ARCHIVES

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Signature of Author: Signature redacted I MIT Sino in School of Management May 6, 2016

Certified by: Signature redacted Erik Brynjolfsson Schussel Family Professor Thesis Supervisor Signature redacted____ Accepted by: Rodrigo S. Verdi Associate Professor of Accounting Program Director, M.S. in Management Studies Program MIT Sloan School of Management

Individual Investors, Social Media and Chinese Stock Market: a Correlation Study

By

Yonghui Wu

Submitted to MIT Sloan School of Management on May 6, 2016 in Partial fulfillment of the requirements for the Degree of Master of Science in Management Studies.

ABSTRACT Chinese stock market is a unique financial market where heavy involvement of individual investors exists. This article explores how the sentiment expressed on social media is correlated with the stock market in . Textual analysis for posts from one of the most popular social media in China is conducted based on Hownet and NTUSD, two most commonly used sentiment Chinese dictionaries.

The correlation matrices and regressions between sentiment ratios and returns of 9 holding periods for all the 30 sample securities reveal that correlation exists between investor sentiment on social media and the future returns of the Chinese stock market. In addition, I find that negative sentiment ratio is superior than positive sentiment ratio, and correlation of sentiment ratio to return is persistent in future holding periods. Also, by comparing different stocks and indices, I find that well-established market index has better correlation with social media sentiments than individual stocks, and well-known 'star' stocks have better correlation with social media than other stocks. However, I test the VAR model on Shanghai Composite Index, and find that the model is stable but shows no Granger causality. Better data and improved analysis are needed to predict stock market with social media.

Thesis Supervisor: Erik Brynjolfsson Title: Schussel Family Professor (This page left intentionally blank) Acknowledgements

I feel grateful and privileged to have worked with my thesis advisor Professor Erik Bryn- jolfsson. I would like to thank him for his guidance for helping me navigate through the thesis process, and for his prompt feedback and suggetions regarding the directions and the resources of this study. I would also like to thank Professor Marshall Van Alstyne and other fellow students for their valuable comments and encouragement on this research in the class of Economics of Digitalization. This study is very new and challenging for me because I have little prior experienc in programming. This thesis could not have been possible without the help of my friend Lerith Tian. Lerith has helped me tremendously with python programing and textual analysis. I am very grateful for his help and also learned a lot from his patient guidance. I also benefited a lot from my other friends. Shuyi Yu has provided me with many valuable suggestions on statistical analysis. Shan Huang has helped me narrow down the research scope at the very beginning. Alora Chen, Jin Jing Liu and Liam O'Dea have greatly supported me during my preparation for this thesis. I am indebted to these dear friends of mine. Last but not least, I would like to thank my parents Shunfeng Wu and Ganying Deng as well as my sister Yonghong Wu. Thank you for always believing in me and standing behind all my endeavors. (This page left intentionally blank) Contents

1 Introduction 4 1.1 Literature review on investor sentiment and the stock market ...... 4 1.2 Social media and stock markets in China ...... 6 1.3 Literature review on Chinese NLP ...... 10 1.4 Summary ...... 11

2 Data 12 2.1 G uba ...... 12 2.1.1 Guba as a social media in China ...... 12 2.1.2 Posts and samples ...... 13 2.2 Financialdata ...... 16

3 Method 16 3.1 Dictionaries and word list ...... 16 3.2 Segmenting and parsing the posts...... 17 3.3 Quantifying the positive and negative sentiment ...... 19 3.4 Regression methods ...... 20

4 Results 21 4.1 Correlation Analysis ...... 21 4.1.1 Positive ratio v.s. Negative ratio ...... 21 4.1.2 Differnt Holding Periods and securities ...... 23 4.2 Regression Analysis ...... 25 4.2.1 Positive ratio v.s. negative ratio ...... 25 4.2.2 Difference between stocks ...... 26 4.2.3 Difference between stocks and indices ...... 30 4.3 Time-series Analysis ...... 31 4.3.1 Lag selection ...... 33 4.3.2 VAR Model ...... 33 4.3.3 Stability test ...... 33

'I 4.3.4 Granger causality analysis ...... 35

5 Conclusion 36

2 List of Figures

1 Social network penetration in China from 2012 to 2018 ...... 6 2 Domestic market capitalization of stock exchanges in the world in 2014 7 3 Accumulted Number of Individual A Share Account in China ...... 8 4 Trading Volume of Different Investor Type in China (2011, 2012) ..... 9 5 Timespan and Posts Under the Selected 30 sections ...... 14 6 Company Information of the 28 Selected Stocks ...... 15 7 Summary of Method ...... 17 8 List of Positive & Negative Words for Stock Market ...... 18 9 Denotations of Returns ...... 21 10 Correlation of Negative Ratio and Returns of Sample Stocks and Indices 22 11 Correlation of Positive Ratio to Returns of Sample Stocks and Indices . .. 22 12 Average of Correlation Coefficients for Positive and Negative Ratio .. .. 23 13 Negative Correlations with Different Returns for Sample Securities ... .. 24 14 Positive Correlation with Different Holding Period Return ...... 24 15 Sample Securities Ranked by Posts per Day ...... 26 16 Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks .. 27 17 Positive Ratio and Negative Ratio are Both Significant for 17 stocks .. .. 28 18 Outliers in Regression Coefficients ...... 29 19 Coefficients of Positive Ratio and Negative Ratio for Stocks ...... 30 20 Regression Results for Indices ...... 31 21 Coefficients for Positive Ratio: Indices v.s. Stocks ...... 32 22 Coefficients for Negative Ratio: Indices v.s. Stocks ...... 32 23 Lag Length Selection ...... 33 24 VAR M odel ...... 34 25 Unit Root Check ...... 34 26 Granger Causality Test ...... 35

3 1 Introduction

1.1 Literature review on investor sentiment and the stock market

Behavior science tells us that emotions can influence people's decisions. In financial ar- eas, many researchers in beharioral finance have identified that stock perfomances are af- fected by investor behaviors and sentiments. Unlike the standard finance model, where unemotional investors always force capital market prices to equal to the rational present value of expected future cash flows, behavior finance has grossed substantially in the past decade to augment the standard model. The first dimension of behavior finance is about behavior patterns. Many behaviro patterns and biases have been discovered. For example, M. Seasholes and N. Zhu(2010) have found that individuals tilt their portfolios towards locally-headquartered firms, and this local bias doesn't bing them superior returns. J. En- gelbert and C. Parsons (2011) have identified the causal effect of the local media on the trading behavior-all else equal, local press coverage increases the daily trading volume of local retail investors. The second dimension of behavior finance is related to sentiments. One of the most important assumptions in behavior finance is that investors are subject to sentiment. Investor sentiment, defined broadly, is a belief about future cash flows and in- vestment risks that is not justified by the facts at hand (Beker 2007). Although the question is no longer whether investor sentiment affects stock prices, but rather how to measure in- vestor sentiment and quantify its effects. Many measurements have been developed, such as Investor Surveys (Qiu and Welch, 2004), Investor Mood (Kamstra, Kramer and Levi 2003), Retail Investor Trades (Barber, Odean, and Zhu, 2003), IPO Frist-Day Returns, Option im- plied volatility (Market Volatility Index or VIX which measures the implied volatility of options on the standard and Poor's 100 stock index). However, those measurements are only proxies of investors' sentiments, not direct measurements. In addition, data avail- ability narrows these options considerably, because data for some measures is costly and sometimes subjective.

4 A direct way to measure investor sentiment is to quatify languages in the media. By quantifying language, researchers can examine and judge the directional impact of a limit- less variety of events. Tetlocl (2008) analyzes the negatives words in Wall Street Journals and concludes that negative words in firm-specific stories leading up to earnings announce- ments significantly contribute to a useful measure of firms' fundamentals. Feng Li(2009) uses Naive Bayesian algorithm to detect tones in Management's Discussion and Analy- sis of Financial Condition and Results of Operations (MD&A) and finds that the tone of the forward-looking statement is positively correlated with future performance and has explanatory power incremental to other variables. Loughran and McDonald (2010) also compare the widely-used Harvard-IV-4 TagNeg (H4N) file with other five wordlists. Those studies have explored the quantification of news in journals and public infomation released by companies. As internet begins to play a major role in businesses and people's everyday lives, so- cial media has become one of the most important venues where individual investors share their opinions on financial securities. The content on social media, apparently diverse in quality, huge in quantity and different with traditional meida, provides direct sources of sentiment data to measure investors' sentiment. A growing number of literature has ad- dressed the effects of user-generated content of social media and the stock market in United States. Bollen, Mao and Zeng (2011) find that the mood expressed on Twitter among fi- nancial investors can predict daily stock returns. Karabulut (2013) finds that the National Happiness Index issued by Facebook is correlated with the daily stock return and trading volume. Gilbert and Karahalios construct an Anxiety Index based on Twitter to predict the stock market. Chen and De (2014) conduct research on Seeking Alfa and find evi- dence that views expressed in Seeking Alfa could predict the future stock returns. With its ever-growing amount of user-generated content, social media has become an important confluence of investment sentiments and has exerted its influences on financial markets. However, few English liturature addresses the correlation between social media and stock market in countries other than United States. In this study, I want to explore the correlaton between socia media and stock market in China.

5 Social network penetration in China from 2012 to 2018

FiTure 1: Social network penetration in China from 2012 to 2018

1.2 Social media and stock markets in China

The past decade has witnessed the boom of social media in China. In addition to having the world's biggest Internet user base--513 million people, more than double the 245 mil- lion users in the United States-China also has the world's most active environment for social media. More than 300 million people use it, from blogs to social-networking sites to microblogs and other online communities.That's roughly equivalent to the combined pop- ulation of France, Germany. Italy. Spain, and the United Kingdom. In addition, China's online users spend more than 40 percent of their time online on social media, a tigure that continues to rise rapidly. Statistics show that Chinese people spend an average of 3 hours daily on social media, I hour more than on TV. Wechat, the popular application that com- bines the best of Facebook and WhatsApp, has 600 million Monthly Active Users as of has June 2015. With such a high penetration of social media in China, how social media

affected the stock market becomes an interesting question.

6 Domestic market capitalization of stock exchanges in the world in 2014 (in billion U.S. dollars)

.1-4 w RUM 11144

Exn L Ss 4cAse b 13 aAa11 11w1tRNU 041

mb-ia iBsE &NKSL 1110M 4-[

NX'SF Fw,.11mltI mm m. E IA m

Lar- -h-I < -1,,

!N-\SDAQON1 YC

Chhm d J Sover 2h mi,.hioI, ad tskekgs I 43&s N YSLELamwxl I S; RMHA

11"' 11R 4) a' 'w I ; W I' 1,14 11K 2 341) IYX

Figure 2: Domestic market capitalization of stock exchanges in the world in 2014

In addition, the Chinese stock markets have been growing very fast in the past two decades, though they are still young by global standards. Taken together, China's two stock markets rank second in the world in market capitalization behind the New York Stock Exchange in 2014. As of December 31 2015, the total number of A share accouts has reached over 200 million. In additiona trading in the Chinese stock markets is very acitve comparing with many other stock markets in the world, which makes Chinese stock markets one of the most dynamic stock markets in the world. Volatility is also high in stock markets in China. For example, the Chinese stock market suffered a major crash in 2015. The crash began with the popping of the stock market bubble on 12 June 2015. A third of the value of A-shares on the was lost within one month of the event. With such a dynamic and volatile backdrop, the effects of investor sentiment on Chinese stock markets seem to have the potential to help us understand this market better, for example, what role does investor sentiment play in this volatile market? Does the findings on investment sentiment in United States apply to Chinese stock markets?

Before diving into answering those questions, consideration on the information and market characteristics of the Chiese stock markets is needed. According to Fama( 1970)'s

7 Accunulted Number of Individual A Share Account in China from 2001 to 2015

200

150

Figure 3: Accumulted Number of Individual A Share Account in China

EMH theory, efficient stock markets accurately reflect all available information at all times. Weak-form efficiency implies that current prices reflect all historical price information, and the semi-strong form implies that all public information would be fully reflected in mar- ket prices. In the strongest form of the theory, even private (insider) information would already be incorporated into market prices. Many researches have been conducted to iden- tify which form Chinese stock market belongs to, and those results are mixed. Amelie and Olivier found that B shares on Chinese stock exchanges do not follow the random walk hypothesis and therefore are significantly inefficient, whereas A shares are more efficient. Nisar adopted three methods to test the random walk theory and concluded inefficency in Chinese stock market. Malkiel argued that Chinese stock market was broad-form effi- cient because semi-strong and weak form both existed in Chinese stock market. As many economists suggest, a well functioning financial system should be supported by a strong legal system and by proper corporate governance, but China has none of it. This fact may partially lead to the mixed results in testing the EMH theory. Moreover, Behavior Finance also shed new lights on solving the contradiction in applying EMH theory to the Chinese market. One feature that sets the Chinese stock market apart from any other markets in the world is that individual investors are very active in China. Individual investors, who

8 Trading Volume of Different Investor Type in China (2011, 2012)

40%

30

Figure 4: Trading Volume of Different Investor Type in China (2011, 2012) hold merely 26% of total market capitalization, account for 78% of daily trading volume. China's approximately 200 million retail investors trade more often than investors in any other countries: 81 percent of individual investors trade at least once a month, compared with 53 percent in the U.S, according to a survey by State Street in 2015. Moreover, the in- dividual investors are not as educated in other countries, as a survery by Bloomberg shows, more than two-thirds of the most recent batch of new investors didn't even graduate from high school. Many literature has identified individual investors are prone to many biases. Bondt (1998) identified the portrait of individual investors as people who discover naive patterns in past price movements, share popular models of value, and trade in subopti- mal ways. Following researchers have discovered many biases that individual investors suffered, such as overconfidence, availability, framing and accounting biases. etc. Barbe (2008) found that individual investors are prone to invest in attention-grabbing targets be- cause of their limited energy and time to search for investment. With all those portraits and biases exhibited in literature, it is natural for one to wonder whether those biases are present in their social media postings. The large base of individual investors and heavy involvement of individuals in daily stock trading in China make the Chinese stock market a unique place to observe how social media has affected people's investment behavior.

9 1.3 Literature review on Chinese NLP

Natural language processing (NLP) is the ability of a computer program to understand hu- man speech as it is spoken. Human languages, usually referred to as natural languages, is a dynamic set of symbols and corresponding rules for communication. NLP can be considered as a technique for the realization of linguistic theory to facilitate real-world ap- plication, such as online content analysis, machine learning, etc, and has grown very fast as a component of artificial intelligence (AI) since its inception in 1950s. In this regard, Chinese NLP is no exception. However, lack of clear delimiters between words in Chinese renders Chinese NLP unique from western languages. Unlike English text in which words are delimited by white spaces, in Chinese text, sentences are represented as strings of Chi- nese characters (hanzi) without similar natural delimiters between them. For this reason, automatic word segmentation, the major step in Chinese morphological analysis, lays down the foundation of any modern Chinese information system(Wong, Li, Xu, Zhang, 2010). The problem of Chinese word segmentation has been studied by researchers for many years. Several different algorithms have been proposed, which, generally speaking, can be classified into characterbased approaches and word-based approaches(Wong, Li, Xu, Zhang, 2010).. The character-based approach is used to mainly process classical Chinese texts. It is simple and easy to use, which leads to other advantages of reduced costs and minimal overheads in the indexing and querying process. On the other hand, word-based approaches is gaining more popularity because of the increasing computaion power of com- puters. Word-based approaches, as the name implies, attempt to extract complete words from sentences. They can be further categorized as statistics-based, dictionary-based, comprehension-based, and learning-based approaches. As I noticed many English liter- ature on social media and stock markets use lexions and word lists in their studies, so in this study I follow suite and choose dictionary-based approaches to analyzie online content on Chinese social media.

10 1.4 Summary

One major assumption of behavior fiance is that investors are affected by sentiments, and this has been proven by many studies in past decades. Now in the era of social media, a burgeoning amount of lituerature begin to address the effects of social media on stock market. Researchers have investigated the explanatory power of social media in United States, such as twitter, facebook, Seek Alfa, etc. Positive evidences have been found. While China is now the largest market for social media, and has the second largest stock markets in the world, the effects of social media on Chinese stock markets become an interesting question. In addition, the development of Chinese NLP techniques has provided ample lexicons and tools for textual analysis in Chinese. Therefore, in this study I explore the correlation between social media and stock markets in China. Specifically, I want to explore whether correlaiton between social media and Chinese stock market exists, and differences between correlation of negative sentiment raio and positive ratio. Also correlations of different stocks and market indices will be compared to see which "tags" or topics connet most closely the social media and Chinese social media. And following prior literature, I will also test the correlation of social media sentiments to different periods of future returns to see how the correlation persists in the future periods.

11 2 Data

This study uses posts data from Guba, a popular investment-themed socila mida in China. Data of financial prices are downloaded from iFind. The sample period varies because of the data availability from Guba.

2.1 Guba

2.1.1 Guba as a social media in China

Guba(http://guba.com.cn/) is the most popular investment-related online community in China. It is also part of Eastmoney.com, which is the largest financial online media in China. In Alexa ranking system, Eastmoney ranks 772 globaly, which is way above the ranking of Seeking Alfa (1,480). According to iResarch, during January 2015 the Daily Unique Visitors to Eastmoney.com was 15,210,00, 5.9% of all netizens in China. Guba's mobile application is also widely downloaded among smartphone users. In addition, the Weekly Effective Viewing Duration on Eastmoney during January 2015 reached 21,220,000 hours, which indicated a strong user loyalty among users. It is quite easy to make postings on Guba, and it even doesn't require registration before posting. The easiness to post increases the popularity of Guba among individual investors, especially less educated in- vestors. Guba is a topic-based forum-style social media. All topics are named after one stock or indices, so posts are naturally catergorizd under different stocks or indices of the stock market.There are over 2200 topics, covering almost every individual stock and market in- dex. In this study, I download all the posts under selected stocks and indices, and all the posts under one stock or index will be analyzed to extract the invest sentiment for this stock or index. And here we use stock name to denote the disccussions under one specific topic.

12 2.1.2 Posts and samples

I download posts under 30 sections in total. The sections and post information is shown in Figure 5. Among the 30 sections, two are market indices: Shanghai Composite Index and China Stock Index Futures. Both are typical indices for the overall stock market in China. The China Stock Index Futures are known to be very responsive to market infomation and more liquid than equities. I also download posts of 28 individual stocks.My selection criteria of these stocks are (1) representative of A share market in China; (2)relative popularity among all sections and (3) their heterogenicity. As those stocks are all component stocks of the CS1300, a capitalization-weighted stock market index designed to replicate the performance of 300 stocks traded in the Shanghai and stock exchanges, so they are very representa- tive of Chinese stock markets. And they are of different sizes and from different industries. Figure 6 shows the additional information on those stocks. The 28 stocks are comprised of a mixture of industries, such as meida, financial services, retail, construction, transporta- tion, etc. Also the market capitalization also varies from 19, 205 million CNY to 1,124,975 million CNY. PE ratio also varies a lot, from 6 the lowest to 188 the highest. Companies like China and OCT Group have been listed since 1990s, and some companies only become public in recent years. In this regard, the companies are very heterogenous. The time spans of all the posts downloaded in this study vary across different sections. One reason is that they are listed at different time, and the posts centered on its shares are only possible after its date of listing. Another reason for the variation in periods is technichal reason: for very popular sections such as Shanghai Composite, huge amount of data exists and some data may have lost on the server or in the process of downloading. In order to ensure the data quality, I discard some periods that contains some consecutive blank small period.

13 Categories Ticker Chinese Name English Name Beginning Data Ending Data Total Posts Indices 000001 1Eft Shanghai Cormposite 11/18/2013 11/15/2015 1,129,503 000300 R China Stock Future 11/6/2014 11/18/2015 85,094 000002 i f4A China Vanke 11/10/2013 12/10/2015 76,914 000069 $ f6AA OCT Group 7/18/2008 12/28/2015 68,198 000156 $ WASU Media 10/19/2012 12/31/2015 28,099 000157 L1 4XJf 10/9/2008 12/31/2015 126,625 000333 if 9/19/2013 12/31/2015 44,541 000338 If!PM fI 6/20/2008 12/31/2015 115,473 000651 4 F1$)L Gree Electrical 9/26/2008 12/31/2015 44,821 000712 Viel fr}) Golden Dragon 12123/2008 12/30/2015 6,455 000725 AAYi A BOE Tech 4/3/2009 12/31/2015 63,236 000728 PiiLd' Guoyuan Securities 7/1/2008 12/31/2015 75,854 002008 i-k ': Han's Laser 218!2013 12/29/2015 51,425 002024 )$9 & Suning 4/25/2013 12/31/2015 72,274 600038 r ARAi} Avicopter 7/17f2008 12/8/2015 15,441 Stocks 601618 t[9' MCC 9:10/2009 12/31/2015 62,570 601669 UE Power Construction 9/25/2011 12/31/2015 132,530 601688 4 2/9/2010 12/31/2015 53,592 601699 Lu'an Envir. Energy 7/8/2008 12y3 1/2015 48,943 601718 1 Jihua Group 8/24/2010 12/23/2015 16,791 601706 ' $ CRRC Corp. 731/2008 12/31/2015 556,721 601872 1 CMES 6/2/2008 12/31/2015 75,511 601888 4:1A CITS 9/22/2009 12/31/2015 30,882 601898 P%'% China Coal Energy 6/1/2008 12/3112015 83,052 601899 @ Zijin Mining 6/6/2008 12/3 1/2015 157.243 601901 : iEi. Founder Securities 7/31/2011 12/31/2015 147,603 601919 Pl itiV COSCO 6/2/2008 12/31/2015 124,354 601928 Ta-M Phoenix PubL& Medi 11/17/2011 12/31/2015 83,290 601939 Wi.M China Constuction BE 6/6/2008 12/31/2015 101,991 603993 flrYVL'k 9/10/2012 12/31/2015 55,400 Total 3,734,426

Figure 5: Timespan and Posts Under the Selected 30 sections

14 Market Value PDe 31 Ticker Name (Dec 31 2015) Date of Listing Industry Millions CNY 2015) 000002.SZ 7Ef}A 263,094 17 1991-01-29 Real Estate 000069.SZ ${f#rA 64,715 14 1997-09-10 Real Estate 000 156.SZ 'PK 47,014 123 2000-09-06 Media & Entertainment 000157.SZ P4A*|F 36,937 69 2000-10-12 Heavy Machinacry 000333.SZ kLJM 140,001 13 2013-09-18 Electrical Manufacturing 000338.SZ #%P 4JJ 36,225 8 2007-04-30 Automobile 00065 1.SZ ) 134,452 9 1996-11-18 Electrical Manufacturing 000712.SZ %1N3 26,092 67 1997-04-15 Financial Service 000728.SZ t tiE& 44,369 32 1997-06-16 Financial Service 000725.SZ ACFA 103,151 41 2001-01-12 Electrical Manufacturing 002008.SZ t$&I%* 27,339 39 2004-06-25 High-tech manufacturing 002024.SZ $tTViI 99,302 115 2004-07-21 Retail 600038.SH q1 ThR f} 31,083 94 2000-12-18 Aircraft 601618.SH PH4 rif 103,387 29 2009-09-21 Construction 601669.SH r L 9r 110,450 23 2011-10-18 Construction 601688.SH $41E4 133,389 31 2010-02-26 Financial Service 601699.SH i C$iWfrt 19,205 20 2006-09-22 Mining 601718.SH 4hf1$ 44,240 38 2010-08-16 Textile 601766.SH P E I 4 329,574 66 2008-08-18 Railway equipment 601872.SH 4tffPiUK 37,573 188 2006-12-01 Transportation 601888.SH r41[A 57,901 39 2009-10-15 Travelling 601898.SH L -thtNWA 65,588 105 2008-02-01 Mining 601899.SH * hI 65,441 32 2008-04-25 Mining 601901.SH tiE

Figure 6: Company Information of the 28 Selected Stocks

15 2.2 Financial data

I download all financial data from Wind Financial Terminal, which is a leading financial data provider in China, covering stocks, bonds, funds, indices, warrants, commodity fu- tures, foreign exchanges, and the macro industry. Closing prices per day for year are down- loaded for Shanghai Composite Index, China Stock Index Futures and all the individual stocks. Following prior literature(Chen, De, et all 2011), I compute return of differnt hold- ing periods: R1 denotes the return of holding the stock for I day, i.e. buy at the closing price of day t - 1, and sell at the ending price of dayt. R2 denotes return from 2 days holding period.R3 denotes return from 3 days holding period. The same goes for R4 , R5 , R7 , RIO,

R 15, R 30 -

3 Method

In this study I use the dictionary-based way of Chinese Natural Language Processing to do textual analysis. The first step is to pick a suitabel lexion. In order to make the textual analysis reflect the sentiments of Guba posts accurately, augmentation of existing dictio- naries is also needed. I selected the most popular words used in stock discussions among Chinese netizens to add to the positive and negative dictionaries of Hownet and NTUSD. Some words are catchwords and some are the parlances. The second step is to segment all the posts into words. To do this, Jieba Parse is selected in this study to parse all the posts from Guba. After parsing, I computer the total word counts of negative and positve words, and the ratio of both to the total word counts for each day. With those ratios and the price data, I conduct correclation and regression analysis.

3.1 Dictionaries and word list

Many lexicons have been developed for textual analysis in NLP. Of all these lexicons, Hownet and NTUSD are the most popular two dictionaries. HowNet is an on-line common- sense knowledgebase unveiling inter-conceptual relationships and inter-attribute relation- ships of concepts as connoting in lexicons of the Chinese and their English equivalents.

16 Data Collecting Text Analysis Statistical Analysis

1. Posts 1. Dictionary 1. Correlation Analysis

2. Prices 2. Parsing 2. Linear Regression

eries of Sentiment word 3.grime 3. # Regression

4. Sentiment Ratio

Figure 7: Summary of Method

NTUSD (National Taiwan University Sentiment Dictionary) is based on Chinese language, and independant of other languages. Xu, Zhao, Qiu and Hu (2010) compare those dic- tionaries and find that they are not enough. Many sentiment words are not included in the current Chinese sentiment dictionaries. For example, HowNet contains 3969 positive words and 3755 negative words; NUTSD contains 2648 positive words and 7742 negative words. Among them, only 669 positive words and 877 negative words are shared. In order to better measure the sentiment behind the posts, I combine the lexicons of Hownet and NTUSD as the base dictionary for this study. In addition, I also manually select and produce a word list specially designed for stock market. I first use the Chinese parser to divide all the posts into words, and extract the top 300 most frequent words from them. Then I distribute the list of 300 words to three experienced individual investors, who are also heavy users of Guba. They mark the words which they think express positvie or negative emotions. In the meantime, in order to eliminate the parsing errors, I also read over 400 pages of posts to pick all postive and negative words. So based on the two lists, I build a list of 53 positive words(Figure 7) and 70 negative words(Figure 8).

3.2 Segmenting and parsing the posts

Chinese language is very different from English in syntax features. For example, Chi- nese makes less use of function words and morphology than English, verbs appear in a unique form with few supporting function words. Also subject pro-drop, which is the null

17 Negative words (88 words)

f

A, y A Jf ?AV A

18 fyz;A4 I M i1 I' J 21a t IVIR ily IL rordsrk k2r

Positive Words (58i

47kf kEA,' )' ATh jilt

Ri 41 1 _ KB A18

& 1 *1 -K1l -51- Ai i ~

Figure 8: List of Positive & Negative Words for Stock Market

18 I realization of uncontrolled pronominal subjects, is widespread in Chinese but rare in En- glish. Therefore, Natural Language Processing (NLP) tools are very different for the two languages. There are several widely used Chinese NLP open source tools, such as Jieba, BosonNLP, NLPIR, ITP-Cloud, etc. Stanford NLP group and Berkeley NLP Group also provide Chinese segementer and parser. I adopt Jieba NLP tool package (https://github.com/fxsjy/jieba), because it is widely used in the social media industry, and it is conscidered stable and accurate in Chinese pars- ing. Jieba is free open-source tool in Python. As its algorithm is based on Trie Tree struc- ture, Jieba is able to find all the possible wording situations, and arrive at the most probable tree path through dynamic programming. Moreover, Jieba adopted Hidden Markov Model and Viterbi algorithm to detect new words. With these features, Jieba has become one of the most popular segmenter and parser among Chinese users.

3.3 Quantifying the positive and negative sentiment

Although prior literature has noted that positive words in English are limited in testing sentiment, because they are frequently subject to negation, and corporate communications rarely convey positive news using negated words(Loughran and MacDonald 2011). How- ever, it may not be the case in social media, where people are not so mindful about their language. Moverover, many of those positive words in Chinese are verbs; if a Chinese in- vestor wants to express an opposite opinion, he would simply use the opposite verb for it. Negation of the original verb is not the same as using the opposite verb. Some reseaches on Weibo use positive sentiments in their tests, and find positive meansures is also useful (Pang, Li, et all). Based on these considerations, I include positive measures to this study. We calculate the total word count per day, the total count for positive words per day, the total count for negative words per day for every section. Formally, I use the frequence of positive words and negative words as the measures for positive and negative sentiment.

NegRatic =No.ofNegativeWords TotalWordCount

19 PoRto - No.ofPositiveWords TotalWordCount

3.4 Regression methods

I first test the correlation of the sentiment ratios at day t to returns of 9 different holding period (RI, R2, R3 ,R4 , R5 , R7, RIO, R 15 and R30). In this study, I use Ri to denote the return of holding peoriod from day t to day t + i. For example, R, is the same day return at day t, because the holding period is just I day; R2 is the return from day tto day t +2 ; and R30 is the return of holding the security for one month since day t. Figure 9 shows the relationship of these returns. After the correlation analysis, I test the linear regression between returns of different holding period and sentiment measures from Guba as the following regression.

Ri,-,n = aij + i,1NegRatioi, + f3 2PosRatio,1 + Ei,t

The dependent variable is the same- day return Rit, where i indexes sections and t denotes the day on which posts are posted on Guba, and it denotes the days of holding period. Specifically, I chose 9 holding periods to study, including R1 , R2, R3 ,R4 , ,

R 10, R 15 and R30 . So I conduct 9 regressions for each stock and index to see the relationship of sentiment ratios to different returns.

After the linear regression with future returns, I also conduct time series analysis and construct vector autoregression (VAR) models for selected stocks and indices using differ- ent lags of positive ratio and negative ratio. Granger causality test are also conducted to see the predictablity of this models.

20 RR

t -f2 t-3 t+4 i-5 1-7 t+10+15 t+30

negative ratio at day positiveratio at day t

Figure 9: Denotations of Returns

4 Results

4.1 Correlation Analysis

4.1.1 Positive ratio v.s. Negative ratio

Figure 10 shows the correlation coefficients between negative ratio and different returns. Figure 11 shows the correlation coefficients between positive ratio and different returns. First of all, we can see that the results of negative ratio are much higher statistical sig- nificance than those of positive ratio. Of all the 30 sample stocks and indices, 29 securities have very significant correlation between negative ratio and different returns (most of them are significant under the significance level of 1 %), while only 16 securities have significant correlation between positive ratio and different return. In this sense, the negative ratio has more substantial correlation with stock returns. Secondly, the correlation coefficients of negative ratio are much larger larger than the correlation coefficents of positive ratio. For every stock and index, its negative raito has much high correlation with its returns. The only exception is the Shanghai Composite, its positive and negative ratio has almost the same level of correlation. Most of the postive correlation coefficients are much lower than those of the negative ratio. Figure 12 shows

21 licker Ratio R1 12 K3 R4 R5 R? RIO R15 R30 -0.195"' (11100)2 negative ratio -0.223*** -0.280' .313"* -0283'" 4210*'* -0281*** 0.24* -0266"* 00(81618 negative ratio -0- 1* 0.68* -0.155*** -0. 138** 134* ~0.137*** -0 141** -0.1254*** 0855"* 0,0449 0 157*** 000156 negative ratio -00258 -0.0268 00205 -0 0106 0,000432 0.00640 A.0171 -0.117*** 410990* -9082** 000157 negative ratio -0 175* * -0. 1901* -0184** -168"-O 0.155** -00925' 000333 negative ratio -009761 -0J39 0.13* -0131" -0.122* -0.0678 -0:0278 -0,0634 -0.142** 0130* -0.12"* .0864"' 000338 negaive ratio 07131** -0.142+1 -0,134*** -0 130** 1 0153* -0.135"' -0.114++ -0.127* 000651 negative ratio -0,04* -0108" -0.126*** -0 126**-OI22** -0t23"' -0.0338 -oOso3 000712 negative ratio -0.181 .158** 4.143* -0,107' -0. 0* -0.0935 -A0645 000725 ncgative ratio -011 0"'* -3114** -0.0976** -00946* -0.0701* -0.0348 -00379 410304 -0.0169 -0. 157"' -0.0665 000728 negative ratio -094*** -0.212* -0.211"' -0,216' -0.218"* -0.1910** -0.194** -0.196-" .236"* -0.199" 002008 nogativeratio -0.255'** -. 268- -0.286"* -0233*** -0.187* -0.158** -0.171" 4150"' -0.142* -0122"* 002024 negtive ratio -0 "* -0.192*** -0.174** 4- 179*** -0-171' -0.1321 -0.127W*4 -0.133"' .y44'** -0.0886* 6o )38 negative ratio -0 143*** -0.117* -0.117* -0.1151' - .0418 0,00468 601618 negative ratio -0.0534 -0.0851* -0.0956" -0.1956" -0.0870' -0.07204 -0.0750' -0.230"' -0226"' .225"* -0.24* 601669 negative ratio -024,i' -0.235"' -0.253** -0.237"' -0 22* 601688 negative ratio -0164*' -076*** -0.144"** -0115"' -0136 * -0.112** -00837' -0.108" 4.08310 -0171"' -0.191' -0184"*' 601699 negative ratio 033* -0-173"* -0.080.' -0.181* -0.181"* -0.168" 601718 ne-gaye ratio -095* -0.285** -0.287*' -0.268"* -0,166* -0,116 -0,124 -0.118 -).164* 601766 negative ratio -0.2"2*** 4219"' -0.214"'0 -0 197"* -0.195** -0.180*** -0164*"* -0.148"' -0138*"' 601871 negative ratio -050** -0.163 -0.145"' -0.133*** -0.133"' 4.143'" 4138.' -, 138"' 4112*" 601888 Iegat ve, ratio O 08** -0. i I1"' -0.111*"' 4.106'" -0AW93*** -00859"' -0.0794"w -0.0814** -0.0748** 601898 negative rtio -0104*** -0.111*** -0.105*** -0. 101'** -0.0933*** -0.08061" -00848*** -0.0716** -0.0231 601899 negative ratio 00903" -0.0925* -0.0924** -0.0799* -0.0982" -D103" -0.132*0" -0.120"' -0,103" 601901 ncgativc ratio -0,2* -0.192 -0.195" -0. 1802" -0 l "' -0. 138*** -0.114** -0.080" -0.0546 601919 nrgativ ratio .0157*"* -0.192" -. 192"* -1. 08" -0-" 0182*** -0,184" -015** -Y4117w*' 601928 negative ratio -0-163*** -0.155' -0,141*** -0 120'* -0.14** -0.126" -0.132** -0. 106** -0.0769* 601939 negative ratio -0101*** 407 -013' -080971* 093 * 00924*** -0.0433#" -. 082* -0.0896" -0.0563 -0.1000 4).112* 601993 negative ratio -0 119*.* -3-0" A0, 113" 5** -0.0939" -0.0587 -0,0558 4 0.279f -0.338'** -0426"'" 41441" negative ratio -0 23 -0.179"* -0.216**' 0 222"' -0240*'

Sbangha -0488** -0.499** -0454** -0.538*'* ) 0.w0" -0516"* Cbpoia negative ratio -0361"'* -0.466"' 42***

=*P<0.05 ** p<0.01 * P<0.001,

Figure 10: Correlation of Negative Ratio and Returns of Sample Stocks and Indices

fitker Ratio R1 R2 R3 R4 R5 R7 RIO R15 1R30 00(032 positiv ratio 0.0814 0.0862 0049* 0.109' 0 1 0059 0 007(5 00148 40654 5%"'69 )0s0iVCno t .0600* 0045" 0.058 7 0565' o0567* 0.0505k 60.675'" 045 00416 o03(Oi7 0,0330 0 O066 (11210* 000156 roeotive reo 0.0142 o,0257 0.0252 0.0240 0.024 0001 ; iive man A0712"* 0190" 011768** 00827** 0.079** 0.0616' 00621' 0.0782" 0,01'2 000333 positive ratio 0.00676 0.240 0.04' 0.0627 0,0504 O 4O 0.0400 00721 it 108' 2' .0022* 113"* 14(33A posiaive ia't U.0417 0.0343 0,0477 0150.0 t.'622* 0.0622* 0.062 0 ) 0253 0051 pxitive rato 0.0847' Q. I I" 0,082" 0 0514* 0.0718 0i4 0 077 00374 000712 positivrerdai 00200 0.305 00.344 0.0314 0.0212 -0 (W!21 00143 0.0512 0,-06 000725 poiive Mr6o 0.0721' 0.0599 a0474 0.0353 0.0269 0. 051 7 0.0338 0557 0.02W8 000729 posltive ratio 0 028 0.0279 0-0303 0.094 0.0339 00456 0.0512* 00558' 0.8753" 0.0114r 0021)0 poia, rat ( 0135"' Q34*"' 0103" 0.,094* 0077 00491 0.0824 0047 002024 pomitiv ratc 0 151"** 0154-' .142*** 0.123 0.9501 00755 0.C659 00597 f01359 60833 pOitic Vrwio 0.0438 ',0493 0.0624 0,0297 0.0223 0020 U.018 0,0320 0.0133 00524 601611 postve ro 0 110* 00827* 0.0875* 0.0928" 0.063" 0 107" 0.0975* 0106* 0,17*" 601669 pontiac i I 0. 115, 103" 0.104"0 115"* 0.1U0S" 0113" 0.120400 1315" 00859" 0.0974" 00875" 0 t17#. 11614 i 0.0929" O18" (,104" 0.13**' 0,0957" 601699 poestiv tao 0123" 11105"'* 0.0833"' 0.07%* 00" 00604' 0.0477 0 015 "5 )'327 113* 601719 poiivc rtO 0.0626 0117 0.174* 0. 18* 0 * 0.250" 0,276** 0-21" 0764" 00684* 601766 postiv tiratc 0128*" 0J24** 01124** 0 111" 0.10'*** 0.)894** 0.0759** A002126 601871 poitiv Muo't 0O625* 061(3*" 0.0683" 0.0459* 0.0334 0.0256 0.0329 0005 601801i positive ratio 3 013' 0+867** 0.0743" 0.0575* 0.0593* 003-il 0.0591 0.0779"* 014 006133 01670" 601S91 pt10 iv9ra.o 0.040* Q,1677-- A0583* 0.0500' 0.0459 0.07 0.426 0.0878' 0,0589 60189') positia rato 009C? + 026*** 01 0.125*** 0.129*** 0. 1200" 0.104+ 0132*** .147" 0147*"* 60101 positive 'MO .116"' 0132*** 0149** 0153*" 0.140"' 1I40"' 0.0373" 0.078301 0.0722 601919 pie t-c 00779' 00301 00793** 0.952* 0.098*" 0031" 00281 (0.0.59 60t72a pimitiv rat - 10M.1 0106*1, 0083"' 0.0713' 0.056 0(0F5 0),441 0. U10** 0.108*' 0 tItI,* 601939 " itivr rtxl 00766" ' 4f7' 0.0554* 0.0661 * 49' 1 101" 0.0242 V)t941 Pt idvl rato 00415 0910 00339 0.0359 0.43 0 - 0.0105 00108 -C02 -0 0643 101Y31 .0.70 Inde Fatie p tiivL rt-o U.0471 U0464 0100-14 -4. 019 -Cii)9 0,426"* 0.410"' 0164"' 0320"* 0229"' .1"'5** ponstirveattit 40"" 0400"' 0466*** poOl'S '* <0' "* p

Figure 11: Correlation of Positive Ratio to Returns of Sample Stocks and Indices

22 RI R2 R3 R4 R5 R7 RIO RI5 R30 Positive ratio Corr. Coeffi 0085 0.093 0.092 0087 0.083 0.076 0.071 0.071 0.066 Average variance 0.0051 0.0073 0.0065 0.0059 0.0063 0.0053 0.0054 0.0036 0.0040 Negative ratio Co. Coeffl. -0.156 -0.173 -0.173 -0.163 -0.155 -0.146 -0.143 -0.141 -0.118 Average variance 0.0043 0.0066 0.0077 00076 0.0076 0.0091 0.0107 0.0128 0,0146

Figure 12: Average of Correlation Coefficients for Positive and Negative Ratio the average correlation coefficients of positive ratio and negative ratio. We can see that the correlation coefficients are nearly twice the value of correlation coefficients of positive ratio.

Thus, both the value level and statistical significance are better when we use negative ratio. This result is in accordance with the studies in English. Positive words are subject to negation, so when positive words are used, it could be possibly expressing positive or negative tone. At the beginning, I was wondering whether it would be not the case in Chinese, and the results of correlation analysis show that this also holds true in Chinese. I think the major reason lies in the dictionaries I used. Many positive words in Hownet and NTUSD are positive adjective words, which are more often subject to negation. Although I add a word list with many positve verbs or nouns to augment the dictionary, it is only a small part of the lexicons. Therefore, negative ratio is better than positve ratio in terms of correlation degree and significance.

4.1.2 Differnt Holding Periods and securities

Figure 13 shows the correlation coefficents for different stocks' different holding period returns. There is no clear sign that as holding period gets longer, the correlation with future return will decrease. However, the variance between securities are very large.

23 Negative Ratioi Correlation With Rciurmes A Samnpl Stocks and Indiccs

Figure 13: Negative Correlations with Different Returns for Sample Securities

Postitive Ratio Correlation with Returns of Sample Stocks and Indices

ii i

Figure 14: Positive Correlation with Different Holding Period Return

24 First, the Shanghai Composite Index has the largest correlation coefficients for positive and negative ratio with all holding periods' returns. The China Stock Index Future also has very high correlaiton with negative ratio. Individual stocks also vary a lot in the correla- tion level. Some have relative larger correlation coefficients (in absolute value), such as 00002(China Vanke), 002008(Han's Laser), 600669(Power Construction), 601766(CRRC Corp), 000157(Zoomlion). Why these securities have higher correlation coefficients? One possible reason is that these securities have more data. Figure 15 shows that Shanghai Composite Index(szzs) has 1554 posts under it every day, which is 5 times more than the China Stock Index Fu- ture(gzqh), and much much more than the individual stocks. And 00002(China Vanke), 002008(Han's Laser), 600669(Power Construction), 601766(CRRC Corp), 0001 57(Zoom- lion) all have frequent posts per day. There is an outlier 601718(Jihua Group). Although it has very low posts per day, its positive correlation coefficients are relatively larger than many other stocks. Considering the low significance level of 601718's correlation coeffients in Figure 10 and Figure 11, and its low frequence of posts data (rank 28# in the sample), I think 601718 is simply outlier that can be ignored. Another outlier is 0001 56(WASU Media), because it has large positive correlation coefficients for negative ratio, which is against common sense. One reason for 000156's abnormality may lies in its relatively unfrequent posts data.

4.2 Regression Analysis

4.2.1 Positive ratio v.s. negative ratio

Similar to correlation ceefficients' results, the regression results for positive ratio are not as good as negative ratio in terms of significance. As Figure 16 shows, coefficients for positive ratio are not significant to all 9 kinds of returns for 11 stocks. However, coefficients for negatve ratio are almost all significant to all sample stocks and returns (see Figure 16 and Figure 17).

25 ticker post total days post per_,da! Rank ticker post total da-ygspostper ,day Rank szzs 1,129,503 727 1554 1 601939 101,991 2,764 37 16 gzqh 85,094 377 226 2 601898 83,052 2,769 30 17 601766 556,721 2,709 206 3 000728 75,854 2,739 28 18 000002 76,914 760 101 4 601871 75,511 2,768 27 19 601901 147,603 1,614 91 5 601618 62,570 2,303 27 20 601669 132,530 1,558 85 6 000725 63,236 2,463 26 21 002024 72,274 980 74 7 000069 68,198 2,719 25 22 601899 157,243 2,764 57 8 601688 53,592 2,151 25 23 601928 83,290 1,505 55 9 000156 28,099 1,168 24 24 000333 44,541 833 53 10 601699 48,943 2,732 18 25 002008 51,425 1,044 49 11 000651 44,821 2,652 17 26 000157 126,625 2,639 48 12 601888 3 0, 882 2,291 13 27 601993 55,400 1,207 46 13 601718 16,791 1,947 9 28 601919 124,354 2,768 45 14 600038 15,441 2,700 6 29 000338 115,473 2,750 42 15 000712 6,455 2,563 3 30

Figure 15: Sample Securities Ranked by Posts per Day

In addition, it is very obvious that in Figure 16 and Figure 17 that for the same stocks, coefficients of negative ratio are all ways larger than those of positive ratio. This fact shows that in the regression model

Ri,,n = ait + A,1NegRatioij + ,2PosRatioij + Ei~t

#i, 1 is larger than i,2, i.e. negative ratio has better correlation with the stock returns. As disscussed above, this is because positive words are subject to negation. And this study shows that it is the same for textual analysis in Chinese, especially, when the dictionaries used are compriesed of many adjective words. Therefor, based on this we can conclude that negative ratio is a better meansurement of investment sentiment, because it has stable and significant correlation with returns of different holding periods.

4.2.2 Difference between stocks

There are large divergence of correlation between different stocks (Figure 18). The coef- ficients for NegRatioij range from -3 to + 15 for individual stocks, and the coefficients

26 VARIABLES RI R2 R3 R4 R5 R7 RIO R15 R3 0 2.668 6.156* 14.16*** 000156 positive ratio 0.387 1.058 1.248 1.362 1.586 2.743 negative ratio -0.653 -0.972 -0.876 -0.486 0.0824 0.496 1.287 3.781 15.01*4* 0.262 -0.176 -0.282 -1.406* 000002 positive ratio 0.224 0.322 0.425 0.597* 0.685* negative ratio -0.683*** -I.255*** -L691*** -.7Q9*** -1.890*** -2.170*** -2.379*** -2.750*** -2.902*** positive-ratio -0.0524 -0,0286 0.0883 0.278 0.216 0.348 0.414 0.871 1.847** negative ratio -0.300** -0.613*** -0.734*** -0.762*** -0.800*** -0.486 -0.192 -0.576 -1.190* positive-ratio 0.0662 0.0661 0.145 0.214* 0.258* 0.313* 0.385** 0.480** 1.460*** neative ratio -0.297*** -0.477*** -0.547*** -0.607*** -0.806*** -0.880** -0.957** -1109-** 1004*** -0.0431 0.0179 0.206 0.2 000712 positive-ratio -0.00593 0.0117 0.0263 0.036 0.017 negative ratio -0.214*** -0.299*** -0.334*** -0.292** -0.309** -0.356* -0.286 -0.148 -0.535 positive-ratio 0.0538** 0.0629 0.058 0.0448 0.0383 0.107 0.0769 0.156* 0.104 000725 negative-ratio -0.0980*** -0.151*** -0.155*** -0.171*** -0.141** -0.0708 -0.0955 -0.0796 -0.0664

000728 positive ratio 0.0426 0.0573 0.102 0.128 0.116 0.204 0.272* 0.392* 0.991*** 0-0728 negativeratio -0.483*** -0.774*** -0.941*** -1.103*** -1.240*** -1.244*** -1.468*** -1.500*** -0.914** positiveratio 0.0696 0.162 0.292** 0.292* 0.249 0.292 0.214 0.129 0.348 601928 negative ratio -0.342*** -0.463*** -0.506*** -0,508*** -0.503*** -0.723*** -0.906*** -0.864*** -0.855**

0.231 0.142 0.218 0.483 601993 positive ratio 0.152 0.15 0.238 0.276 0.341 negative ratio -0.312*** -0.487*** -0.602*** -0.616*** -0.567*** -0.411* -0.443 -0.925*** -1.380*** -0.0278 0.0396 -0.0204 600038 positive ratio 0.0325 0.0751 0.135 0.0354 -0.00429 0.0168 negative ratio -0.226*** -0.267*** -0.324*** -0.389*** -0.498*** -0.544*** -0.677*** .-0.805*** -0.708** 3.266*** 3.275*** 4.430** 601718 positive ratio 0.0888 0.288 0.660* 0.852* 1.517** 2.311*** negative ratio -0.453** -O.872*** -1.039*** -1.121*** -0.724* -0.591 -0.708 -0.983 -2.752*

Figure 16: Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks for PosRatioi. range from 14.6 to minus 1.4. First of all, I think the large coefficient of

+ 15 for negative ratio and the large coefficient of 14.6 for positive ratio (both come from 000156) are all outliers and should be ignored. As disscussed in the correlation analy- sis, 000156's correlaiton is not significant under any acceptable significance level, and its posts' frequency is also very low. Considering this we will not consider 000 156's data in the following discussions. After deleting the outliers, we still see there are large variances of coefficient value across stocks. For example, 000002 (China Vanke), 002008 (Han's Laser), 600669 (China Power Construction), 601699 (Lu'an Environment Energy), 601766 (CRRC Corp), and 601919 (COSCO) have larger coefficients for negative ratio (As negative ratio is supe- rior than positive ratio, I will focus on disussing negative ratio) than other stocks. This factor may also be related with the relative posts per day, because 000002(China Vanke), 002008(Han's Laser), 600669(China Power Construction) and 601766 (CRRC Corp) have much higher posts per day than other stocks.

27 R7 RIO R30 ~15 VARIABLES RI R2 R3 R4 R5

0,377 0.512 0.645 -2.055 601766 positive ratio 0.829*** L176*** 1382"* 0.943 0.865 - 5 " -6.34 - 2 negative rato -04715*** -1.388* ' 1.817 * -2. 196*" poitverai 09.-A6 83**265** -6,3 0.141 0.13 096' 0.74"'015- 0006 601872 poitiveratio 0.111*** 0.174*** 0.218*** 0170*0 41"-0 ' 0.1"' ratio -0.224*** -0360*** -0,396** *- negative ------. -0410 -4 ** -0.641*** 077* 2" 034 0.0650*0positiveratio 0162*** 0.168*** 0. 147* 09 negative ratio -0.144"* -0.216*** -0.261' -Q.282* -0291*** -0.288*** -03090" -0.369* -0451 0.451 0 poSitive ratio 0. 142** 0.211" 0.257" 0.364" 0.435' 0.601*0 0.68** 098 601919 negative ratio -0.338*** -0.621"* -0,769"' -0.95*** -1.065' L-501' 0,321" * 0. 0-- 0.1-----88 positive ratio 0.0969" 0 136** .-14:9 000069 -0.691*** -0.82** 0. " ,23 i ratio-0.283"*** -0.460*** -0.515"* -0.522** -0,57 *** positive ratio 0.139** 0.267*** 0282** 0..6 0.256 0.1410 021 015 000651 eaieai 16 -02,760,14 negativeratio -0 36*' -0.227** -0.321' -0.370*** -0.395*** -0.462*** -0.600*** -0.594** -0.939*** . 13 0.604** 0513" 0,549** 0.451 0.325 -. 0.4 " - 0,376 002008 positive. ratio 0.42"3*** -1264*** 1.157 - -0 35 negative ratio -. 635*** -. 971"' -. 462* 0..4413" 0. 5" 0160" 0.214' 0.262' 0.3*6" 1 i1* -- -* positive ratio 022 .36* .4** 0-519**0 0.705"** 601669 -- * * . negative ratio -0.463*** -0.694*** -954 1046* 6 , -129*** -l8**-67*-2.093*** -2 99*** ratio 0.268'* 0402*** 0465*** 0.446" 0354 01. 0332 positive . 121 94 0311 002024 -0.946"' -1,44**" negative ratio -0276 -0398*** -0451*** -0.556*** -0.61'7*** -0.761*** -0.813"' 4"* 0.562' '6.0 601939 positive ratio 0.132*** 0.166"' 160** 0214** 0.3 "** 0, * 0-5 ' 0662*** L 043* -0144*** -0.208*** -0.231" -0.257"'* -0289"' negative ratio "*** 0443* "' 0 . * -0.343*** 0 ]" 0.39 .5*** 0380 ' 0 443 positive ratio 0.328"'* 0.4 1 601699 0-.0 " -. 4 negativo ratio -0.418** -0.621"** -0.801*** -0.930** -1 045"'* *11 -1,424*" 0.544*** 046064* 0.9512862' 2.5714 ratio 0.207' 0.397*** 0462*** 0.563*** 29"* -. 1 601688 positive -0648*** 602 -820"' -0.. 0.7 * -. negative ratio -0.388*** -0.630*** 0.060 -2**-.3**0775** " -129*082 2"' 1 610** 0.360*** 0.508 "' 0.5 0.235** 0.297 " 0.624 ratio 0.171*** 0.173** -0211 0.284** 082*** 601618 positive -066*** -0.196*** -0.2* negative ratio -0.0463 -0.114** 0.509 0.355*** 0.442*** 0.503"' 0.571 0.640"' 0.599** 0.617' 601899 positive ratio 0.180** 1.167"'.6-587" -1.45"' negative ratio -0 177** -0.251' -0.308* -0.286 -0.429" 0.342*** 0.425*** 4 " 0.424" 0.508" 0764***.. 0.33 000157 positive ratio 0.197"** 0.288*** -0.784"' -. 799"** -0,9.33 ratio -0.366*** -0.557*** -0.705*** -0.787** -0801* -0.460"' "' - .90 7 * * * negative 7 . 8 " 10 "' 2 . 2 1 8 .5 75 * ** 0 .8 17* * " 0 9 97 "' 60 1 90 1 po sitiv e ratio 0 .3 3 6 * * * 0 -0.970"' -1.0 *** -1 *4" - 26"' .5976 negative ratio -0.409"' -0.682"' -0.868*** -0950*** 03 601898 positive ratio 0.0772" 0.159*" 0.169* 0.168" 0174" -0.376" -0.384"' -0.40*" -0.4 0.*224 negative ratio -0.78*** -0.279*** -0.325*** -0.362***

Figure 17: Positive Ratio and Negative Ratio are Both Significant for 17 stocks

28 I OOutlie

Figure 18: Outliers in Regression Coefficients

On the other hand, their larger coefficients also indicate greater attention from investors for those stocks. For example, CRRC Corp has been a buzzword in China since 2014, be- cause the Chinese government encourages the export of railway equipments under the "Belt and Road" policy raised by Chinese President Xi Jin Ping. Also, CRRC Corp is one of the largest company in Chinese stock market, and has the capacity to attract millions of indi- vidual investors. The mania for CRRC Corp stock has made the stock price increase 400% from Dec 2014 to April 2015. After that, CRRC Corp dived drastically to the original level of price in 2014. Such drastic changes in prices inevitably attract investors to talk about it. Thus the more people are talking about it on the social media, the better the posts reflect investor sentiment, and the better the regression coefficients. Apart from 601766 (CRRC Corp), 000002 (China Vanke) is the No.1 real estate developer in China and its CEO is very famous public figure in China. 002008 (Han's Laser) is the world-leading laser equipment producer and has been regarded as major player in the 3D printing industry. 601669 (China Power Construction) is widely regarded to benefit from the "Belt and Road" policy raised by Chinese government. All those stocks have fancy stories to attract people discussing on the social media. Comparing with these stocks, other stocks are not so attention-grabbing, so their coefficients are mush smaller.

29 Figure 19: Coefficients of Positive Ratio and Negative Ratio for Stocks

4.2.3 Difference between stocks and indices

The regression results in Figure 20 show that Shanghai Composite Index has very good regression results: coefficients for negative ratio and positive ratio are almost all significant under I % significance level and the R square for all the holding period returns are all above

20%, which is much higher than any other regression for individual stocks. An interesting result is that the coefficients of negative ratio gets larger (in absolute value) when holding

period get longer. This indicates that future return is very dependent on the lag of investor sentiment. (To analyze this, I also conduct time series analysis and build a VAR model for for Shanghai Composite Index.)

Things are very different for China Stock Index Future. The coefficients of positive ratio are not significant for all holding period returns. Whereas the coefficients of negative ratio are all significant at I% significance level. Also, all the coefficients are much smaller than those of Shanghai Composite Index. The failure of positive ratio in explaining China Stock Index Future and the smaller coefficients may lies in its much lower posts per day than Shanghai Composite Index. Another underlying reason is the relatively small size of China's future market. Also individual investors are required to deposit 500,000 RMB before they start trading index futures. This rule also limits the involvement of individual investors in the future market. Therefore, although index futures are largely perceived as

good indicators of stock markets worldwide, investor sentiment on index futures on social media may not be a good indicator of the stock market, because they are not widely traded

among individual investors and usually have higher entry level for investment.

30 VARIABLES ri r2 r3 r4 r5 r7 rIO r15 r30 positiveratio ].626*** 2.967*** 3.171*** 3.115*** 3.415** 3.050*** 2.735*** 1.103 -0.993

negative ratio -0.536*** -1.065*** -1.413*** -1.713*** -2.030*** -2.741*** 3.589*** -4.748*** -7.429*** ShanghaI 3594 0.383*** Composite Constant -0.0631*** -0.11* -0.107*** -00907** -00925** -0.0424 0.0114 0.149*** Observations 487 487 487 487 487 487 487 487 487 R-squared 0.22 0.348 0.335 0.314 0.319 0.313 0.31 0.294 0.267 positive-ratio 0.269 0.415 0.114 -0.0825 -0.326 -0.221 -0.819 -1.498 -1.824 negative ratio -0.393** -0.846*** -1.181*** -1.324*** -1.610*** -2.288*** -- 5.138*** -8.209*** Fude Constant 0.00568 0.0203 0.0520* 0.0701** 0.0970** 0,126*** 0.210*** 0.331*** 0.512-** Future Observations 253 253 253 253 253 253 253 253 253 R-squared 0.018 0.035 0.047 0.049 0.058 0.078 0.117 0.187 0.198

Figure 20: Regression Results for Indices

Comparing the regression results of indices and stocks (Figure 21 and Figure 22), we can see that for negative ratio, Shanghai Composite Index have the largest coefficints for almost all returns, which suggests that the investor sentiment of Shanghai Composite In- dex is the most correlated with stock market. However, for positve ratio, coefficients of Shanghai Composite Index are not the largest at all. Other stocks such as 601901 (Founder Securities) and 601718 (Jihua Group) have also large positive ratio coefficients for their regression. As disscussed before, many coefficients of positive ratio are not significant, so their values are not as useful as those of negative ratio.

4.3 Time-series Analysis

As in Shanghai Composite Index's case, it is quite obvious that lagged values of the inde- pendent variable greatly matter. In technical terminology, the regression is called a vector autoregression (VAR). So I construct a VAR modle with lagged variables to better reflect the regression.

Ri=-ai,t +&foNegRatiojt +P,iNegRatio,tj-1+...+f,0PosRatiotj +pi,1PosRatioit.-I+..+ Ei

31 Figure 2 1: Coefficients for Positive Ratio: Indices v.s. Stocks

/#

Figure 22: Coefficients for Negative Ratio: Indices v.s. Stocks

32 Selection-order criterla Number of obs. 483 Lag LL LR df p AIC HQIC SBIC 0 4977.96 -20.60 -20.59 -20.57 1 5431.98 908.03 9 0.00 -22.44 -22.40 -22.34 2 5482.19 100.41 9 0.00 -22.61 -22.54* -22.43* 3 5496.17 27.98 9 0.00 -22.63 -22.53 -22.37 4 5510.93 29.52* 9 0.00 -22.66* -22.53 -22.32

Figure 23: Lag Length Selection

Because among all the sample stocks and indices, Shanghai Composite Index has the largest and significant coefficients and the largest R square, so I use Shanghai Composite Index to conduct the VAR model and Granger causality test.

4.3.1 Lag selection

In order to construct a good VAR model, the first step is to find the optimal lag length for return of Shanghai Composite Index. I use the Akaike Information Criterion (AIC) criteria to determin the lag length, because for monthly and daily VAR models, the Akaike Infor- mation Criterion (AIC) tends to produce the most accurate structural and semi-structural im- pulse response estimates for realistic sample sizes (Ivanov and Kilian, 2005).

As Figure 23 shows, AIC reaches its minimum -22.6581 at Lag 4. Therefore, I choose 4 days length as the optimal lag for the VAR model.

4.3.2 VAR Model

Figure 24 shows the details of this VAR model.

4.3.3 Stability test

The necessary and sufficient condition for stability is that all characteristic roots lie outside the unit circle. Figure 25 shows that the reciprocal values of all unit roots lie within the unit circle. So the VAR model is stable.

33 Vector Autorregression Results

Equation Parms RMSE R-sq chi2 p>chi2 No. of Obs. ri 13 0.02 0.07 35.63 0.00 483 positive-ratio 13 0.00 0.30 205.51 0.00 483 negative_ ratio 13 0.00 0.77 1644,86 0.00 483.00

Coef. Std. Err. z P>IzI ri ri L1. 0.13 0.06 2.27 0.02 L2. -0.11 0.06 -1.82 0.07 L3. -0.01 0.06 -0.22 0.83 L4. 0.18 0.05 3.41 0.00 positive-ratio LI. 0.17 0.32 0.54 0.59 L2. 0.00 0.34 0.00 1.00 L3. -0.33 0.34 -0.98 0.33 L4. -0.30 0.31 -1.00 0.32 negative-ratio L1. -0.09 0.24 -0.38 0.70 L2. 0.17 0.27 0.66 0.51 L3. -0.27 0.26 -1.01 0.32 L4. 0.08 0.23 0.34 0.74 cons 0.03 0.02 1.42 0.16

Figure 24: VAR Model

Rocts of the companion matrix

2

s Ral

Figure 25: Unit Root Check

34 Granger causality Wald tests

Equation Excluded chi2 df Prob>chi2 ri positive-ratio 3.56 4 0.47 ri negative-ratio 1.98 4 0.74 ri all 5.68 8 0.68 positive ratio rI 24.05 4 0.00 positive-ratio negativeratio 7.38 4 0.12 positive ratio all 39.66 8 0.00 negativeratio rI 36.53 4 0.00 negativeratio positive-ratio 15.78 4 0.00 negative_ratio All 58.07 8 0.00

Figure 26: Granger Causality Test

4.3.4 Granger causality analysis

Figure 26 shows the result of Granger causality test. As p values of postive ratio and neg- ative ratiio are larger than the significance levels, the causality between ratios and stock returns doesn't exist. In other words, we can not assert that positive ratio and negative ratio of sentment on Shanghai Composite Index is predictive of market return. Although Shang- hai Composite Index has very significant linear regressions with social media sentiment ratios, it is still challenging to use past sentiment ratios to predict future returns.

35 5 Conclusion

Since 1980s, behavior finance has posited that investors are driven by sentiments and this assumption has been proven by many studies on traditional media, such as newspaper, jour- nals and annual reports. However, sentiments extracted from traditional media may only stand for the opinions of a small fraction of investors. In the era of internet and big data, everyone is connected by social media. Social media has become the important confluence where all investors' sentiments are gathered and maintained. Thus many researchers are in- vestigating how to extract sentiments from crowds of people and better predict the financial market. In this spirit, I want to study the relationship between social media and Chinese stock market, given that few literature covers the Chinese social media and Chinese stock market. I choose guba.com.cn, one of the most popular stock-related social media in China, for this study, and download 3,734,426 posts from 2008 to 2015 under the sections of 30 sample securities, including Shanghai Composite Index, China Stock Index Future and 28 individual stocks. Textual analysis of those posts is conducted and sentiment ratios are ex- tracted. I have analyzed the correlation matrices and regressions between sentiment ratios and returns of 9 holding periods for all the 30 sample securities, and have arrived at the following findings. 1. Negative sentiment ratio is superior than positive sentiment ratio. I have found in Chinese literature that many researchers are using both sentiment measures, while most English literature use negative ratio. Language differences could be a reason for the differ- ence in using sentiment ratios. Therefore, I deliberately compare the two kinds of sentiment ratios and find that negative ratio is superior. First, the correlation coefficients of positive ratio to the returns are all smaller than those of negative ratio for all the 30 samples. This is the same case with regression coefficients. In addition, more than half the coefficients of positive ratio are not statistically significant. Therefore, I conclude that negative senti- ment ratio is better than positive sentiment ratio, especially when dictionary-based textual analysis is used in the research. 2. Correlation of sentiment ratio to return is persistent in future holding periods. I compute the correlaiton coefficients and regression coefficients for sentiment ratios to 9 re-

36 turns of different holding periods. There is no sign that the correlation is decreasing when holding period becomes longer. On the contrary, in the case of Shanghao Composite In- dex, the coefficients of negative ratio become larger in absolute value when holding period increases from 1 day to 30 days. Whereas, in cases of other samples, the absolute value of coefficients change arbitrarily and no clear pattern can be found. However one thing is sure: the correlation doesn't decrease as holding period gets longer. This result indicates that lag terms may exist to predict stock returns, however since no sign of decreasing correlation is identified, finding the optimal lag length could be difficult. 3. Well-established market index has better correlation with social media than individ- ual stocks, and well-known 'star' stocks have better correlation with social media than other stocks. In this study, Shanghai Composite Index has the best correlation results: largest co- efficients of all samples, all significant at I % significance level, and R-squared of all 9 regressions are over 20%. One reason for this is that the discussion about Shanghai Com- posite Index is the most active -1554 posts are submitted everyday. Although individual investors may focus on only several individual stocks, they all care about the market index. That's why Shanghai Composite Index has the most frequent postings and the best corre- lation with social media. Among individual stocks, 601766 (CRRC Corp), 000002 (China Vanke), 002008 (Han's Laser) and 601669 (China Power Construction) have relative better correlation with social media sentiments ratios, because they are all popular stocks among individual investors, and have more posts per day than other stocks. 4. Better data and improved analysis are needed to predict stock market with social media. I test the VAR model on Shanghai Composite Index, and find that the model is stable but shows no Granger causality. I think the reason for this result comes from the quality of data and the textual analysis method. Maybe the quality of the data is not good enough, especially many posts in Guba are rumors and speculations with no reason (Guba is regarded by many investors as full of rumors). So the quality of infomation may be twisted. Another solution could be to improve the textual analysis method, such as improving the dictionary to make it more adaptive to financial markets, improving the parsing tools, etc. All in all, this study confirms that correlation exists between investor sentiment on social media and the returns of the Chinese stock market, and negative sentiment ratio is

37 a better indicator for this correlation. However, in order to predict the stock market, the quantity and quality of contents on social media matter greatly. The more contents, the lareger the correlation exists. As social media attracts more and more users to generate contents, the explanatory and predictive power of social media will be greater in the future.

38 References

[1] Am'else Charles, Olivier Darn'e. The random walk hypothesis for Chinese stock mar- kets: Evidence from variance ratio tests. Economic Systems, Elsevier, 2009, 33 (2), pp. 117-126.

[2] Baker, HK. 2010. "Individual Investor Trading." Pp. 1-26 in Behavioral Finance: In- vestors, Corporations, and Markets.

[3] Baker, Malcolm. 2007. "Investor Sentiment in the Stock Market." Journal of Eco- nomic Perspectives 21(2):129-52.

[4] Biemann, Chris. 2006. "Chinese Whispers - an Efficient Graph Clustering Algorithm and Its Application to Natural Language Processing Problems." In Proceedings of Workshop on TextGraphs, at HLT-NAACL 2006, New York 73-80.

[5] Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. "Twitter Mood Predicts the Stock Market." Journal of Computational Science 2(1):1-8.

[6] Bondt, De and Werner F.M. 1998. "A Portrait of the Individual Investor." European Economic Review 42(3-5):831-44.

[7] Chen, Hailiang, Prabuddha De, Y. Yu (Jeffrey) Hu, and B. H. Byoung-Hyoun Hwang. 2014. "Wisdom of Crowds: The Value of Stock Opinions Transmitted Through Social Media." Rev. Financ. Stud. 27(5):hhuOOl -.

[8] Chen, Hailiang, Prabuddha De, Yu Hu, and Byoung Hyoun Hwang. 2011. "Sentiment Revealed in Social Media and Its Effect on the Stock Market." IEEE Workshop on Statistical Signal Processing Proceedings 25-28.

[9] Engelberg, Joseph E. and Christopher A. Parsons. 2011. "The Causal Impact of Media in Financial Markets." The Journal of Finance 66(t):67-97.

[10] Epstein, Marc J. and Martin Freedman. 1994. "Social Disclosure and the Individual Investor." Accounting, Auditing & Accountability Journal 7(4):94-109.

39 [11] Gilbert, Eric and Karrie Karahalios. 2009. "Predicting Tie Strength with Social Me- dia." Chi 2009.

[12] Gilbert, Eric and Karrie Karahalios. 2010. "Widespread Worry and the Stock Market." Proceedings of the 4th International AAAI Conference on Weblogs and Social Media 58-65.

[13] Tetlock, Paul C. 2015. "Giving Content to Investor Sentiment : The Role of Media in the Stock Market." 62(3):1139-68.

[14] Karabulut, Yigitcan. 2013. "Can Facebook Predict Stock Market Activity?" American Finance Association 2013 Meetings 49(0):60.

[151 Kr, Roman. 2015. "Media , Sentiment and Market Performance in the Long Run." (July).

[16] Loughran, T. I. M. and Bill Mcdonald. 2010. "When Is a Liability Not a Liability ? Textual Analysis , Dictionaries , and 10-Ks Journal of Finance , Forthcoming." Journal of Finance, Forthcoming LXVI(1):46.

[17] Maertens, Annemie, a. V. Chari, and David R. Just. 2014. "Why Farmers Sometimes Love Risks: Evidence from India." Economic Development and Cultural Change 62(2):239-74.

[18] Malkiel, Burton G. 2007. "The Efficiency of the Chinese Stock Markets." (154).

[19] Feng, Li. 2006. "Annual Report Readability, Current Earnings , and Earnings Persis- tence Annual Report Readability , Current Earnings , and." Social Sciences (Septem- ber).

[20] Sheng-, Zhou and S. H. I. Xun-. 2013. "Stock Market Time- Series Prediction Based on Weibo Search and SVM." 22-26.

[21] Tetlock, Paul C., Maytal Saar-Tsechansky, and Sofus MacSkassy. 2008. "More than Words: Quantifying Language to Measure Firms' Fundamentals." Journal of Finance 63(3): 1437-67.

40 [22] Weston, Jason et al. 2011. "Natural Language Processing (Almost) from Scratch." Journal of Machine Learning Research 12:2461-2505.

[23] Yang, Sy, Syk Mo, and Xiaodi Zhu. 2013. "An Empirical Study of the Financial Com- munity Network on Twitter."

[24] Zhou, Xiaolin, Zheng Ye, Him Cheung, and Hsuan-Chih Chen. 2009. "Processing the Chinese Language: An Introduction." Language and Cognitive Processes 24(7- 8):929-46.

[25] Seasholes, Mark S. and Ning Zhu. 2010. "Individual Investors and Local Bias." Jour- nal of Finance 65(5):1987-2010.

[26] Li, Feng. 2008. "The Determinants and Information Content of the Forward-Looking Statements in Corporate Filings - a Na * ive Bayesian Machine Learning Approach." Journal of Accounting Research 1001.

[27] Introduction to Chinese Natural Language Processing, Kam-Fai Wong, Wenji Li, Ruifeng Xu, Zheng-sheng Zhang, Morgan & Claypool Publishers

[28] Xu, Hongzhi, Kai Zhao, Likun Qiu, and Changjian Hu. 2010. "Expanding Chinese Sentiment Dictionaries from Large Scale Unlabeled Corpus." Proceedings of the PACLIC 24 301-10.

[29] Ventzislav, Ivanov and Kilian Lutz. 2005. "A Practitioner's Guide to Lag Order Selec- tion For VAR Impulse Response Analysis." Studies in Nonlinear Dynamics & Econo- metrics 9(l):1-36.

41