CENTRAL POLICY UNIT THE GOVERNMENT OF THE SPECIAL ADMINISTRATIVE REGION

A STUDY ON UNDERSTANDING AND ANALYZING ONLINE PUBLIC OPINION IN

“HONG KONG CYBERSPACE”

THE UNIVERSITY OF HONG KONG

NOVEMBER 2011

Understanding and Analyzing Online Public Opinion in “Hong Kong Cyberspace”

November 2011

Research Team

Principal Investigator: King-wa Fu, Journalism and Media Studies Centre, The University of Hong Kong

Co-Investigator: Michael Chau, Innovation and Information Management, School of Business, The University of Hong Kong

Developer: Cédric Sam, Journalism and Media Studies Centre, The University of Hong Kong

1

Table of Contents

Executive Summary ...... 4 Introduction ...... 8 Online Opinion in Hong Kong ...... 8 What are lacking ...... 8 Policy Implications ...... 9 Objectives of this Study ...... 9 Scope of the Study ...... 10 Approach and Methodology ...... 11 Data collection ...... 11 Online discussion forum ...... 12 Blogs ...... 12 Twitter and microblogging ...... 12 Facebook Groups/Pages/Events ...... 13 Data analysis ...... 13 Time-trend statistics ...... 13 Keyword analysis ...... 14 Social network analysis...... 14 Sentiment analysis ...... 14 Technical Architecture ...... 16 Definitions...... 16 Sina Weibo ...... 16 Background ...... 16 Data collection ...... 16 Reports ...... 17 Search engine ...... 18 Scripts ...... 18 Twitter ...... 18 Background ...... 18 Data collection ...... 18 Reports ...... 20 Scripts ...... 20 Facebook ...... 20 Background ...... 20 The shotgun ...... 20 The watch list ...... 21 The watching ...... 21 Limitations ...... 22 Blogs ...... 22 Background ...... 22 Creating list of blogs to follow ...... 22 Collecting posts ...... 22 Scripts ...... 22 Limitations ...... 22 HK Forums...... 23 Background ...... 23 Forums under study...... 23 Data structure ...... 23 2

Programs ...... 23 Results ...... 24 Time-trend statistics ...... 24 Total amount of postings/tweets/retweets ...... 24 Frequency of each source ...... 25 Unique contributors by source ...... 26 Discussion Forum Topics ...... 27 Retweeting and Reposting ...... 28 Active users and their last update ...... 29 Peak hour of publication ...... 30 Article length ...... 31 Other characteristics (e.g. whether it was attached with image, link, or unit icon) ...... 32 Keyword analysis ...... 34 Term frequency analysis ...... 34 Term frequency inverse document frequency (TF-IDF) analysis ...... 35 Social Network Analysis ...... 38 Sentiment analysis ...... 46 Method ...... 46 Results ...... 47 Case Study 1: Policy on subsidizing home ownership (政府資助市民置業) ...... 51 Case Study 2: Policy on mitigating income inequality改善貧富差距 ...... 57 Conclusions and Recommendations ...... 64 References ...... 69 Appendix ...... 71

3

Executive Summary

As a growing body of Hong Kong citizens, particularly the younger generation, uses the Internet as a channel to voice out opinion on a variety of subjects, ranging from party politics to social policy debate, policy makers pressingly need to investigate the characteristics of online user-generated comments and to establish mechanism to collect online information in a systematic manner.

This study aims to:

1. describe and understand the nature of online public opinion (or user-generated contents in general) in the “Hong Kong cyberspace”; 2. develop a robust and reliable data collection method for sampling the space of online public opinion; 3. collect and analyze online policy discussion of two selected topics;

As shown in this report, project websites, and the daily email reports, online public opinion data in Hong Kong have been systematically collected, stored, and subsequently analyzed quantitatively and qualitatively. Analytical tools used in this study consist of time- trend statistics, keyword analysis, social network analysis, and sentiment analysis.

Results demonstrate that the developed system sampled the space of Hong Kong online public opinion and collected about 800-1,200 government-related posts each day, accounting for 0.3-0.6% of the total data collected that included all types of topics. But the volume of posts fluctuated drastically in the study period, depending on whether or not heated social discussion or public incidents happen on that day. Roughly speaking, half of the government- related posts were originated from online discussion forums and another half of posts were blogs or microblogs. These online contents were created by approximately 500-600 unique Internet accounts every day, of which around 60% were online discussion forum users and 40% were bloggers or microbloggers. Approximately, 300-500 online forum users and 500- 600 bloggers/microbloggers account holders posted at least one time about Hong Kong government in a 7-day period, or 3,300-3,500 online forum users and 2,500-3,000 bloggers/microbloggers made at least one such post in a 30-day duration. Their posts were usually published from the time period 9am to 12:00am.

Sentiment scores were computed to signify the extent to which the Hong Kong government is negatively evaluated by the online users, reflecting the amount of critics that were found in the online public opinion. Moreover, case studies using social network analysis identified key microbloggers who played significant role in the process of information diffusion, namely online “opinion leaders”.

Case studies of “Policy on subsidizing home ownership” and “policy on mitigating income inequality” were used as examples to illustrate the ways to analyze online public opinion by deploying all the above analytical tools.

Significant findings of this study are highlighted as follows:

4

1) Identifying the dynamics of online opinion - Our data shows that the number of posts increases and the sentiment score decreases as public outcry over controversial issues happens. This analysis helps alert relevant parties to take prompt response to the public's reaction and negative sentiment;

2) Recognizing emerging social agenda – Major keywords extracted from online contents are recognized – for example those emerging topics that appear frequently now but rarely found in the past. This finding assists us to identify emerging social agenda that may shortly become major social concern and may require policy maker’s close attention;

3) Finding “opinion leaders” - Digital presences of “opinion leaders” are named and identified. This helps various stakeholders in the government departments locate “public figures” in their corresponding policy areas;

4) Tracking “public opinion” - The sentiment score developed in this study may serve as a fast-track “proxy” indicator of the overall public opinion toward Hong Kong government;

5) Associating with policy discussion - The results of the two case studies demonstrate the ways in which online opinion may be incorporated into policy discussion.

As a result of these major findings, a few recommendations are made as follows:

1) To establish an online public opinion tracking system that helps policy makers and public to keep track of the online discussion on various social and political topics, and to understand the changes, especially short-term change, in citizen’s online sentiment toward public governance and social policy;

2) To formulate a cross-departmental online public engagement policy that addresses the growing citizen’s demand for public engagement, both online or offline, and effective public governance;

3) To review and strengthen the ways to make full use of the opportunity gained from digital and social media, for example via a variety of channels like microblogs, discussion forums, and social networking sites, for the provision of prompt and proactive response to the general public;

4) To promote culture of online policy deliberation, a process through which social and policy issues are debated, discussed, and finally consented between governments, social stakeholders, and interest groups. It should be also supported by the promotion of citizen’s media literacy and democratic literacy as well as an open social policy and a transparent information policy;

5) To research online behavior and new media use for the development of a citizen- centric public engagement policy strategies.

5

摘要

愈來愈多香港市民,尤其是青年一代,透過互聯網發表意見。意見的題材廣泛, 由議會政治到社會政策皆有之。因此,政策制定者有急切需要了解網上意見發表的特 性,與及須設定機制,以有系統方式收集有關的資料。

本研究的目的是:

1) 描述及了解香港的網上民意(來自用戶產生的內容); 2) 開發一套靈活可靠的網上民意數據收集方法; 3) 收集及分析兩個指定政策的網上討論

據本報告、項目網站及每日更新電郵的結果顯示,我們可透過有系統方式收集及 儲存香港網上民意的數據,並以質性和量化研究方法作出分析,分析的工具包括時間 趨勢的統計數據、關鍵詞分析、社會網絡分析及情感分析。

結果亦顯示,本系統每天平均收取約800至1,000條有關香港政府的網上內容樣 本,約等於每日整體收集的數據量(包括任何題材)之0.3至0.6%。不過,樣本數量變 化很大,視乎當天是否有受關注的社會討論出現。每天大約一半的內容樣本來自網上 討論區,另一半來自博客或微博客,來自約500至600個獨立的網上戶口──近六成屬 網上討論區,四成屬博客或微博客。大概300至500名網上討論區用戶及400至600名博 客或微博客在七天內發表至少一條有關政府的內容,或約3,300至3,500 名網上討論區用 戶及2,500至3,000 博客或微博客在三十天內發出至少一條相關的條目。發表時間主要在 早上九時至晚上十二時間。

計算情感的分數可反映政府被負面評價的水平,指出網上民意中對政府作出批評 的數量。另外,採用社會網絡分析的個案研究,可指出一些在資訊傳播中具關鍵角色 的微博客,亦可稱為網上的「意見領袖」。

我們亦以「政府資助市民置業」及「改善貧富差距」作個案研究,並採用上述 方法分析香港的網上民意。

本報告主要研究結果如下:

1. 確定網上民意的變化-數據指出當社會出現重大爭議,網上訊息增加,反映負面 情感的分數亦下降,這分析有助相關機構或人仕就社會議題作出迅速回應;

2. 掌握新興社會議題-從網上內容的主要關鍵詞中,可了解到新興社會議題的出 現,尤其是採用特定的排列方法-過往較罕見、但短時間內出現次數頻密的關鍵 詞,本方法可用以掌握新興社會議題,協助政策制定者了解重要社會訊息及需要 特別關注的事項;

3. 尋找「意見領袖」- 可用以在數碼世界中尋找到網上的「意見領袖」,這有助相

6

關機構或人仕了解到有關政策範疇的主要人物的意見;

4. 監測「民意」-本項目制定可反映負面情感的分數,或可作為迅速掌握社會民意 的一個指標;

5. 與政策討論的關連-兩項個案研究顯示如何透過上述工具分析網上發表的意見, 用以協助有關的政策討論。

據上述結果,我們並作出下列數項建議:

1. 政策制定當局有需要建立網上民意的收集機制,以了解市民在網上發表有關社會 議題的意見的民意變化,並掌握對政府管治及社會政策的民情,尤其短期民情的 起跌;

2. 制定跨部門的網上公眾參與政策,以回應市民對傳統參與及網上參與的要求,與 及公眾對政府管治水平日益增加的期望;

3. 檢討及強化使用數碼及社交媒體,藉此積極回應不同的社會議題,尤其是採用微 博、討論區、社交媒體等;

4. 鼓勵網上政策審議的文化,以具體步驟促進政府、各方團體和人仕的政策討論及 協助達致共議,並提高市民的媒體素質及民主素質,政府亦應建立開放和具透明 度的政策及公開資訊政策;

5. 研究市民網上及新媒體使用的行為,協助建立以市民為本的公眾參與政策。

7

Introduction

Online media have provided novel platforms for citizen to participate in a variety of civic and political activities. One recent and prominent example in Hong Kong is the widespread use of Web 2.0 applications, like Blog, Facebook and Twitter, during the course of the anti-express-rail movement in the beginning of 2010, in mobilizing participants and disseminating activists’ information via the Web (Lai, 2009).

The new media is widely believed to be enabler for individuals to express their opinion publicly, via a variety of ways such as writing on personal blog or online forums, replying other people’s blog or forum posts, twittering their updates and retweeting other users’ messages, or engaging in social network sites like Facebook or MySpace. These modes of civic participation appear to broaden and redefine, if not revolutionize, the conventional meaning of public opinion, which is generally referred to voting, party activities, people’s views found in the newspaper’s op eds pieces or letters to editor, radio phone-in program, submissions to government’s public consultations, or responses to opinion polls.

Departing from these relatively passive modes of expression or a narrow conception of public opinion, online participation is recognized as opening up new avenue for enriching political discussion and societal debates and increasing diversity of opinion (Organisation for Economic Co-operation and Development, 2007b).

Online Opinion in Hong Kong

Currently, there is very limited information about the nature of online public opinion in Hong Kong. The annual Thematic Household Survey (THS) on IT Usage and Penetration, conducted by the Hong Kong Census and Statistics Department since 2006 (Census and Statistics Department, 2009), and the annual Hong Kong Internet Project, undertaken by the City University of Hong Kong since 2000 (City University of Hong Kong, 2009), did not yet cover user-generated contents, that is not to say online public opinion.

A latest market report finds the proportions of Hong Kong people participated in creating/updating a blog, updating/maintaining a profile on social networking site, and contributing to online forum/discussion group in the past month ranged from 9-23%, 18-33%, and 13-26% respectively, depending on their corresponding age group (Synovate, 2010).

What are lacking

What we are lacking is not only survey-based usage data, but also an overall analysis of the characteristics of online opinion (or user-generated contents in general) in Hong Kong

8 and a systematic approach to data analysis. Following this line of thought, some common questions arise are as follows: How often do people contribute pieces to online forum or comments to other people’s blog? How frequent do online citizens update their personal blog/micro-blog or social network profiles? What topics do netizens usually write about? What types of discussion are more likely to invite replies, comments, and heat debates? How many percentages of user-generated contributions can be classified as carrying originality and creativity?

Specifically for researching public opinion, there are further more questions: How can we collect and analyze online public opinion in systematic manner? How can we incorporate these views into public governance like policy formulation and democratic process? What is the connection, if any, between online opinion and offline opinion? Can online opinion function like agenda-setter or opinion leader, which are roles played now by traditional media? How can government fully utilize online environment to facilitate citizen’s civic engagement?

Policy Implications

As an OECD report recently points out, the increasingly important role of user-generated contents in mass communication draw out various implications for policymakers, business, and Web users (Organisation for Economic Co-operation and Development, 2007b). Areas of policy concerns include but not limited to the following headings: i) enhancing R&D, innovation and technology; ii) developing a competitive, non-discriminatory policy framework; iii) enhancing the infrastructure; iv) shaping business and regulatory environments; v) governments as producers and users of content and vi) better measurement.

While many people believe that digital media could encourage citizen’s democratic participation as well as expression of opinion, some empirical findings in US, however, suggest that this notion may be too optimistic and inconclusive (Hindman, 2009; Smith, Schlozman, Verba, & Brady, 2009). As noted by Dahlgren (2009, p. 7), “they are (the newer Information Technology and Communications) contributing to a reconfiguration of political life – thought it is still unclear if this will be sufficient to reconstruct democracy.”

In a nutshell, our society and government require, urgently, to collect reliable and quality data for making informed policy decisions, making sense of the development of the evolving online public opinion, and understanding actual netizens’ behaviour in creating online contents. Worldwide governments have been moving toward an establishment of unified and systematic approach for collecting and analyzing data of user-generated contents (Organisation for Economic Co-operation and Development, 2007a).

Objectives of this Study

There are three objectives of this study.

9

1. To describe and understand the nature of online public opinion (or user-generated contents in general) in the “Hong Kong cyberspace”1; 2. To develop a robust and reliable data collection method for sampling the space of online public opinion; 3. To collect and analyze online policy discussion of two selected topics;

Scope of the Study

The study aims at covering the following areas:

a. To use Web mining and/or opinion mining technique, deploy Web crawlers and/or third-party API and/or open-source software, to explore the overall “Hong Kong Cyberspace” systematically, and identify/extract user-generated contexts in Hong Kong; b. To explore 4 approaches (basic daily/weekly, keyword analysis, social network analysis, and sentiment analysis) for data analysis; c. To analyze the posting of two selected topics: subsidizing home ownership (政府資助 市民置業) and wealth gap (改善貧富差距). d. To develop and deliver a Web-based “Hong Kong Online Opinion” pilot system for user acceptance testing.

1 It is indeed no such thing called “Hong Kong cyberspace” since online sphere is inherently in global scale. This term is vaguely referred to the Internet community created and maintained by the online individuals, mostly but not necessarily Hong Kong citizens, who regularly contribute contents in association with Hong Kong.

Also, the terms “Internet”, “net”, “web”, “cyberspace”, and “online sphere” are used in a way that is largely interchangeable throughout the text. I must, of course, admit that this is, technologically speaking, not accurate. 10

Approach and Methodology

Online public opinion, in this study, is operationalized as the user-inputted, non-firewall- protected, and publicly-accessible web-based textual contents that are obtained from major and popular online outlets primarily for Hong Kong Internet user’s contribution. For some networks such as Facebook, “public” data is whatever accessible with a normal user account.

Generally speaking, this study consists of two major parts: 1) data collection; and 2) data analysis. Each part is briefly introduced as follows. This section briefly describes the approach and the methodology used in this study. Technical details are provided in the next section “Technical Architecture”.

Data collection

This study deploys web mining technique, aiming at discovering pattern of the web contents. Computer programs are developed to explore the overall “Hong Kong cyberspace” and identify/extract online public opinion in Hong Kong, as defined above, systematically and extensively. Web user population is identified on the basis of a set of pre-defined rules (e.g. self-reported country, region, province, city or locale) and updated via Web crawling in regular time interval. Contents and user profiles are retrieved, stored, and indexed in the form of data archive for further data manipulation, content analysis, and/or post-processing. Data analysis results are presented on a web-based interface.

Different websites or Internet contents are modeled in very different ways. We need a common data structure that helps us compare these networks or types of Internet contents between them.

Microblogging platforms such as Twitter and Sina Weibo, or social networking websites such as Facebook, offer APIs (Application Programming Interface), which as standardized “doors” for accessing the website's data. They are normally offered to developers who wish to make applications on top of these networks. We use them instead to gather data. There might be rate limits and overall access limits imposed by networks to prevent abuse. We discuss those limits in the methodology here below.

Moreover, modern web publishing systems, like Blogs, have the advantage of frequently, if not always, offering RSS and Atom feeds, which are standardized syndication formats for frequently updated works. We can download a copy of these files at a given interval of time and pull the contents of these blogs or websites. Some feeds only give a preview of the actual contents.

Web crawler is another approach to “download” web contents from a set of websites, say

11 online discussion forums, regularly and in an orderly manner.

Four types of online platforms: discussion forums, blogs, Twitter/microblogs, and Facebook entities, are analyzed in this study.

Online discussion forum

Online discussion forum is one of the major channels for Hong Kong people’s opinion expression via the Internet. A market report finds that the proportions of Hong Kong people who participated in contributing to online forum/discussion group in the past month are 25% (year 12-19), 26% (year 20-29), 19% (year 30-44), 13% (year 45-64) respectively (Synovate, 2010).

There are a few dozens of these online discussion forums that are specifically targeted for Hong Kong people. At least the top three major discussion forums in Hong Kong2, namely discuss.com.hk, uwants.com, and hkreporter.com, were selected and analyzed in this study.

Blogs

Blogs are another key channel through which Hong Kong people usually contribute opinions and ideas to public sphere. According to Synovate’s study on 2010, the ratios of Hong Kong citizens who have participated in creating and updating a blog in the past month are 19% (year 12-19), 23% (year 20-29), 12% (year 30-44), 9% (year 45-64) respectively.

There is no Hong Kong blogger directory publicly available or a list for Hong Kong blogging service providers. This study explore a pool of Hong Kong bloggers “population” by referring to a list of self-report Hong Kong microblogger who provide personal blog address on their Sina Weibo sites.

Twitter and microblogging

Twitter and other forms of microblogs are increasingly popular and becoming important outlets for rapid dissemination of information and even opinion online. These applications enable users to send, read, reply, or retweet (or repost) other users' messages, namely tweets, and to follow other users or to be followed by others. Tweets are text messages of up to 140 characters (English or Chinese) displayed on the user's profile page. Some micro-blogging service providers allow users to upload images and videos.

There is no Hong Kong twitter/micro-blogger directory publicly available or a list for Hong Kong twitter/micro-blogging service providers. This study develops a pool of Hong

2 Top Sites in Hong Kong: http://www.alexa.com/topsites/countries/HK 12

Kong twitter/micro-blogger “population” on the basis of a set of operational definition of “Hong Kong twitter/micro-blogger”, i.e. self-reported country, region, province, city or locale.

Most of the micro-blogging service providers, like Twitter or Sina Weibo, provide application programming interface (API) that enables developer or researcher to develop computer programs for collecting, searching, retrieving data and contents from service providers.

Facebook Groups/Pages/Events

Facebook is a social networking service and is very popular in Hong Kong. Synovate’s 2010 report finds the percentages of Hong Kong people participated in updating/maintaining a profile on social networking site in the past month are 32% (year 12- 19), 33% (year 20-29), 26% (year 30-44), 18% (year 45-64) respectively. Facebook users may create a personal profile, add other users as friends, and exchange messages but these profile and contents are not necessary made open. Nevertheless, Facebook users can join interest user group (or called Facebook groups) in which their basic profile, information, and comments (if any) are allowed to be accessed by public. This study seeks to study the opinion (which is accessible by public) expressed by Facebook group users.

There is no Hong Kong Facebook group/page/event directory publicly available. This study develops a pool of Hong Kong related Facebook groups/pages/events by web search using a list of government related keywords (Appendix A).

Facebook provides application programming interface (API) that enables us to develop computer programs for collecting, searching, retrieving data and contents from Facebook groups.

Data analysis

This study seeks to explore four different approaches of data analysis. Each approach is briefly described as follows.

Time-trend statistics

Overall statistics are recorded on a daily, weekly, and monthly basis, which include but not limited to the following indicators.

. Total amount of postings/tweets/retweets . Frequency of each type . Number of new blogs/threads/users added

13

. Number of follow-up posted/followers . Percentage of active user (within a period) and average number of postings . Peak hour of publication . Article length . Other characteristics (like whether it was attached with music, image, movie, etc)

Keyword analysis

This approach is used to identity key topics of the day/week/month from the text. Example of indicators include: the highest frequency keyword and the special keywords. Open-source Chinese segmentation software is deployed to extract keyword information from the text.

Social network analysis

Social network analysis (or SNA in short) is an approach to analyze the relationships (web-links and inter-links) between content contributors from a network perspective. Quantitative indicators, like centrality analysis (measured by degree), closeness, or betweenness, and clustering analysis are used for indicating and describing the network status and the network characteristics, for examples “opinion leaders” or “groups”, or popularity of a node (contributor) within a network structure;

The network data is usually presented and visualized in sociogram, i.e. a graphical presentation of social links between contributors. Data visualization aims to present data visually in a sensible and interpretable manner. This approach is usually used to visualize social networking relationships. Given a subset of the “microblog sphere”, we formalize it as a graph comprised of vertices (nodes) and edges (relationships). The vertices typically represent entities of the system, which are the individual users. There are individual groups, lists or hash tags too. The edges represent the relationship between such entities.

In the simplest expression of a graph, such as Sina Weibo graph, vertices are the users and the edges are a simple following-follower relationship. We can also define the relationship as a mention, repost, or retweet, and chose a scale by which to assign a value. Let’s say if two users are friends (mutual following), then their relationship is “bonified” if they are found to mention each other in their own user timeline.

Sentiment analysis

Computational linguistic approaches, for example part-of-speech (POS) tagger, are experimented to analyze opinion text. An application is to identify negative sentiment (toward the target) of an opinion by using computational approach. Another potential application is used for prediction of conventional public opinion, say phone survey.

14

Open-source Chinese parser and POS tagger software are deployed to conduct analysis on the collected text.

15

Technical Architecture

From October 2010, we have gathered data from various digital media sites, namely Twitter, Facebook, Sina Weibo, blogs and discussion forums. This section presents the system architecture and methodology and outlines some of their limitations.

Definitions

API: Application Programming Interface. In general, it is a specification for programmers to access an application. More specific to our case, an API takes the form of a URL that allows programmers to download particular data in a standard format. The APIs are like doors to various online programs. In our case, doors to vast stores of data created by users of social networks.

Timeline: A timeline is a collection of posts (or statuses, for microblogs). They arrange those posts by order in which they were published.

Sina Weibo

Background

We started work on Sina Weibo in November 2010. Weibo emerged as a major and indispensable social media in Hong Kong and China during 2010. It was included in the project a month after Twitter, when we realized that the API was at least as easy to use as Twitter’s.

Data collection

The outline of our strategy was to build a list of users and using it to collect the users’ timeline at various time intervals. This is the same strategy used from early on the project.

User list

 The Hong Kong users list was created by a continuous search through the API on posts made by users who declared to be from Hong Kong. The user’s self-declared location is selected from a scroll-down menu in Sina Weibo, unlike for Twitter, which is up to the user’s choice. The user entity for these posts is retrieved and stored in the database.

 In the latter part of the project, the list was augmented via keeping the user entity when refreshing user timelines by including original posters when saving retweeted statuses. New users were also added when retrieving;

 As of November 2011, the Hong Kong list comprises of 175K users. We also have a “priority list”, which is a subset of the larger list that contains active and frequently

16

updating users (more than once a day). It contains about 21K users as of now.

Collecting posts

The vast majority of all posts that we collected were done so via getting user timelines from the API. Some supplementary posts are also collected when retrieving reposts of posts of interest.

For each user list, we attempt to retrieve every user’s timeline on Saturdays, Sundays and Wednesday, usually within the day. On other days of the week, we only go through priority lists, as defined in the previous section.

The process of updating one timeline goes as follows:

 Send request to the API.  Attempt to insert the posts one by one starting with the newest ones (posts are pre- ordered in the API).

 When 5 consecutive posts are found to already exist in the database, we stop the loop. The job of refreshing the user timelines is done 25 users in parallel to speed up the process (the bottleneck is I/O, not calling the API). If the API limit is attained, we pause for a few seconds to let the IP refresh itself. If again after a 2-3 more attempts the limit is attained, we pause the process until the next API reset time.

Storing posts

The posts are stored in a database called rp_sinaweibo. The table’s columns model closely fields available through the API. We also add our own “helper” column, such as the self-explanatory “deleted” and “dbinserted”. New columns were added in October 2011 to accommodate the new fields provided in the API.

The table is range partitioned, divided by week of the created_at value. The children tables are named based on the ISO year and ISO week value: “rp_sinaweibo_y2011w45” would only contain posts made during ISO week number 45 in 2011. We created indexes on frequently used columns such as user_id, creanted_at and retweeted_status.

Reports

We produce a number of reports based on the data. The main ones are:

 Most reposted posts among sample: http://research.jmsc.hku.hk/social/sinaweibo/  E-mail and keyword analysis: http://research.jmsc.hku.hk/social/reports.py/general/

17

Search engine

We use Lucene to provide full-text search ability on posts’ text field. Here is the URL to the search tool: http://research.jmsc.hku.hk/social/search.py/sinaweibo/. Posts are added to the index on a daily basis.

Scripts Here is a brief description of scripts used for collecting and storing Sina Weibo data:

 sinaweibo.oauth.py and sinaweibo2.oauth.py: These are the two main scripts used to collect data from the API. Sinaweibo2 is the October 2011 revision that uses the Weibo API V2 and is currently under development. They build mainly on weibopy2, a library provided by Sina, largely inspired from a similar library for Twitter, and modified by us to reflect progressive changes in the API. Handles every type of data collection from timelines to comments (with the exception of friendships).  mypass.py: Contains all the credentials used in the code, such as API keys and database passwords.

 sinatrace.py: Gets all the reposts of a given post and produces a JSON that attempts to trace the path of reposting based on user screen names.

 sinagetter.sh and sinastorage.py: The early scripts used to get and store data from the API. This was before basic authentication was banned from Sina Weibo in late spring 2011.

 sinamostretweeted.sh and sinamostretweeted_firstpass.py and sinamostretweeted_secondpass.py: Handle the “most reposted” reports for Sina Weibo. Its firstpass goes through the database and gets counts, while secondpass gets missing users and missing posts and then outputs the JSON used in the reports. The shell script is used to wrap everything up together.  socialreports.py (formerly socialemail.py): Handles the e-mail reports, for all networks.

Twitter

Background

Twitter is the social media we first worked on, early in September 2010. The initial experiment consisted in following a list of 1000 influential users. We have since found ways to expand this list to find Hong Kong-based users for the most part, and follow what they were saying.

Data collection

We collected data from Twitter as we did on Weibo, based on a user list that we query on the API at regular intervals of time.

18

User lists

The user lists have been gradually constructed, growing from the low thousands of users in the beginning, to about 16,565 users at the latest revision in the end of July 2011. To build those lists, we have first selected users at large that we think may belong to the Hong Kong population. We used a method that we called “geograb”, in that we search the Twitter API for tweets originating from the Hong Kong territory.

We then used a second method that consisted in finding user names in the tweets we collected thus far. Those new names were searched against the Twitter API and saved in the lists. At each iteration, we refined the lists to find Hong Kong users. We identified them through different criteria:

 Their location or description field contains “Hong Kong,” “HK,” ”香港,“ or “HKG” (plus other lexical variants).

 Their location field contained coordinates that correspond roughly to Hong Kong.  Their time zone was Hong Kong Time.

The first method is the most fail-proof, although some users may declare themselves to be living in several places. The second method by coordinates may not always be exact, but were a fairly good indicator (it may contain users from bordering Shenzhen). However, we may not be able to consider users who set their location dynamically, and there is no way to evaluate those who are transiently in Hong Kong. With the third method, we have found by thorough sampling that a sizeable proportion of the people choosing the Hong Kong time zone in their settings come from the Philippines, as determined by their stated location.

Overall, we think that the proportion of users not really from Hong Kong is fairly small and filtered out when doing term analysis. All the users are pooled into a single list that we then use for collecting tweets.

Collecting tweets

We randomly go through the list and query the API for their timeline. The Twitter API allows the downloading of up to 16 pages of 200 tweets, so we can theoretically access the latest 3200 tweets made by a user. This is useful to get older tweets made by recently added users. The collection job is started on a daily basis and usually completed within 48 hours.

Storing tweets

While the number of tweets is not as considerable as for Weibo (because of the size of the lists), we have nonetheless created partitions based by week of creation (see Sina Weibo section). The tweets table contains the same information as available on the API.

19

Reports

We produce a number of reports based on the data. The main ones are:

 Most reposted posts among sample: http://research.jmsc.hku.hk/social/twitter/  E-mail and keyword analysis: http://research.jmsc.hku.hk/social/reports.py/general/

Scripts

Here is a brief description of scripts used for collecting and storing Twitter data:  twitter.oauth.py: The main script that collects tweets from user timelines, user information and friendships.

 mypass.py: Contains all the credentials used in the code, such as API keys and database passwords.  twitter.geograb.py: A script to perform the “geograbbing” of geo-tagged tweets near or in Hong Kong.

 twitter.db.py: Some secondary scripts built on top of Python libraries provided by Twitter to access their API.

 mostretweeted.sh: Leverages twitter.db.py to get the most reposted tweets within an interval of time.

Facebook

Background

In November 2010, we started collecting data on Facebook through the use of their Graph API. Because of the difficulty to obtain location-specific data, we rely on a list of keywords to search for entities (events, pages, groups, etc.) in a process we call the shotgun. Then, through a set of criteria that we developed, we select entities to “watch”, which we mean checking their characteristics at regular intervals of time. The Facebook Graph API constantly changes and some fine-tuning is required in order to follow up.

The shotgun

Using a set of Hong Kong-specific keywords, we queried the search functions of the API. The keywords were chosen for their specificity to Hong Kong. For instance, we include names of places, public people and various organizations of all affiliations. The shotgun currently relies on a list of 114 such words. This is to broaden the search as much as possible to then refine according to our needs. We run the shotgun from 2 to 4 times per day. Each keyword is searched against the API for each of the following entities:

 Events  Groups 20

 Pages

 Posts

 Applications  Users We also developed other scripts that collect likes and comments from posts. In Facebook parlance, “posts” may be links, status updates, photos, etc. The likes and comments are currently not used in our analysis, however.

The watch list

Once the Facebook entities are inserted into our database, we want to follow their evolution. For groups, we want to look at membership. For events, we are interested in attendees, people invited. For pages, we would like the fan count. The number of entities added every day to our database typically ranges around 500-1000. Polling every entity several times per day is impossible, because of usage limits imposed by Facebook on their API and simply by system load. Also, we would not be interested to follow outdated entities or ones unrelated to our study. The watch lists are our solution. When the entities are first discovered, they are neither watched nor not watched. These are the criteria that we consider:

 Rate of increase  Absolute numbers

 Start and end of the event (we un-watch events that ended a week ago)

We also bring some custom corrections:

 Un-watch entities that match Taiwan-related keywords  Remove entities retrieved from the API longer than a month ago

Once an entity is un-watched, it may not return to the watch list. This is why we have set very stringent conditions to un-watch an entity that is currently being watched.

The watching

Three different Facebook entities may be watched: events, groups and pages. In the database, we created separate tables that keep track of the information of each type of entity with a timestamp. These are what they consist in:

 Events: Number of users attending, maybe attending, not attending, invited (no answer).

 Groups: Number of known members.  Pages: Number of fans Watched entities are polled twice a day.

21

Limitations

One major limitation is that we can only see on the API what a typical user can see on Facebook. This means that some groups may be marked as secret or private, and its information can be hidden. In finding entities, we are also limited by the keywords that we choose, even if we have chosen a large-ranging list of terms. The mechanism to mark entities for watching is well-fine-tuned, and can be improved to be more adaptable.

Blogs

Background

Blogs were the last piece added to our project. We completed the crawling system for blogs at the end of September 2011.

Every different blog system, such as Blogger or Sina Blog, has different ways of organizing information important to our problem. We didn’t have a problem with new social media systems, as their APIs provided data in known formats that were easy to parse. To avoid the problem of developing for multiple blog platforms, we thought instead of using RSS feeds to collect data. RSS is a commonly adopted syndication format. It has the advantage of being public.

Creating list of blogs to follow

We leveraged Sina Weibo user information to derive a live of Hong Kong blogs to follow. We picked all the values of the URL field for each of the Hong Kong users in our database. We did the same for our Twitter users from Hong Kong. We downloaded the URL and checked if it contained a RSS feed, and stored it if it did. Using the aforementioned method, we were able to isolate 17,756 RSS feed URLs of blogs that we strongly think are Hong Kong-based.

Collecting posts

Once a day, we go through the entire list of RSS feed URLs and download them. We try to insert posts based on the date. If a post was found to be made at the same date and time, we consider new one a duplicate and it is not updated.

Scripts blogs.parse.py: The main script that takes any standard RSS and attempts to add new entries to the database.

Limitations

One of the limitations of our system is that it does not check for posts that could be updated by the user. We make the assumption (which should be correct) that a blogger would 22 not post incomplete posts such that new ones contain a lot more information. Also, sometimes posts in RSS are previews to complete posts. Some blogger or blog providers prefer to send users to the website’s pages.

HK Forums

Background

Forums are one of the most popular means of communication in Hong Kong. The crawling of forums was developed in spring 2011.

Forums under study

We are interested in three forums, namely Uwants, Discuss.hk and HKReporter. We have also investigated HKGolden forum, but encountered technical difficulties. Each forum system is built on the same framework, so developing for one can be reused for another.

Data structure

We created analogous tables for each of the forums:  Topic is the top-most type of entities in forums. They are created to house a conversation between users.

 Post is for the entries of topics. They are contributed by individual users and are the main source of contents for our study.

 Image, Link, Attach: Are all attached to single topics.

 Smilies: URLs of the available emoticons per forum.

 Field: Type of topic. Each topic is in one field.  User: Information about users of the forum.

Programs

The application that collects and classifies the forum entries was written in Java. Since the forum topics are numbered by uniformly increasing IDs, our program systematically goes through a range of IDs known to be recent. Topics and the posts they contain are stored in our database.

Further iterations of the program verify if new posts were made recently for existing topics. If none are found, we may eventually specify that a given topic is “finished”. This helps us limit the number of topics to follow and verify.

23

Results

Time-trend statistics

Total amount of postings/tweets/retweets

Stephen Lam appointed as new Chief Secretary

Andy Tsang’s Theory of “Black Shadow”

Donald Tsang’s “no place for triad society”

The above figure shows the time trend of the overall number of posts made by different sources: Discuss HK forum, Uwants forum, HKreporter forum, Twitter, Sina Weibo/comment, and blogs during the period from July 11 to October 24, 2011. If the few spikes are not counted, the daily number usually ranges between 600 to 1,200 posts. The highest spike at October 14 climbed up to 2,937 posts in a single day.

The top three highest spikes are marked with arrows and the corresponding public incidents are labeled with: “Chief of Police Andy Tsang’s ‘Theory of Black Shadow’” (August 29), “Stephen Lam appointed as Chief Secretary” (September 30), and “Chief Executive Donald Tsang’s saying ‘no place for triad society’” (October 14). It suggests that public outcry over government’s decision or senior official’s controversial remark often triggers online reaction in the forms of user’s posting or reposting.

Moreover, the duration periods of both “Theory of Black Shadow” and “Stephen Lam’s appointment” incidents, about three or four days before or after the spikes, is found to be longer than that of the “Donald Tsang’s ‘no place for triad society’ incident, only one day after. Within the two time durations, “double-spiked” or even “triple-spiked” are observed. It would be explained by the prolonged online debate on the issues against the government and the online mobilization activities of the social movement via the Internet within the periods.

24

Stephen Lam appointed as new Frequency of each source Chief Secretary

Andy Tsang’s Theory of “Black Shadow”

Apple Daily’s headline “只要 Donald Tsang’s “no place for 通車不要救人 他媽的!” triad society”

The above figure shows the breakdown of the number of posts from each source. Discuss HK forum and Twitter are the first two main sources of online posts related to the Hong Kong government. Uwants forum, HKreporter forum, and Sina Weibo come as the second largest. Sina Weibo comments and Blogs have fewer such posts relatively.

Discuss HK forum and Twitter are also major contributors to the spikes as identified (the three public incidents as mentioned). The spikes of the Twitter posts seem to come one or two days earlier than those of the Discuss Hong Kong forum posts. This observation would be explained by the unique nature of microblogging: microblogs are able to respond faster and diffuse broader than other online medium like discussion forums.

Relatively speaking, Uwants and HKreporter do not appear to contribute much to the total number of posts. But their post counts increased markedly at the spike of “Chief Executive Donald Tsang’s saying ‘no place for triad society’” on October 14, indicting the sensitivity of their user’s reaction toward some controversial issues related to Hong Kong government.

Sina Weibo is a special online medium. Its count pattern is not only sensitive to local Hong Kong issues (like the three public incidents as mentioned), but also reflects online user’s response to social issue happened in Mainland China. For example, on July 26 2011, Hong Kong local newspaper Apple Daily ran a “723 train crash” story headlined “只要通車 不要救人 他媽的!” to criticize the Wenzhou government and the Ministry of Railway in China, which drew attention in both Mainland China and Hong Kong, particularly with reference to the recent high speed railway debate in Hong Kong.

25

Unique contributors by source Stephen Lam appointed as new Chief Secretary

Andy Tsang’s Theory of “Black Shadow”

Apple Daily’s headline “只要 通車不要救人 他媽的!” Donald Tsang’s “no place for triad society” The above figure displays the time trend of the amount of unique contributors by different sources. A unique contributor is counted by a same user identifier number (not user name or display name) used in each source at a same day. A unique contributor is able to write more than one post per day. While interpreting the numbers of posts, unique contributor count can help reflect the actual size of online participants who create online contents.

Tweets are found to be likely created by a smaller group of Twitter users, i.e. measured by counts-to-unique contributor ratio. For example, there were 901 tweets on August 29, 2011, accounting for 159 Twitter users’ contributions, whereas 422 posts on Discuss HK were created by 275 forum users. Counts-to-unique contributor ratios of Twitter and Discuss HK on that day were 5.7 and 1.5 respectively, indicating a huge difference between two ratios. It is attributable to the nature of Twitter – microblogging with social media characteristics – that a single tweet is able to diffuse through a series of retweets via a large interconnected user network.

But it seems the above observation is solely a characteristic of Twitter, but not microblog in general. Sina Weibo has no such characteristic. For instance on September 30, 436 Sina Weibo posts were created by 341 users and its counts-to-unique contributor ratio was 1.3. On the same day, there were 985 Discuss HK posts made by 532 forum users. Its counts-to- unique contributor ratio was 1.9 on that day. Their derivation is relatively nominal.

26

Discussion Forum Topics

Stephen Lam Andy Tsang’s Theory of “Black appointed as new Shadow” Chief Secretary

Apple Daily’s headline “只要 Donald Tsang’s “no 通車不要救人 他媽的!” place for triad societ ”

Number of discussion forum topics is used to indicate the extent to which unique and single conversations, i.e. number of “threads”, are going on per day. It is not equivalent to the number of posts – an individual topic or conversation may generate, and usually, more than one posts. In other words, a large amount of posts may be created by a small number of conversations.

As found in the above figure, daily changes in number of discussion forum topics were not as salience as those observed in the graph of the number of posts, particularly at the points of the few mentioned public incidents. For instance, there were 226 government- related Discuss HK forum topics on 30 September 2010, the date when Stephen Lam was appointed as new the Chief Secretary. But the total number of posts was amounted to 985 on that day. The average number of posts per topic was 4.4.

On 14 October 2011, when the Chief Executive Donald Tsang made his “no place for triad society” remark, the number of government-related Discuss HK forum topics climbed up to 349 and the total number of posts reached a record high of 1,185. The average number of posts per topic was about 3.4

From the above graph, HKreporter seems to be more sensitive to the public controversies in relation to the Hong Kong government than Uwants, for instance like “Stephen Lam appointed as new Chief Secretary” and “Donald Tsang’s ‘no place for triad society’”. It may be because quite a number of HKreporter’s posts were associated with a few local active political groups.

27

Retweeting and Reposting

Stephen Lam Andy Tsang’s Theory appointed as new of “Black Shadow” Chief Secretary

Apple Daily’s headline “只要通車 不要救人 他媽的!” Donald Tsang’s “no

place for triad

Microblogging is a type of social media that is characterized by its rapid information diffusion in the form of retweets or reposts, in which message, no matter it is original or not, is forwarded by a microblogger to one’s own list of followers, in which there may be zero, a few or more than 10,000 followers. When forwarding message, microbloggers may add their comments on the retweets or reposting messages.

The above figure shows the proportion in % of retweets and reporting messages over the total posts of Twitter and Sina Weibo in each individual day, i.e. retweets-to-tweets (Twitter) or reposts-to-total (Sina Weibo) ratios. As seen in the figure, about 10-20% of tweets were retweets and about 30-40% Sina Weibo posts were often found to be reposts. But when public controversies related to the Hong Kong government happened, retweets-to-tweets and reposts-to-total ratios went up considerably.

For example, on 29-30 August 2011 (Andy Tsang’s Theory of “Black Shadow”), one third of the total number of tweets were retweets. On 30 September 2011 (Stephen Lam appointed as new Chief Secretary), half of the Sina Weibo posts were reposts. An extreme case was found on 26 July 2011 (Apple Daily’s headline 只要通車不要救人 他媽的) – over 80% of the Sina Weibo posts were reposts. A close association between high ratio of retweets/reposts and occurrence of public controversy may suggest that the online opinion was propagated primarily via retweeting or reposting of a few original messages by a small group of microbloggers.

28

Active users and their last update

The above figure shows the distribution of the number of active users who have contributed Hong Kong government-related posts to Sina Weibo in the past 30 days. The results were obtained on November 1, 2011. Number of active users in the past 30 days was totally 2,356.

The y-axis of the chart indicates the number among those 2,356 microbloggers whose last update post was made (on the date marked at the x-axis). For example, 30% (707 out of 2356) published their last update posts on November 1 (today), 40% (947 out of 2356) made their last posts on October 31 (yesterday) and 8% (199 out of 2356) posted their last Weibo update on October 30 (two days ago). In other words, among those 2,356 microbloggers who have contributed government-related contents to Sina Weibo in the past 30 days, about 78% made at least one post in the past 3 days. This group of microbloggers would be classified as regular contributors (at least one update in every 3 days) to Sina Weibo. Or in other words, about 20% may be known as irregular Sina Weibo contributors.

29

Peak hour of publication

The above figure displays the hours (in 24-hour clock time) when discussion forum users (blue line) and microbloggers (red line) posted their messages online. Y-axis indicates the percentage of posts over the total in the past seven days that were published at the hour (x-axis 0-23 hour in 24 hour clock time).

As seen in the above figure, the patterns of posting time of discuss forums and microblog users look fairly similar.

30

Article length

The above figure presents the average number of Chinese or English characters of the posts obtained from different sources during the period from October 9 to October 19, 2011. As from the chart, we find that microblogs, e.g. Sina Weibo or Sina Weibo Comment, had relatively shorter posts, in which there were usually fewer than 100 words, whereas forum posts often carried lengthy posts. For example, the average length of HKreporter discussion posts ranged from 400 to 600 words and Uwants forum published pieces with length between 300 and 500 words. However, blogs published articles with a wide range of average length such that their average article length on 12 October 2011 was 299 words but climbed up to 3,088 words on 13 October 2011.

The above numbers reflect the defining features of the various types of publishing - discussion forum, microblogging, and conventional blogging. The word limit of Sina Weibo and Twitter is 140 words. Article length is often limited in discussion forums but users mostly never reach the limit. Usually, no such restriction on blogging is imposed.

31

Other characteristics (e.g. whether it was attached with image, link, or unit icon)

32

The above three figures are time trends of the forum posts that were attached with i) images; ii) external links; and iii) unit icons, i.e. emotion icons. However, the results do not seem to suggest any connection between uses of image, link, or unit icon and the occurrence of significant public controversy in relation to Hong Kong government.

There are obvious variations in attaching image and unit icons between the three discussion forums. For example, Uwants is the forum with highest usage of image attachment and Discuss HK forum is the one with the largest proportion of using unit icons. Moreover, there is no clear difference in using URL links between the three discussion forums.

33

Keyword analysis

Each day, textual contents of all the posts from different sources are aggregated and keywords are extracted and analyzed. Before the analysis, textual contents is filtered out and preprocessed as follow:-

1. Filtering out bracketed special symbols or [ABC], e.g. unit icon, image, attachment, or link 2. Filtering out Twitter or Sina Weibo hashtag #XYZ or #XYZ# 3. Filtering out Twitter or Sina Weibo citation @XYZ 4. Filtering out URL Web links http://xxx.xxx.xxx/yyy/zzz.html 5. Filtering out punctuations, special characters, control characters and non- Chinese & non-English characters (e.g. Japanese or Korean) 6. Filtering out dates 7. Chinese words and phrases are then segmented using open-source software – NLPbamboo (link: http://code.google.com/p/nlpbamboo/). 8. Removing Chinese or English stop words, e.g. 但是, 当, 到, 得, 的, a, about, above, across, after, or again

Then, we used the software R (http://www.r-project.org/) to conduct text mining (library package tm) and generate term-document matrix and TD-IDF weighting.

Term frequency analysis

After filtering and preprocessing, Chinese or English terms are space-separated and their frequencies are counted. Term frequency analysis is a simple way to examine the extent to which a Chinese or English term is used in the content.

For example, below is a bar chart to show the term frequencies on 23 October 2011. The top high frequency keyword was “黎智英” and “執政黨” and “唐英年” followed.

34

Term frequency inverse document frequency (TF-IDF) analysis

Term frequency inverse document frequency (TF-IDF) analysis is an approach to investigate the importance of a term in a document by computing its term frequency weighted by its corresponding inverse document frequency, i.e. giving the term less weighing when it occurs frequently in other documents of the whole corpus. It means to identify recent frequent term but is infrequent in the past. TF-IDF is used on the basis of the assumption that term frequency may not be good enough to indicate the term’s importance. TF-IDF can be deployed to identify “special keyword” that occurs frequently in the current document (e.g. today’s document) but not in the whole corpus (data archive).

First of all, a Hong Kong government related contents corpus was built by searching the list of government-related keywords via our posts archive during the period from November 2010 to June 2011. Following the same procedures as stated above, the obtained contents were filtered and preprocessed. The resultant data was used as the corpus for the TF- IDF computation.

For example, on 23 October 2011, two sets of terms were generated on the basis of the rank of term frequency and TF-IDF respectively. Below figure presents the top 10 terms ranked by term frequency and TF-IDF.

35

Top10 TF Words Term frequency Top10 TF-IDF Words TF-IDF 黎智英 289 卡扎菲 574.4 执政党 130 陶君行 486.6 唐英年 123 梁振英 476.6 基本法 113 天主教 456.9 自食其力 105 纳税人 439.7 候選人 87 申請人 422.1 共产党 78 社会主义 420.9 黃毓民 76 反對派 360.9 skip 75 100 360.5 民主黨 73 林瑞麟 359.0

Based on the top ranked terms by using TF-IDF, the results indicate that a number of “unusual” topics were found in the government-related posts on 23 October 2011. The first one was about the death of Muammar Gaddafi, in which the term “卡扎菲” was frequently used but was not often found in the old archive.

The second top ranked term was about Jimmy Lai’s political donation “scandal”, in which the term “黎智英” was repeatedly mentioned and ranked as the top term frequency. But since the use of “陶君行” was not as frequent as “黎智英”, “陶君行” was not ranked as top 10 term frequency but was ranked second on the list of TD-IDF. TD-IDF gave higher weighting to the term “陶君行” which was rarely found in the old archive but was relatively common term on 23 October 2011.

36

Time-lines of top10 keywords are shown in the following tables.

Major keywords used of each week (ranked by term frequency)

候選人 创业板 投票率 民主黨 支持者 支持者 民主黨 投票率 候選人 候選人 候選人 梁家傑 朱永新 候選人 民主黨 支持者 民主黨 民主派 反對派 黨加油 黨加油 投票率 民主派 基本法 民主派 民主黨 何俊仁 民主派 梁家傑 候選人 基本法 民主派 民建聯 何俊仁 民建聯 民建聯 丰子恺 何俊仁 民主派 民建聯 投票率 黃毓民 促进会 民建聯 李永達 梁家傑 基本法 民主黨 教育者 區議會 區議會 黃毓民 黃毓民 梁振英 梁振英 支持者 支持者 李永達 反對派 反對派 6-Nov 7-Nov 8-Nov 9-Nov 10-Nov 11-Nov

Major keywords used of each week (ranked by term frequency inverse document frequency)

投白票 中小企业 李卓人 陶君行 何俊仁 意大利 超級區 投白票 梁家傑 超級區 二檔案 發言人 急症室 超級區 超級區 痛定思痛 意大利 打官司 活躍症 投票站 投票站 李卓人 李永達 投票率 廣東省 李永達 陳景輝 大部份 陶君行 菲律賓 朱綺華 急症室 前特首特 稅務局 發言人 王德全 港交所 活躍症 對學界 基本法 純為民 李志喜 發言人 百分點 單仲偕 馮檢基 超級區 公民党 陳方安生 梁家傑 梁美芬 工作室 王德全 技術性 主題曲 帕潘德里歐 沈旭輝 北京市 痛定思痛 梁耀忠 6-Nov 7-Nov 8-Nov 9-Nov 10-Nov 11-Nov

The above keyword timeline presents the changes in use of keyword in the week after the District Council election on November 6 2011. As seen in the tables, election related keywords were overwhelming. Names of many pan-democratic candidates, for example 李卓 人, 李永達, and 何俊仁, were often mentioned and ranked among the top 10 keyword lists. Leaders of political parties, e.g. 黃毓民 or 梁家傑, were cited a number of times too.

37

Social Network Analysis

In addition to time trend and text analysis, social media is by nature an interconnected information network. If data is available, its network characteristics can be analyzed and its pathway of information diffusion can be visualized and the major participants involved (say opinion leaders) can be identified. Our collected data on Sina Weibo support our social network analysis.

The information diffusion via Sina Weibo reposting is represented by a directed graph of interconnections between microbloggers in which a link between two nodes signifies a repost sent from one microblogger to another. For example, in above figure, node B, C, and D repost a message from node A and node B, E, and F reposts further the message from node C.

B A

C F

D

E

To analyze the graph, we used R to conduct social network analysis (library package igraph), evaluating the node-level parameters such as in-degree, out-degree, and betweenness centrality measures, which are routinely used to represent the importance and the centrality of each individual node (in this case, microblogger) within a network (Chin & Chignell, 2006; Freeman, 1978). Moreover, degree distribution, diameter, average path length, and global cluster coefficient (Lewis, 2009; Wasserman & Faust, 1994), were deployed to compare between networks.

For instance in the above social graph, each individual node represents a microblogger who reposts a message, say microblogger A with a list of followers FA=[ FA1, FA2,….. FAf], where f is the total number of followers of A. The following node-level and network-level topological parameters are used:-

 Out-degree centrality of A means the number of A’s followers who eventually repost the message after receiving it from A. Therefore, if A has 20 followers and only the followers B, C, and D repost the message sent by A, the out-degree centrality of A is 3. The out-degree centrality serves as an indicator of the effectiveness of A in propagating the message;

38

 In-degree centrality of a microblogger represents the number of times that microblogger reposts the message received from other microbloggers. When following other microbloggers, a microblogger may receive the same message from multiple sources and thus microblogger could repost more than one time; For example, microblogger D receive reposts from microblogger A and E. So the in-degree centrality of D is 2.

 Betweenness centrality of a microblogger represents the total count of pairs of nodes in the network whose shortest path between them consists of microblogger, denoting the relative importance of the position where a microblogger is located as a gatekeeper within the network; For example, the shortest path between A/E and A/F must pass through node C.

 Degree distribution of a network denotes a degree sequence, either out-degree or in-degree, of all nodes in the network. If a network follows a power law distribution, its degree distribution is represented by the equation h(k) ~ k-q , where k is degree and q is an exponent (Lewis, 2009);

 Diameter of a network represents the longest path between any two nodes in the network. Longer diameter, more steps the message is reposted by microbloggers and sustained in the network;

 Density of a social group represents the extent to which nodes are interconnected. It is approximated by the ratio of actual number of edges to the number of edges in a complete graph (Lewis, 2009);

 A network’s average path length is equal to the average number of paths over all direct paths connected between nodes in the network, showing the effectiveness of information diffusion. Generally speaking, shorter average path length, higher effectiveness of information diffusion;

 Global cluster coefficient is an indicator of a network in which its nodes tend to cluster together;

 Network visualization employed the Fuchterman and Rheingold graph layout algorithm provided by the igraph package.

Social network analysis is deployed to investigate the information diffusion pattern, network characteristics, and to identify “opinion leaders” who are located at the important position of the reposting network, i.e. measured by betweenness. The three case studies are listed as follows: 1) RTHK protest against the newly appointed director; 2); Stephen Lam’s new appointment as Chief Secretary; 3) Hong Kong domestic helper’s right of abode lawsuit.

39

Sina Weibo #3358173450683498 Author: 廖伟棠 | 關注/followers:27457 | 粉絲/friends:881 | 微博/posts:5180 | | | 香港 | member since 2010-02-14 00:00:00

港台牛比,我敢說全世界都罕有這樣迎接單位新上任領導的:(綜合報道)(星島日報報 道)政務官鄧忍光空降香港電台接任廣播處長,昨正式履新,獲港台工會高舉黑色港台 台徽字牌、鋪黑地氈「迎接」,穿上黑衣的員工更大喊不要政務官、不要變官媒。(星 島日報報道)

This message was posted by 廖伟棠 on 2011-09-16 08:53:25. He is a Hong Kong- based writer. His Sina Weibo is followed by 27,457 microbloggers at the time of report writing. He made this post to respond to a Sing Tao Daily news story about RTHK staff’s protest against the new on board Director of Broadcasting. His post was reposted by at least 3,040 unique Weibo users. During the peak period from 13:00-14:00pm, it was reposted at least over 350 times (below time trend figure). Figure showing pathway of information diffusion and Top 10 reposters can be found next page. The top reposter, 杨锦麟, is a Hong Kong-based journalist. Most of the other reposters were originated in Mainland China.

40

Top10 reposters User Name (Province) Time Betweenness Num_of_Followers Out_degree In_degree 杨锦麟 (HK) 9/16 10:56 102 362047 76 1 小记者亮冰娜(Henan) 9/16 18:50 48 9547 9 1 平民内参(Henan) 9/16 18:53 42 17526 8 1 李佳佳Audrey (Guangdong) 9/16 18:51 33 25687 14 1 大饼卷馒头就米饭O (Peking) 9/16 19:24 28 37 5 1 何云H (Guangdong) 9/16 18:41 25 4667 1 1 打酱油的躲猫猫(HK) 9/16 13:36 24 11329 7 1 郭家翠(Zhejiang) 9/16 19:22 24 15 1 1 李诺言(Sichuan) 9/16 18:57 22 4130 3 1 远征锋(Guangdong) 9/16 14:08 20 2 7 1

Network Characteristics: Number of node: 3040 Number of edges: 1126 Density: 0.00012 Average degree: 0.741 Diameter: 8 Average path length: 1.57 Global Cluster Coefficient: 0 Power law exponent coefficient: 2.71

41

Sina Weibo #3363301825166195 Author:梁東 leungdont | 關注/followers:4073 | 粉絲/friends:436 | 微博/posts:2462 | | | 香港 | member since 2010-05-31 00:00:00

【新華社報導林瑞麟接任政務司司長】今天是 Black Friday [泪] http://t.cn/hDqrl9 This message was posted by 梁東 on 2011-09-30 12:31:45, the date Mr. Stephen Lam was appointed to be Chief of Secretary, HKSAR government. 梁東 is a disc jockey in Hong Kong. His Sina Weibo is followed by 4,073 microbloggers at the time of report writing. He posted a short message of his response to Mr. Lam’s appointment, saying that “today is a Black Friday”. His post was reposted by no fewer than 145 unique Weibo users, of which most were reposted shortly after the announcement. Figure showing pathway of information diffusion and top10 reposters can be found next page. The top reposter 柳俊江 is a former Hong Kong based journalist. Most of the reposters were originated in Hong Kong.

42

Top10 reposters User Name (Province) Time Betweenness Num_of_Followers Out_degree In_degree 柳俊江 (HK) 9/30 12:47 25 27338 20 1 搵食能手 (HK) 9/30 13:32 14 11173 7 1 伍餐肉 (HK) 9/30 14:09 8 4254 3 1 siujam小貓 (HK) 9/30 12:50 4 592 1 1 煮飯喵 (HK) 9/30 14:23 4 179 2 1 paulala13 (HK) 9/30 12:51 3 723 1 1 狸狸 (Yunnan) 9/30 14:25 3 7 1 1 Tobyhk (HK) 9/30 12:50 2 102493 1 1 孫箎 (Guangdong) 9/30 13:08 2 81 1 1 Tomaz_Wong (HK) 9/30 13:56 2 95 1 1

Network Characteristics: Number of node: 145 Number of edges: 67 Density: 0.00321 Average degree: 0.924 Diameter: 4 Average path length: 1.60 Global Cluster Coefficient: 0 Power law exponent coefficient: 2.28

43

Sina Weibo #3363276742532208 沪港小生 | 關注/followers:26472 | 粉絲/friends:287 | 微博/posts:4228 | | | 香港 其他 | member since 2009-11-16 00:00:00

號 外[围观] 號外[围观] 突發新聞:香港高等法院裁定「外傭爭取居留權案件」涉及入 境條例違反基本法,意味著幾十萬外傭(以菲律賓、印尼為主)若在香港住滿七年, 或申請獲得香港永 久居民權利和特別行政區護照,對香港將有重大社會影響。香港政 府發言人表示,正研究原訟庭的判詞,稍後將向公眾作交代。 Posted on 2011-09-30 10:52:05

This message was posted by 沪港小生 on 2011-09-30 10:52:05. On 30 September morning, the High Court ruled that the existing Hong Kong Immigration Ordinance violates the Hong Kong Basic Law in the right of abode of domestic helpers. 沪港小生 is a media columnist. His Sina Weibo is followed by 26,472 microbloggers at the time of report writing. He responded promptly to the news story and forwarded the story to his followers. His post was reposted by at least 192 unique Weibo users, of which some were reposted shortly after his post and many were made around 7-8pm. Figure showing pathway of information diffusion and top10 reposters can be found next page. Most of the top reporters were based in Hong Kong.

44

Top10 reposters User Name (Province) Time Betweenness Num_of_Followers Out_degree In_degree 虾抖陆 (HK) 9/30 11:33 5 497 2 2 賽西莉亞兔(HK) 9/30 12:17 4 66 2 1 寂寞的可乐罐(HK) 9/30 12:21 4 438 1 1 一坨水晶 (HK) 9/30 12:15 3 75 1 1 冰封雪花飘_Angela (HK) 9/30 12:34 3 570 1 1 北漂中滴小凡(Peking) 9/30 11:02 1 68 1 1 水缘无忆 (HK) 9/30 13:34 1 1509 1 1 DXZhao (International) 9/30 14:30 1 53 1 1 Kitkatnz (International) 9/30 17:15 1 89 1 1 bestHermeslover (Others) 9/30 10:52 0 8 0 0

Network Characteristics: Number of node: 192 Number of edges: 39 Density: 0.00106 Average degree: 0.406 Diameter: 4 Average path length: 1.418 Global Cluster Coefficient: 0 Power law exponent coefficient: 3.20

45

Sentiment analysis

Sentiment analysis is a popular application in contemporary text mining and information science (Efron, 2011). Its algorithms may computationally classify a set of opinionated contents into a category of tones, usually positive, negative, and neutral tone. This application has been used in identifying sentiment in the social media (Pak & Paroubek, 2010; Thelwall, Buckley, & Paltoglou, 2011) and the political comments on Web forums (Abbasi, Chen, & Salem, 2008). Another application of sentiment analysis is to predict public opinion time series (O’Connor, Balasubramanyan, Routledge, & Smith, 2010). According to a review of a variety of sentiment analysis techniques, the accuracy rates of these sentiment analysis techniques often ranges between 70% and 80% (Annett & Kondrak, 2008).

By deploying sentiment analysis techniques, our study aims to analyze the sentiment of the Hong Kong online public opinion. A set of online posts were collected from our archive and classified into negative and positive/neutral sentiment, which were validated by making comparison to the sentiment rated by human raters. Using this sentiment classifier, a time- series of sentiment scores was computed and was compared with the government approval rate obtained from public telephone poll.

Method

First, a large corpus of government-related posts was retrieved from our data archive by using government-related keyword search (See Appendix for full list of keywords). In total, 66,468 posts were collected. A human rater was then assigned to classify these posts into two categories: negative sentiment and others. A "negative-sentiment" post is defined as the content that represents a criticism toward the Hong Kong government or social policy in Hong Kong or explicitly identifies at least one public body to which responsibility toward a social policy or a public incident is attributed. Once the classification was completed, a random quality check on a small sample collected from the corpus was conducted by the Principal Investigator.

Through the classification, 1,468 negative sentiment posts and 65,000 other posts were categorized. The Chinese contents of these posts were subsequently segmented (using NLPbamboo) and preprocessed by deleting stop words, punctuation, URL link, and some special symbols used in Twitter and Sina Weibo posts (e.g. @uuu where uuu is a username, #hhh or #hhh# where hhh is hash tag, and emotion icons etc). Finally part-of-speech tag was assigned to each space-separated Chinese word.

Afterward, one thousand samples were randomly selected from each of the two categories of posts, i.e. negative and others, and the combined 2,000 samples were used as the training set for supervised machine learning. After weighted by inverse-document frequency and deleting low frequency terms, a term-document matrix was generated and was used as 46 the feature vector for machine learning.

The performance of machine prediction was tested by using different sizes of feature vectors. Support vector machines algorithm was deployed to build a model for sentiment classification and prediction. The remaining 468 negative sentiment posts and 468 randomly chosen category “others” were combined to form the test set. This process was repeated 10 times to compute a set of outcome indicators on average: accuracy, precision, and recall. Accuracy rate is calculated by dividing the number of correct predictions by the total number of test set. Precision is the percent of negative sentiment predictions which were correct. Recall is the percent of negative sentiment test items which were correctly predicted.

We used R version 2.12.1 to conduct text mining analysis (library package tm) and support vector machines analysis (library package ksvm). open-source Chinese language processing software Nlpbamboo (http://code.google.com/p/nlpbamboo/) was deployed to undertake Chinese word segmentation and part-of-speech tagging.

Results

Sentiment Prediction

Results show that the accuracy of negative sentiment prediction was generally above 75% and reached 79% when the size of feature vector was set as 4,500. With the feature vector size of 4,500, the precision, which is the level of correctly predicted negative sentiment, was 78% and the recall, which is the extent to which negative sentiment test items were correctly predicted, was 80%. The results indicate that our sentiment classifier yields a fair level of accuracy which is comparable with some of the advanced techniques used in the field (Annett & Kondrak, 2008).

Table: Negative Sentiment Prediction Results Size of feature vector accuracy precision recall 500 0.754 0.737 0.789 1000 0.766 0.750 0.798 1500 0.758 0.743 0.790 2000 0.762 0.750 0.788 2500 0.769 0.762 0.785 3000 0.778 0.767 0.798 3500 0.776 0.770 0.790 4000 0.774 0.764 0.793 4500 0.788 0.781 0.801 5000 0.779 0.773 0.790

47

Figure: Negative Sentiment Prediction Results

0.820

0.800

0.780

Accuracy 0.760 Precision Recall Percentage

0.740

0.720

0.700 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size of Feature Vector

Sentiment Score

Instead of undertaking sentiment prediction of each post individually, we are interested in an overall sentiment of the public opinion. So we seek to deploy the sentiment classifier to evaluate a daily aggregate score. The advantage of this approach is that, with a sufficiently large sample, this method potentially cancels out the random noise of the measurement, i.e. false-positive and false negative, and may obtain a better estimation (O’Connor, et al., 2010).

By deploying the above sentiment classifier and with a set of parameters (e.g. size of feature vector), we construct a sentiment score to indicate the extent to which negative sentiment contents are manifested in the Hong Kong online opinion. Within a time interval t, where t is any time interval: in this case 1 day, sentiment score in t is defined by the following equation.

 Numbers _ of _ Posts _ with _ Pr edicted _ Negative _ Sentiment  Sentiment _ Score  log   Total _ Number _ of _ Posts 

To test the assumption that aggregated score is able to minimize error, we repeatedly run the test set 50 times and evaluate each of their sentiment scores. As there are equal numbers of positive and negative samples in the test set i.e. 468 positive and 468 negative samples, the theoretical value of the score is -log(0.5)= 0.693. Figure shows the distribution of 50 scores obtained from the classifier. It approximately follows a normal distribution and its mean is 0.647 and its standard deviation is 0.03. Roughly speaking, 95% confidence interval ranges 48 from 0.587 to 0.707. In other word, the error should be within the range of -0.106 to 0.01.

Correlation between Phone Survey and Sentiment Score from Online Opinion

We seek to test the relationship between traditional way of collecting “public opinion”, say by using telephone survey, and our approach of evaluating public sentiment toward government. We obtained the phone survey results from the website of the Public Opinion Programme, The University of Hong Kong (in short HKUPOP, http://hkupop.hku.hk/). HKUPOP publishes polling results of “popularity figures of Chief Executive Donald Tsang and the HKSAR Government” regularly every month. We select the item percentage of “dissatisfaction rate of Hong Kong SAR Government performance” as the reference for comparison.

Starting from 11 July 2011 until now, we retrieved online posts from our archives of Twitter, Sina Weibo, Blogs and the three discussion forums by using government-related keywords search. The contents retrieved every day were inputted to the classifier and a sentiment score was computed.

The first HKUPOP phone survey result that was published after 11 July 2011 was on 25 July 2011. Thus we set a base point 100 to represent the sentiment score on 25 July 2011. All sentiment scores were adjusted: divided by the score on 25 July 2011 and multiplied by 100. By the same token, HKUPOP results were adjusted using same procedures.

49

The above figure displays the time trend of the daily sentiment scores and its 3-day moving average line. The monthly phone survey results of the HKUPOP were marked on the graph in orange color (25 July, 23 August, and 20 September). The online version of this graph is available via the link http://research.jmsc.hku.hk/social/sscore/sscore1.html.

Within a relatively short period of time from July to September 2011, the sentiment score moving average curve seems to follow largely on the time trend of the HKUPOP results, which was slightly moving upward from July to September 2011. Preliminary finding supports that the sentiment score appears to track the time trend of the public telephone poll. But we may need a longer duration to observe a longer term correlation.

50

Case Study 1: Policy on subsidizing home ownership (政府資助市民置業)

This case study aims at investigating online public opinion on a specific policy topic, in this case which is policy on subsidizing home ownership (政府資助市民置業). We conducted additional keyword search (list of keywords are as follows) from the data previously collected by using government-related keyword search. Study period was from the week starting at 5 September 2011 (hereafter 5 September week) to the week starting at 24 October 2011 (hereafter 24 October week). Time-trend charts of weekly total post counts, posts by different sources, and unique contributors are provided in the following pages. Major keywords in terms of term frequency and term frequency inverse document frequency are listed in table form over a timeline of the study period. Finally, online mention of major government officers is presented in a time-trend chart. Figures and tables can be found in the following pages. The keywords used were: 資助市民置業, 資助房屋, 居屋, 居者有其屋, 置 安心, 先租後買, 補地價, 白表, 綠表, 可租可買, 市區重建, 首次置業, 土地規劃, and 首期.

The below figure displays the weekly numbers of posts from the September 5 week to the October 24 week. Average number of posts was approximately 100 every week, except the spike at the 10 October week, in which 865 relevant posts were published. It was attributable to the new housing policy officially announced by the Chief Executive’s on 12 October 2011. Twitter and Hong Kong Discussion Forum were two major sources of online posts with the keywords. The chart showing numbers of unique contributors can be found on next page. Another chart shows the mentions of government officers during the study period. Donald Tsang was the top mention, especially in the week of announcement of Policy Address. Eva Cheng and Henry Tang followed as the second and third top mentions.

51

52

Online Mentions of Government Officers (Policy on subsidizing home ownership )

700

600 曾蔭權 林瑞麟 曾俊華 500 黃仁龍 孫明揚 周一嶽 400 曾德成 張建宗 林鄭月娥

Frequency 300 陳家強 鄭汝樺 中策組 200 梁振英 唐英年 曾偉雄 100 譚志源

0

1 1 1 1 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 0 0 0 0 0 0 /20 /2 /20 /2 /20 /20 /20 /2 /2 /2 /20 /20 /20 /20 /20 /5/201 /7/201 /9/2011 3 5 /1 /3 /5 /7 /9 1 3 5 9 9 9 /11/2011/13/2011 /19/2011/21/20112 2 /27/2011/29/2 /1 /1 /1 /17 /19 /21 /23 9 9 9/15 9/17/20119 9 9/ 9/ 9 9 10 10 10 10 10 0 0 0 1 1 1 10 10 10 10 Date

53

Major keywords used of each week (ranked by term frequency)

梁振英 下一代 受訪者 梁振英 老百姓 房屋局 梁振英 王光亞 董建華 年薪局 梁振英 金管局 炎黄子孙 施政報 空置率 金管局 蔡加讚 有限公司 們瞓街 一直乾 孙中山 梁振英 大學生 申請法 年輕人 納稅人 落成量 回報率 平均化 唐英年 唐英年 唐英年 納稅人 房屋局 鄉議局 場變化 房地产 二千五百 吸纳量 林瑞麟 蔡志明 馬鞍山 勾地表 後填海 受訪者 申請人 青年人 強積金 規劃署 係醫院 房屋局 施政報 年輕人 平方呎 林瑞麟 自由市 參加者 啟德舊 百分點 星期五 另一方面 補地價 補地價 應讓市 大角嘴 大部分 任幫市 無能力 不劳而获 房屋署 全世界 問責制 工聯會 渡假式 地產商 留意市 了不起 鄭汝樺 黎智英 民主黨 05-Sep 12-Sep 19-Sep 26-Sep 03-Oct 10-Oct 17-Oct 24-Oct

Major keywords used of each week (ranked by term frequency inverse document frequency)

施政報 縱容個特 阮次山 白中開 周永新 供應量 房屋局 壽臣山 絡紅人 酒店式 炒高樓價 高峰期 投資者 靈活性 南昌站 歌賦山 黨晤特 黨死攬死 現樓市 壽臣山 付不起 當時市 年轻人 銀行界 石礦場 焚化爐 打工仔 歌賦山 供应量 會負責 價明明 全世界 紀雲峰 骨灰龕 大部份 納稅人 宣传家 黃毓民 大出血 年輕人 下一代 身體力行 殖民地 年薪局 无可避免 王文彥 應永記 南昌站 免稅額 高埔村 官迫民 自由市 用不着 梁國雄 金正日 鄉議局 對示威 一個月 必需品 振英哥 经济学派 六千五百 蕭若元 為什麼 汽車展 上流社 發展商 天水圍 输身家 禁售期 權合約 丁屋後 愈演愈烈 救世主 梁展文 佛光街 陶渊明 實用面 現時市 候損人 05-Sep 12-Sep 19-Sep 26-Sep 03-Oct 10-Oct 17-Oct 24-Oct

There are a number of observations:

 CY Leung (梁振英) was repeatedly mentioned – as he was a key person to develop housing policy and a potential candidate to run the Chief Executive election  Young people (年輕人) and next generation (下一代) were frequently found in the contents. This signals a central concern of housing policy with respect to the younger age generation’s housing need.  After the announcement of Policy Address, issue surrounding land premium (補地價) was vibrantly debated in the society and this keyword was consistently found among the top10 list of keyword in two consecutive weeks from the October 10 week to October 17 week. Supply (供應量) and flexibility (靈活性) were two other keywords that indicate the public expectation of the role of government in supplying land to the market.

54

Social Network Analysis 香港政府新聞網 | URL: http://news.gov.hk | 關注/followers:418642 | 粉絲 /friends:26 | 微博/posts:1224 | | | 香港 其他 | member since 2010- 07-05 00:00:00

今年的#施政报告#提出复建居者有其屋的新政策。 新计划的对象是每月收入低于30,000元,主要是属 首次置业人士的家庭。计划提供实用面积约400 - 500平方呎的单位,以可负担的楼价出售;预期首批 单位可望于2014或2015年预售。 http://t.cn/aFTebA

Posted on 2011-10-12 12:12:54

This message was posted by 香港政府新聞網 on 2011-10-12 12:12:54, shortly after the announcement of the Chief Executive’s Policy Address 2011. 香港政府新聞網 is operated by the Information Service Department, HKSAR government. Its Sina Weibo is followed by 418,642 microbloggers at the time of report writing. This post concerned about one of the major proposed measures of the Policy Address - new housing policy. It was reposted by at least 30 unique Weibo users, of which none, however, reposted it further. Figure showing pathway of information diffusion and top10 reposters are found on next page.

55

Top10 reposters User Name (Province) Time Betweenness Num_of_Followers Out_degreeIn_degree GraceYim 10/12/2011 12:16 0 24101 0 0 被賈寶菇迷倒的Laruku貓奴 10/12/2011 12:16 0 378 0 0 P仔敏 10/12/2011 12:17 0 210 0 0 左南ZuoNan 10/12/2011 12:17 0 1766 0 0 葛莉絲_W 10/12/2011 12:19 0 109 0 0 TlnK0 10/12/2011 12:21 0 58 0 0 西大行健非官方微博 10/12/2011 12:21 0 8808 0 0 Halley哈利 10/12/2011 12:21 0 283 0 0 京畿白楊樹HK_Uncle 10/12/2011 12:22 0 1048 0 0 joejoecool 10/12/2011 12:26 0 35 0 0 Network Characteristics: Number of node: 30 Number of edges: 1 Density: 0.001149425 Average degree: 0.066666667 Diameter: 1 Average path length: 1 Global Cluster Coefficient: NA Power law exponent coefficient: 24.55249337

56

Case Study 2: Policy on mitigating income inequality 改善貧富差距

This case study aims at investigating online public opinion on a specific policy topic, in this case which is policy on mitigating income inequality (改善貧富差距). We conducted additional keyword search (list of keywords are as follows) from the data previously collected by using government-related keyword search. Study period was from the week starting at 5 September 2011 (hereafter 5 September week) to the week starting at 24 October 2011 (hereafter 24 October week). Time-trend charts of weekly total post counts, posts by different sources, and unique contributors are provided in the following pages. Major keywords in terms of term frequency and term frequency inverse document frequency are listed in table form over a timeline of the study period. Finally, online mention of major government officers is presented in a time-trend chart. Figures and tables can be found in the following pages. The keywords used were: 貧富差距, 最低工資, 關愛基金, 跨區交通津貼, 綜援, 在職貧窮, 社會保障, 失業, 社會流動, 階級流動, 跨代貧窮, 扶貧, 退休保障, and 再 培訓.

The below figure displays the weekly numbers of posts from the September 5 week to the October 24 week. Average number of posts ranged from 352 at the October 24 week to 686 at the October 10 week. Even before the release of Policy Address on October 12, there were 631 and 501 relevant posts published at the September 26 week and October 3 week respectively. Still, Twitter and Hong Kong discussion forums were two major sources of online posts with the keywords. The chart showing numbers of unique contributors can be found on next page. Another chart shows the mentions of government officers during the study period. Donald Tsang was the top mention, especially in the week of announcement of Policy Address. Matthew Cheung followed as the second top mentioned officer.

57

58

Online Mentions of Government Officers (Policy on mitigating income inequality )

250 曾蔭權 林瑞麟 曾俊華 200 黃仁龍 孫明揚 周一嶽 曾德成 張建宗 150 林鄭月娥 陳家強 鄭汝樺

Frequency 中策組 100 梁振英 唐英年 曾偉雄 譚志源 50 李少光 俞宗怡 邱騰華 蘇錦樑 0

1 1 1 11 11 11 11 11 11 11 11 0 011 011 011 011 011 011 20 20 2 2 2 20 20 201 /201 20 20 201 2 2 20 2 9/ 1/ 5/ 7/ 9/ 9/ 9/5/ 9/7/ 9/9/2 11/201113/201115/ 17/201119/201121/201123/ 25/ 27/ 1 9/ 9/ 9/ 9/ 9/ 9/ 9/ 9/ 9/ 9/2 10/ 10/3 10/ 10/ 10/ 10/11/201110/13/201110/15/ 10/17/ 10/ 10/21/201110/23/ Date

59

Major keywords used of each week (ranked by term frequency)

尖沙咀 失業率 失業率 港人治港 露宿者 失業率 失業率 港交所 招募月 青少年 梁振英 資本家 星期日 黃毓民 梁振英 失業率 新來港 該區市 示威者 菲律賓 活生生 十二月 申請人 青少年 申請人 束手無策 王嘉盈 新來港 三角褲 十一月 唐英年 投資者 低收入 永久性 唐英年 基本法 使用率 示威者 華爾街 強積金 失業率 刑事化 福利局 反對派 南昌站 青少年 候選人 政府部 執行委 應鼓勵 青少年 朱婆婆 對這群 強積金 林瑞麟 刑事化 裁判官 政府部 不足率 林瑞麟 有所不知 永久性 百萬富翁 應鼓勵 參加者 門負責 召集人 普選制 東帶路 一次性 刑事化 束手無策 綜援金 唐英年 政府部 殖民地 深旺道 施政報 政府部 永久性 05-Sep 12-Sep 19-Sep 26-Sep 03-Oct 10-Oct 17-Oct 24-Oct

Major keywords used of each week (ranked by term frequency inverse document frequency)

羅致光 尖沙咀 通脹率 代理商 脅途人 華爾街 消費者 參與管理者 出版商 招募月 華爾街 層壓式 該區露宿者 劉兆佳 該區市 極大化 投注站 中小型 有生之年 滲透進來 連翔道 內地人 門負責 出版商 平理性 福利局 永久性 眾所周知 關報道 政府部 束手無策 教科書 營飯堂 董建華 應鼓勵 失業率 關注露宿者 低收入 青少年 股票市 董建華 加拿大 該區市 申請人 無家可歸 應鼓勵 應鼓勵 該區市 永久性 梁振英 門負責 直銷商 基本法 該區市 百分之一 門負責 青少年 下一代 刑事化 黎廣德 菲律賓 門負責 一次性 關當局 證來港 綜援金 束手無策 青少年 周永新 刑事化 銷貨值 一次性 嫁給港 傭居港 百分點 納稅人 示威者 束手無策 永久性 梁振英 05-Sep 12-Sep 19-Sep 26-Sep 03-Oct 10-Oct 17-Oct 24-Oct

There are a number of observations:

 Unemployment rate (失業率) is found to be a major keyword used throughout the study period despite a recent drop in unemployment rate in Hong Kong.  “Younger age people” (青少年) is the second major key term used in the text. It may be related to the high unemployment rate among younger generation.  “束手無策” is repeatedly found, seemingly indicating the public perception that income inequality problem in Hong Kong is difficult to resolve.  After the announcement of Policy Address, “one-off measure” (一次性) is used in three consecutive weeks from October 10 week to October 24 week, reflecting the characteristic of Hong Kong government’s anti-poverty measures that are usually on a one-off basis.

60

Social Network Analysis 香港政府新聞網 | URL: http://news.gov.hk | 關注/followers:418642 | 粉絲 /friends:26 | 微博/posts:1224 | | | 香港 其他 | member since 2010- 07-05 00:00:00

关爱基金新来港人士津贴计划今年10月3日至2012年6月 30日期间接受申请,合资格人士可申请领取6,000元津贴 。计划旨在协助合资格的低收入家庭新来港定居的成年 成员适应和融入香港社会,预计约有23万人受惠。有关 津贴由11月起陆续发放。http://t.cn/a1Buys 关爱基金网页 http://t.cn/a1BuUP [translate] Posted on 2011-09-08 19:01:03

This message was posted by 香港政府新聞網 on 2011-09-08 19:01:03. 香港政府新聞網 is operated by the Information Service Department, HKSAR government. Its Sina Weibo is followed by 418,642 microbloggers at the time of report writing. This post was written in concern of the open application for “Community Care Fund” – which was one of the major measures of the Policy Address last year. It was reposted by at least 59 unique Weibo users, of which none, however, reposted it further. Figure showing pathway of information diffusion and top10 reposters are found on next page.

61

Top10 reposters User Name (Province) Time Betweenness Num_of_Followers Out_degree In_degree 麥田怪圈Miracle(HK) 9/8/11 19:02 0 500 0 0 柏钦(Others) 9/8/11 19:02 0 65 0 0 Ryusue(Sichuan) 9/8/11 19:04 0 313 0 0 李冰_anitalee(Guangdong) 9/8/11 19:04 0 114 0 0 翟Paul青(Shanghai) 9/8/11 19:04 0 84 0 0 我不是狐狸晶(Guangdong) 9/8/11 19:05 0 67 0 0 jenny---wong(Guangdong) 9/8/11 19:08 0 21 0 0 艾特我你就大鑊(HK) 9/8/11 19:10 0 167 0 0 Lorelei_Kwan(HK) 9/8/11 19:11 0 75 0 0 凡間一塵埃wyn(HK) 9/8/11 19:12 0 323 0 0 Network Characteristics: Number of node: 59 Number of edges: 1 Density: 0.000292227 Average degree: 0.033898305 Diameter: 1 Average path length: 1 Global Cluster Coefficient: NA Power law exponent coefficient: 24.55249337

62

Sentiment Analysis (Case study 1 and 2)

300

250

200 Policy on subsidizing home ownership 150

sentiment score sentiment Policy on mitigating 100 income inequality

50

0 9/5/2011 9/12/2011 9/19/2011 9/26/2011 10/3/2011 10/10/2011 10/17/2011 10/24/2011 Date

The above figure shows the time changes in sentiment score in relation to the policies on subsidizing home ownership (blue line) and on mitigating income inequality (pink line) from September 5 to October 24, 2011. It demonstrates that both scores fell after the announcement of the Chief Executive’s Policy Address on October 12 2011. Sentiment on subsidizing home ownership policy (blue line) declined drastically but the one on mitigating income inequality policy (pink line) decreased mildly. It would be attributable to the substance of the Policy Address 2011 in which housing policy was one of its main focuses (e.g. new and “My Home Purchase Plan”). However, measure to support people in need, i.e. in effect alleviating income inequality, was relatively less emphasized among all of its major policy areas. As housing policy has been key public concern, many people discussed the topic offline and online, criticizing government’s policy and generating posts with negative sentiment. But after that week (October 10-16, 2011), both scores started to climb up and reached the highest at the last week of October.

63

Conclusions and Recommendations

As a growing body of Hong Kong citizens, particularly younger generation, use Internet as a channel to voice out opinion on a variety of subjects, ranging from party politics to social policy debate, policy maker pressingly needs to investigate the characteristics of online user- generated comments and to establish mechanism to collect online information in a systematic manner.

The first primary aim of this study was successfully developed to address the question of data collection. Moreover, as can be found in this report, project websites, and the email update reports sent out every day, online public opinion data in Hong Kong have been systematically collected, stored, and subsequently analyzed quantitatively and qualitatively.

This study deploys a number of analytical tools, which include time-trend statistics, keyword analysis, social network analysis, and sentiment analysis. Time-trend statistics provides an overall macro picture of the quantity of online public opinion on a daily basis over a longer time period. Various types of statistics, for example number of posts, post counts by sources, and amount of unique contributors, are shown to indicate the development of a few local public controversial issues; keyword analysis assists our understanding on the contents of the online opinion in a quantitative manner. Term frequency and term frequency inverse document frequency analyses are the two approaches to identification of “important” keywords or topics.

While social media is by nature a network of information flow, social network analysis helps us examine the information characteristics from a “network perspective”, in which pattern of information diffusion is visualized and quantified and “opinion leaders” can be eventually identified; Finally, by using a number of text mining techniques, sentiment analysis is conducted to assign a “sentiment score” automatically to online sentiment toward government on a daily basis. The performance outcomes of the sentiment classifier are found to have comparable result as good as some of the commonly used techniques in the field. It is also evident that the trajectory of the “sentiment scores” time trend seems to move as close as to that of the approval rate of the Hong Kong government as obtained from telephone survey, suggesting that “sentiment scores” may serve as a fast-track “proxy” indicator of the overall public opinion toward Hong Kong government.

Last but not least, case studies of “Policy on subsidizing home ownership” and “policy on mitigating income inequality” were used as examples to illustrate the ways how to analyze online public opinion by deploying all the above analytical tools.

Significant findings of this study are highlighted as follows:

1) Identifying the dynamics of online opinion

As found in the results, amount of online public opinion may rapidly change and public sentiment may fluctuate widely day by day. But generally speaking, our data shows that the number of posts increases and the sentiment score decreases as public outcry over controversial issues occurs. These numbers are available immediately at the next morning when the email update report is sent each day. This

64

analysis helps alert relevant departments to take prompt response to the public reaction and negative sentiment.

2) Recognizing emerging social agenda

Using various methods to determine term ranking, major keywords extracted from online contents are recognized – for example those emerging topics that appear frequently now but rarely found in the past. This finding assists us to identify up- and-coming social agenda that may shortly become major social concern and may require policy maker’s close attention.

3) Finding “opinion leaders”

In traditional meaning, opinion leaders are those who speak regularly, comment on social issues publicly, and appear repeatedly in the radio, television programs, or newspaper/magazine stories. As found in this study, digital presence of “opinion leaders” may be known as those who serve as an information hub – located at a node where the shortest path of all other pairs must pass through – and act as a gateway of information diffusion by mean of a series of message reposting (or retweeting). Using this technique, “online opinion leaders” in this sense are named and identified. This analysis helps various stakeholders in the government departments locate key “public figures” in their corresponding policy areas.

4) Tracking “public opinion”

Based on the sentiment score developed in this study, preliminary evidence suggests that the sentiment score may be able to keep track of the approval rate of the Hong Kong government as measured by phone survey. This tracking is done on a daily basis and in advance of the announcement of phone survey result. It may therefore serve as a fast-track “proxy” indicator of the overall public opinion toward Hong Kong government.

5) Associating with policy discussion

The results of the two case studies demonstrate the ways in which online opinion may be incorporated into the policy discussion. The identified keywords may help establish new policy agenda that government officers may have ignored at the formative process of policy making. The analysis on the mentions of government officers may be able to discover linkage between government departments in respect to a specific social policy.

There are a few challenges that may impact on the implication of the findings in this project.

1) Biased sample

Cumulative evidence has shown that online content contributors are a biased sample and are not representative of the overall population. The majority of them are from a younger age group. Our previous research also finds younger people with “critical” political orientation, like embracing democratic and post-materialist values,

65

are more likely to contribute contents to online platforms.

2) Chilling effect

Because of the dominating opinion made by a group of “critical citizens” as indicated in Point 1, people with alternative or different views may choose not to get involved in online discussion, theoretically speaking there is what we call “spiral of silence” (Noelle-Neumann, 1974). Consequently, it reinforces the existing “winners- take-all” status quo that individuals with alternative opinion are apt to keep silence in the online sphere.

3) Unorganized and unstructured opinion and subtle use of language

This study seeks to make use of computational methods to extract essential data from online textual contents and present the information analytically. All routinely run procedures are automatically undertaken without human input. However, online contents are largely not well organized and structured. Subtle components of language use in online contents, e.g. hidden linguistic meaning, sarcasm or understatement, are extremely hard, if not possible, to be processed and dug out solely by using computational methods.

4) Data and technical limitations

Some procedures of data collection are primarily based on third-party software API (e.g. Twitter, Facebook, and Sina Weibo). There is no guarantee of service and data quality. Due to limited resource of computer hardware, our system may be not sufficiently resourceful and powerful to gather full sets of online data and full lists of online user within the sampling duration. Crawling algorithms used in this study may be sub-optimal in term of performance and reliability. A few operational assumptions, e.g. the definition of “Hong Kong Internet users”, were made during the design process and they may have impacts on the results.

As a result of these findings, a few recommendations are made:

1. Establishing an online public opinion tracking system

As shown in this study, our findings strongly suggest that an online public opinion tracking system can help policy makers and the public to keep track of the online discussion on various social and political topics, as well as to understand the changes, especially short-term change, in citizens' sentiment toward public governance and social policy. This online tracking system could be complementary to the current commonly deployed phone-survey type of public opinion program. The said tracking system should be setup and supported by ongoing financial and human resources. Technical and maintenance work of the system and routine data analysis could be partly outsourced to relevant commercial or academic sectors. But we recommend that the core daily analysis on the changes of public opinion and further in-depth investigation are better undertaken and leveraged by current establishment of public opinion research units within the government and/or academic scholars with relevant media research and text mining experience.

66

2. Cross-departmental online public engagement policy

Amid a gradual political development in Hong Kong, citizens’ new media use for political participation poses challenge to the government’s existing public engagement system that has been repeatedly found to fail to consolidate online opinion and satisfy public demand. The Hong Kong government has long recognized the importance of the development of online public engagement policy (Chief Executive of Hong Kong, 2008), but so far a well-established and cross- departmental policy direction remains absent. If the government fails to establish institutional procedures and arrangements to address growing public demand for public engagement, both online and offline, it may undermine an effective public governance and create additional social disengagement and cynicism (Dahlgren, 2009). It would, at worst, become major hurdle for long-term development of democracy in Hong Kong.

3. Proactive response to online negative sentiment

Government’s current practice of public response to negative public sentiment should be reviewed and strengthened. One of the major areas for such a review should be the way to make full use of the opportunity gained from digital and social media for the provision of prompt and proactive response to the general public. There are a variety of channels, for example, microblogs, discussion forums, and social networking sites. A good example in mainland China is the Shanghai Metro’s rapidly-posted Sina Weibo apology which was online posted at the night time of September 27, 2011 – the day when an underground train accident happened (in the daytime). The apology microblog was widely circulated via microblogs and became the second largest repost on that day. It may subsequently help mitigate the damage on the Shanghai Metro’s public image.

4. Promoting culture of online policy deliberation

Individuals' value system—especially the younger generations so-called “post- 80s” or “post-90s”— have been undergoing a drastic shift toward more post- materialistic, democratic, self-expressive values, and demand for high quality governance, as evidenced in many democratic countries, namely the growth of “critical citizens” (Norris, 1999). This trend results in an increasing citizens’ demand for public deliberation, a process through which social and policy issues are debated, discussed, and finally consented between governments, social stakeholders, and interest groups, usually through an established mechanism with a set of defined procedures. Online media is widely expected to be an important platform for public deliberation (Coleman & Gotze, 2001). But the goal of such online deliberation can not be achieved overnight and should be supported by citizen’s increase in media literacy and democratic literacy as well as government’s open social policy and transparent information policy.

5. Researching online behavior and new media use

Besides online public opinion research, there is an urgent need for policymakers to investigate the characteristics, value system, pattern of offline and online public/political engagement/participation, and offline/online media use of the

67

Hong Kong citizens—especially younger age generation and politically disengaged and disadvantaged individuals—and thus to develop a citizen-centric public engagement policy strategies. These research questions are essential for the development of public engagement policy. They serve as a baseline to understand the characteristics of new media use and political participation of the Hong Kong citizens and create a reference for future longitudinal comparative studies. The findings should be of interest to academics, policy makers and political practitioners. The results should have long-term and substantial impact on the ways in which the government engages with citizens for public deliberation via offline or online platforms.

68

References

Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Trans. Inf. Syst., 26(3), 1-34. doi: 10.1145/1361684.1361685 Annett, M., & Kondrak, G. (2008). A comparison of sentiment analysis techniques: polarizing movie blogs. Paper presented at the Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence, Windsor, Canada. Census and Statistics Department. (2009). Hong Kong as an Information Society. Hong Kong: Census and Statistics Department. Chief Executive of Hong Kong. (2008). The 2008-09 Policy Address - Embracing New Challenges. Hong Kong: Retrieved from http://www.policyaddress.gov.hk/08- 09/index.html. Chin, A., & Chignell, M. (2006). A social hypertext model for finding community in blogs. Paper presented at the Proceedings of the seventeenth conference on Hypertext and hypermedia, Odense, Denmark. City University of Hong Kong. (2009). Internet Use in Hong Kong: The 2008 Annual Survey Report. Hong Kong: City University of Hong Kong. Coleman, S., & Gotze, J. (2001). Bowling Together: Online Public Engagement in Policy Deliberation. London: Hansard Society. Dahlgren, P. (2009). Media and political engagement: citizens, communication, and democracy. Cambridge; New York: Cambridge University Press. Efron, M. (2011). Information search and retrieval in microblogs. Journal of the American Society for Information Science and Technology, 62(6), 996-1008. doi: 10.1002/asi.21512 Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215-239. doi: 10.1016/0378-8733(78)90021-7 Hindman, M. S. (2009). The myth of digital democracy. Princeton: Princeton University Press. Lai, C. (2009). A new generation of rebels who have found their cause, South China Morning Post, p. EDT4. Lewis, T. G. (2009). Network science: theory and practice. Hoboken, N.J.: John Wiley & Sons. Noelle-Neumann, E. (1974). The Spiral of Silence A Theory of Public Opinion. Journal of Communication, 24(2), 43-51. doi: 10.1111/j.1460-2466.1974.tb00367.x Norris, P. (1999). Critical citizens: global support for democratic government. Oxford: Oxford University Press. O’Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public opinion time series. Paper presented at the Fourth International AAAI Conference on Weblogs and Social Media. Organisation for Economic Co-operation and Development. (2007a). Measuring user-created content: Implication for the "ICT access and use by households and individuals" surveys. (DSTI/ICCP/IIS(2207)3/FINAL). Organisation for Economic Co-operation and Development. (2007b). Participative Web and user-created content: Web 2.0, wikis and social networking. Retrieved from www.sourceoecd.org/scienceIT/9789264037465 Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010.

69

Smith, A., Schlozman, K. L., Verba, S., & Brady, H. (2009). The Internet and Civic Engagement. Washington, D.C.: Pew Internet & American Life Project. Synovate. (2010). Synovate Media Atlas Hong Kong. Retrieved from http://www.synovate.com/news/article/2010/03/lifestyles-and-consumption-habits-of- four-generations-of-hong-kong-consumers-revealed-by-synovate.html Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418. doi: 10.1002/asi.21462 Wasserman, S., & Faust, K. (1994). Social network analysis: methods and applications. Cambridge: Cambridge University Press.

70

Appendix

Appendix A: Research Team

Role Name Division of Work Principal Investigator Dr. King-wa Fu  Overseeing the whole project  Developing research strategy and methodology  Supervising research staff  Coordinating with the funding body  Resource and budget control  Writing up reports

Co-Investigator Dr. Michael Chau  Leveraging his previous research and knowledge on data mining, analysis of web-based content and conducting social network analysis  Developing research strategy and methodology  Writing up reports

Developer Mr. Cédric Sam  Assisting PI and Co-I in research development of this project  System implementation, integra- tion, and computer programming  Procurement of computer hard- ware and software  Developing pilot system and maintaining the service  Writing up reports

71

Appendix B: List of Hong Kong politics-related keywords (totally 118 keywords)

范徐麗泰 梁振英 特區政府 人大釋法

中聯辦 聯繫匯率 比例代表制 功能組別

動議辯論 政府帳目委員會 法案委員會 私人條例草案

申訴專員 內務委員會 分組點票 人大常委

行政主導 高度自治 港人治港 廉政公署

基本法 審計署 屋宇署 政府統計處

香港海關 環保署 食環署 衞生署

金管局 警務處 房屋署 入境處

政府新聞處 稅務局 勞工處 地政總署

法援署 康文署 規劃署 香港電台

選舉事務處 工業署 運輸署 一國兩制

中央政策組 保安局 俞宗怡 公務員事務局

利益輸送 功能組別 勞工及福利局 區議會

反對派 周一嶽 唐英年 商務及經濟發展局

單程證 地產霸權 孫明揚 官商勾結

建制派 張建宗 律政司司長 政制及內地事務局

政務司 政務司司長 教育局 曾俊華

曾偉雄 曾德成 曾蔭權 李少光

林瑞麟 林鄭月娥 民主派 民建聯

民政事務局 港府 湯顯明 特首

環境局 社民連 社署 立法會

終審法院 綜援 職工盟 蘇錦樑

行政會議 行政長官 財政司 財政司司長

財爺 財經事務及庫務局 貧富懸殊 運輸及房屋局

邱騰華 鄧國斌 鄭汝樺 醫管局

72

關愛基金 陳家強 食物及衞生局 黃仁龍

人民力量 公民黨 社民連 職工盟

街工 新同盟 四五行動 自由黨

經濟動力 新民黨 民建聯 工聯會

民主黨 民協

73

Appendix C: List of Hong Kong government-related keywords (totally 99 keywords)

范徐麗泰 稅務局 曾俊華 梁振英 勞工處 曾偉雄 特區政府 地政總署 曾德成 人大釋法 法援署 曾蔭權 中聯辦 康文署 李少光 聯繫匯率 規劃署 林瑞麟 比例代表制 香港電台 林鄭月娥 功能組別 選舉事務處 民政事務局 動議辯論 工業署 港府 政府帳目委員會 運輸署 湯顯明 法案委員會 一國兩制 特首 私人條例草案 中央政策組 環境局 申訴專員 保安局 社署 內務委員會 俞宗怡 立法會 分組點票 公務員事務局 終審法院 人大常委 利益輸送 綜援 行政主導 功能組別 蘇錦樑 高度自治 勞工及福利局 行政會議 港人治港 區議會 行政長官 廉政公署 周一嶽 財政司 基本法 唐英年 財政司司長 審計署 商務及經濟發展局 財爺 屋宇署 單程證 財經事務及庫務局 政府統計處 地產霸權 貧富懸殊 香港海關 孫明揚 運輸及房屋局 環保署 官商勾結 邱騰華 食環署 建制派 鄧國斌 衞生署 張建宗 鄭汝樺 金管局 律政司司長 醫管局 警務處 政制及內地事務局 關愛基金 房屋署 政務司 陳家強 入境處 政務司司長 食物及衞生局 政府新聞處 教育局 黃仁龍

74

Appendix D:

Explanation note for the daily email alert http://research.jmsc.hku.hk/social/reports.py/general/

Explanation note for the time trend website

http://research.jmsc.hku.hk/social/sscore/sscore1.html

Explanation note for the search engines

http://research.jmsc.hku.hk/social/search.py/hkforums

http://research.jmsc.hku.hk/social/search.py/sinaweibo

75