Making sense of “Big Weibo Data”: a data mining approach

Department of Media and Dr. K. W. Fu Communication, City University of Hong Kong Journalism and Media Studies Centre 29 October 2012 The University of Hong Kong

• Sina Weibo (新浪微博) – the biggest microblogging site in • Basic statistics – Since August 2009 – Registered users: >300 millions – Authenticated users: 34k – # of Weibo entries: 3 millions per day or 40 per sec. –Source:中国微博元年市场白皮书(2010)

Weibo and Public Incidents

Qian Yunhui incident (2010)

My Father is Li Gang (2010)

Yihuang Self-Immolation Incident (2010) Train Crash (2011) Weibo and NPC/CPPCC meetings Attempted Suicide

• On Feb 23, 2011 9:56pm, a wrist cutting photo on Sina Weibo was posted: “Today, I am returned back to you. That’s all. You make me feel like falling from heaven to hell. Now I get it.” Reality Check for the Chinese Microblog Space: a random sampling approach

• We generated randomly possible account codes (10-digit) between 1000000000 and the maximum (in this case, it was 4294917290) and successively tested their validity with the computer program • If it was a valid account, its timeline was downloaded and saved • We collected between January 25 and 26, 2012, are as follows: we generated 130,790 codes randomly, of which 120,788 were invalid and 2 obtained API error. Then, 10,000 valid account codes were obtained. • We revisited all accounts on February 17, 2012, to ensure their existence. Genders and Location

Characteristics Categories Frequency (%) Self-reported gender Male 61.0 Female 39.0 Self-reported location Guangdong 14.1 Jiangsu 5.5 Beijing 5.2 5.0 Shandong 4.4 Shanghai 4.1 Hubei 3.9 Henan 3.7 Fujian 3.3 International 1.1 Other provinces 35.3 Others 14.2 Missing 0.2 Followers, friends counts, days since account created

Followers count 0 51.1 1-4 24.8 5-19 12.5 20-99 7.7 100-499 3.2 500-999 0.4 1000 and above 0.3 Friends count 0 32.9 1-4 19.0 5-19 18.1 20-99 24.2 100-499 5.2 500-999 0.5 1000 and above 0.2 Days since account created 0-6 days 1.1 7-29 days 4.5 30-59 days 7.2 60-89 days 7.5 90-179 days 17.8 180-364 days 35.4 365-729 days 25.0 730 days and above 1.5 Total count of statuses and posts in the past 7 days

Characteristics Categories Frequency (%) Total count of statuses 0 56.8 1-4 22.1 5-19 8.6 20-99 6.5 100-499 3.9 500-999 1.2 1000 and above 0.8 Microblogs posted in the past 7 days 0 92.4 1 2.1 2-4 2.2 5-9 1.3 10-14 0.6 15-19 0.4 20-39 0.7 40-59 0.2 60 and above 0.2 Original posts and unique reposts

Original posts in the past 7 days 0 94.3 1 2.0 2-4 1.8 5-9 1.0 10-14 0.3 15-19 0.3 20 and above 0.2 Unique reposts in the past 7 days 0 95.0 1 1.6 2-4 1.6 5-9 0.8 10-14 0.4 15-19 0.2 20-39 0.3 40 and above 0.2 Repost and Comment counts

Characteristics Categories Original posts (N=3,235) Frequency (%) Frequency of repost 0 89.5 1 5.9 2-4 2.9 5-9 0.9 10-14 0.2 15-19 0.1 20 and above 0.5 Comment count 0 65.6 1 8.2 2-4 14.3 5-9 7.6 10-14 2.3 15-19 1.2 20-39 0.6 40 and above 0.1 Predicting counts of original posts and unique reposts of the sample by using the hurdle model. Original Posts Unique Reposts Estimate(SE) z p Estimate(SE) z p Count portion Intercept -1.43(0.47) -3.02 ** -2.83(1.21) -2.34 * Log count of followers 0.46(0.08) 5.47 *** 0.29(0.09) 3.24 ** Log count of friends 0.07(0.09) 0.74 0.48(0.09) 5.52 *** Gender 0.06(0.15) 0.4 -0.23(0.19) -1.21 Province (Beijing/Shanghai/Guangdong) -0.23(0.16) -1.39 0.07(0.19) 0.35 Days since created 0.0006(0.0004) -1.45 0.0001(0.0005) -2.02 * Log(theta) -1.84(0.44) -4.21 *** -3.04(1.2) -2.54 * Zero portion Intercept -3.85(0.21) -18.65 *** -4.3(0.23) -18.46 *** Number of Followers 0.82(0.04) 19.77 *** 0.72(0.04) 16.48 *** Number of Friends 0.08(0.04) 1.78 . 0.34(0.05) 6.86 *** Gender -0.27(0.1) -2.66 ** -0.68(0.11) -6.1 *** Province (Beijing/Shanghai/Guangdong) 0.42(0.1) 3.97 *** 0.12(0.12) 1.01 Days since created 0.0001(0.0002) -6.91 *** -0.001(0.0003) -3.51 ***

Log-likelihood -2884 (DF=13) -2581 (DF=13) AIC 5793.8 5188.5 Observed % of zero 94.3% 95.0% Predicted % of zero 92.8% 92.3% Predicting the sample’s maximum counts of being reposted and commented by using hurdle model Count of being reposted (maximum) Count of receiving comments (maximum) Estimate(SE) Z p Estimate(SE) z p Count portion Intercept -14.32(188.8) -0.08 0.16(0.4) 0.4 Log count of followers 0.89(0.2) 4.5 *** 0.33(0.08) 4.03 *** Log count of friends 0.02(0.27) 0.09 0.16(0.11) 1.51 Gender -0.86(0.4) -2.16 * -0.37(0.14) -2.59 ** Province (Beijing/Shanghai/Guangdong) 0.27(0.38) 0.71 -0.07(0.14) -0.49 Days since created -0.001(0.0009) -1.41 -0.001(0.0004) -1.53 Log (theta) -13.01(188.8) -0.07 -0.01(0.17) -0.08 Zero portion Intercept -6.8(0.45) -15.04 *** -5.75(0.33) -17.44 *** Number of Followers 0.74(0.07) 10.05 *** 0.9(0.06) 14.64 *** Number of Friends 0.27(0.09) 2.99 ** 0.17(0.07) 2.53 * Gender -0.55(0.2) -2.79 ** -0.59(0.15) -3.95 *** Province (Beijing/Shanghai/Guangdong) 0.4(0.19) 2.1 * 0.39(0.15) 2.65 ** Days since created 0.001(0.0005) 2.08 * 0.00004(0.0004) 0.1

Log-likelihood -720.6 (DF=13) -1495 (DF=13) AIC 1467.2 3015.3 Observed % of zero 98.6% 97.3% Predicted % of zero 98.7% 95.5% WeiboScope

• Developed in-house at JMSC • Gathered a list of about 400,000 profiles of users with more than 1000 followers • Regular automated task to download the posts made by these 400K users System Architecture Most reposted Weibo entries Reposting Weibo Qian Yunhui incident (2010) Chronology

Date Summary Public Weibo Mainstream Local Judicial Reaction Media Government System Dec. Around 9:00 At 9:40am, 25th, am, Qian right after 2010 Yunhui, an the officer of accident, village the government, villagers was knocked nearby down and spontaneou crushed by a sly heavy-duty organized truck . He to protect died the body of immediately. Qian. Dec. On 21:26, the first 26th, Weibo was posted and 2010 was reposted over 23k times and over 6k comments Dec. China Youth The local 27th, Newspaper (中 government held 2010 國青年報) a first press reported the conference. accident and pointed out some doubts. Chinanews.com (中新) 網 reported the press conference held by the local government. Qian Yunhui incident (2010) Chronology

Dec. The government 29th, of Wenzhou City 2010 held a second press conference, claiming again that it was just a normal traffic accident. Jan. 4th, South 2011 Metropolitan Daily (南方都市 報) conducted an interview with the villagers and were told that they had been warned many times this week. This was later reported by many mainstream media. Feb. 1st, The city court 2011 sentenced the driver of the truck in prison for 3.5 years. Searching for “西红柿”(tomato)

“According to relevant laws, regulations and policies, the “西红柿” search results were not shown.” Deleted posts page Deleted posts strategy • Selected small subsets of users • Regularly check their profiles • Compare consecutive timelines Deleted posts timeline

A post is made A post is deleted We mark it

Time

Regular "checks" by our system "Permission denied" • Two messages from the API when a previously existing post is deleted: o "Weibo does not exist" o "Permission denied" • No official word, but circumstantial evidence seems to show that "permission denied" are posts manually removed Weibo Turns the Spotlight on Propaganda

“Reading the Shoddy Performance of U.S. Politicians through the Cheng Guangcheng Affair”

MAY 4, 2012: four Beijing newspapers harshly criticize the U.S. for its involvement in the Chen Guangcheng case. Critical Post on Beijing Editorials Deleted Searches for “Beijing Daily” Disabled China Turns on its Own Propaganda Xiong Peiyun: China needs principles, not apologies

• How can you not be ashamed that a dignified citizen must flee inside his own country? You must live up to the sun that shines every day across this land. Sixty years [of CCP rule], and what this country needs is to settle its soul. What it needs is a set of core values the people of the nation can be proud of. What this country needs is its own way [and principles], not an apology from another country. I’ve heard the saying, “A just cause deserves abundant support, and an unjust cause must find little support” (得道多助,失道寡助). But I’ve never heard the saying, “An apology deserves abundant support, and no apology must find little support” (得道歉多助,失道歉 寡助).

• Source: China Media Project http://cmp.hku.hk/2012/05/03/22283/ Analyzing Censored Keyword

• Determine a list of Chinese terms that discriminate the censored and uncensored posts when they are written by the same microbloggers • 17,594 “permission denied” posts captured from January 1, 2012, to June 30, 2012 • Each censored post was paired up with two randomly-selected uncensored posts published by the same microblogger • Pre-processing and tokenization Top 30 keywords for censorship and PAM status according to χ² value. Keywords in red exist in both lists.

Predictors for censorship Predictors for PAM

Rank Terms χ² RR Terms χ² RR 1 重庆 ( Chungking ) 302.38 3.25 两会 (Two meetings) 75.88 5.67 2 光诚 ( Guangcheng ) 248.62 32.82 雷锋(Lei Feng) 56.66 3.41 3 陈光诚( Chen Guangcheng ) 237.42 46.72 白色 情人节 (White Day) 37.07 5.86 人大代表 (National People's 4 两会 (Two meetings) 232.29 5.39 33.48 3.14 Congress) 5 骆家辉 ( Gary Locke ) 212.82 12.38 情人节 (Valentine's Day) 30.66 1.42 6 辟谣 ( refuting rumours ) 203.14 5.16 王立军(Wang Lijun) 30.57 4.91 7 代表 ( representative ) 200.1 2.75 立军 (Lijun) 30.57 4.91 8 薄 ( Bo, a family name) 187.26 5.07 锋 (Feng) 27.7 1.87 9 日报 ( daily newspaper ) 179.82 4.21 妇女节 (Women's Day) 24.97 3.61 10 公布 财产 ( announced assets ) 178.93 24.59 bed 22.26 2.01 11 北京 日报 ( Beijing Daily ) 172.05 7.91 民主 (Democracy) 22.05 1.92 12 薄熙来 ( Bo Xilai ) 159.18 7.01 三八 (Mar-8) 22.04 2.56 人大代表 ( National People's 13 152.5 4.23 bed 凌乱 (bed, mess) 21.96 3.74 Congress ) 骆家辉 公布 ( Gary Locke, 14 152.2 305.96 学雷(learn, Lei) 21.68 3.36 announced ) 15 财产 ( assets ) 150.95 3.86 吴英 (Wu Ying) 21.52 5.13 Analyzing Censored Keyword

• Mostly related to the Bo Xilai scandal, Chen Guangcheng diplomatic incident, United States Ambassador to China Gary Locke’s finance disclosure, the one-child policy, housing policy, and the pension system.

• Self-created terms by Chinese microbloggers to circumvent the censors. For example Pingxi Wang (平西王, literally "King who pacifies the west," referring to Bo Xilai), cgc (initials of Chen Guangcheng), “crown prince” (儲君, referring to Xi Jinping, China's leader-to-be) and “grass” (艹, an obscure alternative writing of 草, a homophone of a vulgar word).

• A number of “sensitive” terms which have lower discriminatory power, representing that they had higher survival rates for censorship circumvention. They included “tomato” (西红柿, “western red city” which referred to Chongqing) and “head nurse” (护士长, referring to Wang Lijun, a key figure in the Bo Xilai scandal). Evaluating the Impact of Real-Name Registration (RnR) • Study period – 3 months before March 16, 2012 - T1 – 3 months after - T2 • Those who made no posts during T2 were defined as potentially affected microbloggers (hereafter, PAMs in short) • We hypothesized that the reduced user activity among PAMs was largely attributed to the chilling effect of RnR. • Each of the 3,000 PAM (Beijing, Shanghai, Guangdong, and Tianjin) was paired with a matching non-PAM Evaluating the Impact of Real-Name Registration (RnR) • We compare the posts in T1 of the two groups, totaling 437,153 posts • Among all top keywords with RR > 1, many were found to be connotatively referred to as political scandals, international affairs, social events or figures, – for instance "two meetings," Lei Feng (雷锋), Wang Lijun (王立军), Wu Ying (吴英, a businesswoman convicted of financial fraud and sentenced to death), Syria (叙利亚), Cultural Revolution (文革), Wukan (乌坎, a village where an anti-corruption protest took place), or Article 73 (73 条, new legislation allowing authorities to detain any parties suspected of national security threats) and corruption (腐败). The Way Forward

• Making the system more effective to collect “deleted” messages? • Contributing to news gathering? • Making the data available to researcher and the public? • Academic research – Understanding microblogging in China in general – Longitudinal Random sampling analysis – Social network analysis • Funding issue Questions & Answers