Tweeting Under Pressure: Analyzing Trending Topics and Evolving Word Choice on Sina Weibo

Tweeting Under Pressure: Analyzing Trending Topics and Evolving Word Choice on Sina Weibo Le Chen Chi Zhang Christo Wilson College of Computer and College of Computer and College of Computer and Information Science Information Science Information Science Northeastern University Northeastern University Northeastern University Boston, MA USA Boston, MA USA Boston, MA USA [email protected] [email protected] [email protected] ABSTRACT socialize and share content. However, social media in China also In recent years, social media has risen to prominence in China, with plays a more profound role as a platform for breaking news and sites like Sina Weibo and Renren each boasting hundreds of mil- political commentary that is not available in the state-sanctioned lions of users. Social media in China plays a profound role as a news media. For example, Weibo played a key role in the downfall platform for breaking news and political commentary that is not of once-prominent politician Bo Xilai [17]. available in the state-sanctioned news media. However, like all Like all websites in China, Chinese social media is subject websites in China, Chinese social media is subject to censorship. to government-enforced content regulation policies. The primary Although several studies have identified censorship on Weibo and manifestation of these regulations is censorship, which is known Chinese blogs, to date no studies have examined the overall impact to impact Chinese blogs [25] and Weibo. Current work disagrees of censorship on discourse in social media. on the scope of censorship on Weibo, with estimates ranging from In this study, we examine how censorship impacts discussions 0.01% [42] to 16% [7] of all “weibos” (a.k.a. tweets on Weibo) on Weibo, and how users adapt to avoid censorship. We gather being censored. Users who discuss political issues [42, 49] and tweets and comments from 280K politically active Weibo users for minority groups [7] tend to incur the brunt of censorship. In fact, 44 days and use NLP techniques to identify trending topics. We it is hypothesized that Weibo employs thousands of crowdsourced observe that the magnitude of censorship varies dramatically across workers to manually examine and censor the huge volume of tweets topics, with 82% of tweets in some topics being censored. How- that are generated each day [49]. Thus, tweets may be visible for ever, we find that censorship of a topic correlates with high user minutes, hours, or even days before they are censored, giving re- engagement, suggesting that censorship does not stifle discussion searchers an opportunity to download and analyze them. of sensitive topics. Furthermore, we find that users adopt variants Although it is no secret that tweets on Weibo are censored, how of words (known as morphs) to avoid keyword-based censorship. censorship is applied and the impact that it has on discourse is We analyze emergent morphs to learn how they are adopted and currently unknown. In this study, we seek to answer two key spread by the Weibo user community. questions: first, what is the impact of censorship on discourse on Weibo? In other words, is censorship effective at chilling or even halting discussion on Weibo? Second, do Weibo users adapt in or- Categories and Subject Descriptors der to avoid censorship? Anecdotal evidence suggests that users J.4 [Computer Applications]: Social and behavioral sciences; may use morphs to avoid keyword-based censorship [7, 42], e.g., K.5.2 [Governmental Issues]: Censorship ¨君 (crown prince) instead of `近s (Xi Jinping, the current president of China). However, it is unknown whether this theory is Keywords true, and if so, what the dynamics of morph generation are. These two questions get at the heart of the conflict between information Online social networks; Sina Weibo; Trending topics dissemination and censorship in the highly dynamic, human-driven social media space. 1. INTRODUCTION To answer these questions, we break our study down into three major components. First, we conduct a large scale crawl of Weibo In recent years, social media has risen to prominence in China. for 44 days. Our crawl targeted a connected component of 280,250 Sina Weibo (the Chinese equivalent of Twitter, abbreviated as users who are active on Weibo. The crawler implemented a prioriti- Weibo) boasts 500 million users [45], and Renren (the Chinese zation system where users who tweet more frequently were crawled equivalent of Facebook) boasts 172 million users [22]. Like people more frequently. This enabled the crawler to gather most censored the world over, Chinese users flock to these platforms as places to tweets before they were deleted (censorship can then be identified after-the-fact). In total, our crawl gathered 36.5M tweets, 1% of Permission to make digital or hard copies of all or part of this work for personal or which were censored. We observe that censorship is not applied classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- uniformly, e.g., 82% of tweets from one particularly contentious tion on the first page. Copyrights for components of this work owned by others than topic were censored, while up to 50% of tweets from some celebrity ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- users were censored. publish, to post on servers or to redistribute to lists, requires prior specific permission In addition to tweets, our crawler also gathered all of the com- and/or a fee. Request permissions from [email protected]. ments on each tweet. Comments on Weibo function like comments COSN’13, October 7–8, 2013, Boston, Massachusetts, USA. on Facebook, i.e., users append them to existing tweets. Unlike Copyright 2013 ACM 978-1-4503-2084-9/13/10 ...$15.00. http://dx.doi.org/10.1145/2512938.2512940. prior studies of censorship on Weibo, ours is the first to examine both tweets and comments. This distinction is important, because, delete tweets, etc. However, as we discuss in § 3.1, there are sig- as we show in § 4.2, there are an order of magnitude more com- nificant limitations to Weibo’s APIs. ments than tweets on Weibo. Second, we leverage Latent Dirichlet Allocation (LDA) [9] to 2.2 Government Regulation of the Web extract 37 trending topics from our crawled data. Each one of these The Chinese government enforces several policies to regulate topics corresponds to a real-world event (e.g., the Boston marathon content on websites. As a major social hub, Sina Weibo regulates bombing, Ya’an Earthquake, etc.), and several were heavily cen- content in cooperation with these policies. sored (e.g., a Sichuan official who was criticized following the Ya’an earthquake, an incident between President Xi and a Beijing • Real Name Policy. In March 2012, Weibo implemented taxi driver, etc.). Across these topics, we analyze the relationship the Real Name Registration (RnR) Policy [48]. The policy between the magnitude of censorship and the characteristics exhib- states that users must use their real name when creating a ited by the topic (e.g., number of engaged users, tweets per user, Weibo account, although users may use a pseudonym as their etc.). Contrary to our expectations, we find that users are more ac- public handle on the website. tive in discussing censored topics, indicating that censorship does not have a chilling effect on discussion on Weibo. • Blacklists. Weibo maintains a blacklist of words and Third and finally, we examine the usage of morphs on Weibo. URLs that are not permitted in tweets. For example, tweets We find that 11 of our 37 topics include morphs, in some cases up may not contain links that leverage Google’s URL shortener to 5 morphs per topic. Although we observe that many uncensored goo.gl. e.g., 黑AW topics include morphs for comedic or satirical effect ( • Search Censorship. Weibo does not permit users to ¢AW (Black Cross) in place of (Red Cross)), we also find that search for tweets that contain certain words. The China Dig- morph usage dramatically increases within censored topics. Tem- ital Times maintains an up-to-date list of words impacted by poral analysis reveals that morph usage increases rapidly within search censorship on Weibo [1]. hours of censorship being implemented, suggesting that users adapt their word usage to circumvent censorship. • Tweet Censorship. Several studies have confirmed that We view this study as a first step towards understanding the im- Weibo censors tweets [7, 49]. Tweets may be deleted if pact of censorship on discourse in social media, rather than simply they contain politically sensitive topics, abusive language, quantifying the scope of censorship. This study lays the foundation pornography, or rumors. It is hypothesized that Weibo em- for updating existing information dissemination models, or devel- ploys a heterogeneous strategy for tweet censorship, ranging oping new ones, that take adversarial forces into account. Our re- from keyword filtering to real-time crowdsourced monitor- sults also point towards new techniques for identifying and predict- ing [49]. ing censorship, by using language models to observe when words usage changes (i.e., morphs) in otherwise unexpected ways. Violations and Penalties. Sina Weibo enforces penalty policies against users who violate online content regulations. The Sina 2. BACKGROUND Weibo Community Treaty, launched in May 2013 [6], outlines these We begin by briefly introducing Sina Weibo, comparing its fea- penalty policies. The treaty introduced a credit system for Weibo tures to Twitter, and discussing government regulation of the Web users where credit is deducted for each policy violation. Weibo ac- in China. counts are permanently deleted if their credit reaches 0.

Tweeting Under Pressure: Analyzing Trending Topics and Evolving Word Choice on Sina Weibo

The Versatility of Microblogging

The Limits of Commercialized Censorship in China

Statistical Data Mining for Sina Weibo, a Chinese Micro-Blog: Sentiment Modelling and Randomness Reduction for Topic Modelling

MASTER THESIS in Universal Design of ICT May 2017

The Public Square Project

Analysis of Bytedance with a Close Look on Douyin / Tiktok ——————————————————

Linguistic Fingerprints of Internet Censorship: the Case of Sina Weibo

Arcana: Enabling Private Posts on Public Microblog Platforms

Cyberspace and Gay Rights in a Digital China: Queer Documentary Filmmaking Under State Censorship

Gilbert SHANG NDI Faculty of Social Sciences, Universidad De Los Andes Bogota, Colombia [email protected]

CHECKLIST How to Compare Social Software Platforms

Information, Communication & Society Taking Stock, Moving Forward: The