Information Retrieval and Knowledge Organization: a Perspective from the Philosophy of Science

Total Page:16

File Type:pdf, Size:1020Kb

Information Retrieval and Knowledge Organization: a Perspective from the Philosophy of Science information Article Information Retrieval and Knowledge Organization: A Perspective from the Philosophy of Science Birger Hjørland Department of Communication, University of Copenhagen, 8 Karen Blixens Plads, DK-2300 Copenhagen S, Denmark; [email protected] Abstract: Information retrieval (IR) is about making systems for finding documents or information. Knowledge organization (KO) is the field concerned with indexing, classification, and representing documents for IR, browsing, and related processes, whether performed by humans or computers. The field of IR is today dominated by search engines like Google. An important difference between KO and IR as research fields is that KO attempts to reflect knowledge as depicted by contemporary scholarship, in contrast to IR, which is based on, for example, “match” techniques, popularity measures or personalization principles. The classification of documents in KO mostly aims at reflecting the classification of knowledge in the sciences. Books about birds, for example, mostly reflect (or aim at reflecting) how birds are classified in ornithology. KO therefore requires access to the adequate subject knowledge; however, this is often characterized by disagreements. At the deepest layer, such disagreements are based on philosophical issues best characterized as “paradigms”. No IR technology and no system of knowledge organization can ever be neutral in relation to paradigmatic conflicts, and therefore such philosophical problems represent the basis for the study of IR and KO. Keywords: information retrieval; knowledge organization; philosophy of science; classification; knowledge organization systems; ontologies; Kuhnian paradigm theory; pragmatism Citation: Hjørland, B. Information Retrieval and Knowledge Organization: A Perspective from the Philosophy of Science. Information 1. Introduction 2021, 12, 135. https://doi.org/ Information retrieval (IR) and knowledge organization (KO) are two research fields 10.3390/info12030135 that, on the one hand, are separate fields of study, but on the other hand, have the same aim: to facilitate the findability of documents, knowledge, and information. Anderson Academic Editor: Claudio Gnoli and Pérez-Carballo’s [1] handbook of KO has the title Information Retrieval Design, which indicates the close connection between KO and IR, where IR is about search processes, Received: 23 February 2021 while KO is about designing optimal structures for IR (in addition to other purposes KO Accepted: 8 March 2021 Published: 20 March 2021 may serve). Today, the field of IR has mainly migrated from information science to computer Publisher’s Note: MDPI stays neutral science and has developed systems used in search engines such as Google, which are with regard to jurisdictional claims in extremely successful. There is, however, a surprisingly simple objection to the underlying published maps and institutional affil- principles in mainstream IR research: they are not based on scientific or scholarly norms on iations. which documents or knowledge claims have the best scientific or scholarly foundation and correspond to our best theories and findings. It seems obvious that users need to retrieve what is regarded as true knowledge (or to be more precise, what is considered our best substantiated knowledge claims). The main approaches in IR are discussed in Section3. In this respect, the field of KO has often had the goal, at the least implicitly, to Copyright: © 2021 by the author. Licensee MDPI, Basel, Switzerland. classify and represent documents and knowledge in accordance with updated scientific This article is an open access article knowledge (e.g., to classify books about birds in accordance with the classification of birds distributed under the terms and in ornithology). Despite this aim, the theory of KO has lacked an adequate toolset of conditions of the Creative Commons conceptualization to cope with it (and often, approaches have dominated that do not deal Attribution (CC BY) license (https:// adequately with the problems. This is discussed in Section2). creativecommons.org/licenses/by/ The purpose of this article is to provide arguments and conceptualization for creating 4.0/). methods and principles to develop systems and processes in IR and KO that aim at Information 2021, 12, 135. https://doi.org/10.3390/info12030135 https://www.mdpi.com/journal/information Information 2021, 12, 135 2 of 25 providing users with our best substantiated knowledge. To do so, this article argues, we need to base IR on KO, and we need to base KO on insights derived from the philosophy of science. 2. The Field of Knowledge Organization (KO) KO is sometimes termed “information organization” and [2] (p. 391) found that “information organization” is now by far the most used name for courses in educational programs in North American ALA-accredited programs. It is also sometimes termed “knowledge representation”, as in the subtitle of the official journal of ISKO, Knowledge Organization: “International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation”. KO is about knowledge organizing processes such as indexing, tagging, classifying, describing, and organizing documents and information, and about knowledge organization systems (KOS) such as classification systems, thesauri, and ontologies. KO has a practical aim. Although it is also a theoretical field, the aim of both practical activities and empirical and theoretical research is to support practical activities. The practical aims are the sole justification for KO. It has been claimed, for example, that KOS such as thesauri and controlled vocabularies are obsolete today, cf., [3]. KO must show that its processes and systems are necessary. If, for example, search engines such as Google can fulfill all practical needs without relying on KO, there is no need for this field. KO must always consider itself in relation to all possible alternative approaches, whether they have been developed inside or outside the field itself. Specifically, we must demonstrate, why, for example, Google is not enough (cf., [4]). This article contains arguments about the necessity of KO. The KOS developed for organizing documents and information are primarily about organizing concepts. The conceptual structures used in KOS are to a large degree derived from specific domains of knowledge. Documents about birds, for example, are often organized according to how ornithologists organize the birds themselves. This principle was recognized by Henry Bliss [5], who argued (p. 37): “To make the [bibliographic] classification conform to the scientific and educational organization of knowledge is to make it the more practical”, and this is probably the main reason the field was and still is named KO. The roots of KO lie in the following: (1) the practical classification and indexing of books and other kinds of documents in libraries and bibliographical databases; (2) philosophical principles, including Aristotle’s logic and Francis Bacon’s classification of knowledge, among many others; (The reader should be warned, however, that there are many misunderstandings about philosophical issues, including Aristotle’s role in classification. What is commonly attributed to Aristotle is a myth (see, e.g., [6], Chapter 2: “The Aristotelian Framework”). (3) scientific and scholarly contributions, including, for example, the contributions of Aristotle, Carl Linnaeus, Charles Darwin and many other scientists to the classification of living organisms and all other things in the world; (4) developments in information technology (IT), such as databases, communication networks and social media. The connections between these four fields are important for the development of KO as a field of study and practice but have often been neglected. This article argues that subject knowledge and its foundation in the philosophy of science is the most important perspective for KO, but this perspective has so far not been very influential in KO. Alternatively, influential perspectives have been: • To consider KO an intuitive process that does not need deeper justification. The original classification of journals in the citation indexes published by the Institute for Scientific Information, for example, was just intuitive (cf., [7], p. 602). In practice, many KO activities have been driven by this view: you simply start constructing a classification and stop when it seems to suit the purpose. A criticism of this perspective Information 2021, 12, 135 3 of 25 may state that any classification is always serving some interests at the expense of other interests, and if unanalyzed, it cannot be optimized for the purpose it is going to serve (cf., [8]). Of course, this view challenges the whole idea of KO as a needed field of study. • To claim that since only the individual user knows their own “information need”, they are the only person qualified to set principles of what should be found in IR and how documents should be indexed or classified. This is the view underlying user-oriented and cognitive approaches in information science and KO and is discussed a little in the present article and more detailed in [9]. • To say that the best way, both economically and qualitatively, is to base KO on user tagging or related “social technologies”, implying that a deeper theoretical under- standing of KO is unnecessary. User tagging is by some both seen as “democratic” and economically preferrable, while there are also critical voices (for a discussion, see [10]). • To claim that KO is basically a technological problem,
Recommended publications
  • Basic Concepts in Information Retrieval
    1 Basic concepts of information retrieval systems Introduction The term ‘information retrieval’ was coined in 1952 and gained popularity in the research community from 1961 onwards.1 At that time the organizing function of information retrieval was seen as a major advance in libraries that were no longer just storehouses of books, but also places where the information they hold is catalogued and indexed.2 Subsequently, with the introduction of computers in information handling, there appeared a number of databases containing bibliographic details of documents, often married with abstracts, keywords, and so on, and consequently the concept of information retrieval came to mean the retrieval of bibliographic information from stored document databases. Information retrieval is concerned with all the activities related to the organization of, processing of, and access to, information of all forms and formats. An information retrieval system allows people to communicate with an information system or service in order to find information – text, graphic images, sound recordings or video that meet their specific needs. Thus the objective of an information retrieval system is to enable users to find relevant information from an organized collection of documents. In fact, most information retrieval systems are, truly speaking, document retrieval systems, since they are designed to retrieve information about the existence (or non-existence) of documents relevant to a user query. Lancaster3 comments that an information retrieval system does not inform (change the knowledge of) the user on the subject of their enquiry; it merely informs them of the existence (or non-existence) and whereabouts of documents relating to their request.
    [Show full text]
  • The Seven Ages of Information Retrieval
    International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME OCCASIONAL PAPER 5 THE SEVEN AGES OF INFORMATION RETRIEVAL Michael Lesk Bellcore March, 1996 International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME The IFLA Core Programme on Universal Dataflow and Telecommunications (UDT) seeks to facilitate the international and national exchange of electronic data by providing the library community with pragmatic approaches to resource sharing. The programme monitors and promotes the use of relevant standards, promotes the use of relevant technologies and monitors relevant policy issues in an effort to overcome barriers to the electronic transfer of data in library fields. CONTACT INFORMATION Mailing Address: IFLA International Office for UDT c/o National Library of Canada 395 Wellington Street Ottawa, CANADA K1A 0N4 UDT Staff Contacts: Leigh Swain, Director Email: [email protected] Phone: (819) 994-6833 or Louise Lantaigne, Administration Officer Email: [email protected] Phone: (819) 994-6963 Fax: (819) 994-6835 Email: [email protected] URL: http://www.ifla.org/udt/ Occasional papers are available electronically at: http://www.ifla.org/udt/op/ UDT Occasional Papers # 5 Universal Dataflow and Telecommunications Core Programme International Federation of Library Associations and Institutions The Seven Ages of Information Retrieval Michael Lesk Bellcore [email protected] March, 1996 ABSTRACT analysis. This dates to a memo by Warren Weaver in 1949 [Weaver 1955] thinking about the success of Vannevar Bush's 1945 article set a goal of fast access computers in cryptography during the war, and to the contents of the world's libraries which looks suggesting that they could translate languages.
    [Show full text]
  • Implications of Big Data for Knowledge Organization. Fidelia Ibekwe-Sanjuan, Bowker Geoffrey
    Implications of big data for knowledge organization. Fidelia Ibekwe-Sanjuan, Bowker Geoffrey To cite this version: Fidelia Ibekwe-Sanjuan, Bowker Geoffrey. Implications of big data for knowledge organization.. Knowledge Organization, Ergon Verlag, 2017, Special issue on New trends for Knowledge Organi- zation, Renato Rocha Souza (guest editor), 44 (3), pp.187-198. hal-01489030 HAL Id: hal-01489030 https://hal.archives-ouvertes.fr/hal-01489030 Submitted on 23 Apr 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0 International License F. Ibekwe-SanJuan and G. C. Bowker (2017). Implications of Big Data for Knowledge Organization, Knowledge Organization 44(3), pp. 187-198 Implications of big data for knowledge organization Fidelia Ibekwe-SanJuan Geoffrey C. Bowker Aix Marseille Univ, IRSIC, Marseille, France University of California, Irvine, USA. [email protected] [email protected] Fidelia Ibekwe-SanJuan is Professor at the School of Journalism and Communication, University of Aix-Marseille in France. Her research interests span both empirical and theoretical issues. She has developed methodologies and tools for text mining and information retrieval. She is interested in the epistemology of science, in inter-disciplinarity issues and in the history of information and library science.
    [Show full text]
  • Information Retrieval (Text Categorization)
    Information Retrieval (Text Categorization) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Pura ed Applicata Università di Padova Anno Accademico 2008/2009 Dip. di Matematica F. Aiolli - Information Retrieval 1 Pura ed Applicata 2008/2009 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C = {c 1,…,c m} of categories (aka classes or labels) TC is an approximation task , in that we assume the existence of an ‘oracle’, a target function that specifies how docs ought to be classified. Since this oracle is unknown , the task consists in building a system that ‘approximates’ it Dip. di Matematica F. Aiolli - Information Retrieval 2 Pura ed Applicata 2008/2009 Text Categorization We will assume that categories are symbolic labels; in particular, the text constituting the label is not significant. No additional knowledge of category ‘meaning’ is available to help building the classifier The attribution of documents to categories should be realized on the basis of the content of the documents. Given that this is an inherently subjective notion, the membership of a document in a category cannot be determined with certainty Dip. di Matematica F. Aiolli - Information Retrieval 3 Pura ed Applicata 2008/2009 Single-label vs Multi-label TC TC comes in two different variants: Single-label TC (SL) when exactly one category should be assigned to a document The target function in the form f : D → C should be approximated by means of a classifier f’ : D → C Multi-label TC (ML) when any number {0,…,m} of categories can be assigned to each document The target function in the form f : D → P(C) should be approximated by means of a classifier f’ : D → P(C) We will often indicate a target function with the alternative notation f : D × C → {-1,+1} .
    [Show full text]
  • Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM
    Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM 15 Notes INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find information relevant to the task at hand. In view of this, information retrieval (IR) deals with the representation, storage, organization of/and access to information items. Here, types of information items include documents, Web pages, online catalogues, structured records, multimedia objects, etc. Chief goals of the IR are indexing text and searching for useful documents in a collection. Libraries were among the first institutions to adopt IR systems for retrieving information. In this lesson, you will be introduced to the importance, definitions and objectives of information retrieval. You will also study in detail the concept of subject approach to information, process of information retrieval, and indexing languages. 15.2 OBJECTIVES After studying this lesson, you will be able to: define information retrieval; understand the importance and need of information retrieval system; explain the concept of subject approach to information; LIBRARY AND INFORMATION SCIENCE 321 MODULE - 5B Information Retrieval System: Concept and Scope INFORMATION RETRIEVAL SYSTEM illustrate the process of information retrieval; and differentiate between natural, free and controlled indexing languages. 15.3 INFORMATION RETRIEVAL (IR) Notes The term ‘information retrieval’ was coined by Calvin Mooers in 1950. It gained popularity in the research community from 1961 onwards, when computers were introduced for information handling. The term information retrieval was then used to mean retrieval of bibliographic information from stored document databases.
    [Show full text]
  • Principles of Knowledge Organization: Analysis and Structures in the Networked Environment
    Principles of knowledge organization: analysis and structures in the networked environment 0. Abstract 1. Context of discussion 2. Definition of principles Knowledge organization in the networked environment is guided by standards. Standards Principles are laws, assumptions standards, rules, in knowledge organization are built on principles. For example, NISO Z39.19-1993 Guide judgments, policy, modes of action, as essential or • Knowledge Organization in the networked environment is guided by standards to the Construction of Monolingual Thesauri (now undergoing revision) and NISO Z39.85- basic qualities. They can also be considered goals • Standards in knowledge organization are built on principles 2001 Dublin Core Metadata Element Set are tw o standards used in ma ny implementations. or values in some knowledge organization theories. Both of these standards were crafted with knowledge organization principles in mind. (Adapting the definition from the American Heritage Therefore it is standards work guided by knowledge organization principles which can Existing standards built on principles: Dictionary). affect design of information services and technologies. This poster outlines five threads of • NISO Z39.19-1993 Guide to the Construction of Monolingual Thesauri thought that inform knowledge organization principles in the networked environment. An • NISO Z39.85-2001 Dublin Core Metadata Element Set understanding of each of these five threads informs system evaluation. The evaluation of knowledge organization systems should be tightly
    [Show full text]
  • A Philosophical Wish List for Research in Music Information Retrieval Cynthia M
    A Philosophical Wish List for Research in Music Information Retrieval Cynthia M. Grund Institute of Philosophy, Education and the Study of Religions - Philosophy University of Southern Denmark, Odense [email protected] Abstract what might otherwise seem utterly intractable questions. Within a framework provided by the traditional trio What insights could MIR-research provide into the consisting of metaphysics, epistemology and ethics, a first great questions of philosophy? The answer lies in the stab is made at a wish list for MIR-research from a particular facility which the tools of MIR possess for philosophical point of view. Since the tools of MIR are dealing with language as a sonic phenomenon, thus 1 equipped to study language and its use from a purely sonic providing yet another revolution in the linguistic turn. A standpoint, MIR research could result in another revealing search through the reams of literature produced regarding revolution within the linguistic turn in philosophy. the role of language in philosophical endeavor reveals that the language framework is virtually always a written one. Keywords: Philosophy and MIR, language as spoken, Little or no attention has been paid to the mechanisms at memory work in thinking, learning and communicating in a context 2 1. Introduction and Brief Setting of the Stage which is virtually of an exclusively oral, sonic. Since the overwhelming majority of time during which our thinking, Philosophy wrestles with questions regarding learning and communicative skills developed was metaphysics,
    [Show full text]
  • Biases in Knowledge Representation: an Analysis of the Feminine Domain in Brazilian Indexing Languages
    Suellen Oliveira Milani and José Augusto Chaves Guimarães. 2011. Biases in knowledge representation: an analysis of the feminine domain in Brazilian indexing languages. In Smiraglia, Richard P., ed. Proceedings from North American Symposium on Knowledge Organization, Vol. 3. Toronto, Canada, pp. 94-104. Suellen Oliveira Milani ([email protected]) and José Augusto Chaves Guimarães ([email protected]) Sao Paulo State University, Marília, SP, Brazil Biases in Knowledge Representation: an Analysis of the Feminine Domain in Brazilian Indexing Languages† Abstract: The process of knowledge representation, as well as its tools and resulting products are not neutral but permeated by moral values. This scenario gives rise to problems of biases in representation, such as gender issues, dichotomy categorizations and lack of cultural warrant and hospitality. References on women’s issues are still scarce in the literature, which makes it necessary to analyze to what extent the terms related to these particular issues are inserted in the tools in a biased way. This study aimed to verify the presence of the terms female, femininity, feminism, feminist, maternal, motherly, woman/women within the following Brazilian indexing languages: Subject Terminology of the National Library (STNL), University of Sao Paulo Subject Headings (USPSH), Brazilian Senate Subject Headings (BSSH) and Law Decimal Classification (LDC). Each term identified in the first three alphabetical languages generated a registration card containing both its descriptors and non-descriptors, as well as scope notes, USE/UF, RT, and BT/NT relationships. As for the analysis of LDC, the registration card was filled out by following the categories proposed by Olson (1998).
    [Show full text]
  • An Architecture for Information Retrieval Agents* Craig A
    From: AAAI Technical Report SS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. An Architecture for Information Retrieval Agents* Craig A. Knoblock and Yigal Arens Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA {KNOBLOCK,ARENS}QISI.ED Abstract sources that are available to it. Given an informa- tion request, an agent identifies an appropriate set of With the vast number of information resources information sources, generates a plan to retrieve and available today, a critical problem is howto lo- process the data, uses knowledge about the data to re- cate, retrieve and process information. It would formulate the plan, and then executes it. This paper be impractical to build a single unified system that describes our approach to the issues of representation, combines all of these information resources. A communication, problem solving, and learning, and de- more promising approach is to build specialized scribes how this approach supports multiple, collabo- information retrieval agents that provide access rating information retrieval agents. to a subset of the information resources and can send requests to other information retrieval agents Representing the Knowledge of an when needed. In this paper we present an archi- tecture for building such agents that addresses the Agent issues of representation, communication, problem Each information agent is specialized to a particular solving, and learning. Wealso describe how this area of expertise. This provides a modular organiza- architecture supports agents that are modular, ex- tion of the vast numberof information sources and pro- tensible, flexible and efficient, vides a clear delineation of the types of queries each agent can handle.
    [Show full text]
  • Semantic Information Retrieval Based on Wikipedia Taxonomy
    International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 77-80, 2013, ISSN: 2319–8656 Semantic Information Retrieval based on Wikipedia Taxonomy May Sabai Han University of Technology Yatanarpon Cyber City Myanmar Abstract: Information retrieval is used to find a subset of relevant documents against a set of documents. Determining semantic similarity between two terms is a crucial problem in Web Mining for such applications as information retrieval systems and recommender systems. Semantic similarity refers to the sameness of two terms based on sameness of their meaning or their semantic contents. Recently many techniques have introduced measuring semantic similarity using Wikipedia, a free online encyclopedia. In this paper, a new technique of measuring semantic similarity is proposed. The proposed method uses Wikipedia as an ontology and spreading activation strategy to compute semantic similarity. The utility of the proposed system is evaluated by using the taxonomy of Wikipedia categories. Keywords: information retrieval; semantic similarity; spreading activation strategy; wikipedia taxonomy; wikipedia categories 1. INTRODUCTION fashion than a search engine and with more coverage than WordNet. And Wikipedia articles have been categorized by Information in WWW are scattered and diverse in nature. So, providing a taxonomy, categories. This feature provides the users frequently fail to describe the information desired. hierarchical structure or network. Wikipedia also provides Traditional search techniques are constrained by keyword articles link graph. So many researches has recently used based matching techniques. Hence low precision and recall is Wikipedia as an ontology to measure semantic similarity. obtained [2]. Many natural language processing applications must estimate the semantic similarity of pairs of text We propose a method to use structured knowledge extracted fragments provided as input, e.g.
    [Show full text]
  • Introduction to Information Retrieval
    DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, INFORMATION information retrieval might be defined thus: RETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main- tain product inventories and personnel records.
    [Show full text]
  • Linguistic Fingerprints of Internet Censorship: the Case of Sina Weibo
    Linguistic Fingerprints of Internet Censorship: the Case of Sina Weibo Kei Yin Ng, Anna Feldman, Jing Peng Montclair State University Montclair, New Jersey, USA Abstract trainings, seminars, and even study trips as well as advanced equipment. This paper studies how the linguistic components of blogposts collected from Sina Weibo, a Chinese mi- In this paper, we deal with a particular type of censorship croblogging platform, might affect the blogposts’ likeli- – when a post gets removed from a social media platform hood of being censored. Our results go along with King semi-automatically based on its content. We are interested in et al. (2013)’s Collective Action Potential (CAP) theory, exploring whether there are systematic linguistic differences which states that a blogpost’s potential of causing riot or between posts that get removed by censors from Sina Weibo, assembly in real life is the key determinant of it getting a Chinese microblogging platform, and the posts that remain censored. Although there is not a definitive measure of on the website. Sina Weibo was launched in 2009 and be- this construct, the linguistic features that we identify as came the most popular social media platform in China. Sina discriminatory go along with the CAP theory. We build Weibo has over 431 million monthly active users3. a classifier that significantly outperforms non-expert hu- mans in predicting whether a blogpost will be censored. In cooperation with the ruling regime, Weibo sets strict The crowdsourcing results suggest that while humans control over the content published under its service (Tager, tend to see censored blogposts as more controversial Bass, and Lopez 2018).
    [Show full text]