Knowl. Org. 46(2019)No.2

KO KNOWLEDGE ORGANIZATION

Official Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Contents

Articles Review

Li Yang and Yejun Wu. Semantic Perception: How the Illusion of a Common Creating a Taxonomy of Earthquake Disaster Response Language Arises and Persists, Jody Azzouni. New York: and Recovery for Online Earthquake Information Oxford University Press, 2013. ISBN 9780199967407. Management ...... 77 2015. ISBN 9780190275549. Ontology Without Borders, Jody Azzouni. New York, Chengzhi Zhang, Hua Zhao, Xuehua Chi and NY: Oxford University Press. ISBN 9780190622558. Shuitian Ma. Reviewed by Matthew Kelly...... 147 Information Organization Patterns from Online Users in a Social Network ...... 90 Erratum ...... 154

Reviews of Concepts in KO Books recently published ...... 155

Koraljka Golub. Automatic Subject Indexing of Text ...... 104

Marcia Lei Zeng. Interoperability ...... 122

Knowl. Org. 46(2019)No.2

KNOWLEDGE ORGANIZATION KO

Official Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

KNOWLEDGE ORGANIZATION José Augusto Chaves GUIMARÃES, Departamento de Ciência da Informacão, Universidade Estadual Paulista–UNESP, Av. Hygino Muzzi This journal is the organ of the INTERNATIONAL SOCIETY FOR Filho 737, 17525-900 Marília SP Brazil. E-mail: [email protected] KNOWLEDGE ORGANIZATION (General Secretariat: Amos DA- VID, Université de Lorraine, 3 place Godefroy de Bouillon, BP 3397, Michael KLEINEBERG, Humboldt-Universität zu Berlin, Unter den 54015 Nancy Cedex, France. E-mail: [email protected]. Linden 6, D-10099 Berlin. E-mail: [email protected]

Editors Kathryn LA BARRE, School of Information Sciences, University of Illi- nois at Urbana-Champaign, 501 E. Daniel Street, MC-493, Champaign, IL Richard P. SMIRAGLIA (Editor-in-Chief), Institute for Knowledge Or- 61820-6211 USA. E-mail: [email protected] ganization and Structure, Shorewood WI 53211 USA. E-mail: [email protected] Devika P. MADALLI, Documentation Research and Training Centre (DRTC) Indian Statistical Institute (ISI), Bangalore 560 059, India. Joshua HENRY, Institute for Knowledge Organization and Structure, E-mail: [email protected] Shorewood WI 53211 USA. Daniel MARTÍNEZ-ÁVILA, Departamento de Ciência da Informação, Peter TURNER, Institute for Knowledge Organization and Culture, Universidade Estadual Paulista–UNESP, Av. Hygino Muzzi Filho 737, Shorewood WI 53211 USA. 17525-900 Marília SP Brazil. E-mail: [email protected]

J. Bradford YOUNG (Bibliographic Consultant), Institute for Knowledge Widad MUSTAFA el HADI, Université Charles de Gaulle Lille 3, URF Organization and Structure, Shorewood WI 53211, USA. IDIST, Domaine du Pont de Bois, Villeneuve d’Ascq 59653, France. E-mail: [email protected] Editor Emerita H. Peter OHLY, Prinzenstr. 179, D-53175 Bonn, Germany. Hope A. OLSON, School of Information Studies, University of Wiscon- E-mail: [email protected] sin-Milwaukee, Milwaukee, Northwest Quad Building B, 2025 E New- port St., Milwaukee, WI 53211 USA. E-mail: [email protected] M. Cristina PATTUELLI, School of Information, Pratt Institute, 144 W. 14th Street, New York, New York 10011, USA. Series Editors E-mail: [email protected]

Birger HJØRLAND (Reviews of Concepts in Knowledge Organization), K. S. RAGHAVAN, Member-Secretary, Sarada Ranganathan Endowment Department of Information Studies, University of Copenhagen. E-Mail: for Library Science, PES Institute of Technology, 100 Feet Ring Road, [email protected] BSK 3rd Stage, Bangalore 560085, India. E-mail: [email protected].

María J. LÓPEZ-HUERTAS (Research Trajectories in Knowledge Heather Moulaison SANDY, The iSchool at the University of Missouri, Organization), Universidad de Granada, Facultad de Biblioteconomía y 303 Townsend Hall, Columbia, MO 65211, USA. Documentación, Campus Universitario de Cartuja, Biblioteca del Colegio E-mail: [email protected] Máximo de Cartuja, 18071 Granada, Spain. E-mail: [email protected] M. P. SATIJA, Guru Nanak Dev University, School of Library and Infor- Editorial Board mation Science, Amritsar-143 005, India.

E-mail: [email protected] Thomas DOUSA, The University of Chicago Libraries, 1100 E 57th St, Chicago, IL 60637 USA. E-mail: [email protected] Aida SLAVIC, UDC Consortium, PO Box 90407, 2509 LK The Hague, The Netherlands. E-mail: [email protected] Melodie J. FOX, Institute for Knowledge Organization and Structure, Shorewood WI 53211 USA. E-mail: [email protected]. Renato R. SOUZA, Applied Mathematics School, Getulio Vargas

Foundation, Praia de Botafogo, 190, 3o andar, Rio de Janeiro, RJ, 22250- Jonathan FURNER, Graduate School of Education & Information Stud- 900, Brazil. E-mail: [email protected] ies, University of California, Los Angeles, 300 Young Dr. N, Mailbox 951520, Los Angeles, CA 90095-1520, USA. Rick SZOSTAK, University of Alberta, Department of Economics, 4 E-mail: [email protected] Edmonton, Alberta, Canada, T6G 2H4. E-mail: [email protected]

Claudio GNOLI, University of Pavia, Science and Technology Library, Joseph T. TENNIS, The Information School of the University of Wash- via Ferrata 1, I-27100 Pavia, Italy. E-mail: [email protected] ington, Box 352840, Mary Gates Hall Ste 370, Seattle WA 98195-2840 USA. E-mail: [email protected] Ann M. GRAF, School of Library and Information Science, Simmons University, 300 The Fenway, Boston, MA 02115 USA. Maja ŽUMER, Faculty of Arts, University of Ljubljana, Askerceva 2, E-mail: [email protected] Ljubljana 1000 Slovenia. E-mail: [email protected]

Jane GREENBERG, College of Computing & Informatics, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104 USA, E-mail: [email protected]

Knowl. Org. 46(2019)No.2 77 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

Creating a Taxonomy of Earthquake Disaster Response and Recovery for Online Earthquake Information Management Li Yang*, Yejun Wu** *Southwest Petroleum University, School of Computer Science, Chengdu, China 610500, **Louisiana State University, School of Library and Information Science, Baton Rouge, LA 70803,

Li Yang is a lecturer in the School of Computer Science at Southwest Petroleum University in China. Her re- search areas include knowledge organization, knowledge discovery and sharing, and information management and service.

Yejun Wu is an associate professor in the School of Library and Information Science at Louisiana State Univer- sity. He received a PhD in information studies from the College of Information Studies, University of Maryland, College Park (2008). His research areas include knowledge organization and discovery, information retrieval, and digital libraries.

Yang, Li and Yejun Wu. 2019. “Creating a Taxonomy of Earthquake Disaster Response and Recovery for Online Earthquake Information Management.” Knowledge Organization 46(2): 77-89. 43 references. DOI:10.5771/0943- 7444-2019-2-77.

Abstract: The goal of this study is to develop a taxonomy of earthquake response and recovery using online information resources for organizing and sharing earthquake-related online information resources. A construc- tivist/interpretivist research paradigm was used in the study. A combination of top-down and bottom-up ap- proaches was used to build the taxonomy. Facet analysis of disaster management, the timeframe of disaster management, and modular design were performed when designing the taxonomy. Two case studies were done to demonstrate the usefulness of the taxonomy for organizing and sharing information. The facet-based taxon- omy can be used to organize online information for browsing and navigation. It can also be used to index and tag online information resources to support searching. It creates a common language for earthquake manage- ment stakeholders to share knowledge. The top three level categories of the taxonomy can be applied to the management of other types of disasters. The taxonomy has implications for earthquake online information management, knowledge management and disaster management. The approach can be used to build taxonomies for managing online information resources on other topics (including various types of time-sensitive disaster responses). We propose a common language for sharing information on disasters, which has great social relevance.

Received: 13 August 2018; Revised: 3 December 2018; Accepted: 26 December 2018

Keywords: information, taxonomy, disaster, categories, response, earthquake, recovery

1.0 Introduction 2013). Many of the posts were about help-seeking. Sixty- three percent of Canadians said emergency responders Internet was reported as the preferred source of infor- should be prepared to respond to calls for help posted on mation and the most reliable source for news by a majority social media (Canadian Red Cross 2012). Online infor- of American adults in 2009. Disaster communication in- mation sources, especially those from social media, have creasingly occurs via social media for up-to-date infor- been used for emergency response and recovery. mation. For example, after the 2011 Japanese tsunami, Earthquake is one of the most common hazards. In the there were more than 5,500 tweets per second about the last ten years, 20,351 earthquakes above magnitude 5 hit disaster (Fraustino et al. 2012). Within the first nine hours the Earth with an estimated total of more than 350,000 after Ya’an Earthquake in 2013 in China, the number of deaths (USGS 2019). The earthquake and the aftershocks tweets about the earthquake from Weibo (a Chinese mi- may take severe tolls in casualty and cause severe economic croblogging website) totaled 64,000,000 (Chen and Fu loss. As acute events, earthquake disasters require quick re- 78 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… sponse and relief services (Marianti 2007). Organizing, mation from the earthquake-related websites and social mapping, disseminating, and communicating information media sites. adequately and swiftly are essential to prompt and effective The paper is organized as follows: in the research back- response, especially in the first seventy-two prime hours, ground, several tools and methods of information manage- by supporting critical rescue decisions, such as where are ment in disaster management domain and related view- the worst-hit areas, what are the needs of the victims, what points are introduced. The methodology is presented sub- to do for vulnerable groups, and how to cooperate among sequently, followed by the procedure of developing the tax- volunteers. onomy, and a discussion of difficult problems and compro- The capacity of collecting, organizing, and disseminat- mised solutions. Two case studies of using the taxonomy ing information is critical to disaster response and recov- for organizing and sharing online earthquake-related infor- ery, especially to the rapid-onset events with immediate de- mation resources are presented afterwards, followed by a struction and death (OCHA-ROAP 2017; Marianti 2007). summary of findings and implications. The taxonomy can However, stakeholders of disasters use various technolo- be viewed at http://www.swpu.edu.cn/system/_content/ gies and protocols for communication, which makes infor- download.jsp?urltype=news.DownloadAttachUrl&owner mation exchange difficult or even not possible between =1459835785&wbfileid=2539518). The last section con- various organizations and countries (Knezić et al. 2015). cludes the paper and points out future work. Under extraordinary emergencies, they probably do not have uniform platforms and standards for information ex- 2.0 Research background change, especially for the influx of information from do- mestic and overseas online sources. Furthermore, the con- Timely, reliable, and accurate information is critical to de- tents and structures of online disaster information vary cision making for disaster management. In intra- and inter- from each other. News agencies and portals tend to have agency interactions, multiple stakeholders collect, collate, a full coverage of catastrophes. Government authorities and communicate information to coordinate response to, and non-government organizations are prone to publicize and relief of, human suffering. Accessing high-quality in- what has been done and what resources are available. So- formation and sharing it with partners is essential for im- cial media sites mainly reveal concerns of the general pub- proving the effectiveness of responses (UNCHA 2002; lic, the needs of victims, and are also the places to express Dantas et al. 2006; Bharosa et al. 2010). emotion. The online information in respect to disaster re- However, information sharing and coordination in dis- sponse and recovery is organized more often simply and aster response and recovery are not as fluent and efficient casually rather than structurally and systematically. A tax- as they should be. Various levels of obstacles from individ- onomy can be used to describe and organize information, ual to community limit the communication of information providing a relationship structure and a common context (Bharosa et al. 2010). Technical constraints make infor- of data, processes and management tools for cooperation mation inaccessible as well (UNCHA 2002). The infor- and then facilitate information sharing and exchanging for mation sharing model of web portals such as ReliefWeb, disaster response and recovery (Knezić et al. 2015). the United States Agency for International Development The goal of this study is to develop a faceted taxonomy (USAID), Redcross, FEMA, CNN, Sina, Yahoo, etc. is using online information resources for managing online mostly monodirectional. Additionally, information on these information with respect to earthquake disaster response portals tend to be general and informative; it focuses on the and recovery. The vocabularies in the taxonomy were ex- width instead of the depth of information, which could tracted from online information resources. The reason hardly support operations. Many other platforms and sys- why we chose online information other than other sources tems are designed to promote information sharing, such as (e.g., TV media and printed media) is that online infor- the Virtual On-Site Operations Coordination Centre mation increases promptly and dramatically after an earth- (VOSOCC) (OCHA Field Coordination Support Section quake happens. The information that indicates the stake- 2014), Inter-Organizational Information-Sharing Systems holders’ demand, reflects the disaster situation and implies (IOISS) (Bharosa et al. 2010), and Sahana Free and Open the priorities of disaster response activities is critical to dis- Source Disaster Management System (https://sahanafoun- aster management. The taxonomy was built using infor- dation.org/). Some frameworks and platforms were pro- mation resources about past earthquakes and is expected posed to establish the linkages, templates, and sharing to support indexing and mapping, assisting retrieval and standards to enable information sharing during emergency browsing, and exchanging and sharing information for fu- response activities (Dantas et al. 2006; Martin and Rice ture earthquakes. In this methodological paper, we focus 2012; Sakurai 2016). With the development of communi- on the process of constructing a taxonomy of earthquake cation technology, mobile applications were created to sup- disaster response and recovery by using online infor- port disaster response for private and public communica- Knowl. Org. 46(2019)No.2 79 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… tions even without Internet access or cellular data, like Fire- The AIRS/211 LA County Taxonomy (AIRS: Alliance Chat, FEMA App, First Aid, Earthquake by American Red of Information and Referral Systems) (211 LA County Cross, Hurricane by American Red Cross, etc. 2019) sets a standard for defining human services and for Another trend is that social media has been and will indexing and accessing human services. The “disaster ser- abidingly be involved in disaster response and recovery. For vices” section of the taxonomy is built from the perspective the huge user base, accessibility and fast response, social of a government agency for organizing, managing, and co- media enables online exchange of information through ordinating disaster services. It covers “disaster management conversation and interaction, thus publics have been turn- organizations,” “disaster preparedness,” “mitigation,” “warn- ing towards social media sites for timely communication ings,” “response,” “relief,” and “recovery services.” There (Yates and Paquette 2011; Besaleva and Weaver 2016). In are many categories that might be adopted to build our tax- disaster response, decision-making is based upon infor- onomy, e.g., “disaster management organizations,” “disaster mation and reports from the public (Dantas, Seville, and preparedness,” “disaster mitigation,” “disaster warnings,” Nicholson 2006). Social media like microblogs are believed “disaster response service,” “disaster relief services,” “dis- to be the ideal sources of information, through which in- aster recovery service,” and their sub-categories. formation is organized in clusters, such as topics, com- The Humanitarian Decision Makers Taxonomy has a ments, and tags on posts and images (Garg and Kumar constraint domain within decision makers in sudden-onset 2016; Yates and Paquette 2011). However, extracting useful disasters. It presents only the stakeholders of disaster re- information from social media is a challenging task. sponse and recovery. It is the basis for mapping the infor- Ineffective management of online information will in mation needs of humanitarian decision makers in sudden reverse increase workloads for both public and relief agen- onset disasters, which helps to identify the decision makers cies. Several problems emerge due to this. First, people are and then target information towards them (Gralla et al. faced with severe time pressure and a burst of information 2013). To serve our purpose, the categories such as “do- that can result in cognitive overload (Bharosa, Lee, and nors,” “international organizations,” “public sector,” “pri- Janssen 2010). Second, poor quality of data and infor- vate sectors,” “military,” “media,” “non-governmental or- mation, such as irrelevant, inaccurate, or outdated infor- ganizations,” “individuals,” and their sub-categories serve mation may cause improper, even inaccurate, decisions a good framework. (Bjerge, Clark, Fisker, and Raju 2016). Third, the public Federal Emergency Management Agency (FEMA) pro- uses natural language on social media. Common naming posed a taxonomy for disaster recovery activities from the conventions are absent, leading to communication dilem- perspectives of government agencies, humanitarian organ- mas and difficult information retrieval (Yates and Paquette izations, and personnel. It divides disaster recovery activi- 2011). Fourth, categorization, common standards, and ties into four phases—disaster preparedness, short-term frames are underdeveloped to share information (Dantas, recovery, intermediate recovery, and long-term recovery Seville, and Nicholson 2006; Yates and Paquette 2011). (FEMA 2011). The four phases demonstrate the timeliness There is a gap between the ideal application of infor- of disaster management, which is a marked feature that we mation and the existing tools and methods. Raw online in- should take into account to develop the taxonomy. formation is difficult to use until it is organized. Convert- ing ambiguous terminologies to standard terminologies 3.0 Methodology for annotating information helps to maintain information quality (UNCHA 2002). Decision support requires inte- A constructivist/interpretivist research paradigm, which is gration of, and interoperability between, datasets. Time “the theoretical framework of most qualitative research” pressure favors standardized information for sudden-on- (Tuli 2010, 100), was used in the study. It sees the world as set disasters (Van de Walle and Comes 2015). Taxonomy is constructed and interpreted by people in their experiences a useful tool for standardizing and organizing information and interactions with each other and the wider social sys- (Bardet and Liu 2010). tems (Bogdan and Biklen 1992; Lincoln and Guba 1985; Taxonomy is a kind of knowledge organization system Maxwell 2006; Merriam 1988; Tuli 2010). This paradigm that models the underlying semantic structure of a domain acknowledges and recognizes the researchers’ active role in (Hill et al. 2002). Taxonomies supply terminologies and constructing the interpretation of the data gathered their relationships. Organizing information by taxonomy (Lauckner, Patterson and Terry 2012). can meet users’ specific decision-making and action-taking A taxonomy was constructed by the researchers using needs (Pellini and Jones 2011). Several taxonomies related online information resources. The taxonomy can be eval- to disaster response and relief have been developed and uated by experts of earthquake disaster management for are introduced below. These taxonomies have different internal validity, and/or can be adjusted and applied to or- contents, structures, and purposes. ganize current online information resources to show some 80 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… external validity. In the constructivist/interpretivist para- The major categories of the taxonomy are developed digm, we applied the taxonomy to organize current online based on facet analysis. Facets are “homogeneous or se- information resources in two case studies in an attempt to mantically cohesive categories,” which are used to create demonstrate the usefulness of the taxonomy for organiz- term groupings of a subject discipline with a manageable ing and sharing information. If the two case studies are size (Svenonius 2000, 139). Facet analysis can “provide a meaningful and successful, the taxonomy is interpreted by framework within which all the various types of terms can the researchers as useful to some extent. be accommodated, together with rules for their combina- A taxonomy has a structure, categories, and terms. This tion” (Foskett 2009, 1819). The categories in a faceted clas- section introduces the method of developing the structure sification can be combined with each other, thus faceted and categories of the taxonomy, specifically, the method classification can be flexible and can accommodate new of selecting information resources and the terms of the phenomena (Vickery 1966, 46; Hjørland 2013; Kwaśnik taxonomy, and the procedure of developing the taxonomy. 1999). In the enumerative classification, all classes are listed. By contrast, new facets can be added as needed and 3.1 Developing the structure and categories of the classes are built by combinations of building blocks in the taxonomy faceted classification (Hjørland 2013). “By combining terms in compound subjects it introduces new logical rela- The structure of a taxonomy can be lists, tree structures, hi- tions between them, thus better reflecting the complexity erarchies, polyhierarchies, matrices, facets and system maps of knowledge” (Vickery 1966, 46; Hjørland 2013). A hier- (Pellini and Jones 2011). Simple hierarchical structure (e.g., archical and systematically ordered scheme and the syntac- enumerative classification) with IS-A relation between con- tic and semantic relations between categories provided by cepts is the most common structure of taxonomy (Knezić the facet analysis can represent various complicated attrib- et al. 2015). Sufficient information is needed for multitask- utes of disaster response and recovery and also address the ing in disaster response and recovery. For example, person need of describing dynamic information. finding involves multiple aspects as shown in Figure 1. It is Facet analysis is conducted based on Raganathan’s obvious that “personal details” and “missing persons phone PMEST (“personality,” “matter,” “energy,” “space,” and lines” are not species of “person finding.” “Contact person” “time”) facets. P (“personality”) is identified as stakeholders and “name” have a connection of “relation to the sought (see Appendix 2 for the category of stakeholders in the tax- person.” Various hierarchies are not demonstrated in the onomy). M (“matter”) is the disaster itself, such as earth- mini-taxonomy and not all terms exhibit a genus-species re- quake, flood, and tsunami. S (“space”) represents the loca- lation. However, this mini-taxonomy serves its purpose. “A tion, such as epicenter, refugee center, hospital, etc. T scheme of categories of terms which will do more than im- (“time”) represents time, such as the time a disaster hap- itating the genus-species relation” is more suitable for the pened, the timeframe of disaster management, and the last taxonomy of disaster response and recovery (Foskett 2009, known time of a missing person. E (“energy”) is disaster pre- 1819). A hierarchical structure that expresses more than the paredness, response, and recovery according to a timeframe genus-species (IS-A) relationship between a category and its of disaster management, which is addressed below. sub-categories is applied to the development of the earth- Disaster management, also called emergency manage- quake response and recovery taxonomy. ment, is a dynamic process. It has three phases including preparing, responding, and recovering (UNISDR 2017). Disaster preparedness enables timely, effective, and appro- priate responses. Response has to do with immediate needs, short-term needs, and basic needs. Recovery in- cludes restoring and improving of livelihoods, health and other development of disaster-affected societies. Each phase focuses on different missions, thus disparate infor- mation is needed for decision support. Disaster manage- ment needs to be operated in a timely manner. From the beginning, effective disaster preparedness is measured by coordinated and timely manner of avoiding gaps, duplica- tion of effort, and parallel structures (United Nations 2008). Timely warning, timely liaison, timely role conver- Figure 1. An incomplete taxonomy of person sion of organizations, timely decisions, timely action, timely finding in disaster response (data source: vocab- information, and timely post-disaster review are required ularies are extracted from Sina Weibo). in an effective disaster response and recovery (Carter Knowl. Org. 46(2019)No.2 81 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

2008). The timeframe is a critical attribute of disaster man- lower-level categories according to the constraints of the agement. Especially information for response and recov- higher-level categories (Wu and Yang 2015). The higher- ery is for time-sensitive demands and should be dissemi- level categories can be adjusted and refined according to nated for emergency needs in a timely manner (United Na- the need of controlling the lower-level categories. tions 2018). In the AIRS/211 LA County Taxonomy, the The primary taxonomy should be broad and shallow “timeframe” elements are featured in the lowest level cat- since multiple and at times disposable taxonomies can then egories, e.g., “disaster preparedness,” “emergency prepar- be used for specific purposes (Pellini and Jones 2011). The edness and response planning,” “pre-disaster donations top and second level categories are adjusted when the sub- collection/storage,” post “disaster emergency medical ordinate level categories are created. Third-to-bottom level care,” “post disaster child care.” However, the taxonomy categories are developed from tags and terms that are ex- does not organize the categories in accordance with tracted from online information using a bottom-up ap- “timeframe.” In the National Disaster Recovery Frame- proach. work (FEMA 2011), the “timeframe” elements including When developing lower-level categories and collecting preparedness, short-term, intermediate, and long-term, are terms, folksonomy thinking is integrated into the bottom- at the top level. As elements of timeframe, preparedness, up approach. Folksonomy is a powerful and innovative response, and recovery are embedded in the taxonomy. tool that complement taxonomies and help reduce the tax- The timeframe is a feature of many disasters, not only onomy’s rigidity (Pellini and Jones 2011). Therefore, user- earthquakes. A modular design is adopted for multi-field provided tags and terms from social media are absorbed applications. Modularization means the elements of a sys- into the taxonomy. For example, road and traffic situations tem are split up and assigned to modules according to a gain intensive attention after an earthquake occurs. “Air- formal architecture or plan (Clark and Baldwin 2002). The port,” “air traffic control,” “broken roads,” “blocked original design of the taxonomy focused on the earthquake roads,” and “road traffic control” were mentioned by me- domain. The first step to modular design is to remove the dia coverage repeatedly during the earthquakes, such as “earthquake” term from the top categories, particularly CNN for Haiti earthquake, Tencent, Sina, and Sohu for from the top three level categories so that it can be applied Ya’an earthquake (2013), USAID for Ecuador earthquake to other disasters. The second step is to design the struc- (2016), and Google Crisis Response for several disasters. ture of the taxonomy for maximum independence of cat- “Road blockage,” “traffic accident,” “traffic blockage,” etc. egories. Particular elements of a modular design may be were hot topics that attracted followers’ comments after changed in unforeseen ways as long as the design rules are Ya’an earthquake on microblogs such as Sina Weibo and obeyed, which makes module tolerant of uncertainty. Tencent Weibo. These terms can be seen as tags and inte- When developing the specific categories and terms, a grated into the taxonomy. combination of top-down and bottom-up approach is used. It is the best practice in taxonomy construction as 3.2 Selecting information resources and the terms discussed in knowledge organization literature (Wang, of the taxonomy Chaudhry and Khoo 2010; Ramos and Rasmus 2003; Cisco and Jackson 2005; Holgate 2004; Wu and Yang The terminology of a subject domain is obtained by expli- 2015). A bottom-up approach builds up important catego- cating natural language words and phrases (Svenonius ries from the concepts or vocabularies that are extracted 2000). Online information, such as social tagging, can actu- from online information sources. Automated technologies ally help in identifying new terms and categories and in such as information extraction and clustering can auto- adapting and changing existing taxonomies (Pellini and mate bottom-up analysis (Ramos and Rasmus 2003) but Jones 2011). Terms restricted to the scope of earthquake offers little control over the semantic meaning and organ- response and recovery were extracted from three main ization of higher-level categories (Cisco and Jackson types of online information sources: web pages, social net- 2005). A top-down approach starts at the general, concep- works, and research reports. Websites are the main online tual levels and establishes an overall framework for the tax- venue to transfer disaster information, which may contain onomy based on the objectives of the taxonomy (Ramos a large number of terms of special topics. Social network- and Rasmus 2003). Therefore, it offers control over the ing reflects what the masses keep watching on during an top- and higher-level categories of the taxonomy (Cisco event and provides opportunities for new ways of creating and Jackson 2005). A combination of the top-down and and managing taxonomies (Pellini and Jones 2011). Re- bottom-up approach develops the higher-level categories search reports, normally of specific topics related to disas- in the taxonomy first, creates lower-level categories ac- ter management, are important information sources that cording to the grouping of terms, classifies specific terms supplement web pages and social networks. We used web into lower-level categories along the way, and refines the search engines such as Google and Baidu and microblogs’ 82 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… internal search to retrieve the three types of information Recovery Framework, which satisfies the need of sources. We used query terms covering Haiti earthquake various response and recovery services under severe (2010), Japan Earthquake (2011), Ya’an earthquake (2013), time pressure during an earthquake disaster. Nepal earthquake (2015), Ecuador earthquake (2016), Step 5: Load and accommodate the bottom-level clus- emergency response, relief response, disaster response, and ters into the basic taxonomy. Build middle-level catego- disaster recovery, etc. The major information resources un- ries by combining the top-down and bottom-up ap- der these three sources are provided in Appendix 1. proaches. The scope of the terms needs to be defined when select- Step 6: Revise and adjust the categories by adding, split- ing the terms. Classification is purposeful, and every classi- ting, transposing, and merging to reach mutual exclusiv- fication brings together resources that go together to differ- ity between sibling categories. entiate among them (Hemerly et al. 2013). Defining the scope of a domain “helps to ensure that terms to be added The seven-level earthquake disaster response and recovery or removed from a vocabulary are made on a consistent ba- taxonomy was built. The number of categories at each sis” (Svenonius 2000, 134). Earthquake response and recov- level is shown in Table 1. Appendix 2 presents a snippet ery involve a broad range of actions. Terms that match these of the top two-level categories of the taxonomy, and Ap- actions are all selected, such as disaster situation, losses, ser- pendix 3 presents a snippet of the bottom levels of the vices, needs, operations, facilities, modality, emotion, pro- taxonomy. grams, and orientation/philosophy (AIRS 2019). 1st 2nd 3rd 4th 5th 6th 7th Level 3.3 Procedure level level level level level level level Number of 5 50 218 539 563 163 36 The guidelines from Pellini and Jones (2011) and Wu and categories Yang (2015) were referenced during the development of Table 1. Numbers of categories at each level of the taxonomy. the taxonomy. The general procedure is as follows: 4.0 Difficult problems and compromised solutions Step 1: Select the online information resources and ex- tract terms. A couple of structural problems emerged in the process of More than 3,500 terms were manually extracted the taxonomy development. We have partial solutions but from a total of fifty-three websites and social media probably not the best ones. The timeframe is an important sites, and twenty-five digitally published guidelines, foundation of the taxonomy. The top-level categories re- field handbooks, and working papers were adopted flect this feature by using time-related terms: response, for extracting terms and categories. When selecting short-term recovery, intermediate-term recovery, and long- the terms, titles, section headings, and tags of the term recovery. However, not all the terms are time-re- sites and digital publications, which reflected the stricted. Many services are processes extended from the re- topics directly, were extracted preferentially. Then, sponse phase to the recovery phase, such as shelter, evacua- generally, the nouns in the content that supported tion center, medical care, food and water supply, and people the topics were extracted. finding. Within the recovery phase, some services run Step 2: Normalize all the terms by translating, convert- through the whole process. To solve this problem, we de- ing plural forms to singular, removing duplicate terms, graded short-term recovery, intermediate-term recovery, and standardizing terms with multiple expressions into and long-term recovery to second-level categories and one. added the more comprehensive term “recovery” as their A total of 1,574 terms were kept. The mapping of top-level category. We then added “general” as the sibling multiple expressions to one standard term is kept for category to the degraded ones. This partially solved the future study. structural problem but increased the repetition of lower- Step 3: Cluster the terms into homogeneous categories level categories. For example, the third-level category “hous- based on subject. Bottom-level and higher-level clusters ing” exists in all three phases of recovery, and it could not are generated. be classified into “general” category, because “housing” in Six clusters were created for the 1,574 normalized each phase has a special connotation. In long-term recovery, terms. “housing” mainly means permanent housing solution, and Step 4: Build the basic taxonomy from existing taxono- in intermediate recovery it means temporary housing, while mies related to disaster response and recovery. in short-term recovery it means sheltering more than hous- The top two levels were initially built using the two- ing. The current taxonomy achieves categorical homogene- level FEMA categories from the National Disaster ity but also contains term redundancy. Knowl. Org. 46(2019)No.2 83 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

An alternative possible solution to avoiding redundancy 2016) presents information related to emergencies and dis- is to put those non-time-restricted categories (such as shel- asters to which it had responded or was responding. It has ter, evacuation center, medical care, food, and water supply) eight menu items in the menu bar, which are “Nepal,” at the same hierarchical level as response and recovery. Put- “ebola,” “South Sudan,” “Syria,” “photos,” “videos,” “dis- ting temporal categories together with non-temporal cate- aster response,” and “prayer.” These menu items are not gories can be found in the Library of Congress Classification classified based on a systematic and normalized taxonomy, (LCC) when classifying world history (Library of Congress but mainly by hot topics such as places, disasters, formats 2018). Under Sub-class D(204)-(475) of LCC, modern his- of information, and actions, which is a representative char- tory has the following temporal categories: 1453-1648, acteristic of web portals for eye-catching. Instantiation of 1601-1715, 1715-1789, and 1789-. Under “1789-,” there are the taxonomy application to the website is presented in the both temporal categories (i.e., “period of the French Revo- following. lution,” “19th century,” “20th century,” “World War I,” “pe- The instance is the web page titled “Aid Organizations riod between World Wars,” “World War II,” “post-war his- Face Challenges in Rush to Help Nepal Earthquake Survi- tory”) and non-temporal categories (i.e., “developing coun- vors Desperately in Need.” Food, water, temporary shelter, tries,” “eastern hemisphere,” “Europe-general”). This ap- the death toll, tarps, blankets, first aid kits, sleeping mats, proach sacrifices categorical homogeneity to keep structural blankets, water, temporary shelter, protection for children, non-redundancy. Another possible solution is to create a international relief organizations, and flight are mentioned category of “cross-phase issues” to contain those non-time- in the web page. If single-label classification is preferred restricted categories. This category is at the same hierar- (Qi and Davison 2009), the web page can be classified into chical level as response and recovery, but contains hodge- the categories “response”-“emergency relief commodity” podge terms. This approach avoids redundancy of catego- according to the title to cover a relatively comprehensive ries and terms but sacrifices the homogeneity of categories topic. If multi-label classification is preferred (Qi and Da- and terms. Which of these three approaches makes more vison 2009), all the keywords plus the keywords Nepal sense to the users will have to be tested by a usability study earthquake, survivor, and aid organizations abstracted in the future. from the title could be classified into the corresponding Another problem is that as a first level category, “stake- categories. We can index the web page through tagging it holder” is not dispersed into each phase of the timeframe. by the categories used when classifying, or we can reduce For example, “search and rescue team” is not a sub-category the granularity to get fewer tags. For example, categories of “search and rescue,” “governmental agency” is not a sub- “tarps,” “blankets,” and “sleeping mats” share the same category of “government management,” and “doctor” is third-level category, “household item,” which means not a sub-category of “medical services.” Having a separate “household item” can be used in lieu of the three ones. category of “stakeholders” can reduce unnecessary repeti- When classifying the eighty-four web pages of the web- tion and produce a map of stakeholders. However, the se- site into the taxonomy, 211 topics were classified into the mantic relations between actions and stakeholders are cut top-level categories, whereas fifty-seven topics were uncat- off. This problem needs a better solution by adding possible egorized. The classification of the web pages means that, cross-references between the stakeholders and actions. conversely, we can use the corresponding categories of the taxonomy to tag the web pages. Generally, most of the web 5.0 Applying the taxonomy to organize and share pages are displayed in chronological order. Valuable attrib- online information: two case studies utes of the web pages can be displayed by tags selected from the taxonomy. It is expected that some categories in This section presents two case studies of using the taxon- the taxonomy may need to be revised and some new cate- omy to organize and share the earthquake response and gories may be added. The following findings can be made recovery information: World Vision web portal and Sina based on this case study. Weibo (microblog site). The purpose of the two case stud- First, using timeframe to organize online disaster re- ies is to demonstrate that the taxonomy is useful and has sponse and recovery information is reasonable. The World some external validity. Vision website has a category of “disaster response,” and all the contents under this category can be classified into 5.1 Information organization “response,” which is a top-level category of taxonomy. Second, modularity creates options of application 5.1.1 World Vision web portal (Clark and Baldwin 2002). The modular design increases the flexibility and feasibility of the taxonomy, since the World Vision Inc. is a registered nonprofit organization. website, as most of other sites, is not restricted to earth- One of its pages (now.worldvision.org, accessed 30 June quake. A flexible tool is critical to organize information 84 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… from this kind of websites. For instance, “dead body bur- showing how much revision is considered stable. The ial” can be used to manage “dead body burial of Ebola.” added and revised categories have increased the adaptabil- All the information under the “prayer” menu item can be ity of the taxonomy. classified into the “prayer” category. Third, the taxonomy has adaptability and extensibility 5.2 Information sharing to deal with specific disasters and situations. For example: “children” is an independent category appearing in “Ne- A framework of knowledge sharing (Carlile 2004, Yate and pal,” “ebola,” “South Sudan,” “Syria” and other emergency Paquette 2011) describes three progressively complex responses on the World Vision website. For Nepal’s high boundaries, syntactic, semantic and pragmatic, and three altitude and shortage of electricity, categories such as “ef- progressively complex processes, transfer, translation, and fect of high altitude,” “winter cold,” and “electricity out- transformation. This framework illustrates how to manage age” are suitable for these special situations. and share knowledge by crossing these boundaries sequen- tially. The taxonomy contributes to the first two boundaries 5.1.2 Sina Weibo of the model. The syntactic boundary corresponds to transferring knowledge. The primary focus is the storage Sina Weibo is the most popular microblog in China. The and retrieval of knowledge, for which a common lexicon is “topics” and the “official accounts” related to disasters are necessary. The semantic boundary corresponds to translat- the two primary venues to access disaster information on ing knowledge. The common meaning is a way to address its site. We have collected sixty-three topics and forty offi- the semantic differences such as ambiguous meaning and cial accounts with more than 3,000 posts. The topics interpretive differences. mainly focus on eight fields: earthquake information, relief This framework is used to demonstrate how the taxon- and response, education, prayer and love, donation, person omy supports information sharing and exchange in the first finding, memorial, recovery, and reconstruction. Some of two boundaries. As a knowledge organization tool, a taxon- these topics have classification labels and tags to describe omy provides standardized terms systematically. These the contents of posts. Some problems appear. First, not all terms establish the common cognition of a certain domain, the topics have classification labels and tags. Second, simi- which are used to organize and retrieve knowledge. The lar topics have discrepant classifications. For instance, the earthquake response and recovery taxonomy contains topic “person finding in Ya’an” has the category of “oth- standardized vocabularies helping users to build up cogni- ers” and region of “national,” whereas the topic “family tion basis for online information exchange. In addition, tax- finding in Ya’an earthquake” has category of “social” and onomy helps to put knowledge into practice by making region of “Hebei Province of China.” sense of the knowledge and a common way of working The posts following one topic are classified automati- (Pellini and Jones 2011). The structure of the taxonomy, the cally into the category of the topic. For instance, there are facets, and the modules in the structure semantically reflect several posts under the topic “traffic information of Ya’an the activities of disaster response and recovery. Table 3 earthquake.” We can classify all of the posts into the sec- shows exemplifications from the two case studies. ond-level category “traffic and travel information.” If greater granularity is needed, we can break them down by 6.0 Summary of findings and implications the third-level, fourth-level, etc. For the posts not follow- ing related topics, the categories in the earthquake re- An earthquake response and recovery taxonomy has been sponse and recovery taxonomy can be used as facets. The built for the purpose of managing earthquake-related online facets can be combined for a post-coordinate index. For information resources. The categories of the taxonomy can example, a post with content “two soldiers have reached be used to organize online information for browsing. Chengdu by air medical evacuation” can be indexed by cat- egories of “Sichuan Ya’an earthquake” + “soldiers” + “in- 1st 2nd 3rd 4th 5th 6th 7th Level jured person” + “medical air evacuation.” level level level level level level level A total of thirty-four categories were added and sixteen Number of catego- were revised after applying the taxonomy to categorize the ries in the original 5 50 218 539 563 167 36 online disaster-related information on the World Vision taxonomy web portal and Sina Weibo. Table 2 shows the number of Added categories 0 3 19 7 5 0 0 added and revised categories at each level of the taxonomy. Revised categories Less than 10% of the categories from the second to fifth (including structure 0 2 9 5 0 0 0 update) levels were added or revised, showing a relative stability of the taxonomy although there is no systematic study so far Table 2. Numbers of added and revised categories at each level. Knowl. Org. 46(2019)No.2 85 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

Boundary Syntactic boundary Semantic boundary Process Transfer information: infor- Translate information: creating shared meanings mation processing Solutions of Standardized terms to facilitate A facet classification for semantically cohesive categories the taxonomy information retrieval and brows- ing Food insecurity represents “food A webpage with the topic “Updated Ecuador earthquake: World Vision re- crisis,” “hunger crisis,” “hungry,” sponds” is classified into “shelter supply,” while a webpage with the topic “Ne- “need food” and other related pal earthquake: Shelter” is classified into “shelter need.” terms and topics on the World Vision Website. Instances “Person finding” is used to ex- Posts and topics like “Ya’an Lushan 7 magnitude earthquake,” “4.20 Lushan press topics such as “find earthquake,” “Ya’an earthquake,” “Moving forward to Baoxing,” “Sichuan 4.20 friends,” “finding relatives,” Earthquake relief and response” actually refer to one quake. Baoxing is one of “missing person” and “find a the towns of the epicenter Lushan, which is one of the counties of Ya’an, a city child” on Sina Weibo. of Sichuan Province in China and this quake happened on April 20, 2013. Lushan is classified into “epicenter,” Baoxing into “seismic region,” Sichuan into “impact landscape” and Ya’an, as a landmark, into “location.”

Table 3. Information sharing supported by the taxonomy.

The facet-based taxonomy can be used to index and tag 7.0 Conclusions and future work information. The two case studies show that the taxonomy is useful and has some external validity. Based on the two Online information explosion and time-sensitive demand case studies, when applied to organize online information of victims after an earthquake make disaster response resources, the taxonomy experienced slight categorical ad- challenging. Obstacles and constraints limit information justments and so presents a relative stability, adaptability exchange that supports decision making. Organized infor- and extensibility; therefore, it has some internal validity. If mation with standard terms and relationship structures can the taxonomy is acknowledged by the earthquake manage- facilitate information sharing during disaster response and ment community, it can be used to organize and share recovery. An earthquake response and recovery taxonomy earthquake-related online information of future earth- is built using online information resources from various quakes; therefore, it has an implication for earthquake sources for the purpose of managing online information, online information management. especially for web portals and social media sites. The tax- The earthquake response and recovery taxonomy mod- onomy is developed using a combination of the top-down els the knowledge structure of the earthquake manage- and bottom-up approaches. The key features are four-fold: ment domain. It can be used by earthquake response and a combination of the enumerative and faceted classifica- recovery personnel to understand the various aspects of tion, the timeframe-based categories in different phases of earthquake management. It can also be used by stakehold- disaster management to address the time sensitivity, the ers as a common language for communication and ex- folksonomy thinking in category creation contributing to change of knowledge. Therefore, it presents a social rele- the flexibility of the taxonomy, and the modular architec- vance and has an implication for knowledge management. ture for adaptability and extensibility. For these features, The earthquake response and recovery taxonomy was this approach can be easily extended to building taxono- developed from a broader perspective of disaster manage- mies for other types of natural disasters, such as flood, tor- ment based on the timeframe of disaster management and nado, wildfire, etc. With standardized terminology, the tax- the idea of modular design. The top three level categories onomy can be used by earthquake management commu- can be applied to the management of other types of dis- nity and the public as a common language for exchanging asters other than earthquake. Using timeframe to organize and sharing information. It presents a social relevance and online disaster-related information is reasonable based on has implications for knowledge management, disaster the two case studies. The modular design allows the taxon- management, and online information management. omy to be easily revised to be a taxonomy for other types Two case studies were presented to demonstrate its use- of disasters; therefore, it has an implication for disaster fulness in managing online earthquake response and re- management. covery information posted on the World Vision website and the Sina Weibo social network site. They demonstrated that the taxonomy is useful and has some external validity. Based on the two case studies, the taxonomy experienced 86 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery… slight categorical adjustments when applied to organize the Canadian Red Cross. 2012. “Social Media during Emergen- information resources on the two sites. The taxonomy pre- cies”. https://www.redcross.ca/crc/documents/Social- sents relative stability, adaptability, and extensibility, and so Media-in-Emergencies-Survey-Oct-2012-English.pdf has some internal validity. Carlile, Paul R. 2004. “Transferring, Translating, and The study has found a new problem when building tax- Transforming: An Integrative Framework for Managing onomies for time-sensitive disaster response. Timeframe Knowledge across Boundaries.” Organization Science 15: brings a challenge to the structure of the taxonomy. Du- 555-68. plicating these services under each timeframe brings re- Carter, Nick W. 2008. Disaster Management: A Disaster Man- dundancy of categories, and so is not an optimal solution. ager’s Handbook. Manila, Phillipines: Asian Development Classifying these services to a category of “services Bank. https://www.think-asia.org/handle/11540/5035 throughout all phases of the disaster” may hinder the un- Chen, Sisi and Yu Fu. 2013. “突发实践中微博传播的双 derstanding of the phases of disaster response and recov- 面性” [Double Faces of Weibo Communication in ery. Emergencies]. 以雅安地震为例.”青年记者 [Youth Jour- The future work contains further studies to reach a bal- nalist] 10Z: 67-68. doi:10.15997/j.cnki.qnjz.2013.29.033 ance between timeframe and the category redundancy. A Cisco, Susan L. and Wanda K. Jackson. 2005. “Creating systematic evaluation of the taxonomy is to be done to Order out of Chaos with Taxonomies.” Information Man- fully assess its internal validity and to find areas to be im- agement Journal 39, no. 3: 45-50. proved. More case studies are to be performed in order to Clark, Kim B. and Carliss Y. Baldwin. 2002. The Power of demonstrate its external validity and to improve the taxon- Modularity. Vol. 1 of The Option Value of Modularity in De- omy’s structure, categories, and terms. In the meantime, a sign: An Example from Design Rules. Harvard NOM method is to be developed to automatically apply the tax- Working Paper No. 02-13; Harvard Business School onomy to earthquake-related sites to manage the online in- Working Paper No. 02-078. doi:10.2139/ssrn.31240 formation of ongoing earthquake response events. During Coburn, Andrew W., Gary Bowman, Simon J. Ruffle, Rox- this process, the folksonomies for mapping multiple terms ane Foulser-Piggott, Daniel Ralph, and Michelle to a standardized term can be useful in organizing web Tuveson. 2014. A Taxonomy of Threats for Complex Risk pages and online posts into the categories in the taxonomy Management. Centre for Risk Studies, University of Cam- for browsing or indexing the web pages and posts for bridge. searching. Dantas, Andre, Erica Seville and Alan Nicholson. 2006. “In- formation Sharing during Disaster: Can We Do It Bet- References ter?” https://ir.canterbury.ac.nz/handle/10092/2843 ESRI (Environmental Systems Research Institute). 2008. AIRS-211 LA County Taxonomy of Human Services Geographic Information Systems Providing the Platform for Com- (website). 2019. https://211taxonomy.org/ prehensive Emergency Management. ESRI White Paper. Red- Bardet, Jean-Pierre and Fang Liu. 2010. “Towards Virtual lands, CA. ESRI. https://www.esri.com/library/ white- Earthquakes: Using Post-Earthquake Reconnaissance papers/pdfs/gis-platform-emergency-management.pdf Information.” Online Information Review 34, no. 1: 59-74. FEMA (Federal Emergency Management Agency). 2011. Besaleva, Liliya I. and Alfred C. Weaver. 2016. “Crowd- National Disaster Recovery Framework: Strengthening Disaster sourcing for Emergency Response.” In Proceedings of the Recovery for the Nation. Washington, DC: U.S. Depart- International Conference on Frontiers in Education: Computer ment of Homeland Security, Federal Emergency Man- Science and Computer Engineering (FECS), July 25-28, Las agement Agency. https://www.fema.gov/pdf/recov Vegas, Nevada, USA. CSREA Press: 248-53. eryframework/ndrf.pdf Bharosa, Nitesh, Jinkyu Lee, and Marijn Janssen. 2010. Foskett, Douglas J. 2010. “Facet Analysis [ELIS Classic].” “Challenges and Obstacles in Sharing and Coordinating In Encyclopedia of Library and Information Sciences. 3rd ed., Information during Multi-Agency Disaster Response: ed. Marcia J. Bates and Mary Niles Maack. Boca Raton: Propositions from Field Exercises.” Information Systems CRC Press, 1818-22. Frontiers 12, no. 1:49-65. Fraustino, Julia Daisy, Brooke Liu, and Yan Jin. 2012. Social Bjerge, Benedikte, Nathan Clark, Peter Fisker and Em- Media Use during Disasters: A Review of the Knowledge Base manuel Raju. 2016. “Technology and Information Shar- and Gaps. College Park, MD: National Consortium for ing in Disaster Relief.” PLoS ONE 11, no. 9: e0161783. the Study of Terrorism and Responses to Terrorism. doi:10.1371/journal.pone.0161783 Garg, Muskan and Mukesh Kumar. 2016. “Review on Bogdan, Robert and Sari Knopp Biklen. 1992. Qualitative Event Detection Techniques in Social Multimedia.” Research for Education: An Introduction to Theory and Meth- Online Information Review 40: 347-61. ods. London: Allwyn and Bacon. Knowl. Org. 46(2019)No.2 87 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

Gralla, Erica, Jarrod Goentzel, and Bartel Van de Walle. Capacities in the Australian Context.” Disaster Prevention 2013. “Field-Based Decision Makers’ Information Needs and Management: An International Journal 21, no. 5: 529-40. in Sudden Onset Disasters.” http://digitalhumanitarians. Maxwell, Joseph A. 2006. Qualitative Research Design: An In- com/content/decision-makers-needs teractive Approach. 2nd ed. Thousand Oaks: Sage. Hemerly, Jess, Vivien Petras, Michael Manoochehri, and Merriam, Sharan B. 1988. Qualitative Research and Case Study Longhao Wang. 2013. “Classification: Assigning Re- Applications in Education. San Francisco: Jossey-Bass. sources to Categories.” In The Discipline of Organizing, OCHA (United Nations Office for the Coordination of ed. Robert J. Glushko. Cambridge, MA: MIT Press, 273- Humanitarian Affairs). 2014. “On-Site Operations Coor- 316. dination Centre (OSOCC) Guidelines.” https://www. Hill, Linda, Olha Buchel, Greg Janee and Marcia Lei Zeng. unocha.org/sites/dms/Documents/2014%20OSOCC 2002. “Integration of Knowledge Organization Sys- %20Guidelines_FINAL.pdf tems into Digital Library Architectures.” In 13th ASIS OCHA-ROAP (United Nations Office for the Coordina- SIG/CR Classification Research Workshop, ed. Clare tion of Humanitarian Affairs-Regional Office for Asia Beghtol, Jonathan Furner, Barbara Kwasnik and Jens- and the Pacific). 2017. “Disaster Response in Asia and Erik Mai. Advances in Classification Research Online the Pacific: A Guide to International Tools and Services.” 13. doi:10.7152/acro.v13i1.13835 https://www.unocha.org/sites/unocha/files/ROAP_ Hjørland, Birger. 2013. “Facet Analysis: The Logical Ap- DisasterGuide.pdf proach to Knowledge Organization.” Information Pro- Pellini, Arnaldo and Harry Jones. 2011. “Knowledge Tax- cessing & Management 49: 545-57. onomies: A Literature Review.” Evidence Ideas Change Holgate, Lisle. 2004. “Creating and Using Taxonomies to (blog), May. https://www.odi.org/publications/5753- Enhance Enterprise Search.” Information Today 21, no. 7: knowledge-taxonomies-literature-review 10-11. Qi, Xiaoguang and Brian D. Davison. 2009. “Web Page Knezić, Snježana, Martina Baučić, Toni Kekez, Uberto Classification: Features and Algorithms.” ACM Compu- Delprato, Giovanni Tusa, Alexander Preinerstorfer and ting Surveys 41, no. 2: article 12 Gerald Lichtenegger. 2015. “Taxonomy for Disaster Quintarelli, Emanuele, Andrea Resmini, and Luca Rosati. Response: A Methodological Approach.” In Evolving 2007. “Facetag: Integrating Bottom-Up and Top-Down Threats and Vulnerability Landscape: New Challenges for the Classification in a Social Tagging System.” Bulletin of the Emergency Management, Proceedings of the International Emer- Association for Information Science and Technology 33, no. 5: gency Management Society 22nd TIEMS Annual Conference, 10-5. Rome, Italy, September 30-October 2, 2015, ed. Knezic, Ramos, Laura and Daniel W. Rasmus. 2013. “Best Practices Snjezana and Meen Poudyal Chhetri, 22. in Taxonomy Development and Management.” Planning Kwasnik, Barbara H. 1999. “The Role of Classification in Assumption (blog), Jan. 8. http://citeseerx.ist.psu.edu/ Knowledge Representation and Discovery.” Library viewdoc/download?doi=10.1.1.201.4848&rep=rep1& Trends 48, no. 1: 22-47. type=pdf Lauckner, Heidi, Margo Paterson, and Terry Krupa. 2012. Sakurai Mihoko. 2016. “Communication Platform for Dis- “Using Constructivist Case Study Methodology to Un- aster Response.” In Tackling Society's Grand Challenges with derstand Community Development Processes: Pro- Design Science: 11th International Conference, DESRIST 2016, posed Methodological Questions to Guide the Re- St. John’s, NL, Canada, May 23-25, 2016, Proceeding, ed. Jef- search Process.” The Qualitative Report 17, no. 13: 1-24. frey Parsons, Tuure Tuunanen, John Venable, Brian Don- Library of Congress. 2018. “Library of Congress Classifi- nellan, Markus Helfert, and Jim Kenneally. Cham: cation Outline: D - World History and History of Eu- Springer, 223-7. doi:10.1007/978-3-319-39294-3_19 rope, Asia, Africa, Australia, New Zealand, etc.” https:// Svenonius, Elaine. 2000. The Intellectual Foundation of Infor- www.loc.gov/aba/cataloging/classification/lcco/lcco mation Organization. Cambridge, MA: MIT Press, 127-46. _d.pdf Tuli, Fekede. 2010. “The Basis of Distinction between Lincoln, Yvonna S. and Egon Guba. 1985. Naturalistic In- Qualitative and Quantitative Research in Social Science: quiry. Newbury Park, CA: Sage. Reflection on Ontological, Epistemological and Meth- Marianti, Ruly. 2007. What Is to Be Done with Disasters? A odological Perspectives.” Ethiopia Journal of Education Literature Survey on Disaster Study and Response. SMERU and Sciences 6, no. 1: 97-108. Working Paper. Jakarta: SMERU Research Institute. UN (United Nations). 2008. Disaster Preparedness for Effective http://www.smeru.or.id/sites/default/files/publica Response: Guidance and Indicator Package for Implementing Pri- tion/disastermanagement_eng.pdf ority Five of the Hyogo Framework. Geneva: United Na- Martin, Nigel and John Rice. 2012. “Emergency Commu- tions. https://www.unisdr.org/files/2909_Disasterpre nications and Warning Systems: Determining Critical parednessforeffectiveresponse.pdf 88 Knowl. Org. 46(2019)No.2 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

UN (United Nations). 2018. United Nations Disaster Assess- 2. Social network sites ment and Coordination: UNDAC Field Handbook. 7th ed. [Geneva]: United Nations Office for the Coordination – Social networks and blogs, such as Sina Weibo. of Humanitarian Affairs. https://www.unocha.org/ – Bookmarking sites, such as Blogmarks.net. sites/unocha/files/1823826E_web_pages.pdf – Content communities, such as Quora and Zhihu. UNCHA (United Nations Office for the Coordination of Humanitarian Affairs). 2002. “Symposium on Best 3. Research reports Practices in Humanitarian Information Exchange.” https://reliefweb.int/report/world/symposium-best- – Handbooks, such as Disaster Response in Asia and The practices-humanitarian-information-exchange Pacific: A Guide to International Tools and Services UNISDR (United Nations Office for Disaster Risk Reduc- (OCHA-ROAP, 2017). tion). 2017. “Terminology.” https://www.unisdr.org/ – Field Handbooks, such as the United Nations Disaster we/inform/terminology Assessment and Coordination (2018) by the Office for USGS (United States Geological Survey). 2019. Earth- the Coordination of Humanitarian Affairs of the quake Statistics. http://earthquake.usgs.gov/earth United Nations. quakes/browse/stats.php – White papers, such as Geographic Information Systems Van de Walle, Bartel and Tina Comes. 2015. “On the Na- Providing the Platform for Comprehensive Emergency ture of Information Management in Complex and Nat- Management (Environmental Systems Research Insti- ural Disasters.” Procedia Engineering 107: 403-11. tute, 2008). Vickery, Brian Campbell. 1966. Faceted Classification Schemes. – Working papers, such as the National Emergency Com- Rutgers Series on Systems for the Intellectual Organi- munications Plan (U.S. Department of Homeland Se- zation of Information. 46. New Brunswick, NJ: Grad- curity: https://www.dhs.gov/publication/2014-natio uate School of Library Science at Rutgers University. nal-emergency-communications-plan). Wang, Zhonghong, Abdus Sattar Chaudhry, and Christo- – Research papers, such as A Taxonomy of Threats for pher Khoo. 2010. “Support from Bibliographic Tools Complex Risk Management (Coburn et al., 2014). to Build an Organizational Taxonomy for Navigation: Use of a General Classification Scheme and Domain Appendix 2. Thesauri.” Knowledge Organization 37: 256-69. A snippet of the top two levels of the taxonomy Wu, Yejun and Li Yang. 2015. “Construction and Evalua- tion of an Oil Spill Semantic Relation Taxonomy for 1 disaster Supporting Knowledge Discovery.” Knowledge Organiza- 1.1 disaster damage tion 42: 222-31. 1.2 location Yates, Dave and Scott Paquette. 2011. “Emergency 1.3 post-disaster effect Knowledge Management and Social Media Technolo- 1.4 secondary disaster gies: A Case Study of the 2010 Haitian Earthquake.” 1.5 time International Journal of Information Management 31, no. 1: 1.6 type 6-13. 2 preparedness Appendix 1. 2.1 community capacity and resilience-building Major information resources that are used to build 2.2 disaster preparedness conduction the taxonomy 2.3 legal preparedness 2.4 mitigation implementation 1. Websites and portals 2.5 partnership building 2.6 pre-disaster recovery planning – Online-published coverage from news agencies, such as CNN.com. 3 response – Coverage from web portals, such as Sina.com. 3.1 general – Websites of humanitarian organizations, such as 3.2 authority action Oxfam.org. 3.3 communication – Websites of government agencies, such as FEMA.gov. 3.4 dead body – Websites of education and research institutes, such as 3.5 disease NIED.go.jp. 3.6 displacement and shelter 3.7 donation Knowl. Org. 46(2019)No.2 89 Li Yang and Yejun Wu. Creating a Taxonomy of Earthquake Disaster Response and Recovery…

3.8 emergency relief commodity 5 stakeholders 3.9 emergency response level 5.1 animal health agency 3.10 infrastructure 5.2 celebrity 3.11 food 5.3 civil society 3.12 help seeking 5.4 donor 3.13 logistics 5.5 earthquake-affected population 3.14 paramedic and medical care 5.6 education and research 3.15 person searching 5.7 expert 3.16 search and rescue 5.8 financial institution 3.17 stress 5.9 general public 3.18 traffic and travel information 5.10 individual and volunteer organization 3.19 warning and notification 5.11 international organization 3.20 water, sanitation and hygiene 5.12 journalist 5.13 mother, infant and child 4 recovery 5.14 national authority 4.1 short-term recovery 5.15 non-governmental organization 4.2 intermediate recovery 5.16 private sector 4.3 long-term recovery 5.17 relief and rescue worker

Appendix 3. A snippet of the bottom levels of the taxonomy

3.7 emergency relief commodity 3.7.4.4 hygiene product 3.7.1 relief item hygiene kit 3.7.2 communication item oral and dental hygiene 3.7.3 power supply item toothbrush 3.7.4 household item toothpaste 3.7.4.1 waterproof supply trash bag raingear cleaning supply canopy bathing soap tarp laundry soap 3.7.4.2 cold-weather essential napkin blanket bathing tissue clothing hygiene material underwear sanitary material for menstruation headscarf washable nappy or diaper other 3.7.5 medical item glove 3.7.6 commodity management sleeping bag 3.7.7 commodity need hand warmer 3.7.4.3 individual emergency bag general band-aid

90 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

Information Organization Patterns from Online Users in a Social Network† Chengzhi Zhang*, Hua Zhao**, Xuehua Chi*** and Shuitian Ma**** Nanjing University of Science and Technology, Department of Information Management, Nanjing, 210094, China, *, **, ***<[email protected]>, ****

Chengzhi Zhang is Professor of information science in the Department of Information Management, Nanjing University of Science and Technology. He is a also visiting scholar at the University of Pittsburgh, Pittsburgh (2013) and Senior Research Associate at the City University of Hong Kong, Hong Kong SAR (2010~2011). His research interests are information organization, digital libraries, information retrieval, text mining, natural lan- guage processing, etc. He has published approximately 100 articles in refereed journals and conference proceed- ings including JASIST, ASLIB JIM, OIR, Scientometrics, etc. He is an editorial advisory board member of The Electronic Library, Data Intelligence and Technology Intelligence Engineering.

Hua Zhao received her master’s degree in information science as well as bachelor’s degree in information man- agement and system program from the Nanjing University of Science and Technology, Nanjing, China, in 2014 and 2017, respectively. From 2018, she joined as an algorithm engineer with the Ctrip.com International Ltd, Shanghai, China. Her research interests include text mining and natural language processing.

Xuehua Chi is a graduate student at Nanjing University of Science and Technology. She received her bachelor’s degree in information management and information system from Nanjing University of Science and Technology as well. Her research interests include user modeling, text mining, information systems, and knowledge organi- zation.

Shutian Ma is a doctoral student at Nanjing University of Science and Technology. She is in the last year of a master-doctor combined program now and received her bachelor’s degree in information management and in- formation system from Nanjing University of Science and Technology as well. Her research interests include citation recommendation, embedding algorithms, clustering and classification, and knowledge organization.

Zhang, Chengzhi, Hua Zhao, Xuehua Chi and Shuitian Ma. 2019. “Information Organization Patterns from Online Users in a Social Network.” Knowledge Organization 46(2): 90-103. 37 references. DOI:10.5771/0943-7444- 2019-2-90.

Abstract: Recent years have seen the rise of user-generated contents (UGCs) in online social media. Diverse UGC sources and information overload are making it increasingly difficult to satisfy personalized information needs. To organize UGCs in a user-centered way, we should not only map them based on textual topics but also link them with users and even user communities. We propose a multi-dimensional framework to organize infor- mation by connecting UGCs, users, and user communities. First, we use a topic model to generate a topic hier- archy from UGCs. Second, an author-topic model is applied to learn user interests. Third, user communities are detected through a label propagation algorithm. Finally, a multi-dimensional information organization pattern is formulated based on similarities among the topic hierarchies of UGCs, user interests, and user communities. The results reveal that: 1) our proposed framework can organize information from multiple sources in a user-centered way; 2) hierarchical topic structures can provide comprehensive and in-depth topics for users; and, 3) user com- munities are efficient in helping people to connect with others who have similar interests.

Received: 3 June 2018; Sixth revision: 17 January 2019; Accepted: 18 January 2019

Keywords: topic model, user, information, social networks

† This work is supported by the National Social Science Fund (No. 14BTQ033), Key Laboratory of Rich- media Knowledge Organization and Service of Digital Publishing Content, Institute of Scientific and Tech- nical Information of China (No. ZD2018-07/01) and Qing Lan Project.

Knowl. Org. 46(2019)No.2 91 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

1.0 Introduction 2.0 Related works

Various online social networks have been created for peo- Of the rich set of studies on the organization of online ple to communicate and share information with each other information organization, in this section we discuss mainly around the world. There is also increasing growth in the research into UGCs and user information behavior. numbers of social media users. For example, at the end of September 2017, Twitter had 330 million active monthly 2.1 Information organization based on UGCs users (https://www.statista.com/statistics/282087/num ber-of-monthly-active-twitter-users/), while Sina Weibo With the popularization of web 2.0, UGCs have received (https://weibo.com), one of the biggest Chinese micro- much attention from researchers. Social tags and folkson- blog platforms, had 376 million active monthly users omies are becoming popular among different types of (http://www.useit.com.cn/thread-17562-1-1.html). Simul- UGCs (Kim 2008). Researchers have made particular pro- taneously, vast quantities of user-generated contents gress on information organization (Munk and Mørk 2007). (UGCs), referring to images, videos, text, audio, and any Noruzi and Alireza (2006) explored the folksonomy tag- other form of content, are being posted by users in online ging phenomenon and discussed relevant problems. platforms (https://en.wikipedia.org/wiki/User-generated Mathes (2004) provided the first review and survey of so- _content). Due to their huge volume, uneven quality, and cial tagging systems. Potnis (2011, 32-35) allowed users to dynamic changes, UGCs pose new challenges for mining participate in information organization by using folkson- and organizing content (Zhu et al. 2013, 233). omy. Finally, Van Damme et al. (2007) proposed an inte- To provide organized information for users, most of grated approach for turning folksonomies into ontologies the current social networks methodically present UGCs. and using networked knowledge organization systems for For instance, Sina Weibo displays posts in chronological information organization. order, whereas users may prefer contents to be sorted by While mass UGCs provide rich information resources, topics of their interests. There are a large number of re- they also present new challenges, like the integration of dundant communities and inactive groups in social net- heterogeneous contents. Although classifying UGCs in works, since most cannot organize information efficiently pre-customized categories is simple, classification accuracy to meet users’ needs (Treem and Leonardi 2012, 143). Re- depends on the manual maintenance of a category system searchers are seeking ways to group and streamline large (Gao 2012, 761). Chen et al. (2011) organized UGCs amounts of UGCs (Kietzmann et al. 2011; Van Damme et through a topic model, which does not require mainte- al. 2007). Traditional knowledge organization tools employ nance of a classification system. However, the topics ob- conventional relations between concepts, subjects, and in- tained from their model were somewhat difficult to under- formation units, whereas current studies focus more on stand, and its classification accuracy needs to be improved. users, aiming to organize and integrate information in a Gupta et al. (2010) used social labels for information or- user-centered way (Hjørland 2003; 2014). ganization, which greatly reduces manual costs, but their There are three indispensable elements of a social net- method has problems of data sparseness due to label scar- work platform: UGCs, users, and user communities. Since city. Zhu et al. (2013) proposed topic hierarchy construc- most related research is focused on one element in at- tion for UGCs, while Li (2013) constructed the hierar- tempting to organize information (Ming et al. 2014; Zhu chical architecture in an entity-based formalism. These et al. 2014), we aim to develop a multi-dimensional infor- studies are very relevant but do not distinguish between mation organization system by linking the three elements. different types of UGCs, and few studies have attempted To represent UGCs, we employ a topic model to construct multi-dimensional information organization. a topic hierarchy, which has been widely used in many studies (Zhang 2017; Zhu et al. 2013; Zhu et al. 2014). We 2.2 Information organization based on user then generate profiles of users’ interests through user information behavior modeling and detect community structure to reveal net- work organization of users. Finally, to support user-cen- The research related to user information behavior can be tered information organization, we perform similarity cal- divided into theoretical and empirical works. Theoretical culations to associate UGCs, users, and user communities research has explored, for instance, the behavior of infor- with one another. mation query (Bawden 2006), information interaction (Chen et al. 1998; Keenan et al. 2013), information crea- tion (Maria et al. 2008), information utilization (Kaplan et al. 2010), and information sharing (Stutzman 2006). Em- pirical research has mainly focused on information behav- 92 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

Figure 1. Framework of the study’s multi-dimensional information organization.

ior (Benevenuto et al. 2009), user influence (Agichtein et 3.0 Methodology al. 2008), and user social relationships (Kwak et al. 2010). Overall, few studies have explored user behavior with re- 3.1 Framework spect to organizing information. Bar-Ilan and Belous (2007) studied the theory and practice of information or- To connect UGCs, users, and user communities, we at- ganization and its relationship with human perception, tempt to map them in a topic space. Our framework com- proposing an intuitive information organization system. prises four main steps: 1) hierarchical topic construction Bu et al. (2010) emphasized the importance of considering for UGCs; 2) user modeling to represent user interests; 3) user behavior, participation, and experience when design- community detection; and, 4) multi-dimensional infor- ing an information organization interface. There is clearly mation association to link UGCs, users, and user commu- a need for more in-depth research of user behavior, espe- nities. Our experimental process is depicted in Figure 1. cially with respect to organizing information on social net- We begin by collecting UGCs from verified users’ mi- works. croblogs and the relationships between users in five do- Traditional methods of information organization rely on mains (football, internet, literature, medicine, law) on Sina semantic or topical relations between textual contents to or- Weibo. We then use these UGCs (taken from posts that ganize UGCs, without considering the users who generate users create, like, and repost) to generate the topic hierar- contents and the user communities in which similar users chy. To obtain a fusion model of users, we merge features gather. Despite the increasing shift toward user-centered in- learned from users’ interests and relations (follower, fol- formation integration (Zhou 2010, 36-40), focusing espe- lowing) using an author-topic model. To detect user com- cially on relationships between information and users, only munities, we employ a label propagation algorithm. Finally, a few studies have considered a multi-dimensional approach a multi-dimensional information organization system is to organizing information, combining UGCs with users and constructed by calculating topic similarities among the user communities. By fully considering the features of UGC topic hierarchy, the user fusion model, and detected UGCs, this study endeavors to associate these three ele- user communities. ments to generate a new pattern of knowledge organization In the next section, we will discuss the key techniques in social networks. employed in UGC topic hierarchy construction, user mod- eling, community detection, and multi-dimensional infor- mation association.

Knowl. Org. 46(2019)No.2 93 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

3.2 Key techniques 2) Extract the polynomial probability distribution φ for each topic; 3.2.1 UGC topic hierarchy construction 3) For each item in document d: a) extract an author x; 3.2.1.1 Latent Dirichlet Allocation (LDA) b) extract a subject z; c) extract a term w;

LDA is a three-level hierarchical Bayesian model (Blei et al. 4) Repeat extraction Nd times to generate document d. 2003). It assumes that each item in a collection is generated by the finite mixture of topics, each of which is modeled as a multinomial mixture of vocabulary. It also assumes that each document is modeled as a finite mixture over an un- derlying set of topics. The topic mixture is then drawn from a conjugate Dirichlet prior that remains the same for all documents. The LDA model contains the following param- eters: α, the Dirichlet priori parameter of document-topic distribution; β, the Dirichlet prior parameter of topic-word distribution; K, the topic number; d, the document; and z, the topic. This paper uses the LDA model to derive topical representations for verified users in the five domains of law, Figure 2. Author-topic model. medicine, literature, football, and internet. In the author-topic model: θ represents the author-topic 3.2.1.2 Document topic extraction probability distribution; φ represents the topic-term prob- ability distribution; α is the Dirichlet prior parameter of After topic modeling, each document is projected into the document-topic probability distribution; β is the Dirichlet topic space with different probability distributions, as prior parameter of the topic-term probability distribution; shown in equation 1 below. We then set a threshold to as- ad represents the uniform distribution of the author’s set; sign a specific topic to each document; if the topic proba- x represents the author; z represents the topic; w is the bility of a document in the topic space exceeds the prede- term; D represents the document set; Nd represents the fined threshold, the document is assigned to this topic. number of repeated samples; K represents the number of authors; and T represents the number of topics.

𝑑𝑜𝑐 𝑡𝑜𝑝𝑖𝑐:𝑤,,𝑡𝑜𝑝𝑖𝑐:𝑤,,,…, 𝑡𝑜𝑝𝑖𝑐 :𝑤 ,…, 𝑡𝑜𝑝𝑖𝑐 :𝑤 (equation 1) 3.2.2.2 User fusion model , , After first using author-topic data to model user interests Where 𝑡𝑜𝑝𝑖𝑐 represents topic j, 𝑑𝑜𝑐 represents document using posts in five domains, we then used relational data 𝑖 , 𝑤, represents the weight of topic j in document i. collected on followers and followings to generate follower 𝑤, 𝑤, ⋯𝑤, ⋯𝑤, 1, p=1/n; if 𝑤, collection model and following collection model by the au- 𝑝, we assign this document to 𝑡𝑜𝑝𝑖𝑐, thereby collecting the thor-topic model. Finally, the fusion model is obtained by documents for each topic. fusing the three models (user model, follower collection model, and following collection model). 3.2.2 User modeling User model of each user is denoted as equation 2:

𝑡𝑜𝑝𝑖𝑐 :𝑤 ,…,𝑡𝑜𝑝𝑖𝑐:𝑤,…, 3.2.2.1 Author-topic model 𝑢 (equation 2) 𝑡𝑜𝑝𝑖𝑐:𝑤

The author-topic model uses a topic-based representation Where, 𝑤 ⋯𝑤 ⋯𝑤 1, u represents the to model both document contents and author interests. user model, 𝑡𝑜𝑝𝑖𝑐 represents topic i, and 𝑤 represents This model supposes that each author has a topic proba- the weight of topic i. bility distribution θ, and each topic has a term probability distribution φ, as shown in Figure 2. The model generation With reference to Hannon (2010, 201), the follower col- process (Steyyers et al. 2004, 307-310) is as follows: lection and following collection model are obtained by merging the follower model and the following model sep- 1) Extract the polynomial probability distribution θ for arately, which are complemented to the user model by the each author; following equation: 94 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

𝑢:𝑤,U:𝑤, Where 𝐴 represents the adjacency matrix of the network 𝑈 (equation 3) U:𝑤 graph; m represents the number of edges of the network graph; and 𝑃 represents expectations of the edge be- Where, 𝑤 𝑤 𝑤 1; 𝑢 rep- tween node i and node j in an empty model. Networks with resents the user model; U represents the follow- high modularity have dense connections between the ers collection model; U represents the follow- nodes within modules but sparse connections between ings collection model; 𝑤 represents the weight of the nodes in different modules. If nodes i and j are in the same user model; 𝑤 represents the weight of the fol- community, 𝛿𝐶,𝐶1; otherwise, 𝛿𝐶,𝐶0. lowers collection model; 𝑤 represents the weight of the following collection model; and 𝑈 repre- 3.2.4 Multi-dimensional information association sents the final user model, also called the user fusion model. 𝑈 and 𝑈 is computed by the We use the Jensen–Shannon (JS) divergence distance to cal- following formula: culate the similarity of topics, based on the “topic-term” matrix obtained from the user model and topic model. The U∑ 𝑢 (formula 1) formula of the JS divergence distance is as follows:

Where 𝑢 represents user i’s own model, and U represents 𝑃𝑖 the model results of user collection {𝑢 ,…,𝑢 ,…,𝑢 }. 𝐷 𝑃||𝑄 1/2ln 𝑃𝑖 𝑄𝑖 ∑ ln 𝑄𝑖 (formula 3) 3.2.3 User community detection

3.2.3.1 Label propagation algorithm Where 𝑃𝑖 represents the probability of word 𝑖 in topic P, and Q𝑖 represents the probability of word 𝑖 in topic The label propagation algorithm (Raghavan et al. 2007) is Q. the classical algorithm for finding communities and is widely used in large-scale networks. The algorithm as- We can then calculate the similarity among hierarchy top- sumes that a node’s label is the one carried by the largest ics, users, and user communities to build a multi-dimen- number of its neighbors. Nodes with the same label are sional information organization system. grouped into the same community. The steps of the label propagation algorithm are as follows: 4.0 Experiment

1) Initialize the labels of all nodes in the network, and 4.1 Dataset give each node a unique label; 2) Set t=1, with t representing the number of iterations; For this study, we collected 4,966 verified users from Sina 3) Randomly arrange nodes in the network, and gener- Weibo. Microblogs were sourced from Tu et al. (2015). The ate sequence X; maximum number of microblogs was set to 500 for each 4) According to the order in sequence X, let each specific user. In total, there are 1,887,633 user post profiles arg node’s label be , where N v represents the across the five domains, as elaborated in Table 1. set of neighbors with label v; In addition, 152,976 follower and following relation- 5) Update labels until each node changes its label to the ships were crawled. As Table 2 shows, we obtain 106,388 one carried by the largest number of its neighbors; pairs of followers and 110,367 pairs of following, and set t=t+1, and return to the third step. there are fewer follower–following pairs than the sum of follower and following pairs, which indicates that existing Through this repeated process, nodes with the same label user pairs follow each other. are grouped into the same community. 4.2 Experimental results analysis 3.2.3.2 Community detection 4.2.1 UGC topic hierarchy construction Modularity is a benefit function that measures the quality of a network’s division into groups or communities (New- LDA is used to derive the first topic layer using the open man et al. 2004). The formula is: source Gibbs sampling tool. The parameters are set as fol- lows (see Section 3.2.1 for parameter definitions): K = 5, α Q ∑ 𝐴 𝑃 𝛿𝐶 ,𝐶 (formula 2) = 50 / K, and β = 0.01. Document collections for each Knowl. Org. 46(2019)No.2 95 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

domain football internet law literature medicine Total

posts 409,389 393,187 201,477 655,816 227,764 1,887,633

Table 1. Microblogs distribution across different domains.

Relationship type Follower Following Follower + following

pairs 106,388 110,367 152,976

Table 2. Relational data distribution.

User model Following model Follower model Fusion model

Topic0 0.3527 Topic0 0.3391 Topic0 0.2526 Topic0 0.3300

Topic19 0.1875 Topic19 0.1339 Topic19 0.1378 Topic19 0.1669

Topic47 0.1070 Topic3 0.0560 Topic36 0.0752 Topic47 0.0864

Topic26 0.0970 Topic47 0.0551 Topic47 0.0558 Topic26 0.0772

Topic3 0.0664 Topic10 0.0481 Topic3 0.0527 Topic3 0.0616

Topic18 0.0382 Topic26 0.0480 Topic26 0.0471 Topic18 0.0318

Topic13 0.0259 Topic36 0.0463 Topic10 0.0440 Topic36 0.0261

Table 3. Model results based on user and relationship data topic are extracted by document extraction method. We set 4.2.2 User modeling the topic probability threshold to 1 / K (0.2) and get five new document collections. Subsequently, we use LDA to Each user is modeled based on microblogs and relation- derive the second topic layer based on these five new col- ships. We applied author-topic modeling (ATM) to formu- lections of documents. By repeating the last steps, we de- late the user model using microblogs. The ATM parame- rive the third topic layer, thus completing the UGC topic ters are set as follows (see Section 3.2.2 for parameter def- hierarchy structure, in which the first, second, and third initions): K = 50, α = 50 / K, and β = 0.01. We also generate hierarchical layers comprise five, twenty-five, and 125 top- the follower collection model and following collection ics, respectively. The Appendix presents the ten most- model for each user by merging user’s each follower’s posts common terms for each topic, together with their proba- and following’s posts respectively. Note that the follower bilities. model and following model are normalized in this study. As shown in the Appendix, the first-layer topics are Finally, the user fusion model is formed by integrating the “literature,” “law,” “medicine,” “football,” and “internet”; user, follower, and following models, respectively weighted the second-layer topic terms for the first-layer topic “med- 0.6, 0.2, and 0.2. icine” are “experts,” “daily recuperation,” “surgical,” “fe- To demonstrate the process, we choose user male medical treatment,” and “diet”; the third-layer topic “178****763” on Sina Weibo, who has a large number of terms for the second-layer topic of “female medical treat- followers. As Table 3 shows, the high-frequency topics of ment” are “pregnancy,” “diseases of affluence,” “treat- the user model, follower collection model, and following ment,” “nursery,” and “pregnancy test.” The third topic collection model for this individual are “Topic0,” layer is the fine-grained description of the first topic layer, “Topic19,” “Topic47,” “Topic26,” and “Topic3.” The only showing the hierarchical relationship and distribution of difference between the follower model and following topics in each level. model is found in the topics’ weights. Compared to the 96 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

训练 (Training) 足球赛事 (Football game) 人生感悟 (Life thoughts) 家庭 (Family) Topic19: Topic0: Topic47: Topic26: 足球 (football) 0.0368 加油 (Fighting) 0.0257 人生 (life) 0.0219 孩子 (child) 0.0090 比赛 (competition) 0.0205 比赛 (competition) 0.0241 生活 (living) 0.0207 妈妈 (mother) 0.0061 俱乐部 (club) 0.0089 足球 (Football) 0.0142 世界 (world) 0.0133 回家 (go home) 0.0047 联赛 (league) 0.0059 兄弟 (brother) 0.0126 生命 (live) 0.0085 儿子 (son) 0.0046 冠军 (champion) 0.0053 训练( training) 0.0112 梦想 (dream) 0.0067 爸爸 (father) 0.0044 国安 (Guoan) 0.0049 球迷 (soccer fans) 0.0108 内心 (heart) 0.0058 女儿 (daughter) 0.0034 进球 (goal) 0.0048

工作 (work) 公益 (charity) 生活记录 (life record) 宗教 (religion)

Topic3: Topic18: Topic36: Topic13:

同学 (classmate) 0.0064 孩子 (child) 0.0351 馋嘴 (glutton) 0.0140 佛教 (Buddhism) 0.0108

回家 (go home) 0.0049 爱心 (love) 0.0120 星座 (constellation) 0.0121 菩萨 (buddha) 0.0092

心情 (mood) 0.0040 父母 (parent) 0.0096 休息 (rest) 0.0113 修行 (discipline) 0.0078

上班 (go to work) 0.0037 祝福 (blessing) 0.0089 熊猫 (panda) 0.0061 众生 (beings) 0.0070

吃饭 (eat) 0.0036 生命 (live) 0.0074 加油 (fighting) 0.0095 法师 (master) 0.0070

公司 (company) 0.0029 传递 (delivery) 0.0074 睡觉 (sleep) 0.0074 智慧 (wisdom) 0.0065

Table 4. The terms and probabilities of topics presented in Table 3. user model, the fusion model substitutes “Topic36” for As Figure 3 shows, Community 3 has the largest num- “Topic13.” Table 4 details the probability of the terms for ber of members (almost 40% of the total); the other three each topic in Table 3. communities have similarly sized populations, each being High-frequency topics of the three models are “train- around 20%. ing,” “football game,” “life thought,” “family,” and As Figure 4 shows, users in the Community 4 are mainly “work”; in the fusion model, “religion” was replaced by related with the football domain and a few users are related “life record.” The high weight of the topics “training” and with the other four domains. It indicates that users from “football game” indicates that this user likes football or different domains are linked. There are significantly more does work related to football; they are also interested in men than women in the “football” community, indicating other topics, like “life thoughts,” “family,” “work,” “char- that male users are more interested in football. Finally, the ity,” and “life record,” registration date findings indicate that Weibo’s popularity increased significantly in 2010 and 2011, with a high pro- 4.2.3 Community detection portion of registrations in both years. As Figure 5 shows, members of the “football” commu- We detected four communities based on the relationship nity come from different provinces. Besides , Shang- dataset. Figure 3 shows the population distribution of the hai and , there are many users from other re- communities’ members. gions like Liaoning, Guangdong, Shandong, , Hubei, etc. Knowl. Org. 46(2019)No.2 97 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

Figure 3. Community population distribution.

Figure 4. Domain, Gender, and Registration Date Distribution of Community 4.

98 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

Figure 5. Province Map of Community 4.

4.2.4 Multi-dimensional association Table 5 shows the term distribution of topic0 (“foot- ball”), topic4 (“football game”), and topic1 (“domestic In this section, we will show the relationship between com- match”). These topics are closely related to topic0 (“foot- munity topics and hierarchical topic for the “football” ball training”). community. Figure 6 shows the associations among user, For fine-grained detail on the community's interest topics, community, and topic hierarchy. we present the third-layer topic distribution of second- User “178****763” in Sina Weibo is assigned to the layer topic 4 in Table 6. “football” community from the experiment results, so we As Table 6 shows, the third-layer topics of second-layer can recommend the “football” community and other topic4 are “international match,” “domestic match,” “player members to them. Community topic0 is closely related to training,” “football show,” and “person.” We can thus rec- first-layer topic3, which is itself closely related to second- ommend topics in a more refined and accurate way based layer topic4, which is, in turn, closely related to third-layer on the associated community and topic hierarchy. Here, we topic1. can recommend these third-layer topics for the “football”

Topic hierarchy structure user first layer topics second layer topics third layer topics

topic0(0.4780) topic 0(0.4682) topic 0(0.5192)

“football” ( ) topic1(0.4319) topic 1(0.5589) topic1 0.7676

Community topics topic 2(0.5578) topic 2(0.4130) topic 2(0.5016)

Community to- pic0 topic3(0.6625) topic 3(0.4439) topic 3(0.5136) …

community topic 4(0.4396) topic4(0.7328) topic 4(0.6551)

Figure 6. Associations among user, community, and topic hierarchy in the “football” community. Knowl. Org. 46(2019)No.2 99 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

First-layer topic-Topic3 Second-layer topic-Topic4 Third-layer topic-Topic1 “足球” “足球赛事” “国内赛事” (football) (football game) (domestic match) 加油 (fighting) 足球 (football) 比赛 (competition) 0.0053 0.0196 0.0369 足球 (football) 比赛 (competition) 球迷 (football fan) 0.0051 0.0191 0.0211 比赛 (competition) 加油 (fighting) 足球 (football) 0.0050 0.0161 0.0191 回家 (go home) 球迷 (football fan) 球队 (football team) 0.0031 0.0088 0.0102 妈妈 (mother) 俱乐部 (football club) 广州 (Guangzhou) 0.0027 0.0054 0.0073 睡觉 (sleep) 球队 (football team) 联赛 (league) 0.0025 0.0054 0.0069 孩子 (child) 训练 (train) 北京 (Beijing) 0.0025 0.0053 0.0060 球迷 (football fan) 球员 (footballer) 青岛 (Qingdao) 0.0023 0.0048 0.0054 爱心 (love) 兄弟 (brother) 主场 (home court) 0.0019 0.0047 0.0049 兄弟 (brother) 体育 (sports) 上海 (Shanghai) 0.0019 0.0042 0.0048

Table 5. Relevant topics of community topic0.

Topic0 Topic1 Topic2 Topic3 Topic4 “世界赛事” “国内赛事” “球员训练” “足球节目” “风云人物” (international match) (domestic match) (player training) (football show) (person) 比赛 (competition) 比赛 (competition) 足球 (football) 体育 (sports) 加油 (fighting) 0.0167 0.0369 0.0288 0.0108 0.0122 球员 (footballer) 球迷 (football fan) 教练 (coach) 足球 (football) 国安 (Guoan) 0.0085 0.0211 0.0060 0.0080 0.0059 进球 (goal) 足球 (football) 孩子 (child) 比赛 (competition) 足球 (football) 0.0076 0.0191 0.0058 0.0059 0.0032 球队 (football team) 球队 (football team) 训练 (train) 上海 (Shanghai) 女足 (women’s soccer) 0.0075 0.0102 0.0048 0.0046 0.0025 西班牙 (span) 广州 (Guangzhou) 运动 (sport) 球迷 (football fan) 青岛 (Qingdao) 0.0061 0.0073 0.0039 0.0045 0.0023 冠军 (champion) 联赛 (league) 球员 (footballer) 直播 (broadcasting) 王永珀(Wang Yongbo) 0.0055 0.0069 0.0038 0.0041 0.0022 皇马 (Real Madrid) 北京 (Beijing) 俱乐部 (football club) 参加 (participate) 李 帅(Li Shuai) 818 0.0054 0.0060 0.0036 0.0038 0.0017 巴萨 (Bass) 青岛 (Qingdao) 比赛 (competition) 俱乐部 (football club) 王晓龙 (Wang Xiaolong) 0.0049 0.0054 0.0035 0.0037 0.0016 意大利 (Italy) 主场 (home court) 运动员 (athlete) 现场 (scene) 邵佳一(Shao Jiayi) 0.0048 0.0049 0.0032 0.0031 0.0016 决赛 (final) 上海 (Shanghai) 体育 (sports) 节目 (show) 徐云龙() 0.0047 0.0048 0.0032 0.0028 0.0015

Table 6. The third-layer topic distribution of second-layer Topic4.

100 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network community. In addition, other second-layer topics can be main elements with respect to their information organiza- recommended to the community to help users access more tion methods and associated advantages and disad- relevant information. In short, users can easily obtain a com- vantages. The table also provides examples of relevant so- prehensive, in-depth picture of their topics of interest. cial media platforms. As described in Table 7, for those organization models 4.2.5 Comparisons between different organization using only one element (all elements are user, hierarchical patterns topic, and community), it is more costly for users to access information from another two sources. For those models There are three basic elements for information organiza- using two elements, shortages might exist when infor- tion in this paper: topic hierarchy, user, and community. As mation organization is trying to build on the missing ele- Table 7 shows, we analyze different combinations of the ment. For instance, user interaction is not sufficiently

Basic Elements Organization Model Advantages Disadvantages Social Media Cases Users in a friend Information is usually Obtained information Communication software relationship on social shared through a one-to- resources are limited to e.g., Tencent QQ, User media can send all types one connection and with the range of friends. Messenger of messages and be fully known friends, which connected. guarantees privacy. Information content is Information organized in Classification system is Internet encyclopedia based on entries and is a hierarchical topic not clear enough; the projects organized in a structure can help users data need to be Hierarchical e.g., Wikipedia, Baidu hierarchical manner to quickly search for and maintained and updated Topic Encyclopedia according to a certain find needed knowledge. in real time. classification system or topic. Users with the same Users within the same Accessible information Online communities, information needs are community can quickly resources are limited by e.g., Google Groups integrated in the same share information. community themes. Community virtual space and communicate with one another within this community. Users who seek specific Interaction between users User interaction is not Socialized question and User + information will consult and information sufficiently strong. answer platforms, Hierarchical their friends or look for acquisition can be e.g., Zhihu, Stack Topic relevant information efficient. Overflow based on the topic. Users make friends Convenient for users to Community construction Business social platforms, gradually and effectively achieve their is affected by users’ social e.g., LinkedIn, Dajie communities are basically social goals and maintain activities over social User + Network groups of users with a personal connections. media. Community common interest or goal, who are also highly likely to be friends. Users within each Users in the same Information sharing Web forums community have highly community can share within social media e.g., Tianya Community + personalized interactions; information in real time communities is mainly in Hierarchical different communities and find corresponding the form of text and Topic have different topics that sub-communities pictures. can be organized in a according to their own hierarchical structure. information needs. Users can obtain Users can engage in There might be Social networking User + information through social behaviors, access information security platforms, Hierarchical friendships, communities rich information sources, risks. e.g., Facebook, Douban Topic + to which they belong, and and obtain information Community hierarchical topics on effectively from different social media. sources.

Table 7. Comparisons between different information organization patterns. Knowl. Org. 46(2019)No.2 101 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network strong when community information is not considered. ings of the 9th ACM SIGCOMM Conference on Internet Therefore, the most efficient solution is to incorporate all Measurement. November 4-6, 2009, Chicago, Illinois. New three elements in the information organization model, en- York, NY: Association for Computing Machinery, 49- abling users to obtain information via friends, communi- 62. doi:10.1145/1644893.1644900 ties to which they belong, and hierarchical topics on social Blei, David M., Andrew Y. Ng and Michael I. Jordan. 2003. media. “Latent Dirichlet Allocation.” J Machine Learning Re- search Archive 3: 993-1022 5.0 Conclusion Bu, Shuqing, Huamei Liu and Guangping Wang. 2010. “ 国外近几年网络环境下知识组织理论、方法的深 We have described our investigation into how to organize 化与拓展.” [A Summary of Recent Research on information on social networks in a user-centered way to Knowledge Organization.] 中国索引 [Journal of the meet personalized needs. Our proposed method linked China Society of Indexers] 1:2-12. UGCs, users, and user communities in multi-dimensional Chen, Sherry Y. and Nigel J. Ford. 1998. “Modelling User framework. Navigation Nehaviours in a Hypermedia-based Learn- First, we constructed a three-layer topic hierarchy based ing System: An Individual Differences Approach.” on UGCs. Second, we developed a user interests model us- Knowledge organization 25: 67-78. ing UGCs and relationship data, fusing the user, follower, Chen, Enhong, Yanggang Lin, Hui Xiong, Qiming Luo and following models. Third, inspired by previous work on and Haiping Ma. 2011. “Exploiting Probabilistic Topic community detection, we proposed a new approach for de- Models to Improve Text Categorization under Class tecting the topics of each community. Finally, we derived a Imbalance.” Information Processing & Management 47: 202- multi-dimensional information organization pattern 214. through similarities among three dimensions: the UGC Gao, Xia and Jiancheng Guan. 2012. “Network Model of topic hierarchy, user interest model, and user communities. Knowledge Diffusion.” Scientometrics 90: 749-62. Our results show that the topic hierarchy is effective in Gupta, Manish, Rui Li, Zhijun Yin and Jiawei Han. 2010. providing supplementary and recommended information. “Survey on Social Tagging Techniques.” ACM SIGKDD Problems of spare data in user modeling can be partly Explorations Newsletter 12: 58-72. solved by integrating it with the relational model. We can Hannon, John, Mike Bennett and Barry Smyth. 2010. also help users find communities of interest using commu- “Recommending Twitter Users to Follow Using Con- nity detection. As the user modeling in this paper is not tent and Collaborative Filtering Approaches.” In evaluated, we propose to conduct a follow-up study, in RecSys’10: Proceedings of the fourth ACM conference on Rec- which the evaluation will involve both expert scoring and ommender systems. September 26-30, 2010, Barcelona, Spain. user assessment, such as user satisfaction of information New York, NY: Association for Computing Machinery, recommended through platforms. 199-206. doi:10.1145/1864708.1864746 Hjørland, Birger. 2003. “Fundamentals of Knowledge Or- References ganization.” Knowledge organization 30: 87-111. Hjørland, Birger. 2014. “User-based and Cognitive Ap- Agichtein, Eugene, Carlos Castillo, Debora Donato, Aris- proaches to Knowledge Organization: A Theoretical tides Gionis and Gilad Mishne. 2008. “Finding High- Analysis of the Research Literature.” Knowledge organiza- quality Content in Social Media.” In WSDM’08: Proceed- tion 40: 11-27. ings of the 2008 international conference on web search and data Kaplan, Andreas M. and Michael Haenlein. 2010. “Users mining. February 11-12, 2008, California, USA. New York, of the World, Unite! The Challenges and Opportunities NY: Association for Computing Machinery, 183-194. of Social Media.” Business Horizons 53: 59-68. Doi:10.1145/1341531.1341557 Keenan, Andrew and Ali Shiri. 2013 “Sociability and Social Bar-Ilan, Judit and Yifat Belous. 2007. “Children as Archi- Interaction on Social Networking Websites.” Library Re- tects of Web Directories: an Exploratory Study.” Journal view 58: 438-450. of the American Society for Information Science & Technology Kietzmann, Jan H., Kristopher Hermkens, Ian P. McCar- 58: 895–907. thy, Bruno S. Silvestre. 2011. “Social Media? Get Seri- Bawden, David. 2006. “Users, User Studies and Human ous! Understanding the Functional Building Blocks of Information Behaviour.” Journal of Documentation 62: Social Media.” Business Horizons 54: 241-251. 671-679. Kim, Hak Lae, Simon Scerri, John G. Breslin, Stefan Benevenuto, Fabrício, Tiago Rodrigues, Meeyoung Cha Decker and Hong Gee Kim. 2008. “The State of the and Virgílio Almeida. 2009. “Characterizing User Be- Art in Tag Ontologies: A Semantic Model for Tagging havior in Online Social Networks.” In IMC’09: Proceed- and Folksonomies.” In DC2008: Proceedings of the 102 Knowl. Org. 46(2019)No.2 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

International Conference on Dublin Core and Metadata Treem, Jeffrey W. and Paul M. Leonardi. 2012. “Social Me- Applications 2008. September 22-26, 2008, Berlin, Germany. dia Use in Organizations: Exploring the Affordances of Singapore: Dublin Core Metadata Initiative, 128-137. Visibility, Editability, Persistence, and Association.” So- Kwak,Haewoon, Changhyun Lee, Hosung Park and Sue cial Science Electronic Publishing 36: 143-189. Moon. 2010. “What is Twitter, a Social Network or a Tu, Cunchao, Zhiyuan Liu, Huanbo Luan and Maosong News Media?” In WWW2010: Proceedings of the 19th inter- Sun. 2015. “PRISM: Profession Identification in Social national conference on World Wide Web. April 26-30, 2010, Ra- Media.” ACM Transactions on Intelligent Systems and leigh, USA. New York, NY: Association for Computing Technology 8, no. 6: 1-16. Machinery, 591-600. doi: 10.1145/1772690.1772751 Van Damme, Céline, Martin Hepp, and Katharina Li, Jinhai, Changlin Mei and Yuejin Lv. 2013. “Incomplete Siorpaes. 2007. “Folksontology: An Integrated Ap- Decision Contexts: Approximate Concept Construction, proach for Turning Folksonomies into Ontologies.” Pa- Rule Acquisition and Knowledge Reduction.” Inter- per presented at Bridging the Gap between Semantic national Journal of Approximate Reasoning 54: 149-165. Web and Web 2 .0 (SemNet 2007), at the 4th European Maia, Marcelo, Jussara Almeida and Virgílio Almeida. 2008. Semantic Web Conference. http://citeseerx.ist.psu.edu/ “Identifying User Behavior in Online Social Networks.” viewdoc/summary?doi=10.1.1.379.5516 In SocialNets’08: Proceedings of the 1st Workshop on Social Ming, Zhao Yan, Jintao Ye, and Tat Seng Chua. 2014. “A Network Systems. April 1, 2008, Glasgow, Scotland, UK. New Dynamic Reconstruction Approach to Topic Summari- York, NY: Association for Computing Machinery, 1-6. zation of User-generated-content.” In CIKM '14: Proceed- Mathes, Adam. 2004. “Folksonomies: Cooperative Classifi- ings of the 23rd ACM International Conference on Conference on cation and Communication through Shared Meta- Information and Knowledge Management. November 03-07, data.” http://www.adammathes.com/academic/com 2014, Shanghai, China. New York, NY: Association for puter -mediated-communication/folksonomies.html. Computing Machinery, 311-320. doi:10.1631/FITEE. Munk, Timme Bisgaard and Kristian Mork. 2007. “Folk- 1500402 sonomy, the Power Law & the Significance of the Least Zhang, Wei, Jia-yu Zhuang, Xi Yong, Jian-kou Li, Wei Effort.” Knowledge organization 34: 16-33. Chen and Zhe-min Li. 2017. “Personalized Topic Mod- Newman, Mark EJ and Michelle Girvan. 2004. “Finding eling for Recommending User-generated Content.” and Evaluating Community Structure in Networks.” Frontiers of information technology & electronic engineering 18: Physical review E 69: 026113. 708-718. Noruzi, Alireza. 2006. “Folksonomies: (Un) controlled Zhou, Xiaoying. 2010. “知识链接的发展阶段、发展动 Vocabulary?” Knowledge organization 33: 199-203. 因和类 型特征分析.” [Studies on Development Potnis, Devendra. 2011. “Folksonomy-based User-centric Phases, Development Motivation, Type and Character- Information Organization Systems.” International Journal istic of knowledge Linkage.] 图书情报工作 [Library of Information Studies 3: 31-43. and Information Service] 54: 36-40. Qiang, Bi, and Wang Yu. 2013. “Fronts and Hotspots of Zhu, Xingwei, Zhao-Yan Ming, Zhao-Yan Ming, Zhao- the Application Research on Folksonomy Abroad.” (In Yan Ming. 2013. “Topic Hierarchy Construction for the Chinese) Data Analysis and Knowledge Discovery 29: 36-42. Organization of Multi-source User Generated Con- Raghavan, Usha Nandini, Réka Albert, and Soundar Ku- tents.” In SIGIR'13: Proceedings of the 36th international mara. 2007. “Near Linear Time Algorithm to Detect ACM SIGIR conference on Research and development in infor- Community Structures in Large-scale Networks.” Physi- mation retrieval. July 28 - August 01, 2013, Dublin, Ireland. cal Review E Statistical Nonlinear & Soft Matter Physics 76: New York, NY: Association for Computing Machinery, 036106. 233-242. doi: 10.1145/2484028.2484032 Steyvers, Mark, Padhraic Smyth, Michal Rosen-Zvi and Zhu, Xingwei, Zhao-Yan Ming, Yu Hao, Xiaoyan Zhu, Thomas Griffiths. 2004. “Probabilistic Author-topic Tat-Seng Chua. 2014. “Customized Organization of Models for Information Discovery.” In KDD'04: Proceed- Social Media Contents Using Focused Topic Hierar- ings of the tenth ACM SIGKDD international conference on chy.” In CIKM '14: Proceedings of the 23rd ACM Interna- Knowledge discovery and data mining. August 22-25, 2004, Se- tional Conference on Conference on Information and Knowledge attle, Washington, USA. New York, NY: Association for Management. November 03-07, 2014, Shanghai, China. New Computing Machinery, 306-315. doi:10.1145/1014052. York, NY: Association for Computing Machinery, 1014087 1509-1518. doi:10.1145/2661829.2661896 Stutzman, Frederic. 2006. “An Evaluation of Identity-shar- ing Behavior in Social Network Communities.” Interna- tional Journal of Performance Arts & Digital Media 3: 10-18.

Knowl. Org. 46(2019)No.2 103 Chengzhi Zhang, Hua Zhao, Xuehua Chi and Shuitian Ma. Information Organization Patterns from Online Users in a Social Network

Appendix: The sample of UGC topic hierarchy construction

The first layer

“文学”(literature) “法律”(law) “医疗”(medical) “足球”(football) “互联网”(internet)

Topic0:0.2707 Topic1:0.1894 Topic2:0.1238 Topic3:0.2557 Topic4:0.1604

人生(life) 0.0049 律师 (lawyer)0.0203 孩子(child) 0.0118 加油(fighting) 0.0053 手机(phone) 0.0076

生活(live)0.0048 法律(law) 0.0049 医院(hospital) 0.0111 足球(football) 0.0051 活动(activity) 0.0053

作家(writer) 0.0041 社会(society) 0.0049 治疗(treat) 0.0110 比赛(competition) 0.0050 公司(company) 0.0052

故事(story) 0.0038 国家(country) 0.0044 宝宝(baby) 0.0080 回家(back home) 0.0031 发布(release) 0.0044

小说(fiction)0.0034 美国(America) 0.0040 患者(sufferer) 0.0075 妈妈(mother) 0.0027 体验(experience) 0.0041

电影(film)0.0034 政府(government) 0.0035 手术(operation) 0.0058 上海(Shanghai) 0.0026 产品(product) 0.0034

作品(works) 0.0030 新闻(news) 0.0031 检查(examination) 0.0050 睡觉(sleep) 0.0025 升级(upgrade) 0.0032

作者(author) 0.0026 媒体(media) 0.0026 女性(female) 0.0038 孩子(child) 0.0025 微信(wechat) 0.0032

出版(publish) 0.0020 法院(court) 0.0021 门诊(clinic) 0.0038 球迷(football fans) 0.0023 互联网(internet) 0.0031

文学(literature) 0.0019 法官(judge) 0.0018 病人 (patient)0.0035 馋嘴(greedy) 0.0021 功能(function) 0.0030

The second layer

“专家”(expert) “日常休养” “外科”(surgery) “女性医疗” “饮食”(diet)

Topic0: 0.2269 (daily maintenance) Topic2: 0.1451 (women’s health care) Topic4: 0.1627

Topic1:0.2759 Topic3: 0.1894 医院(hospital) 0.0219 治疗(treat) 0.0180 宝宝(baby) 0.0149

宝宝(baby) 0.0089 治疗(treat) 0.0213 北京(Beijing) 0.0082 效果(effectment) 0.0058 食物(food) 0.0126 妈妈(mother)0.0063 检查(check) 0.0130 患者(sufferer) 0.0079 皮肤(skin) 0.0055 维生素(vitamine) 0.0060 运动(sport) 0.0055 患者(suffer) 0.0124 病人(patient) 0.0070 疼痛(pain) 0.0053 饮食(diet) 0.0060 生活(live) 0.0050 医院(hospital) 0.0093 医疗(medical) 0.0061 患者(sufferer) 0.0053 营养(nutrition) 0.0054 身体(body) 0.0049 女性(female) 0.0089 大夫(doctor) 0.0050 眼睛(eye) 0.0045 食品(food) 0.0042 家长(patriarch) 0.0040 疾病(disease) 0.0077 教授(professor) 0.0049 针灸(acupuncture) 0.0038 水果(fruit) 0.0039 父母(parents) 0.0039 子宫(womb) 0.0076 专家(expert) 0.0047 中药(chineseherb) 0.0033 母乳(breast milk) 0.0038 慰问(condole) 0.0032 症状(symptom)0.0075 门诊(clinic) 0.0041 脱发(alopecia) 0.0030 牛奶(milk) 0.0032 生病(ill) 0.0030 手术(operation) 0.0073

协和(Concord hospital) 0.0041 (massage) 0.0030 蔬菜(vegetable) 0.0031 睡眠(sleep) 0.0029 按摩 孩子(child) 0.0072

The third layer

“备孕”(pregnancy) “富贵病”(affluenza) “就诊”(vis.) “育婴”(infant-raising) “孕检”(pregnancy test)

Topic0: 0.1594 Topic1: 0.1974 Topic2: 0.2447 Topic3: 0.1993 Topic4: 0.1992

女性(female) 0.0294 糖尿病(diabetes) 0.0116 治疗(treat) 0.0311 治疗(treat) 0.0225 胎儿(fetus) 0.0218

子宫(womb) 0.0255 疾病(disease) 0.0113 患者(sufferer) 0.0284 孩子(child) 0.0171 宝宝(baby) 0.0217

治疗(treat) 0.0156 高血压(hypertension)0.01 门诊(clinic) 0.0235 感染(infect) 0.0152 孩子(child) 0.0178

医院(hospital) 0.0151 患者(suffer) 0.0084 手术(operation) 0.0204 药物(medicine) 0.0120 孕妇(gravida) 0.0160

月经(menstruation) 0.0134 治疗(treat) 0.0079 检查(check) 0.0125 症状(symptom) 0.0117 怀孕(pregnant) 0.0158

宫颈(cervix) 0.0119 饮食(diet) 0.0079 加号(plus) 0.0121 感冒(cold) 0.0110 检查(check) 0.0156

肌瘤(myoma) 0.0118 因素(factor) 0.0073 申请(apply) 0.0111 咳嗽(cough) 0.0098 发育(growth) 0.0137

检查(check) 0.0115 控制(control) 0.0072 医院(hospital) 0.0100 宝宝(baby) 0.0097 分娩(childbirth) 0.0095

症状(symptom) 0.0112 血压(blood) 0.0067 病人(patient) 0.0096 医院(hospital) 0.0089 妊娠(gestation) 0.0089

不孕(sterility) 0.0093 预防(prevent) 0.0058 诊断(diagnose) 0.0076 疫苗(vaccine) 0.0080 出生(birth) 0.0059 104 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Automatic Subject Indexing of Text† Koraljka Golub Linnaeus University, School of Cultural Sciences, Department of Library and Information Science, Faculty of Arts and Humanities, 351 95 Växjö, Sweden,

Koraljka Golub is an associate professor in library and information science at Linnaeus University, Sweden. Her research interests focus on knowledge organization, primarily in the context of information retrieval. Research projects she has worked on have explored the potential of social tagging when enhanced by suggestions from controlled vocabularies, automatic subject indexing, and evaluation of subject indexing in the context of retrieval. She would like to examine to what degree automatic full-text indexing, end-user tagging, author tagging, profes- sional subject indexing, and automatic assigned indexing, or any combination thereof, contribute to successful retrieval.

Golub, Koraljka. 2019. “Automatic Subject Indexing of Text.” Knowledge Organization 46(2): 104-121. 126 refer- ences. DOI:10.5771/0943-7444-2019-2-104.

Abstract: Automatic subject indexing addresses problems of scale and sustainability and can be at the same time used to enrich existing metadata records, establish more connections across and between resources from various metadata and resource collections, and enhance consistency of the metadata. In this work, automatic subject indexing focuses on assigning index terms or classes from established knowledge organization systems (KOSs) for subject indexing like thesauri, subject headings systems and classification systems. The following major approaches are discussed, in terms of their similarities and differences, advantages and disadvantages for automatic assigned indexing from KOSs: “text categorization,” “document clustering,” and “document classification.” Text categorization is perhaps the most widespread, machine-learning approach with what seems generally good reported performance. Document clustering automatically both creates groups of related documents and extracts names of subjects depicting the group at hand. Document classification re-uses the intellectual effort invested into creating a KOS for subject indexing and even simple string-matching algorithms have been reported to achieve good results, because one concept can be described using a number of different terms, including equivalent, related, narrower and broader terms. Finally, applicability of automatic subject indexing to operative information systems and challenges of evalu- ation are outlined, suggesting the need for more research.

Received: 27 September 2018; Revised: 31 October 2018; Accepted: 7 December 2018

Keywords: indexing, subject, terms, document, documents, automatic, classification

† Many thanks to Birger Hjørland and two anonymous reviewers who kindly provided detailed feedback that helped improve this article.

1.0 Introduction dexing of commercial search engines: consistency through uniformity in term format and assignment of terms, pro- Increasingly, different types of information resources are vision of semantic relationships among terms, and sup- being made available online. Current search engines yield port for browsing through consistent and clear hierarchies good results for specific search tasks but are unsuited to (see Mazzocchi 2018). the conceptual or subject-based searches requiring high However, such subject index terms require substantial precision and recall, common in academic research or se- resources to produce. Because of the ever-increasing num- rious public inquiry (for a discussion on (dis)advantages of ber of documents, there is a risk that recognized objectives automatic full-text indexing, see Keyser 2012, chapter 2). of bibliographic systems, such as finding all documents on Differences in terminology between various communities a given subject, would get left behind. As an example, a and even individuals lead to the fact that literal string recent exploratory study of Swedish library catalogs indi- search in many cases cannot deliver effective search. This cates that subject access is not addressed systematically, is exacerbated in cross-system and cross-lingual search and that in new digital collections KOSs are applied to a very retrieval where integrated subject access is probably the limited degree, and in integrated library and commercial hardest challenge to address. Subject index terms taken databases the mappings between the different KOSs do from knowledge organization systems (KOSs) such as the- not exist, therefore preventing quality search across them sauri, subject headings systems and classification systems (Golub 2016). Automatic means could be a solution to provide numerous benefits compared to the free-text in- preserve recognized objectives of bibliographic systems Knowl. Org. 46(2019)No.2 105 K. Golub. Automatic Subject Indexing of Text

(Svenonius 2000, 30). Apart from addressing problems of (CV). Automatic subject indexing is, then, a machine- scale and sustainability, automatic subject indexing can be based subject indexing where human intellectual processes used to enrich existing bibliographic records, establish of the above three steps are replaced by, for example, sta- more connections across and between resources, and en- tistical and computational linguistics techniques, which will hance consistency of bibliographic data (Golub et al. be discussed in further detail below. 2016). Further, automatic indexing is used today in a wide The terminology related to automatic subject indexing variety of applications such as topical harvesting, person- is inconsistently used in the literature. This is probably be- alized routing of news articles, ranking of search engine cause this research topic has been addressed by different results, sentiment analysis (see, e.g., Hu and Li 2011) and research fields and disciplines, grounded in various episte- many others (Sebastiani 2002). mological traditions. In order to clarify the differences, ma- Research on automatic subject indexing began with the jor terms used are briefly discussed and defined below. availability of electronic text in the 1950s (Luhn 1957; In information science, the terminology of subject in- Baxendale 1958; Maron 1961) and continues to be a chal- dexing involves several important concepts. Subject index lenging topic, for the reasons and purposes outlined above. terms may be derived either from the document itself, For a historical overview of automatic indexing, see Ste- which is known as derived indexing (e.g., keywords taken vens (1965) and Sparck Jones (1974) covering the early pe- from title), or from indexing languages that are formalized riod of automatic indexing and Lancaster (2003, 289-292) and specifically designed for describing the subject content for the later one. A related term is machine-aided indexing of documents, which is known as assigned indexing or (MAI) or computer-assisted indexing (CAI) where it is the classification. In assigned indexing, index terms are taken human indexer who decides, based on a suggestion pro- from alphabetical indexing languages (using natural lan- vided by the computer (see, for example, Medical Text In- guage terms with terminology control such as thesauri and dexer (U.S. National Library of Medicine, 2016)). A similar subject headings); in classification, classes are taken from approach is applied by Martinez-Alvarez, Yahyaei, and classification systems (using symbols, operating with con- Roelleke (2012) who propose a semi-automatic approach cepts). The main purpose of assigned indexing using al- in which only those predictions likely to be correct are pro- phabetical indexing languages is to allow retrieval of a doc- cessed automatically, while more complex decisions are left ument from many different perspectives; typically, three to to human experts to decide. twenty elemental or moderately pre-combined subject There are different approaches to automatic indexing, terms are assigned. The main purpose of classification, as- based on the purpose of application but also coming from signing classes from classification schemes, is to group different research fields and traditions. The terminology is, similar documents together to allow browsing (of library therefore, varied. Further, research of automatic indexing shelves in the traditional environment and directory-style tools in operating information environments is usually browsing in the online environment); a few, typically one, conducted in laboratory conditions, excluding the com- highly pre-combined subject class(es) are assigned. (See plexities of real-life systems and situations. The remainder also Lancaster (2003, 20-21) concerning the similarities be- of this entry reflects upon these issues and is structured as tween indexing and classification). follows: the next section (2) discusses major terms and In computer science, the distinction between different provides definition of automatic subject indexing as used types of indexing languages is rarely made. While a com- for the purposes of this work. Section 3 discusses ap- mon distinction made is the one between formal ontolo- proaches to automatic subject indexing as to their major gies, light ontologies (with concepts connected using gen- similarities and differences. Section 4 contains a discussion eral associative relations rather than strict formal ones typ- on how good the addressed automatic solutions are today, ical of the former) and taxonomies, at times the term on- and Section 5 contains concluding remarks. tology is used to refer to several different knowledge or- ganization systems. For example, Mladenić and Grobelnik 2.0 Definition and terminology use the term to refer to hierarchical web directories of search engines and related services as well as subject head- According to the current ISO indexing standard (ISO ings systems (2005, 279): 5963:1985, confirmed in 2008, International Organization for Standardization 1985), subject indexing performed by Most of the existing ontologies were developed with the information professional is defined as a process involv- considerable human efforts. Examples are Yahoo! ing three steps: 1) determining the subject content of a and DMOZ topic ontologies containing Web pages document; 2) a conceptual analysis to decide which aspects or MESH ontology of medical terms connected to of the content should be represented; and, 3) translation Medline collection of medical papers. of those concepts or aspects into a controlled vocabulary 106 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Also, derived indexing may be variously termed, for exam- proaches. These approaches are described and discussed in ple keyword assignment, keyword extraction, or noun the following section. phrase extraction (referring to noun phrases specifically). In related literature, other terms for automatic subject 3.0 Approaches to automatic subject indexing indexing are used. Subject metadata generation is one gen- eral example. Terms text categorization and text classifica- This section (3.1) first describes the underlying methodol- tion are common in the machine learning community. Au- ogy common in different specific approaches. Section 3.2 tomatic classification is another example of a term used to provides a brief overview of addressing various document denote automatic assignment of a class or a category from types. Section 3.3 discusses the major approaches, text cat- a pre-existing classification system or taxonomy. However, egorization, document clustering and document classifica- this phrase may also be used to refer to document cluster- tion. ing, in which groups of similar documents are automati- cally discovered and named. 3.1 Basic approach Here the term automatic subject indexing is used as the primary term. It denotes non-intellectual, machine-based Generally speaking, automatic subject indexing typically processes of subject indexing as defined by the infor- follows a course of several major steps. The first one is a mation science community: derived and assigned indexing preparation step in which documents to be indexed are using both alphabetical and classification indexing systems each processed in order to create suitable representations for the purposes of improved information retrieval. The for computer manipulation in what follows. This process rationale for combining them into one entry is the fact that is comparable to preparation of documents for infor- the underlying machine-based principles are rather similar, mation retrieval. especially when it comes to application to textual docu- ments. However, the major focus in this entry is on as- 3.1.1 Pre-processing signed indexing because of the added value provided by indexing systems for information searching as perceived in A list of words appearing in the document is created based information science, such as increased precision and recall on tokenization, the process of automatically recognizing ensuing from natural language control of, e.g., homonymy, words. Also, all punctuation is taken away. Further, words synonymy, word form, and advantages for hierarchical that tend to carry less meaning are taken out, such as con- browsing, e.g., when the end-user does not know which junctions, determiners, prepositions, and pronouns, all of search term to use because of unfamiliarity with the topic which are known as stop-words. This resulting representa- or when not looking for a specific item. Further, term sub- tion of documents is known as a bag-of-words model. A ject indexing assumes applying both alphabetical and clas- more advanced representation is the n-gram model of sification indexing systems, because similar principles ap- words which is used, for example, when noun phrases ply when it comes to automatic processes; although, it is need to be extracted in derived indexing or when string also common to refer to the process of using the former matching is conducted against terms comprising more subject indexing and the latter subject classification. Fi- than just one word (see below section 3.3.3. Document nally, while the word automated more directly implies that classification). Word n-grams may be unigrams (individual the process is machine-based, the word automatic is more words), bigrams (any two adjacent words), trigrams (any commonly used in related literature and has, therefore, be- three adjacent words), etc. Further, more advanced natural come the term of choice here, too. language processing techniques may be performed; in Further, terminology to distinguish between different stemming, each word is reduced to its stem, which means approaches to automatic subject indexing is even less con- removal of its affixes—for example, illegally may be re- sistent (see also Smiraglia and Cai 2017). For example, Har- duced to its stem legal whereby its prefix il- and its suffix tigan (1996, 2) writes: “The term cluster analysis is used -ly are removed. The rationale behind this is that words most commonly to describe the work in this book, but I with the same stem bear the same meaning. In addition, much prefer the term classification.” Or: “classification or part-of-speech taggers and syntactic parsers can also be categorization is the task of assigning objects from a uni- applied. For an overview of text processing, see Manning verse to two or more classes or categories” (Manning and and Schütze (1999) and Weisser (2015). Schütze 1999, 575). In this entry, terms text categorization and document clustering are chosen, because they tend to be the prevalent terms in the literature of the correspond- ing communities. Term document classification is used in order to consistently distinguish between the three ap- Knowl. Org. 46(2019)No.2 107 K. Golub. Automatic Subject Indexing of Text

3.1.2 Term weighting of index phrases; for example, if “time,” “over,” and “tar- get” appeared within a certain number of words from each The following major step is determining the importance of other, an index phrase “air warfare” would be generated. each term for describing the aboutness of the document at Fuhr and Knorz (1984) created about 150,000 rules for hand. The term can be either an individual word or a com- matching physics documents to KOS terms. Jones and Bell pound phrase, depending on the given task. For each term, (1992) extracted index terms based on matching terms a weight expressed as a number is calculated and assigned. from the document against several lists: a stop-word list, a Here different statistical and other heuristic rules can be list of terms of interest, a list to aid in the disambiguation applied. An example of statistical rules, words appearing of homographs, a list to conflate singular and plural forms, very many times both in the document at hand and in all and a list of word endings to allow simple parsing. Ruiz, other documents in the collection, are probably not partic- Aronson, and Hlava (2008) claim that rule-based ap- ularly indicative of the subject matter of the document and proaches dominated in the 1970s and 1980s and that ma- vice versa. This is known as term frequency-inverse docu- chine learning or statistical approaches picked up in the ment frequency weight (tf-idf, Salton and McGill 1983, 63, 1990s. Rule-based approaches are based on manually cre- 205): it combines 1) term frequency (Luhn 1957), where ated rules while in machine learning sets of examples are weight of the term at hand is considered to be proportional required for training the algorithm to learn concepts. Hlava to the number of times it appears in the document, with 2) (2009) describes rule-based indexing as better and states inverse document frequency (Sparck Jones 1972), where that the majority of rules are simple and can be automati- weight of the term is an inverse fraction of the documents cally created, while complex rules are added by editors. On that contain the word. An overview of term weighting the other hand, in the domain of medical documents, measures can be found in Roelleke (2013). Humphrey et al. (2009) compared a rule-based and statisti- Features such as the location of the term, or the font cal approach and showed that the latter outperformed the size or font type, may also be included in determining the former. Approaches combining the best of the two worlds importance of a term. In web pages, for example, words may be superior. that appear in titles, headings or metadata may be consid- ered more indicative of the topicality than those written in 3.1.3 Further representations normal font size elsewhere. A known example is Google that owes much of its success to the PageRank algorithm Based on the two aforementioned major commonly ap- (Page et al. 1998) that ranks higher those web pages which plied processes, each original document is now trans- have more external web pages linking to them. Gil-Leiva formed into a list of (stemmed, parsed) terms and their (2017) pointed out that generally there is less use of loca- assigned term weights. There seem to be two possible ways tion heuristics rules than of statistical rules (outlined in the to continue from here: a) vector representation; or, b) previous paragraph) and conducted an experiment compar- string matching. ing the two sets of rules, which showed that best results are achieved with location heuristics rules. A number of other a. Vector representation is the dominant approach in principles have also been investigated. A co-occurrence, or which the result of the first two steps is now trans- a citation-based one applies the idea that if publication A formed into vectors in a vector space of terms. In this cites publication B, A may include text that indicates what vector space, each term with its weight is represented as B is about (Bradshaw and Hammond 1999). Chung, Miksa, one dimension in that space (term space). When fea- and Hastings (2010) compared how sources human index- tures like location are added, each feature becomes a di- ers normally resort to in order to determine the subject of mension in the vector space called feature space which the document at hand, such as conclusion, abstract, intro- could then contain the term space. Many terms and fea- duction, title, full text, cited works, and keywords of scien- tures will lead to the challenge of high dimensionality; tific articles, contribute to automatic indexing performance. research has been suggesting dimensionality reduction Using the SVM implementation in Weka (Witten and Frank methods such as: choosing only terms with highest 2000), they gained results that indicated keywords outper- weights, selecting clusters of closest terms instead of formed full-text, while cited works, source title (title of terms, taking only parts of documents like summaries journal or conference), and title were all as effective as the or web page snippets. Vector space representation al- full text. lows for advanced mathematical manipulations beyond Rules can be of different types. Driscoll et al. (1991) what would be possible with just strings of text. matched the document text against over 3,000 phrases and b. Less commonly applied is a string-matching approach a set of deletion and insertion rules. These rules were used between terms from the document and terms describ- to transform the list of terms from the document to the list ing concepts from an indexing language. 108 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

In assigned automatic indexing, a parallel process is taking ply neural networks to reconstruct contexts of words. place to represent target index terms (e.g., classes from a Each unique word is assigned a vector and positioned close classification system, descriptors from a thesaurus). For ex- to vectors representing words, which often appear in sim- ample, in subject indexing languages such as thesauri, one ilar contexts. concept can be represented by a certain number of synon- ymous terms, related terms, narrower and broader terms. 3.2 Document types Or, each concept can be represented by terms extracted from documents that have been manually indexed by the While this encyclopedia entry focuses on automatic sub- term representing that concept. These representations need ject indexing of textual documents, automatic indexing of to be transformed into vectors when the documents are non-textual or heterogeneous documents shares principles represented as vectors in order to allow the comparison. basic to those presented here. For example, multimedia documents like images, sound and video could also be rep- 3.1.4 Assignment of index terms resented by vectors and processed similarly. However, how exactly features of multimedia such as shapes and color In this final step, either a) vector-based comparisons and distribution need to be selected and processed, is beyond calculations (when vectors are used), or b) string matching the scope of this entry. For automatic indexing of non- between terms from the documents and terms represent- textual resources, the readers may want to refer to Rasmus- ing target index terms are conducted. Usually a list of can- sen Neal (2012). didate terms is the first result, from which, then, best can- Another common document type today is data, where didates are selected also applying various statistical and automatic categorization is typically applied for prediction heuristic rules. One example is to assign the candidate purposes (e.g., weather forecast, medical diagnosis, mar- term if it is among the top five and appears in the title of keting) as opposed to our context of description. Still, the document, or, more simply, select the top, say, three many of the principles are similar to ours. For further in- candidates with the highest weight. formation, please refer to Kelleher, Mac Namee, and As seen from the four steps above, the dominant basic D’Arcy (2015). approach takes into account only terms, rather than con- When it comes to textual documents, there also are cepts or semantic relationship between terms. Taking ad- many different sub-types, and while the basic approach de- vantage of relationships in indexing languages like thesauri scribed above tends to be applied in most cases, special and ontologies to identify concepts is another possibility challenges may arise, as well as special features that could (see section 3.3). Also, there are examples that try to ap- be beneficially explored. For example, web documents proach this problem in other ways; e.g., Huang et al. (2012) have specific characteristics such as hyperlinks and an- who experimented with a measure for identifying concepts chors, metadata, and structural information, all of which by first mapping words from documents to concepts from could serve as complementary features to improve auto- Wikipedia and WordNet. matic classification. In addition, geographic location, use Apart from using KOS, other approaches have been profiles, citation, and linking, like in PageRank mentioned suggested. In latent semantic indexing (LSI), perhaps the previously, may be utilized. On the other hand, they are best-known example, it is assumed that terms that are used rather heterogeneous; many of them contain little text, in semantically related documents tend to have similar metadata provided are sparse and can be misused, struc- meanings. Based on this assumption, associations between tural tags can be misapplied, and titles can be general terms that occur in similar documents are calculated, and (“home page,” “untitled document”) (see, e.g., Govert,̈ then concepts for those documents extracted. LSI was first Lalmas, and Fuhr 1999; Golub and Ardö 2005; Klassen applied in information retrieval for comparing search and Paturi 2010). Apart from web pages, the following is a query terms to documents, at the conceptual rather than non-exhaustive list of textual document examples where literal level (Deerwester et al. 1988; Meng, Lin, and Yu research in automatic indexing has been conducted (in no 2011). LSI has been further developed into related ap- particular order): archival records (e.g., Sousa 2014), doc- proaches, such as probabilistic LSI (pLSI) (Hofmann toral theses (e.g., Hamm and Schenider 2015), clinical med- 2001) and Latent Dirichlet Allocation (LDA) (Blei, Ng, ical documents (e.g., Stanfill et al. 2010), e-government and Jordan 2003). Statistical approaches also try to identify (e.g., Svarre and Lykke 2013), business information (Flett concepts, in particular the ones based on the distributional and Laurie 2012), online discussions (e.g., Mu et al. 2012), hypothesis (Harris 1954). According to the hypothesis, parliamentary resolutions (De Campos and Romero 2008), words that appear in same contexts tend to have similar political texts on the web (Dehghani et al. 2015), grey lit- meanings. This has been applied in word2vec models erature (e.g., Mynarz and Skuta 2011), written documents (Mikolov et al. 2013; Goldberg and Levy 2014), which ap- from businesses like invoices, reminders, and account Knowl. Org. 46(2019)No.2 109 K. Golub. Automatic Subject Indexing of Text statements (e.g., Esser et al. 2012), legal documents for lit- The literature reports on a range of different ways to igation (e.g., Roitblat, Kershaw, and Oot 2010), documents build classifiers, for example support vector machines from construction industry such as meeting minutes, (SVM) (e.g., Lee et al. 2012), artificial neural networks (e.g., claims, and correspondences (e.g., Mahfouz 2012), and Ghiassi et al. 2012), random forest learning (Klassen and documents related to research data such as questionnaires Paturi 2010), adaptive boosting (AdaBoost) (Freund and and case studies (El-Haj et al. 2013). Schapire 1997), to name a few considered to be state-of- the-art today. For an overview of different classifiers, see 3.3 Approaches to automatic subject indexing Mitchell 1997; for comparisons between them, see, Yang (1999) and Sebastiani (2002). Also, two or more different As described in Section 3.1. above, methods to automati- classifiers and ways to build them can be combined to cally index or classify are at its foundational level effectively make a classification decision—these are known as classi- the same—applying heuristic principles to computationally fier committees or metaclassifiers (e.g., Liere and Tadepalli determine the subject of a document, and then assign an 1998; Wan et al. 2012; Miao et al. 2012). appropriate index term based on that. Approaches and dif- Text categorization approaches can be divided into hard ferences between them may be grouped based on various and soft; in hard, a decision is made as to whether the doc- criteria, and still the distinction will not always be clear-cut. ument does or does not belong to a category; in soft, a The criteria followed here are based on the general context ranked list of candidate categories is created for each doc- set out for this entry, that is, assigned subject indexing for ument and one or more of the top-ranked are chosen as purposes of information retrieval. The criteria are: a) appli- the appropriate categories (Sebastiani 2002). The soft ap- cation purposes; b) a more-or-less coherent body of pub- proach better reflects reality (cf. Section 4 where aboutness lished research following the approach; and, c) general ap- is discussed). proach: supervised learning, unsupervised learning, or Text categorization has been applied to KOSs that in- string matching. The division that follows is also in line with corporate hierarchies of concepts, such as Wikipedia, previously published bibliometric analysis of the identified Open Directory Project, and Yahoo’s Directory (for an approaches (Golub and Larsen 2005) and a discussion on overview, see, e.g., Ceci and Malerba 2007 and a workshop the same approaches as applied to web pages (Golub by Kosmopoulos et al. 2010). When compared to a flat ap- 2006b). Each approach is described via its definition, dif- proach, many have reported that including features based ferences within the approach, application, and evaluation. on the hierarchy structure in the classifier improves classi- fication accuracy (e.g., McCallum et al. 1998; Ruiz and 3.3.1 Text categorization Srinivasan 1999; Dumais and Chen 2000). Li, Yang, and Park (2012) combined text categorization algorithms with Text categorization or text classification are two terms that WordNet and an automatically-constructed thesaurus and most often refer to automatic indexing of textual docu- gained high effectiveness as measured by precision, recall, ments where both manually (intellectually) assigned docu- and F-measures (see below). Maghsoodi and Homayoun- ments and the target KOS exist. This is a machine-learning pour (2011) have extended the feature vector of the SVM approach employing supervised learning whereby the al- classifier by Wikipedia concepts and gained improved re- gorithm “learns” about characteristics of target index sults (for the Farsi language). This is in line with research terms based on characteristics of documents that had been in document classification (see Section 3.3) where other manually pre-assigned those index terms. One of com- features from existing KOSs have been used to improve monly used characteristics is word frequency; for example, the algorithm results. words that often occur in documents assigned to the same Examples of test collections specially designed for use index term as opposed to those that occur in documents in text categorization include Reuters newswire stories assigned to other index terms. (e.g., Reuters-21578), OHSUMED with metadata from The process comprises three major steps. First, a col- MEDLINE, and WebKB for web pages, to name a few. lection of documents manually (intellectually) indexed us- However, for many document collections, there will be no ing a pre-defined KOS is chosen or created for the text training documents available to train and test the classifier. categorization process. The documents in this collection If there are no resources or possibilities to create one man- are called training documents. In the second step, for each ually, approaches like semi-supervised learning and unsu- category a classifier is built, most often using the vector- pervised learning can be adopted instead. For an overview space model. The classifiers are tested with a new set of of semi-supervised learning, see Mladenić and Grobelnik documents from the collection; these are called test docu- (2014). Unsupervised learning is basically document clus- ments. Finally, the third step is the actual categorization tering described in the following section. where the classifier is applied to new documents. 110 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Evaluation in text categorization is often conducted by process. For a detailed overview of these and other evalu- comparison against pre-assigned categories in test collec- ation measures in text categorization, see Sebastiani (2002, tions created for that task. Evaluation generally excludes 32-39). For further information on text categorization in deeper considerations of contexts like real-life end-user terms of technical detail please refer to Sebastiani (2002) tasks and information practices. Furthermore, problems and for a more general overview to Mladenić and Gro- of using existing test collections for text categorization belnik (2014). have been reported. Yang (1999) claims that the most seri- ous problem in text categorization evaluations is the lack 3.3.2 Document clustering of standard data collections and shows how different ver- sions even of the same collection have a strong impact on Document clustering is the term most often used to refer the performance. This corresponds to the well-established to automatic construction of groups of topically related knowledge from inter-indexer consistency studies that hu- documents and automatic derivation of names for those man indexing is very inconsistent, and that inconsistency groups of documents. Also, relationships between the is an inherent feature of indexing, rather than a sporadic groups of documents may be automatically determined, anomaly. Therefore Hjørland (2018, Section 3.2) con- such as those that are hierarchical. No training documents cluded: “That human indexing is sometimes taken as the are used from which the algorithm can “learn” to assign golden standard to which computer-indexing is adjusted is similar documents to the same topics. Therefore, this ap- of course problematic in the light of the large degree of proach is known as unsupervised learning whereby the al- inconsistency found in empirical investigations and the un- gorithm learns from existing examples without any “super- certainty about how indexing should be evaluated.” vision.” Comparison between automatically and manually as- Document clustering approach is best suited for situa- signed categories is calculated using performance tions when there is no target KOS at hand and no training measures such as precision and recall used in information documents, but the documents need to be topically retrieval evaluation (see, for example, Manning, Raghavan, grouped. It traditionally has been used to improve infor- and Schütze 2008, chapter 8). In information retrieval, pre- mation retrieval, for example, when grouping search en- cision is defined as the fraction of retrieved documents gine results into topics. On the other hand, automatic der- that are relevant to the query and recall as the fraction of ivation of names and relationships is still a very challeng- documents relevant to the query that are successfully re- ing aspect of document clustering. “Automatically-derived trieved. structures often result in heterogeneous criteria for cate- gory membership and can be difficult to understand” (Chen and Dumais 2000). Further, the clusters and rela- tionships between them change as new documents are added to the collection; frequent changes of cluster names and relationships between them may not be user-friendly, for example, when applied for hierarchical topical brows- ing of a document collection. Koch, Zettergren and Day (1999) suggest that document clustering is better suited for Translated to automatic subject indexing, recall is calcu- organizing web search engine results. lated as the number of correct automatically assigned in- The process of document clustering normally involves dex terms divided by the number of manually assigned in- two major steps. First, documents in the collection at hand dex terms. Precision is the number of correct automati- are typically each represented by vectors. The vectors are cally assigned index terms divided by the number of all then compared to one another using vector similarity automatically assigned index terms. measures such as the cosine measure. A variety of heuristic principles may be applied when deriving vectors, as out- |𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑢𝑡𝑜𝑚𝑎𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑡𝑒𝑟𝑚𝑠| Precision |𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑎𝑢𝑡𝑜𝑚𝑎𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑖𝑛𝑑𝑒𝑥𝑡𝑒𝑟𝑚𝑠| lined in Section 3.1. Second, the chosen clustering algo- rithm is applied to group similar documents, name the |𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑢𝑡𝑜𝑚𝑎𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑡𝑒𝑟𝑚𝑠| clusters, and, if decided, derive relationships between clus- Recall |𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑛𝑢𝑎𝑙𝑙𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑖𝑛𝑑𝑒𝑥𝑡𝑒𝑟𝑚𝑠| ters. Similar to text categorization, there are two different ap- Macroaveraging and microaveraging are then used to ob- proaches to clustering, hard and fuzzy (or soft). In hard tain average performance over all index terms. Other as- clustering, one document may be a member of one cluster pects of algorithm performance may be evaluated, such as only, while in fuzzy clustering, any document may belong the speed of computation across the different steps of the to any number of clusters. Hard clustering is the most Knowl. Org. 46(2019)No.2 111 K. Golub. Automatic Subject Indexing of Text common approach to clustering. Its subtypes are parti- 3.3.3 Document classification tional (also called flat) and hierarchical clustering. A typical example of partitional clustering is the k-means algorithm A perhaps less established approach that we identify in this whereby the first step is to randomly create a k number of entry is that which tends to arise more specifically from clusters and then new documents are added to the differ- the library and information science community whereby ent clusters based on their similarity. As the document is the purpose is to apply quality-controlled KOSs more di- added to the cluster, the clusters and their centroids (centre rectly to typical subject indexing (including classification) of a cluster) are re-computed. In hierarchical clustering, tasks in library catalogues or closely related information there are divisive and agglomerative algorithms. Divisive retrieval systems, in order to improve searching and brows- hierarchical clustering is a top-down approach in which, at ing. For the purposes of this work and to distinguish be- start, all documents are grouped into one cluster that is tween the previous two approaches, as well as to follow the then subdivided into smaller and smaller clusters up until line of previously published research (cf. Golub 2006a), we each cluster comprises one document. Agglomerative hi- name this approach document classification. However, be- erarchical clustering is a bottom-up approach, starting cause this approach seems less established than the previ- from a set of clusters each comprising a single document, ous two, the community around it being less coherent, and gradually merging those with most similar vectors. Ex- principles and methods applied may not be as homogene- amples of less common approaches to clustering include ous. self-organizing maps (see, e.g., Paukkeria et al. 2012; Lin, Apart from using quality-controlled KOSs for subject Brusilovsky, and He 2011; Saarikoski 2011), and genetic indexing and classification, this seems to be the only ap- algorithms (see, e.g., Song, Yang, and Park 2011). proach using string-matching between terms from the Bibliometrics also applies document clustering; to map documents to be indexed and target index terms. As in text research fields or represent subject categories. It does so categorization and document clustering, the pre- pro- by linking documents through establishing relations be- cessing of documents to be classified typically includes tween documents that cite each other (co-citation), or that stop-words removal; stemming can be conducted; words share same references sets (bibliographic coupling), for ex- or phrases from the text of documents to be classified are ample. The underlying assumption is that the more con- extracted and weights are assigned to them based on dif- nections are established, the more the documents have in ferent heuristics; while vector representations and manip- common scientifically, which can also be interpreted as dif- ulations are not necessary. Furthermore, examples using ferent research specializations, research areas or subject machine learning exist as seen from below. However, as to categories. In order to assign topical words to clusters in- supervised machine learning, research points to scenarios stead of author or journal names from references, co-word where it may not work due the lack of training documents, analysis of titles, keywords or abstracts may be performed. especially for large KOSs; Wang (2009) and Waltinger et al. Combining reference/citation analysis with co-word anal- (2011), argue that Dewey Decimal Classification’s deep and de- ysis is another approach. For more detail on these matters, tailed hierarchies lead to data sparseness and thus skewed see Åström (2014). distribution in supervised machine learning approaches. Evaluation in document clustering is often conducted While this approach is obviously different from docu- by comparison to an existing manually created KOS or ment clustering in that here we have a target KOS, it shares manually pre-assigned classes. Measures used include the this particular feature with the text categorization ap- number of correct decisions compared to all decisions proach. Following the criteria to distinguish between ap- (Rand index); precision, recall, and related. These are called proaches set out at the start of Section 3.3., the document external validity measures. There are also internal validity classification approach is different from text categoriza- measures that estimate compactness, i.e., how close the tion in that: documents are to each other in each cluster (the closer the better as this indicates better similarity), and separability, – its application tends to be tightly related to applying i.e., how distant two clusters are from one another (the quality-controlled KOSs directly to typical subject in- more distant the better) (Frommholz and Abbasi 2014). dexing and classification tasks in library catalogues or For further detail on similarity measures and other as- related operative information retrieval systems; pects of document clustering, please see chapters 16 and – this seems to be the only approach using string-match- 17 of Manning, Raghvan, and Schütze (2008) and Fromm- ing between terms from the documents to be indexed holz and Abbasi (2014). and target index terms, although examples using ma- chine learning also exist, the latter being problematic due to training data sparseness especially for large KOSs. 112 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

However, like in many classifications, there are grey zones, retrieved records against the document and assigned DDC which are discussed below. and FAST to it from those with the highest matching Often the focus of research are publicly available, op- score. Khoo et al. (2015) attempted to solve the problem erative information systems using well-known KOSs. Ex- of cross-searching unrelated libraries. To that extent, they amples include universal KOSs: Dewey Decimal Classification created DDC terms and numbers from pre-existing Dub- (DDC); Universal Decimal Classification (UDC); Library lin Core metadata. The results indicate that best results are of Congress Classification (LCC); FAST (Faceted Applica- achieved when combined title, description, and subject tion of Subject Terminology); German subject headings terms are used. Further, they demonstrate how taking ad- (Schlagwortnormdatei (SWD)); as well as subject-specific vantage of DDC hierarchies for disambiguation in simple systems: Medical Subject Headings (MeSH), National Library string-matching can achieve results that are competitive of Medicine (NLM) classification system, Engineering Index with machine learning approaches, yet without the need classification system and thesaurus (used by the Com- for training documents. pendex database), Inspec classification system and thesau- In the Nordic WAIS/World Wide Web Project, 1993- rus, Fachinformationszentrums Technik (FIZ Technik) 1996 (Ardö et al. 1994; Koch 1994), automatic indexing of thesaurus and classification system, AGROVOC thesau- the World Wide Web and Wide Area Information Server rus, Humanities and Social Science Electronic Thesaurus (WAIS) databases using UDC was experimented with. A (HASSET), and EuroVoc thesaurus. As the predicted rel- WAIS subject tree was built based on two top levels of evance of this approach to the readers of the ISKO Ency- UDC, i.e., fifty-one classes. UDC was also used by GER- clopedia is high, a more detailed, albeit non-exhaustive, HARD, a robot-generated web index of web documents overview of research will be provided in this section for in Germany (Möller et al. 1999) that employed a multilin- illustration purposes. The overview is structured around gual version of UDC in English, German, and French. the specific KOSs. Wartena and Sommer (2012) experimented with auto- Online Computer Library Center’s (OCLC) project matic indexing of articles in academic repositories using Scorpion (OCLC Research 2004) built tools for automatic German subject headings (SWD). German subject head- subject recognition, using DDC. The main idea was to treat ings have a thesaurus-like structure with synonyms, super- a document to be indexed as a query against the DDC ordinate, and related terms. Also, about 40,000 terms have knowledge base. The results of the “search” were treated been enhanced with DDC classes. Like Khoo et al. (2015) as subjects of the document. Larson (1992) used this idea (see above), they conclude that good results are achieved earlier, for books. In Scorpion, clustering was also used, when applying string-matching, which they attribute to the for refining the result set and for further grouping of doc- enriched version of German subject headings. Junger uments falling in the same DDC class (Subramanian and (2014) reports on experiments run by the German Na- Shafer 1998). Another OCLC project, WordSmith (Godby tional Library with the aim to use automatic indexing for and Reighart 2001), was to develop software to extract sig- online publications for which they have no resources to nificant noun phrases from a document. The idea behind manually catalogue. They acquired commercial machine- it was that the precision of automatic indexing could be learning software that has previously been specializing in improved if the input to the classifier were represented as automatic indexing of medical publications called Averbis. a list of the most significant noun phrases, instead as the With catalogue librarians as evaluators, the recall was con- complete text of the raw document. However, it showed sidered high but precision too low to be satisfactory, at- that there were no significant differences. Wolverhampton tributing this to lack of disambiguation mechanisms; they Web Library was a manually maintained library catalogue proposed co-occurrence analysis and related techniques to of British web resources, within which experiments on au- be implemented in the future. tomating DDC classification were conducted (Jenkins et al. Frank and Paynter (2004) applied machine-learning 1998). Resorting to already assigned DDC, Joorabchi and techniques to assign Library of Congress Classification (LCC) Mahdi (2011) extracted references from the document to notations to resources that already have an LCSH term as- be classified, compiled a list of publications that cite either signed. Their solution has been applied to INFOMINE the document to be classified or one of its references, and (subject gateway for scholarly resources at the time), where discovered their corresponding DDC numbers from exist- it was used to support hierarchical browsing. ing library catalogues in order to then assign the most One of the most well-researched automatic indexing probable match to the document at hand. Similarly, software applications was created in 1996 by the National Joorabchi and Mahdi (2013) assigned DDC and FAST by Library of Medicine, known as Medical Text Indexer first identifying Wikipedia concepts in the document to be (MTI) (a lot of publications and other resources about it indexed/classified and then by searching WorldCat for rec- can be found at its website, https://ii.nlm.nih.gov/Publi- ords that contain those concepts. Then they compared the cations/). It is semi-automatic software aimed at assigning Knowl. Org. 46(2019)No.2 113 K. Golub. Automatic Subject Indexing of Text

MeSH. The general approach is combining the intellectual terms designating classes could significantly increase per- work built into the rich UMLS Metathesaurus (UMLS— formance of automatic indexing algorithms. Further, if Unified medical language system), extracted MeSH terms the same KOS had an appropriate hierarchical structure, it from related citations, with comprehensive indexing rules would provide a good browsing structure for the collection and machine learning. In one of the most recent articles of automatically classified documents. titled “12 Years On—Is the NLM Medical Text Indexer Plaunt and Norgard (1997) applied a supervised train- Still Useful and Relevant?,” Mork, Aronson, and Demner- ing algorithm based on extracting lexical terms from bibli- Fushman (2017) show how indexers have continually in- ographic records and associating them with manually-as- creased their use of the MTI, from 15.75% of the articles signed INSPEC thesaurus terms. Project BINDEX (Bilin- indexed with it in 2002, to 62.44% in 2014, at the same gual Automatic Parallel Indexing and Classification) (Maas time also spreading to new subject areas of use, indicating et al. 2002) applied automatic indexing of abstracts in en- its usefulness. Furthermore, the MTI performance statis- gineering available in the English and German languages. tics show significant improvement in precision and F- It used the English Inspec thesaurus and classification sys- measures while they point to the need to improve recall, tem, as well as, FIZ Technik’s bilingual thesaurus and clas- too. One point for further research and development is to sification system. Morpho-syntactic analysis of a docu- resort more to machine learning while keeping the existing ment was performed. It involved identification of single components. Of other medical document types, Pratt and multiple-word terms, tagging and lemmatization, and (1997) experimented with organizing search results into homograph resolution. Keywords were extracted and MeSH categories. Lüschow and Wartena (2017) applied k- matched against the thesauri, and then classification codes nearest-neighbour (kNN) algorithm to a collection of were derived. Keywords above a certain threshold which medical documents with pre-assigned classes from several were not in the thesaurus were assigned as free index classification systems, with the aim of using them as a basis terms. Enriching records with other terms than from the on which to automatically assign the National Library of KOS at hand might lead to improved retrieval. To that ex- Medicine classification system, thus using already assigned tent, Joorabchi and Mahdi (2014) experimented with add- classes from other classification systems instead of using, ing Wikipedia concepts to existing library records. e.g., book titles or keywords as the content representation Lauser and Hotho (2004) applied a support vector ma- for each document. chines (SVM) algorithm to index a collection of agricul- “All” Engineering was a robot-generated web index of tural documents with the AGROVOC thesaurus. The al- about 300,000 web documents, developed as an experi- gorithm improved when they made use of the semantic mental module of the manually created subject gateway information contained in AGROVOC. Similarly, Medelyan Engineering Electronic Library (EELS) (Koch and Ardö and Witten (2008) used KEA, a Naïve Bayes algorithm for 2000). Engineering Index (Ei) thesaurus was used; in this the- extracting both derived and assigned index terms and saurus, terms are enriched with their mappings to Ei clas- achieved good performance with little training data, be- sification scheme. The project proved the importance of cause they also made use of the AGROVOC semantic in- applying a good KOS in achieving the automatic indexing formation. accuracy: 60% of documents were correctly classified, us- Of other examples, De Campos and Romero (2008) ing only a very simple string-matching algorithm based on used machine learning to classify parliamentary resolutions a limited set of heuristics and simple weighting. Another from the regional Parliament of Andalucía at Spain using robot-generated web index, Engine-e, used a slightly mod- EuroVoc. El-Haj et al. (2013) experimented with applying ified automatic indexing approach to the one developed in HASSET terms to the UK Data Archive/UK Data Service “All” Engineering (Lindholm, Schönthal, and Jansson data-related document collection. Their approach was 2003). Engine-e provided subject browsing of engineering based on applying an open source, machine-learning documents based on Ei terms, with six broader categories keyphrase extractor KEA (Keyphrase Extraction Algo- as starting points. Golub, Hamon, and Ardö (2007) applied rithm). string-matching where the Ei thesaurus terms were en- As we see from the examples above, in many of the riched with automatically extracted terms from biblio- cases, the relationships built into KOSs are explored with graphic records of the Compendex database, using multi- favorable results. Willis and Losee (2013) specifically ex- word morpho-syntactic analysis and synonym acquisition, perimented with just that. They employed four thesauri in based on the existing preferred and synonymous terms (as order to determine to what degree the in-built relation- they gave best precision results). Golub (2011) worked ships may be used to the advantage of automatic subject with Ei to automatically organize web pages into hierar- indexing. Their results indicate a great potential, albeit the chical structures for subject browsing, achieving results degree of success seems to be dependent on the thesaurus suggesting how a KOS with a sufficient number of entry as well as collection. 114 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

A major advantage of this approach is that it does not and discourse properties), domain world knowledge, require training documents, while still maintaining a pre- shared knowledge between the creator and user of the defined structure of the KOS at hand. If using a high- text, and the complete context of the understanding at a quality KOS, e.g., a well-developed classification scheme, it specific point in time including the ideology, norms, back- will also be suitable for subject searching and browsing in ground of the user, and the purposes of using the text. In information retrieval systems. Apart from improved infor- 2003, Lancaster claimed that existing automatic subject in- mation retrieval, another motivation to apply a KOS in au- dexing tools are far from being able to handle the com- tomatic classification is to re-use the intellectual effort that plexities, and in applications it is seldom possible to go has gone into creating such a KOS. It can be employed much further beyond vocabulary and syntax analysis. The with vocabularies containing uneven hierarchies or sparse difficulty of dealing with semantics and more advanced distribution across a given collection. levels is reflected in the fact that the methods used today As for evaluation methods, measures such as precision are not particularly new although not as rudimentary when and recall, and F-measure are commonly used. This seems first used (Lancaster 2003, 330-331). to be the only approach where at least the discussion is Still, software vendors and experimental researchers occasionally brought up calling for the need to attend to speak of the high potential of automatic indexing tools. the complexities of evaluation closer to real-life needs and While some claim to entirely replace manual indexing in scenarios. Even aspects such as automatic indexing war- certain subject areas (e.g., Roitblat, Kershaw and Oot rant are taken on; Chung, Miksa, and Hastings (2010) con- 2010), others recognize the need for both manual (human) clude that literary warrant is more suited in automatic in- and computer-assisted indexing, each with its advantages dexing of scientific articles than user warrant. and disadvantages (e.g., Anderson and Perez-Carballo 2001; Svarre and Lykke 2014). Reported examples of op- 4.0 Application in operative systems erational information systems where machine-aided index- ing is applied include NASA’s MAI software, which was The discussion on how applicable automatic subject index- shown to increase production and improve indexing qual- ing is today calls for looking into at least several connected ity (Silvester 1997), and the Medical Text Indexer at the US issues. Theoretically, automating subject determination be- National Library of Medicine, which, by 2017, was con- longs to logical positivism—a subject is considered to be a sulted by indexers in over 60% or articles indexing (Mork, string occurring above a certain frequency, is not a stop Aronson and Demner-Fushman 2017). word, and is in a given location such as a title (Svenonius However, hard evidence on the success of automatic 2000, 46-49). In algorithms, inferences are made such as: indexing tools in operating information environments is if document A is on subject X, then if document B is suf- scarce; research is usually conducted in laboratory condi- ficiently similar to document A (e.g., they share similar tions, excluding the complexities of real-life systems and words or references), then document B is on that subject. situations. The practical value of automatic indexing tools Another critique given is the lack of theoretical justifica- is largely unknown due to problematic evaluation ap- tions for vector manipulations, such as the cosine measure proaches. Having reviewed a large number of automatic that is often used to obtain vector similarities (Salton 1991, indexing studies, Lancaster concluded that the research 975). Further, it is assumed that concepts have names, comparing automatic versus manual indexing is flawed which can be more common in, for example, natural sci- (2003, 334). One common evaluation approach is testing ences, but much less so in humanities and social sciences, the quality of retrieval based on the assigned index terms. although attempts to address this have been undertaken But retrieval testing is fraught with problems, too; the re- more recently (see Section 3.1). sults depend on many factors, so retrieval testing cannot A variety of factors contribute to the challenge of au- isolate the quality of the index terms. Another approach is tomatic subject indexing. Texts are a complex cognitive to measure indexing quality directly. One method of doing and social phenomenon, and cognitive understanding of so is to compare automatically assigned metadata terms text engages many knowledge sources, sustains multiple against existing human-assigned terms or classes of the inferences, and involves a personal interpretation (Moens document collection used (as a “gold standard”), but this 2000, 7-10). Morris (2010) investigated individual differ- method also has problems. When indexing, people make ences in the interpretation of text meaning using lexical errors, such as related to exhaustivity (too many or too few chains (groups of semantically related words) based on subjects assigned) or specificity (usually because the as- three texts and with twenty-six participants; the results signed subject is not the most specific available); they may showed about 40% difference in interpretation. Research omit important subjects or assign an obviously incorrect in automatic understanding of text covers the linguistic subject (see also Hjørland 2017 for a detailed discussion coding (vocabulary, syntax, and semantics of the language on different aspects of aboutness). In addition, it has been Knowl. Org. 46(2019)No.2 115 K. Golub. Automatic Subject Indexing of Text reported that different people, whether users or profes- and word2vec as well as exploiting relationships from exist- sional subject indexers, assign different subjects to the ing KOSs, much more research is needed in this respect. same document. One reason for this are differences in the Approaches to automatic subject indexing may be approach: on the one hand, following the rationalist idea grouped based on various criteria; those followed in this that there is one correct way to index a document (or a work are based on the general context set out for this entry, collection), and on the other one, the pragmatic idea that that is assigned subject indexing for purposes of infor- different purposes and users may need different indexing mation retrieval. The named approaches are also in line (Hjørland 2018). Therefore, existing metadata records can- with previous research and include: text categorization, not be used as “the gold standard;” the classes assigned by document clustering and document classification. Major algorithms (but not human-assigned) might be wrong or differences between them include application purposes might be correct but omitted during human indexing by and presence or absence of machine learning, as well as mistake or by abiding to a certain indexing policy. whether machine learning is supervised or unsupervised. In order to address the complexities surrounding the The document classification approach employs, more than problem of aboutness, Golub et al. (2016) propose a com- others, subject indexing languages such as classification prehensive framework involving three major steps: evalu- schemes, subject headings systems, and thesauri, which are ating indexing quality directly through assessment by an also suitable for subject searching and browsing in an in- evaluator or through comparison with a gold standard, formation retrieval system (although often suggested im- evaluating the quality of computer-assisted indexing di- provements such as being more up-to-date, end-user rectly in the context of an indexing workflow, and evaluat- friendly, etc. should be addressed). Not the least, exploiting ing indexing quality indirectly through analyzing retrieval the intellectual work that has been invested into creating performance. The framework still needs to be tested em- such subject indexing languages in order to improve auto- pirically, and it is expected that much more research is re- matic indexing has shown to be a worthwhile path to ex- quired to develop appropriate evaluation designs for such plore more extensively in the future. complex phenomena involving subject indexing and re- Due to complexities of aboutness, existing experi- trieval and information interaction in general. mental systems and approaches have not been adequately While evaluation approaches often assume that human tested and therefore knowledge about their usefulness for indexing is best, and that the task of automatic indexing is operational systems seems to be flawed. A recently pro- to meet the standards of human indexers, more serious posed comprehensive evaluation framework involves three scholarship needs to be devoted to evaluation in order to major steps: evaluating indexing quality directly through further our understanding of the value of automatic sub- assessment by an evaluator or through comparison with a ject assignment tools and to enable us to provide a fully gold standard, evaluating the quality of computer-assisted informed input for their development and enhancement. indexing directly in the context of an indexing workflow Hjørland (2011) points to the problematics of evaluating and evaluating indexing quality indirectly through analyz- indexing on an example of an empirical study and dis- ing retrieval performance. Further research is needed to cusses this through a theory of knowledge point of view, empirically test it as well as devise most appropriate evalu- while analyzing its epistemological position. He concludes ation approaches for different specific contexts. by proposing that the ideal formula for the future of in- dexing is that the human indexer takes what automatic in- References dexing is good at (once this is understood) and invest their resources on the value-added indexing that requires human Anderson, James D. and Jose Perez-Carballo. 2001. “The judgment and interpretation. This may be in line with ma- Nature of Indexing: How Humans and Machines Ana- chine-aided indexing in operative systems like Medical lyze Messages and Texts for Retrieval. Part II: Machine Text Indexer mentioned at the start of this section. Indexing, and the Allocation of Human versus Machine Effort.” Information Processing and Management 37, no. 2: 5.0 Conclusions 255-77. Ardö, A. et al. 1994. “Improving Resource Discovery and Basic principles applied in various approaches to automati- Retrieval on the Internet: The Nordic WAIS/World cally assign index terms are at its foundational level effec- Wide Web Project Summary Report.” NORDINFO tively the same. The focus is still largely at the level of words Nytt 17, no. 4: 13-28. rather than concepts and commonly includes punctuation Åström, Fredrik. 2014. “Bibliometrics and Subject Repre- and stop-word removal, stemming, heuristic rules, and vec- sentation.” In Subject Access to Information: An Interdiscipli- tor representations and manipulations. While attempts to nary Approach, ed. Koraljka Golub, 107-17. Santa Bar- determine concepts rather than words exist and include LSI bara: Libraries Unlimited. 116 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Baxendale, Phyllis B. 1958. “Machine-made Index for Driscoll, James R. 1991. “The Operation and Performance Technical Literature—An Experiment.” IBM Journal of of an Artificially Intelligent Keywording System.” Infor- Research and Development 2: 354-61. mation Processing & Management 27: no. 1: 43-54. Blei, David M., Andrew Y. Ng and Michael I. Jordan. 2003. Dumais, Susan T. and Hao Chen. 2000. “Hierarchical Clas- “Latent Dirichlet Allocation.” Journal of Machine Learn- sification of Web Content.” In Proceedings of the 23rd An- ing Research 3: 993-1022. nual International ACM SIGIR Conference on Research and Bradshaw, Shannon and Kristian Hammond. 1999. “Con- Development in Information Retrieval, July 24-28, 2000, Ath- structing Indices from Citations in Collections of Re- ens, Greece, ed. Emmanuel Yannakoudakis, Nicholas J. search Papers.” In Knowledge, Creation, Organization and Belkin, Mun-Kew Leong, and Peter Ingwersen. [S. l.]: Use: ASIS '99: 62nd ASIS Annual Meeting, Washington, ACM 2000, 256-63. DC, October 31-November 4, 1999, ed. Marjorie K. Hlava El-Haj, Mahmoud, Lorna Balkan, Suzanne Barbalet, Lucy and Larry Woods. Silver Spring, MD: American Society Bell, and John Shepherdson. “An Experiment in Auto- for Information Science, 741-50. matic Indexing Using the HASSET Thesaurus.” In Ceci, Michelangelo and Donato Malerba. 2003. “Hierar- 2013 5th Computer science and Electronic Engineering Confer- chical Classification of HTML Documents with ence (CEEC), 17th-18th September 2013, University of Es- WebClassII.” In Advances in Information Retrieval: 25th Eu- sex, UK: Conference Proceedings. [Piscataway, NJ]: IEEE, ropean Conference on IR Research, ECIR 2003, Pisa, Italy, 13-8. doi:10.1109/CEEC.2013.6659437 April 14-16, 2003, ed. Fabrizio Sebastiani. Berlin: Esser, Daniel, Daniel Schuster, Klemens Muthmann, Mi- Springer, 57-72. doi:10.1007/3-540-36618-0_5 chael Berger and Alexander Schill. 2012. “Automatic In- Chen, Hao and Susan Dumais. 2000. “Bringing Order to dexing of Scanned Documents: A Layout-based Ap- the Web: Automatically Categorizing Search Results.” proach.” In Document Recognition and Retrieval XIX: Part of In Proceedings of the ACM CHI 2000 Human Factors in the IS&T/SPIE 24th Annual Symposium on Electronic Imag- Computing Systems Conference, The Hague, The Netherlands, ing, 22-26 January 2012, San Francisco, CA USA, ed. Chris- April 1-6, 2000, ed. Thea Turner, Gerd Szwillus, Mary tian Viard-Gaudin, and Richard Zanibbi. SPIE Proceed- Czerwinski, Fabio Peterno, and Steven Pemberton. ings 8297. Washington: SPIE. doi:10.1117/12.908542 New York: ACM Press, 145-52. Flett, Alan and Stuart Laurie. 2012. “Applying Taxonomies Chung, EunKyung, Shawne Miksa and Samantha K. Has- through Auto-Classification.” Business Information Review tings. 2010. “A Framework of Automatic Subject Term 29, no. 2: 111-20. Assignment for Text Categorization: An Indexing Con- Freund, Yoav and Robert E. Schapire. 1997. “A Decision- ception‐based Approach.” Journal of the American Society Theoretic Generalization of On-line Learning and an for Information Science and Technology 61: 688-99. Application to Boosting.” Journal of Computer and System De Campos, Louis M. and Alfonso E. Romero. 2009. Sciences 55: 119-39. “Bayesian Network Models for Hierarchical Text Clas- Frank, Eibe and Gordon W. Paynter. 2004. “Predicting Li- sification from a Thesaurus.” International Journal of Ap- brary of Congress Classifications from Library of Con- proximate Reasoning 50: 932-44. gress Subject Headings.” Journal of the American Society for Deerwester, Scoot, Susan T. Dumais, Thomas K. Lan- Information Science and Technology 55: 214-27. dauer, George W. Furnas and Louis Beck. 1988. “Im- Frommholz, Ingo and Muhammad Kamran Abbasi. 2014. proving Information Retrieval with Latent Semantic In- “Automated Text Categorization and Clustering.” In dexing.” In ASIS '88: Proceedings of the 51st ASIS Annual Subject Access to Information: An Interdisciplinary Approach, Meeting, Atlanta, Georgia, October 23-27, 1988, ed. Chris- ed. Koraljka Golub. Santa Barbara: Libraries Unlimited, tine L. Borgman, and Edward Y. H. Pai. Proceedings of 117-31. the ASIS Annual Meeting 25. Medford: Learned Infor- Fuhr, Norbert and Gerhard Knorz. 1984. “Retrieval Test mation, 36-40. Evaluation of a Rule-based Automated Indexing Dehghani Mostafa, Hosein Azarbonyad, Maarten Marx and (AIR/PHYS).” In Research and Development in Information Jaap Kamps. 2015. “Sources of Evidence for Automatic retrieval: Proceedings of the Third Joint BCS and ACM Sym- Indexing of Political Texts.” In Advances in Information Re- posium, King's College, Cambridge, 2-6 July 1984, ed. Cor- trieval: 37th European Conference on IR Research, ECIR 2015, nelis Joost van Rijsbergen. Cambridge: Cambridge Uni- Vienna, Austria, March 29 - April 2, 2015; Proceedings, ed. versity Press, 391-408. Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Ghiassi, Manoochehr, Michael Olschimke, Brian Moon Norbert Fuhr. Lecture Notes in Computer Science 9022. and Paul Arnaudo. 2012. “Automated Text Classifica- Cham: Springer, 568-73. doi:10.1007/978-3-319-16354- tion Using a Dynamic Artificial Neural Network 3_63 Model.” Expert Systems with Applications 39: 10967-76. Knowl. Org. 46(2019)No.2 117 K. Golub. Automatic Subject Indexing of Text

Gil-Leiva, Isidoro. 2017. “SISA – Automatic Indexing Sys- International Conference on Information and Knowledge Man- tem for Scientific Articles: Experiments with Location agement, Kansas City, MO, USA – November 02 - 06, 1999, Heuristics Rules Versus TF-IDF Rules.” Knowledge Or- ed. Susan Gauch. New York, NY: Association for Com- ganization 44: 139-62. puting Machinery, 475-82. Godby, C. Jean and Ray R. Reighart. 2001. “The Word- Hamm, Sandra and Kurt Schneider. 2015. “Automatische Smith Indexing System.” Journal of Library Administra- Erschliessung von Universitätsdissertationen.” Dialog tion 34, nos. 3-4: 375-85. mit Bibliotheken 27, no. 1: 18-22. Goldberg, Yoav and Omer Levy. 2014. “Word2vec Ex- Harris, Zellig S. 1954. “Distributional Structure.” Word 10, plained: Deriving Mikolov et al.’s Negative-Sampling no. 23: 146-62. Word-Embedding Method.” https://arxiv.org/abs/14 Hartigan, John A. 1996. “Introduction.” In Clustering and 02.3722 Classification, edited by Phipps Arabie, Lawrence J. Hubert Golub, Koraljka. 2006a. “Automated Subject Classification and Geert de Soete. Singapore: World Scientific, 3-5. of Textual Web Documents.” Journal of Documentation 62: Hjørland, Birger. 2011. “The Importance of Theories of 350-71. Knowledge: Indexing and Information Retrieval as an Golub, Koraljka. 2006b. “Automated Subject Classification Example.” Journal of the American Society for Information of Textual Web Pages, Based on a Controlled Vocabu- Science and Technology 62: 72–7. lary: Challenges and Recommendations.” New Review of Hjørland, Birger. 2017. “Subject (of Documents).” Hypermedia and Multimedia 12, no. 1: 11-27. Knowledge Organization 44: 55-64. Golub, Koraljka. 2016. “Potential and Challenges of Sub- Hjørland, Birger. 2018. “Indexing: Concepts and Theory.” ject Access in Libraries Today on the Example of Swe- Knowledge Organization 45: 609-39. dish Libraries.” International Information & Library Review Hlava, Majorie K. 2009. “Understanding ‘Rule Based’ vs. 48: 204-10. ‘Statistics Based’ Indexing Systems: Data Harmony Golub, Koraljka and Anders Ardö. 2005. “Importance of White Paper.” Reprinted from Information Outlook with HTML Structural Elements and Metadata in Auto- permission, updated April, 2009. https://web.ar- mated Subject Classification.” In Research and Advanced chive.org/web/20090417210346/http://www.datahar Technology for Digital Libraries: 9th European Conference, mony.com:80/library/whitePapers/auto_indexing_ ECDL 2005, Vienna, Austria, September 18-23, 2005; Pro- rule-based_vs_statistics-based.htm ceedings, ed. Andreas Rauber, Stavros Christodoulakis Hofmann, Thomas. 2001. “Unsupervised Learning by and Min A. Tjoa. Berlin: Springer, 368-78. Probabilistic Latent Semantic Analysis.” Machine Learn- Golub, Koraljka, Thierry Hamon and Anders Ardö. 2007. ing 42, no. 1: 177-96. “Automated Classification of Textual Documents Hu, Yi and Wenjie Li. 2011. “Document Sentiment Classi- based on a Controlled Vocabulary in Engineering.” fication by Exploring Description Model of Topical Knowledge Organization 34: 247-63. Terms.” Computer Speech & Language 25: 386-403. Golub, Koraljka and Birger Larsen. 2005. “Different Ap- Huang, Lan, David Milne, Frank Eibe, and Ian H. Witten. proaches to Automated Classification: Is There an Ex- 2012. “Learning a Concept-based Document Similarity change of Ideas?” In Proceedings of ISSI 2005: The 10th Measure: Report.” Journal of the American Society for Infor- International Conference of the International Society for Scien- mation Science and Technology 63: 1593-1608. tometrics and Informetrics, Stockholm, Sweden, July 24-28, Humphrey, Susanne M., Aurélie Névéol, Allen Browne, Ju- 2005, ed. Peter Ingwersen and Birger Larsen. Stock- lien Gobeil, Patrick Ruch and Stéfan J. Darmoni. 2009. holm: Karolinska University Press, 270-4. “Comparing a Rule-Based Versus Statistical System for Golub, Koraljka, Dagobert Soergel, George Buchanan, Automatic Categorization of MEDLINE Documents Douglas Tudhope, Marianne Lykke, and Debra Hiom. According to Biomedical Specialty.” Journal of the Ameri- 2016. “A Framework for Evaluating Automatic Index- can Society for Information Science and Technology 60: 2530-9. ing or Classification in the Context of Retrieval.” Journal Hwang, San-Yih, Wan-Shiou Yang and Kang-Di Ting. of the Association for Information Science and Technology 67: 2010. “Automatic Index Construction for Multimedia 3-16. Digital Libraries.” Information Processing & Management Grobelnik, Marko and Dunja Mladenić. 2005. “Simple 46: 295-307. Classification into Large Topic Ontology of Web Doc- International Organization for Standardization. 1985. Docu- uments.” Journal of Computing and Information Technology mentation, Methods for Examining Documents, determining their 13: 279-85. Subjects, and Selecting Index Terms. International Standard Gövert, Norbert, Mounia Lalmas, and Norbert Fuhr. ISO 5963. [Geneva]: International Organization for 1999. “A Probabilistic Description-oriented Approach Standardization. for Categorising Web Documents.” In CIKM99: Eighth 118 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Jenkins, Charlotte, Mike Jackson, Peter Burden and Jon Lancaster, Frederick W. 2003. Indexing and Abstracting in Wallis. 1998. “Automatic Classification of Web Re- Theory and Practice. 3rd ed. London: Facet. sources Using Java and Dewey Decimal Classification.” Larson, Ray R. 1992. “Experiments in Automatic Library Computer Networks & ISDN Systems 30: 646-8. of Congress Classification.” Journal of the American Soci- Jones, Kevin P. and Colin L. M. Bell. 1992. “Artificial Intel- ety for Information Science 432: 130-48. ligence Program for Indexing Automatically (AIPIA).” Lauser, Boris and Andreas Hotho. 2003. “Automatic In Online Information 92: 16th International Online Information Multi-label Subject Indexing in a Multilingual Environ- Meeting Proceedings, London, 8-10 December 1992, ed. David ment.” In Research and Advanced Technology for Digital Li- I. Raitt. Medford, NJ: Learned Information, 187-96. braries: 7th European Conference, ECDL 2003, Trondheim, Joorabchi, Arash and Abdulhussain E. Mahdi. 2014. “To- Norway, August 17-22, 2003; Proceedings, ed. Traugott wards Linking Libraries and Wikipedia: Automatic Sub- Koch and Ingeborg T. Sølvberg. Lecture Notes in Com- ject Indexing of Library Records with Wikipedia Con- puter Science 2769. Berlin: Springer, 140-51. cepts.” Journal of Information Science 40: 211-21. Lee, Lam, Hong Wan, Chin Rajkumar, and Heng Isa. 2012. Joorabchi, Arash and Abdulhussain E. Mahdi. 2013. “Clas- “An Enhanced Support Vector Machine Classification sification of Scientific Publications According to Li- Framework by Using Euclidean Distance Function for brary Controlled Vocabularies: A New Concept Match- Text Document Categorization.” Applied Intelligence 37: ing-based Approach.” Library Hi Tech 31: 725-47. 80-99. Junger, Ulrike. 2014. “Can Indexing Be Automated? The Liere, Ray and Prasad Tadepalli. 1998. “Active Learning with Example of the Deutsche Nationalbibliothek.” Catalog- Committees: Preliminary Results in Comparing Winnow ing & Classification Quarterly 52, no. 1: 102-9. and Perceptron in Text Categorization.” In Conference on Kelleher, John D., Brian Mac Namee, and Aoife D'Arcy. Automated Learning and Discovery, June 11-13, 1998, Carnegie 2015. Fundamentals of Machine Learning for Predictive Data Mellon University, Pittsburgh, PA. S.l.: s.n., 591-6. Analytics: Algorithms, Worked Examples, and Case Studies. Lin, Yi-ling, Peter Brusilovsky and Daqing He. 2011. “Im- Cambridge, MA: MIT Press. proving Self-organising Information Maps as Naviga- Keyser, Pierre de. 2012. Indexing: From Thesauri to the Seman- tional Tools: A Semantic Approach.” Online Information tic Web. Oxford: Chandos. Review 353: 401-24. Khoo, Michael John, Jae-wook Ahn, Ceri Binding, Hilary Lindholm, Jessica, Tomas Schönthal and Kjell Jansson. Jane Jones, Xia Lin, Diana Massam and Douglas Tud- 2003. “Experiences of Harvesting Web Resources in hope. 2015. “Augmenting Dublin Core Digital Library Engineering using Automatic Classification.” Ariadne Metadata with Dewey Decimal Classification.” Journal no. 37. http://www.ariadne.ac.uk/issue37/lindholm/ of Documentation 71: 976-98. Liu, Rey-Long. 2010. “Context-based Term Frequency As- Klassen, Myungsook and Nikhila Paturi. 2010. “Web Doc- sessment for Text Classification.” Journal of the American ument Classification by Keywords Using Random For- Society for Information Science and Technology 61: 300-9. ests.” Communications in Computer and Information Science Lösch, Mathias, Ulli Waltinger, Wolfram Hortsmann and 88, no. 2: 256-61. Alexander Mehler. 2011. “Building a DDC-annotated Koch, Traugott. 1994. “Experiments with Automatic Clas- Corpus from OAI metadata.” Journal of Digital Infor- sification of WAIS Databases and Indexing of WWW.” mation 12, no. 2. https://journals.tdl.org/jodi/index. In Internet World & Document Delivery World International php/jodi/article/view/1765 94, held in London in May 1994: Proceedings of the Second Luhn, Hans P. 1957. “A Statistical Approach to Mechanized Annual Conference, ed. John W. T. Smith. London: Meck- Encoding and Searching of Literary Information.” IBM lermedia, 112-5. Journal of Research and Development 1: 309-17. Koch, Traugott and Anders Ardö. 2000. “Automatic Clas- Lüschow, Andreas and Christian Wartena. 2017. “Classify- sification.” https://web.archive.org/web/20050301133 ing Medical Literature Using k-Nearest-Neighbours Al- 443/http://www.lub.lu.se:80/desire/DESIRE36a-over gorithm.” In NKOS 2017, 17th European Networked view.html Knowledge Organization Systems (NKOS) Workshop, co-located Koch, Traugott, Ann-Sofie Zettergren and Michael Day. with the 21st International Conference on Theory and Practice 1999. “Provide Browsing Using Classification Schemes.” of Digital Libraries 2017 (TPDL 2017), Thessaloniki, https://web.archive.org/web/20050403233258/http:// Greece, September 21st, 2017; Proceedings, ed. Philipp Mayr, www.lub.lu.se/desire/handbook/class.html Douglas Tudhope, Koraljka Golub, Christian Wartena Kosmopoulos, Aris, Eric Gaussier, Georgios Paliouras and and Ernesto William De Luca. http://ceur-ws.org/Vol- Sujeevan Aseervatham. 2010. “The ECIR 2010 Large 1937/paper3.pdf Scale Hierarchical Classification Workshop.” ACM Maghsoodi, Nooshin and Mohammad Mehdi Homayoun- SIGIR Forum 44, no. 1: 23-32. pour. 2011. “Using Thesaurus to Improve Multiclass Knowl. Org. 46(2019)No.2 119 K. Golub. Automatic Subject Indexing of Text

Text Classification.” In Computational Linguistics and Intel- Mladenić, Dunja and Marko Grobelnik. 2014. “Machine ligent Text Processing: CICLing 2011, ed. Alexander F. Gel- Learning on Text.” In Subject Access to Information: An In- bukh. Lecture Notes in Computer Science 6609. Berlin: terdisciplinary Approach, ed. Koraljka Golub, 132-8. Santa Springer, 244-53. Barbara, CA: Libraries Unlimited. Mahfouz, Tarek. 2011. “Unstructured Construction Doc- Moens, Marie-Francine. 2000. Automatic Indexing and Ab- ument Classification Model through Support Vector stracting of Document Texts. Boston: Kluwer. Machine (SVM).” In Computing in Civil Engineering: Pro- Mork, James, Alan Aronson and Dina Demner-Fushman. ceedings of the 2011 ASCE International Workshop on Com- 2017. “12 Years on – Is the NLM Medical Text Indexer puting in Civil Engineering, June 19-22, Miami, Florida, ed. Still Useful and Relevant?” Journal of Biomedical Semantics Yimin Zhu and Raymond R. Issa. Reston, VA.: Ameri- 8, no. 8. doi:10.1186/s13326-017-0113-5 can Society of Civil Engineers, 126-33. Morris, Jane. 2010. “Individual Differences in the Interpre- Manning, Christopher and Hinrich Schütze. 1999. Founda- tation of Text: Implications for Information Science.” tions of Statistical Natural Language Processing. Cambridge, Journal of the American Society for Information Science and MA: MIT Press. Technology 61: 141-9. Manning, Christopher D., Prabhakar Raghavan, and Hin- Mynarz, Jindřich and Ctibor Škuta. 2010. “Integration of rich Schütze. 2008. Introduction to Information Retrieval. Automatic Indexing System within the Document Flow New York: Cambridge University Press. in Grey Literature Repository.” In: Twelfth International Maron, Melvil E. 1961. “Automatic Indexing: An Experi- Conference on Grey Literature: Transparency in Grey Litera- mental Inquiry.” Journal of the Association for Computing ture, 6-7 December 2010, ed. Dominic J. Farace and Jerry Machinery 8: 404-17. Frantzen. Amsterdam: TextRelease. http://www.grey Martinez-Alvarez, Miguel, Sirvan Yahyaei and Thomas net.org/images/GL12_S3P,_Mynarz_and_Skuta.pdf Roelleke. 2012. “Semi-automatic Document Classifica- Möller, Gerhard, Kai-Uwe Carstensen, Bernd Diekman, tion: Exploiting Document Difficulty.” In Advances in and Han Wätjen. 1999. “Automatic Classification of the Information Retrieval: 34th European Conference on IR Re- WWW Using the Universal Decimal Classification.” In search, ECIR 2012, Barcelona, Spain, April 1-5, 2012; Pro- Online Information 99: Proceedings, London, 7-9 December ceedings ed. Richard Baeza-Yates. Lecture Notes in Com- 1999, ed. David Raitt. Oxford: Learned Information puter Science 7224. Berlin: Springer, 468-71. Europe, 231-8. Mazzocchi, Fulvio. 2018. “Knowledge Organization System Mu, Jin, Karsten Stegmann, Elijah Mayfield, Carolyn Rose, (KOS): An Introductory Critical Account.” Knowledge Or- and Frank Fischer. 2012. “The ACODEA Framework: ganization 45: 54-78. Developing Segmentation and Classification Schemes McCallum, Andrew, Ronald Rosenfeld, Tom Mitchell and for Fully Automatic Analysis of Online Discussions.” Andrew Y. Ng. 1998. “Improving Text Classification by International Journal of Computer-Supported Collaborative Shrinkage in a Hierarchy of Classes.” In Machine Learning: Learning 7: 285-305. Proceedings of the Fifteenth International Conference (ICML Maas, Dieter, Rita Nuebel, Catherine Pease, and Paul ‘98), Madison, Wisconsin, July 24-27, 1998, ed. Jude W. Schmidt. 2002. “Bilingual Indexing for Information Re- Shavlik. San Francisco, CA: Morgan Kaufmann, 359-67. trieval with AUTINDEX.” In LREC 2002: Third Interna- Medelyan, Olena and Ian H. Witten. 2008. Domain-Inde- tional Conference on Language Resources and Evaluation, 29th, pendent Automatic Keyphrase Indexing with Small 30th & 31st May, Las Palmas de Gran Canaria (Spain); pro- Training Sets. Journal of the American Society for Information ceedings, ed. Manuel González Rodríguez and Carmen Paz Science and Technology 59: 1026-40. Suárez Araujo. Paris: European Language Resources As- Meng, Jiana, Hongfei Lin and Yuhai Yu. “A Two-stage Fea- sociation, 1136-49. ture Selection Method for Text Categorization.” Com- National Library of Medicine. 2016. "NLM Medical Text puters and Mathematics with Applications 62, no. 7: 2793- Indexer (MTI)." https://ii.nlm.nih.gov/MTI/ 800. OCLC Research. 2004. Scorpion. http://www.oclc.org/re- Miao, Duoqian, Qiguo Duan, Hongyun Zhang and Na search/software/scorpion/default.htm Jiao. 2009. “Rough Set Based Hybrid Algorithm for Page, Larry, Sergey Brin, Rajeev Motwani, and Terry Text Classification.” Expert Systems with Applications 36: Winograd. 1998. The Pagerank Citation Ranking: Bringing 9168-74. Order to the Web. http://ilpubs.stanford.edu:8090/422/ Mikolov, Tomas, Kai Chen, Greg Corrado and Jeffrey Dean. 1/1999-66.pdf 2013. “Efficient Estimation of Word Representations in Paukkeria, Mari-Sanna, Alberto Pérez García-Plazab, Vector Space.” The Computing Research Repository (CoRR), Víctor Fresnob, Raquel Martínez Unanueb, and Timo- January 2013. https://arxiv.org/abs/1301.3781 Honkela. 2012. “Learning a Taxonomy from a Set of Text Documents.” Applied Soft Computing 12: 1138-48. 120 Knowl. Org. 46(2019)No.2 K. Golub. Automatic Subject Indexing of Text

Perry, James W., Allen Kent, and Madeline M. Berry. 1955. Salton, Gerard. 1991. “Developments in Automatic Text “Machine Literature Searching X: Machine Language; Retrieval.” Science 253: 974-9. Factors Underlying its Design and Development.” Sebastiani, Fabrizio. 2002. “Machine Learning in Auto- American Documentation 6: 242. mated Text Categorization.” ACM Computing Surveys 34, Plaunt, Christian and Barbara A. Norgard. 1998. “An As- no. 1: 1-47. sociation‐based Method for Automatic Indexing with a Silvester, June P. 1997. “Computer Supported Indexing: A Controlled Vocabulary.” Journal of the American Society for History and Evaluation of NASA’s MAI System.” In Information Science 49: 888-902. Encyclopedia of Library and Information Science, ed. Allen Pratt, Wanda. 1997. “Dynamic Organization of Search Re- Kent. New York: Dekker, 61: 76-90. sults Using the UMLS.” In The Emergence of 'Internetable' Smiraglia, Richard P. and Xin Cai. 2017. “Tracking the Health Care: Systems that Really Work; 1997 AMIA Annual Evolution of Clustering, Machine Learning, Automatic Fall Symposium, formerly SCAMC; A Conference of the Indexing and Automatic Classification in Knowledge American Medical Informatics Association, October 25-29, Organization.” Knowledge Organization 44: 215-33. 1997, Opryland Hotel, Nashville, TN.; Proceedings, ed. Dan- Song, Wei, Jucheng Yang, Chenghua Li, and Sooncheol iel R. Masys. Philadelphia: Hanley & Belfus: 480-4. Park. 2011. “Intelligent Information Retrieval System Rasmussen Neal, D., ed. 2012. Indexing and Retrieval of Non- Using Automatic Thesaurus Construction.” International text Information. Berlin: De Gruyter Saur. Journal of General Systems 40: 395-415. Roelleke, Thomas. 2013. Information Retrieval Models: Foun- Sousa, Renato Tarciso Barbosa de. 2014. “A representação dations and Relationships. San Rafael, CA: Morgan & Clay- da informação: classificação e indexação automática de pool. documentos de arquivo [The Representation of Infor- Roitblat, Herbert L., Anne Kershaw, and Patrick Oot. mation: Automatic Classification and Indexing of Ar- 2010. “Document Categorization in Legal Electronic chives Records].” In XV Encontro Nacional de Pesquisa em Discovery: Computer Classification vs. Manual Re- Ciência da Informação: além das nuvens, expandindo as fron- view.” Journal of the American Society for Information Science teiras da Ciência da Informação, 27-31 de outubro em Belo and Technology 61: 70-80. Horizonte, MG, ed. Isa M. Freire, Lilian M. A. R. Álvares, Ruiz, Miguel E. and Padmini Srinivasan. 1999. “Hierarchical Renata M. A. Baracho, and Maurício B. Almeida. Belo Neural Networks for Text Categorization.” In Proceedings Horizonte: ECI; UFMG, 798-811. http://enancib2014. of SIGIR '99: 22nd International Conference on Research and eci.ufmg.br/documentos/anais/anais-gt2 Development in Information Retrieval, University of California, Souza, Renato Rocha and Koti S. Raghavan. 2014. “Extrac- Berkeley, August 1999; ed. Marti Hearst, Fredric C. Gey, tion of Keywords from Texts: An Exploratory Study us- and Richard Tong. Association for Computing Machin- ing Noun Phrases.” Informação & Tecnologia (ITEC) 1, no. ery. New York, N.Y.: Association for Computing Ma- 1: 5-16. chinery, 281-2. Sparck Jones, Karen. 1972. “A Statistical Interpretation of Ruiz, Miguel E., Alan R. Aronson and Marjorie Hlava. 2008. Term Specificity and Its Application in Retrieval.” Jour- “Adoption and Evaluation Issues of Automatic and nal of Documentation 28: 11-21. Computer Aided Indexing Systems.” In ASIST 2008: Stanfill, Mary H., Margaret Williams, Susan H. Fenton, Proceedings of the 71st ASIS&T Annual Meeting: People Robert A. Jenders, and William R. Hersh. 2010. “A Sys- Transforming Information - Information Transforming People, ed. tematic Literature Review of Automated Clinical Cod- Andrew Grove. Proceedings of the ASIS & T Annual ing and Classification Systems.” Journal of the American Meeting 45. Silver Spring, MD: American Society for In- Medical Informatics Association 17: 646-51. formation Science and Technology. doi:10.1002/meet. Stevens, Mary E. 1965. Automatic Indexing: A State of the Art 2008.1450450143 Report. National Bureau of Standards Monograph 91. Saarikoski, Jyri, Jorma Laurikkala, Kalervo Järvelin, and Washington, D.C.: U.S. Government Printing Office. Martti Juhola. 2011. “Self-organising Maps in Docu- Subramanian, Srividhya and Keith E. Shafer. 1998. “Clus- ment Classification: A Comparison with Six Machine tering.” https://web.archive.org/web/2004051408033 Learning Methods.” In Adaptive and Natural Computing 1/http://digitalarchive.oclc.org/da/ViewObject.jsp? Algorithms: 10th International Conference, ICANNGA 2011, objid=0000003409 Ljubljana, Slovenia, April 14-16, 2011; Proceedings, ed. Da- Svarre, Tanja and Marianne Lykke. 2014. “The Role of Au- vid Hutchison, Andrej Dobnikar, Uros ̌ Lotric,̌ and tomated Categorization in E-government Information Branko Ster.̌ Lecture Notes in Computer Science 6593. Retrieval.” Knowledge Organization 41: 76-84. Berlin: Springer: 260-9. Svenonius, E. 2000. The Intellectual Foundations of Information Salton, Gerard and Michael McGill. 1983. Introduction to Organization. Cambridge, MA.: MIT Press. Modern Information Retrieval. Auckland: McGraw-Hill. Knowl. Org. 46(2019)No.2 121 K. Golub. Automatic Subject Indexing of Text

Waltinger, Ulli, Alexander Mehler, Mathias Lösch, and Semantic Digital Archives; Proceedings of the 2nd International Wolfram Horstmann. 2011. “Hierarchical Classification Workshop on Semantic Digital Archives, Paphos, Cyprus, Sep- of OAI Metadata Using the DDC Taxonomy.” In Digi- tember 27, 2012; ed. Annett Mitschick, Fernando Loiz- tal Libraries: Achievements, Challenges and Opportunities: 9th ides, Livia Predoiu, Andreas Nürnberger. and Seamus International Conference on Asian Digital Libraries, ICADL Ross. http://ceur-ws.org/Vol-912/paper3.pdf 2006, Kyoto, Japan, November 27-30, 2006; proceedings, ed. Weisser, Martin. 2015. Practical Corpus Linguistics: An Intro- Shigeo Sugimoto. Lecture Notes in Computer Science duction to Corpus-Based Language Analysis. Hoboken, NJ: 6699. Berlin: Springer, 9-40. Wiley. Wan, Chin Heng, Lam Hong Lee, Rajprasad Rajkumar, Willis, Craig and Robert M. Losee. 2013. “A Random Walk and Dino Isa. 2012. “A Hybrid Text Classification Ap- on an Ontology: Using Thesaurus Structure for Auto- proach with Low Dependency on Parameter by Inte- matic Subject Indexing.” Journal of the American Society for grating K-nearest Neighbor and Support Vector Ma- Information Science and Technology 64: 1330-44. chine.” Expert Systems with Applications 39: 11880-8. Witten, Ian H. and Eibe Frank. 2000. Data Mining: Practical Wang, Jun. 2009. “An Extensive Study on Automated Machine Learning Tools and Techniques with JAVA Implemen- Dewey Decimal Classification.” Journal of the American tations. San Diego, CA: Academic Press. Society for Information Science and Technology 60: 2269-86. Yang, Yiming. 1999. “An Evaluation of Statistical Ap- Wartena, Christian and Maike Sommer. 2012. “Automatic proaches to Text Categorization.” Journal of Information Classification of Scientific Records Using the German Retrieval 1, nos. 1/2: 67-88. Subject Heading Authority File (SWD).” In SDA 2012:

122 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Interoperability† Marcia Lei Zeng Kent State University, School of Information, PO Box 5190, Kent, OH 44242-0001, USA,

Marcia Lei Zeng is Professor of information science at Kent State University. She holds a PhD from the School of Computing and Information at University of Pittsburgh (USA) and an MA from Wuhan University (China). Her major research interests include knowledge organization systems (taxonomy, thesaurus, ontology, etc.), linked data, metadata and markup languages, smart data and big data, database quality control, semantic technologies, and digital humanities.

Zeng, Marcia Lei. 2019. “Interoperability.” Knowledge Organization 46(2): 122-146. 70 references. DOI:10.5771/ 0943-7444-2019-2-122.

Abstract: Interoperability refers to the ability of two or more systems or components to exchange information and to use the information that has been exchanged. This article presents the major viewpoints of interopera- bility, with the focus on semantic interoperability. It discusses the approaches to achieving interoperability as demonstrated in standards and best practices, projects, and products in the broad domain of knowledge organization.

Received: 22 August 2018; Revised: 21 January 2019; Accepted: 22 January 2019

Keywords: controlled vocabularies, data, interoperability, knowledge organization systems (KOSs), inter-concept mapping

† The author would like to thank the two anonymous referees and Julaine Clunis for providing their valuable feedback.

1.0 Introduction Fundamentally, the ability to exchange services and data with and among components of distributed systems or si- “Interoperability” had been a topic discussed in infor- los is contingent on agreements between requesters and mation processing and exchange communities long before providers who need to have a common understanding of the arrival of the internet, yet it has never been so critical the meanings of the requested services and data (Heiler or of such great concern among so many communities as 1995). A receiver of information needs to be able to inter- it is in today’s digital information environment. The digital pret or understand the contents in a manner relatively con- age has encouraged the emergence of many knowledge or- sistent with the sender’s intended interpretation or mean- ganization systems (KOS) and new KOS types. It has also ing to meet (common) operational objectives (i.e., the con- brought a demand for interoperability to underpin activi- text for and of the information) (Fritzsche et al. 2017). ties along with emerging technologies, such as web ser- Such cooperative agreements are sought after at three lev- vices, the publishing, aggregation, and exchange of KOS els: data via multiple media and formats, and behind-the- scenes exploitation of controlled vocabularies in naviga- – Technical agreements cover, among other things: for- tion, filtering, and expansion of searches across networked mats, protocols, and security systems so that messages repositories (Clarke and Zeng 2012). On a much broader can be exchanged. landscape, systems that provide or support data and infor- – Content agreements cover data and metadata and in- mation management have been created everywhere. They clude semantic agreements on the interpretation of in- are built based on the prevailing needs of a domain, or- formation. ganization, or application, embedding different contexts, – Organizational agreements cover group rules for ac- purposes, and scope decisions by different institutional cess, preservation of collections and services, payment, sponsors. Integration has become a way of life for many authentication, and so on (Arms et al. 2002). organizations, and interoperation of systems across de- partments and organizations has become essential (On- In the digital information environment, interoperability be- tolog 2018). tween systems remains a ubiquitous need and expectation, not only for professions dealing with information re- Knowl. Org. 46(2019)No.2 123 M. L. Zeng. Interoperability sources, but also businesses, organizations, research groups, 2.0 Definitions and individuals who seek to create optimal experiences, minimize operational overhead, reduce costs, and drive fu- ISO 25964 Thesauri and Interoperability with other Vocabularies ture innovations utilizing new technologies and resources defines interoperability as the “ability of two or more sys- (Fritzsche et al. 2017). tems or components to exchange information and to use Knowledge organization systems and services have the information that has been exchanged” (ISO 25964- been the key for understanding and bridging these contex- 2:2013). Addressing the involved components and results, tual differences. Taking cases of information retrieval NISO (2004) states that “interoperability is the ability of across different systems, the expression in question may be multiple systems with different hardware and software either a search query or part of the metadata associated platforms, data structures, and interfaces to exchange data with a document. In both cases, inter-concept mapping is with minimal loss of content and functionality.” Other the fundamental step. An expression formulated using one definitions note “use” in addition to “exchange”; thus, in- KOS vocabulary would need to be converted to (or sup- teroperability is considered as the ability of two or more plemented by) a corresponding expression in one or more systems or components to exchange information and data other vocabularies (ISO 25964-1:2011). “Vocabularies can and use the exchanged information and data without spe- support interoperability by including mappings to other cial effort by either system or without any special manipu- vocabularies, by presenting data in standard formats and lation (CC:DA 2000; Taylor 2004). by using systems that support common computer proto- Witnessing the web and distributed computing infra- cols” (ISO 25964-2:2013 Section 3. Terms and defini- structures gaining in popularity as a means of communi- tions). The need can be seen from any of these situations cation, Sheth (1998) brought attention to the changing fo- (NISO Z39.19-2005 Appendix A 10.1): cus on interoperability in information systems since the mid 1980s: from system, syntax, and structure to seman- – Metasearching of multiple content resources using the tics. Ouksel and Sheth (1999) further laid out four types of searcher’s preferred query vocabulary; heterogeneity corresponding to four types of potential in- – Indexing of content in a domain using the controlled teroperability issues: vocabulary from another domain; – Merging of two or more databases that have been in- – System: incompatibilities between hardware and oper- dexed using different controlled vocabularies; ating systems. – Merging of two or more controlled vocabularies to – Syntactic: differences in encodings and representation. form a new controlled vocabulary that will encompass – Structural: variances in data models, data structures, and all the concepts and terms contained in the originals; schemas. and – Semantic: inconsistencies in terminology and meanings. – Multiple language searching, indexing, and retrieval. In the KO-related domains, interoperability panoramas are Meanwhile, existing KOS vocabularies differ with regard normally highlighted on three of these four types: syntac- to structure, domain, language, or granularity. The incom- tic, structural, and semantic (Moen 2001, Obrst 2003). In patibilities that occur at structural, conceptual, and termi- the book The Organization of Information, Joudrey and Taylor nological levels of KOS directly impact multiple resource (2017, 189) expressed that “without interoperability on all searching (Iyer and Giguere 1995). KOS interoperability three levels, metadata cannot be shared effortlessly, effi- forms an essence for overall interoperability research and ciently, or profitably.” Without syntactic interoperability, practices, as determined by the ISO 25964 Thesauri and In- data and information cannot be handled properly with re- teroperability with other Vocabularies and other standards prior gard to formats, encodings, properties, values, and data to it. types; and therefore, they can neither be merged nor ex- This article presents the major viewpoints of interop- changed. Without semantic interoperability, the meaning erability, with the focus on semantic interoperability. After of the language, terminology, and metadata values used presenting the definitions and introducing standards, ap- cannot be negotiated or correctly understood (Koch proaches to achieving interoperability as demonstrated in 2006). Varying degrees of semantic expressivity can be standards and best practices, projects, and products in the matched with different types: low at syntactic interopera- broad domain of knowledge organization are discussed. bility, medium at structural interoperability, and high or Figures and tables are created and used to help the inter- very high at semantic interoperability (Obrst 2003). pretation of the major interoperability approaches. Addi- With semantic interoperability, the expanded notion of tional examples are used with sources provided. data includes semantics and context, thereby transforming data into information. This transition both broadens and 124 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability deepens the foundation for all other integration approaches, vices. A few examples of recommendations developed in blending semantic interoperability within various levels of the 2010s and that are closely connected to information interoperability: data, process, services/interface, applica- providers are introduced here: tions, taxonomy, policies and rules, and social networks (Pol- lock and Hodgson 2004). Based on the characterizations – The IIIF (International Image Interoperability Frame- specified by previous researchers, semantic interoperability work) Consortium has defined a set of common API can be defined as the ability of different agents, services, and (application programming interfaces) specifications applications to communicate (in the form of transfer, ex- that support interoperability between image reposito- change, transformation, mediation, migration, integration, ries. The four APIs developed in the 2010s are: 1) Image etc.) data, information, and knowledge—while ensuring ac- API; 2) Presentation API; 3) Authentication API; and, curacy and preserving the meaning of that same data, infor- 4) Search API (http://iiif.io/technical-details/). Major mation, and knowledge (Zeng and Chan 2010/2015). software of image viewers and image servers support IIIF APIs. 3.0 Standards and recommendations – The Research Data Alliance (RDA) is a community- driven international organization established in 2013. Standards and best practice recommendations have been RDA’s Research Data Repository Interoperability Work- developed globally. This section presents selected stand- ing Group released its “Research Data Repository In- ards and recommendations that address interoperability is- teroperability WG Final Recommendations” recently (re- sues, corresponding to the four layers that have been high- view period 26 June, 2018 to 26 July, 2018 DOI:10.15 lighted by researchers and communities (Figure 1)(refer to 497/RDA00025). The major components are: 1) a gen- “Section 2 Definitions” above; Ouksel and Sheth 1999; eral exchange format based on the well-known specifica- Adebesin et al. 2013; Obrst 2003; Ontolog 2018). tion of BagIt (a hierarchical file packaging format for storage and transfer of arbitrary digital content) and 3.1 System layer complemented with BagIt Profiles; and, 2) a specification defining how to describe the internal structure of BagIt- Starting from the base, interoperability at the system layer based packages. addresses issues on incompatibilities between hardware – “Data on the Web Best Practices” (W3C Recommenda- and operating systems for the technical exchange of data tion 31 January 2017) presents the best practices related through networks, computers, applications, and web ser- to the publication and usage of data on the web. “In-

Figure 1. Standards and recommendations addressing interoperability issues. Knowl. Org. 46(2019)No.2 125 M. L. Zeng. Interoperability

teroperability” is one of the benchmarks of benefits enance and management use) were proposed (Isaac that data publishers will gain. The recommendation 2013; ISO 25964) (https://www.niso.org/schemas/iso aligns interoperability with eight best practices (summa- 25964). rized at Section 11 in this W3C Recommendation). It also emphasizes that, to promote the interoperability 3.3 Structural layer among datasets, it is important to adopt data vocabular- ies and standards (https://www.w3.org/TR/dwbp/). Information architectural variances in data frameworks, data models, data structures, and schemas add another 3.2 Syntactic layer layer of interoperability challenges. In the efforts to enable the exchange of data through pre-defined structures, con- Syntactic issues that directly impact any interoperability ef- ceptual models have been established by LAM (library, ar- fort are the differences in encoding, decoding, and repre- chive and museum) communities in the digital age. Con- sentation of data. The most important data language ceptual models are independent of any particular encoding standards that enable the exchange of data through com- syntax and application systems. These are best seen from mon data formats are the W3C (World Wide Web Consor- community standards developed for creating structured tium) official recommendations developed for the seman- data and providing access to information resources in var- tic web. ious LAM communities. The widely applied W3C standards in KOS vocabular- ies and data exchanges in the semantic web include: – IFLA LRM (Library Reference Model), a model for- mally adopted by IFLA Professional Committee in Aug. – Resource Description Framework (RDF), a standard 2017 that consolidated three IFLA FRBR family mod- model for data interchange on the web (https://www. els and covers all aspects of bibliographic data (refer to w3.org/RDF/). Žumer 2018). – RDF Schema (RDFS), a semantic extension of RDF. – DCMI Abstract Model, a Dublin Core Metadata Initia- RDFS provides mechanisms for describing groups of tive (DCMI) Recommendation (2007) specifying the related resources and the relationships between these components and constructs used in Dublin Core resources (https://www.w3.org/TR/rdf-schema/). metadata (Powell et al. 2007). – Web Ontology Language (OWL), a semantic web lan- – BibFrame (Bibliographic Framework), a new model guage designed to represent rich and complex knowledge (version 2.0, 2016) initiated by the Library of Congress about things, groups of things, and relations between for describing bibliographic data. things. OWL documents are known as ontologies – CIDOC-CRM, the Conceptual Reference Model (https://www.w3.org/OWL/). (CRM) produced by the International Committee for – Simple Knowledge Organization Systems (SKOS), a Documentation (CIDOC) of the International Council common data model for sharing and linking KOS via of Museums (ICOM) for describing the implicit and ex- the semantic web. SKOS became a W3C recommenda- plicit concepts and relationships used in cultural herit- tion in 2009. Its development was based on the ISO age documentation (version 6.2.3, 2018). guidelines for developing monolingual thesauri [ISO – RiC-CM (Records in Context conceptual model), first 2788 (1974 and 1986)] and multilingual thesauri [ISO draft released in September 2016 by the International 5964 (1985)]. In developing SKOS, “Correspondences Council on Archives’ Experts Group on Archival De- between ISO 2788/5964 and SKOS constructs” were scription (refer to Bountouri 2017 Chapter 6). developed (https://www.w3.org/2004/02/skos/). – Many domain models and profiles have been developed – SKOS eXtension for Labels (SKOS-XL), released in under the umbrella of a conceptual model in order to 2009, defines an extension for SKOS, providing addi- ensure consistency and comprehension as well as in- tional support for describing and linking lexical entities teroperability across domains of LAMs and go beyond (https://www.w3.org/TR/skos-reference/skos-xl.html) . the restricted silos. A significant effort is a streamlined – After the publishing of ISO 25964 Thesauri and Interop- profile of CIDOC-CRM, named Linked Art Profile of erability with Other Vocabularies Part 1 in 2011, an ISO CIDOC-CRM. 25964 SKOS extension (iso-thes) was initiated. A corre- spondence table between ISO 25964 (which replaces 3.4 Semantic layer ISO 2788 and 5964) and SKOS/SKOS-XL models was generated. In addition, a set of extensions (e.g., a class Semantic interoperability/integration is basically driven by of “iso-thes:CompoundEquivalence,” a number of the communication of coherent purpose. In the practice sub-classes of “skos-xl:Label,” and properties for prov- of integration and achieving interoperability, multiple con- 126 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability texts (including but not limited to time, spatial frame, trust, global consistency. Achieving a balance between ensuring and terminology) have to be addressed. Their similarities, semantic interoperability and addressing particular (e.g., lo- differences, and relationships must be understood. In gen- cal) information needs is a reality. Confronting the global eral, a context is commonly understood to be the circum- consistency vs. a multiplicity of modules, John Sowa stances that form the setting for an event, statement, pro- (2006) explained Wittgenstein’s theory of “language cess, or idea, and in terms of which the event, statement, games,” which allow words to have different senses in dif- process, or idea can be understood and assessed. While di- ferent contexts, applications, or modes of use. Meanwhile, verse and complicated in defining what “contexts” means, Sowa indicated newer developments in lexical semantics there could be guidelines to help make explicit some as- based on the recognition that words have an open-ended pects of a context when using a particular development number of dynamically changing and context-dependent methodology (Ontolog 2018). “microsenses.” Thus, a lattice of theories would be able to ISO 25964 Thesauri and Interoperability with other Vocabu- accommodate both: supporting modularity by permitting laries contains two parts; Part 2 (ISO 25964-2:2013) is de- the development of independent modules, while including voted to the interoperability of thesauri with other types all possible generalizations and combinations. The result- of KOS (Table 1). The principles and practice of mapping ing flexibility enables natural languages to adapt to any are its prime focus. The scope includes interoperability of possible subject from any perspective for any humanly thesauri with classification schemes, taxonomies, subject conceivable purpose (Sowa 2006). Comparably, Svenonius heading schemes, ontologies, terminologies, name author- (2004) looked at the epistemological foundations of ity lists and synonym rings. The specification covers the knowledge representations embodied in retrieval lan- details of the features and functions of thesauri and other guages and considered questions such as the validity of common types of KOS, which then lead to the best prac- knowledge representations and their effectiveness for the tice guides of mapping between thesauri and other types purposes of retrieval and automation. The representations of KOS vocabularies. In the standard, semantic compo- of knowledge she considered were derived from three the- nents and relationships of each of these types are com- ories of meaning, all have dominated twentieth century pared with thesaurus components (refer to ISO 25964-2, philosophy: operationalism, the referential or picture the- Section 17 to 24). ory of meaning, and the contextual or instrumental theory Having standards and best practice recommendations of meaning. Her conclusion is that, in the design of a re- does not imply that every KOS would be created with a trieval language, a trade-off exists between the degree to

Table 1. Coverage of ISO 25964-2’s recommendations for interoperability between thesauri and other types of KOS. Knowl. Org. 46(2019)No.2 127 M. L. Zeng. Interoperability which the language is to be formalized and the degree to 4.1 Derivation which it is to be reflective of language use (Svenonius 2004). Focusing on thesauri’s structure and functional de- 4.1.1 Derived vocabularies signs, Mazzocchi (2016) provided a comprehensive analy- sis of different theoretical approaches to meaning, as pre- A new vocabulary may be derived from an existing vocab- sented by important scholars, ranging from Wittgenstein ulary, which is seen as a “source” or “model” vocabulary. to Svenonius, Sowa, Hjørland, Soergel, and others. The This ensures a similar basic structure and context while al- different perspectives on the nature and representation of lowing different components to vary in both depth and meaning could lead to different ways of designing the se- detail for the individual vocabularies. Specific derivation mantic structures of thesauri. methods may include adaptation, modification, expansion, The Ontology Summit 2016 Communiqué pointed out partial adaptation, and translation. In each case, the new that “both syntactic and semantic interoperability across vocabulary is dependent upon the source vocabulary (see systems and applications are necessary. In practice, how- Figure 2). ever, Semantic Interoperability (SI) is difficult to achieve” For example: (Fritzsche et al. 2017). There is a need to maximize the amount of semantics that can be utilized and to make it – Faceted Application of Subject Terminology (FAST) http:// increasingly explicit (Obrst 2003). Knowledge organiza- fast.oclc.org/, a joint vocabulary effort of OCLC and Li- tion systems have been recognized as prerequisites to en- brary of Congress, derives its contents from the Library hanced semantic interoperability (Patel et al. 2005). The of Congress Subject Headings (LCSH) and modifies the syn- approaches to be discussed in the following sections tax to enable a post-coordinate mechanism (Chan et al. mainly aim at the semantic interoperability level. 2001). – In addition to the multilingual Art and Architecture Thesau- 4.0 Interoperability approaches in KOS vocabulary rus (AAT) master version, multiple translated versions development are hosted at different countries (Refer to 藝術與 建築索引典 (http://aat.teldap.tw); The Dutch Art & Ar- Establishing and improving semantic interoperability in chitecture Thesaurus (AAT-Ned) (http://website.aat-ned. the whole information life cycle always requires the use of nl/home); AAT Deutsch (http://www.aat-deutsch.de/), KOSs (Tudhope and Binding 2004). Sometimes new vo- and Tesauro de Arte & Arquitectura (http://www.aatespanol. cabularies need to be created (or extracted) first. In other cl/taa/publico/buscar.htm)). Variations could exist in the cases, existing vocabularies need to be transformed, coverage, updates, candidate concepts and terms, as well mapped, or merged (Patel et al. 2005). KOS is a generic as semantic relationships. term used for referring to a wide range of types, including – The Dewey Decimal Classification (DDC) has been trans- classification schemes, gazetteers, lexical databases, taxon- lated into more than thirty languages and serves library omies, thesauri, ontologies and other types of schemes, all users in more than 135 countries worldwide. Some are designed to support the organization of knowledge and partial adaptations or partial translations (https://www. information in order to make their management and re- oclc.org/en/dewey/features/summaries.html). trieval easier (Mazzocchi 2018). Individual KOS instances are referred as “KOS vocabularies” (to differentiate from A variation might include the adaptation of an existing vo- “metadata vocabularies”) in this article. Using the termi- cabulary, with slight modifications to accommodate local nology of the linked open data (LOD) communities, KOS or specific needs (Figure 2). A derived vocabulary could vocabularies are used as “value vocabularies” (which are also become the source of a new vocabulary (as in the case distinguished from the “property vocabularies” like of some translated vocabularies). metadata element sets). This term refers to its usage in the RDF-based models where the “resource, property-type, 4.1.2 Microthesaurus property-value” triples benefit from a controlled list of al- lowed values for an element in structured data (Isaac et al. A designated subset of a thesaurus that is capable of func- 2011). tioning as a complete thesaurus is called a microthesaurus In the following sub-sections, KOS vocabulary devel- (ISO 25964-2:2013). It is different from the derived vocab- opment is the focus. The projects mentioned in this sec- ularies that are made through adaptation, modification, ex- tion are examples only (longer discussions can be found pansion, and translation, as discussed in the previous sec- from Zeng and Chan, 2004; 2010). The various approaches tion. Depending on the original design, the answer to the are not exclusive and, in fact, can be complementary. question, “Can a microthesaurus be made from an existing thesaurus?” could be: yes, maybe, no, or not directly. The 128 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Figure 2. Derivation of new vocabularies from a source vocabulary (Zeng and Chan 2010, 4651).

alphabetically organized vocabularies that were initially de- ered as microthesauri (https://www.canada.ca/en/her- signed with a flat structure, even having some broader and itage-information-network/services/collections-docu- narrower context, may not be easily used to generate mi- mentation-standards/chin-guide-museum-stand- crothesauri. In general, KOS vocabularies that are good ards/vocabulary-data-value.html). resources for generating microthesauri would have: a clas- – Since 2014, AAT’s linked data SPARQL endpoint sificatory structure (e.g., EuroVoc), a faceted structure (e.g., (http://vocab.getty.edu/) makes it possible for anyone AAT, FAST), or deep hierarchies (e.g., AAT, NASA The- to generate a microthesaurus dataset (e.g., “object gen- saurus, STW Thesaurus for Economics). res” or a smaller unit of “object genres by function”) For example: easily in just a few seconds, encompassing concept URIs, labels, and semantic relationships represented as – EuroVoc, the multilingual, multidisciplinary thesaurus linked data datasets and downloadable in multiple for- covering the activities of the EU, is split into twenty- mats. Such a function opens the door for any digital col- one domains and 127 microthesauri in its 4.4 (2015) lection that needs standardized controlled vocabularies version. A microthesaurus is considered by EuroVoc as in linked data format (instructions of how to obtain a concept scheme with a subset of the concepts that are data using SPARQL for creating a microthesaurus can part of the complete EuroVoc thesaurus (http://eu- be found in Zeng 2018). rovoc.europa.eu/100141). – The CHIN Guide to Museum Standards (2017) of the It should be acknowledged that, creating derived vocabu- Canadian Heritage Information Network comprises a laries and microthesauri have become widely used ap- section of “Vocabulary (Data Value Standards),” which proaches today as LOD gains its momentum. The demand lists dozens of recommended vocabularies. Individual can be seen from these scenarios. Many projects for digital vocabularies listed under various domains contain those humanities are able to create LOD datasets using existing from AAT facets (e.g., “objects;” “agents;” “styles and unstructured and structured data resources. The linking periods;” “materials;” and physical attributes”) and points are primarily the concepts and named entities, i.e., AAT hierarchies (e.g., “processes and techniques” hier- identifiable things including people, organizations, places, archy and “disciplines” hierarchy). These can be consid- events, objects, concepts, and virtually anything that can be Knowl. Org. 46(2019)No.2 129 M. L. Zeng. Interoperability represented in structured data, as demonstrated by the ex- “wetlands”) is in one thesaurus, and more specific subtop- amples from Binding and Tudhope (2016). In the RDF tri- ics of that concept exist in a specialized vocabulary or clas- ples (subject-predicate-object), these concepts and named sification system (e.g., “Wetlands Classification Scheme”), entities occupy the positions of “subject” and “object.” then the leaf node can refer to that specialized scheme Nevertheless, for a dataset to be qualified as LOD, identi- (Zeng and Chan 2004). On the other hand, a new vocabu- fied entities need to be named with URIs. Thus, KOSs that lary can be built on the basis of more than one existing have been released/published as LOD are popular re- vocabulary. A major task of the developers is to not be un- sources for the LOD dataset producers to use. Depending necessarily redundant. Rather, their primary role is to ex- on the situation, the usage of LOD KOSs might involve tend from the “nodes” and grow localized vocabulary multiple choices and steps. In general, making a new vo- “leaves” (see Figure 3). The leaf nodes approach can be cabulary through the derivation method or creating a mi- used in small vocabularies or very large domains, and the crothesaurus from a LOD KOS is common (Zeng and specialized portion can have different languages or nomens. Mayr 2018). By definition, nomen (defined as “an association between an entity and a designation that refers to it”) is the appellation 4.2 Expansion used to refer to an instance of res (defined as “any entity in the universe of discourse”) (Žumer 2018). To explain, a no- Even within a particular information community, there are men can be any sign or sequence of signs (alphanumeric different user requirements and distinctive local needs. The characters, symbols, sound, etc.) that a res is known by, re- details provided in a particular vocabulary may not meet ferred to, or addressed as (Zeng, Žumer and Salaba 2010). the needs of all user groups. Two practical approaches to For example: expansions are highlighted below (Figure 3): leaf nodes and satellite vocabularies. – DDC, when used by various countries, may have exten- sions in a non-English edition for certain classes in or- 4.2.1 Leaf nodes der to meet the local needs, as demonstrated in a figure in the article of Mitchell et al. (2014). Each subclass is In thesaurus and classification development, a method represented by edition-specific nomens. The nomens are known as leaf nodes has been used in which extended valid within the edition with the expanded hierarchy schemes for subtopics are presented as the nodes of a tree only (Figure 4). structure in an upper vocabulary. When a leaf node (e.g.,

Figure 3. Leaf node linking and satellites (Zeng and Chan 2010, 4652 with updates). 130 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Figure 4. Example of leaf nodes, as demonstrated by a non-English edition of DDC. (Mitchell, Zeng, and Žumer 2014, 95)

4.2.2 Satellite vocabularies rus,” “archaeological sciences,” “event types,” “resource description thesaurus,” and “historic aircraft types.” With careful collaboration and management, satellite vo- These thesauri are displayed in an integrated space on the cabularies can be developed around a superstructure in or- FISH Thesauri web site. Terms are grouped by classes der to meet the needs of managing specialized materials or rather than by broadest terms (top term) and are cross- areas. Satellites under a superstructure are usually devel- linked within each thesaurus. The classes are not part of oped deliberately as an integrated unit and require top- a normal thesaurus hierarchy structure (http://thesau- down collaboration for management. rus.historicengland.org.uk/newuser.htm). For example: 4.2.3 Open umbrella structure – LCSH-based vocabularies include the Legislative Indexing Vocabulary (LIV), the Thesaurus for Graphic Materials An alternative approach that has products similar to satellite (TGM), the Global Legal Information Network (GLIN) the- vocabularies moves from a different direction, aiming to saurus, LC Medium of Performance Thesaurus for Music, LC plug-in different pieces to an existing open umbrella struc- Children’s Subject Headings, etc. A significant satellite ture. The reason is that, in the example of ontology devel- vocabulary of LCSH is the Library of Congress Genre/ opment, the upper level of an ontology (i.e., the more gen- Form Thesaurus for Library and Archival Materials eral concepts) is more fundamental for information integra- (LCGFT), which assumed its title in June 2010 tion. Automatic methods may be used for the semantic or- (http://id.loc.gov/). ganization of lower-level terminology. The responsibility of – The Forum on Information Standards in Heritage ensuring interoperability is that of the developers who will (FISH) Thesauri of Historic England are composed of create the plug-ins to coordinate under the umbrella. several separate online thesauri for monument types: “ar- Putting the ontologies of special interest aside, Patel et chaeological objects,” “building materials,” “defense of al. (2005) established a three-tier structure of upper--core- Britain,” “components,” “maritime place names,” “mari- -domain ontologies: time craft types,” “maritime cargo,” “evidence thesau Knowl. Org. 46(2019)No.2 131 M. L. Zeng. Interoperability

1) Upper ontologies define basic, domain-independent base developed in the past two decades. It covers ap- concepts as well as relationships among them. proximately 3,000 terms capturing the most general 2) Core (or intermediate) ontologies are essentially the up- concepts of human consensus reality that satisfy two per ontologies for broad application domains (e.g., the important criteria: universal and articulate (i.e., neces- audiovisual domain). They may help in making real- sary and sufficient). world decisions for which upper ontologies may fall – Suggested Upper Merged Ontology (SUMO) (http:// short for certain problem domains. www.adampease.org/OP/), released in Dec. 2000 as an 3) Domain ontologies, in which concepts and relationships open source ontology, was focused on meta-level con- used in specific application domains are defined (e.g., a cepts (i.e., general entities that do not belong to a spe- “goal” in the soccer video domain). Patel et al. (2005) cific problem domain). It has been mapped to all of the explain that the concepts defined in domain ontologies WordNet lexicon and provides not only the largest would correspond to the concepts and relationships es- open formal ontology but also multilingual language tablished in both upper and core ontologies, which may generation template and navigation tools. be extended with the addition of domain knowledge – Basic Formal Ontology (BFO) (http://ifomis.uni-saarland. (Figure 5). de/bfo/), though considered a smaller genuine upper ontology, has been used by more than 250 ontology- The UK Digital Curation Center’s Digital Curation Manual: driven endeavors throughout the world. Installment on “Ontologies” (Doerr 2008) recommends that the editors of KOSs first agree on a common upper-level on- 4.3 Integration/Combination tology across disciplines in order to guarantee interoperabil- ity at the fundamental and functional levels. On the other New KOS vocabularies with supporting services can be hand, Tudhope and Binding (2008) advise that it is im- created with multiple resources combined in a new KOS portant to fully grasp the conditions and cost-benefit ratio while the original sources and definitions are maintained. of connecting an upper ontology and domain KOS: 1) the Such an approach is bottom-up, as demonstrated by the intended purpose—indexing and retrieval vs. automatic in- following examples. The Unified Medical Language Sys- ferencing; 2) the alignment of the ontology and domain tem® (UMLS) Metathesaurus has its scope determined by KOS; 3) the number of different KOS structures intended the combined scope of its source vocabularies. The Tax- to be modeled; and, 4) the use cases to be supported. MeOn meta-vocabulary enables the management of het- Examples of upper ontologies: erogeneous biological name collections and is not tied to a single “authority” system (Figure 6). – The Upper Cyc® Ontology was released in 1996 (http://goo.gl/3zhKfs) based on a giant knowledge

Figure 5. Intermediate and domain vocabularies plugged-in under an open umbrella structure (Zeng and Chan 2010, 4653). 132 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Figure 6. Multiple source vocabularies lead to a meta-vocabulary.

4.3.1 Metathesaurus cabulary that supports the representation of changes and differing opinions of certain concepts. Taking biology as A name used by the UMLS, a metathesaurus represents a an example, the positions of species and the nomenclature kind of interoperability approach in which the scope of in scientific taxonomies involve a lot of changes, which di- the KOS vocabulary and system is determined by the com- rectly impact the access to the publications and data asso- bined scope of its source vocabularies. An important step ciated with them in different time periods. is to assign several types of unique and permanent identi- For example: fiers to the concepts, concept names, and relationships be- tween concepts, thus the meanings, concept names, and TaxMeOn is a heterogeneous meta-vocabulary for bio- relationships from each vocabulary are preserved while logical names built by Tuominen, Laurenne, and unified in the metathesaurus. Hyvönen (2011). The datasets utilized in the study con- For example: sist of twenty published species checklists that cover mainly northern European mammals, birds, and several Metathesaurus of UMLS, started in 1986 at the U.S. Na- groups of insects, resulting in about 78,000 taxon names. tional Library of Medicine (NLM), is one of the three The TaxMeOn ontology schema contains twelve onto- major UMLS products: Metathesaurus, the Semantic Net- logical classes with forty-nine subclasses. The represen- work, and the SPECIALIST Lexicon and Lexical Tools. Me- tation of the dataset encompasses these contents: 1) the tathesaurus is a large biomedical thesaurus with links to different conceptions of a taxon; 2) the temporal order similar names for the same concept from more than 100 of the changes; and, 3) the references to scientific publi- different KOS vocabularies in the world, including over cations whose results justify these changes (Refer to Tax- 130 English and nineteen non-English KOS vocabularies MeOn site for an example http://onki.fi/onkiskos/ as of June 2018. Many relationships (primarily for syno- cerambycids/). nyms), concept attributes, and some concept names are added by the NLM during Metathesaurus creation and The direct application of the taxon meta-ontology model maintenance, but essentially all the concepts themselves that allows multilingual, different opinions for the biolog- come from one or more of the source vocabularies. Gen- ical taxonomy concept and nomenclature in a unified view erally, if a concept does not appear in any of the source can be beneficial to the researchers of biology. The de- vocabularies, it will also not appear in the Metathesaurus tailed data can be further linked to other datasets with less (NLM 2009: section 2.1.1). It is a pioneer of using iden- taxonomic information, such as species checklists, and tifiers for the concepts and concept names it contains, in provide users with more precise information. The data addition to retaining all identifiers that are present in the model enables managing heterogeneous biological name source vocabularies. It also identifies useful relationships collections and is not tied to a single database system between concepts and preserves the meanings, concept (Tuominen et al. 2011). The modeling method and the names, and relationships from each vocabulary. (Refer to model itself can be extended in a flexible way and inte- the webpage https://www.nlm.nih.gov/research/umls/ grated with other data sources. This design and product is knowledge_sources/metathesaurus/ and the UMLS Ref- another pioneer in the KOS vocabulary and service devel- erence Manual on Metathesaurus for details from Chapter 2, opment embracing interoperability. National Library of Medicine 2009). 4.4 Interoperation/shared/harmonization 4.3.2 Heterogeneous meta-vocabulary The functions of a shared concept scheme or bridge Similar to the metathesaurus discussed above, sometimes scheme will be discussed in this section. While somewhat the situation involves creating a heterogeneous meta-vo- overlapping with the integration/combination approach Knowl. Org. 46(2019)No.2 133 M. L. Zeng. Interoperability presented above, activities discussed below would usually gies) are typically used to mediate between specific con- lead to a new scheme that is not constrained by the details cepts of multiple ontologies. They capture the commonal- and coverages of the sources. A final product may have its ities between various applications and local ontologies own structure and scope and will function as an interoper- within the same domain (Fritzsche et al. 2017) (Figure 7). ation facilitator. This section also discusses virtual harmo- For example: nization through linking, another kind of practice that be- came widely adopted along with the growth of the seman- The Global Agricultural Concept Scheme (GACS) project tic web. The effective implementations rely on interopera- of Agrisemantics (a community network of seman- tion with other target resources outside of the base vocab- tic assets relevant to agriculture and food security) ulary itself, where each of the target resources is controlled initiated the creation of a shared concept scheme by and maintained by the original provider. integrating existing standard vocabularies in agricul- ture and environment (Baker et al. 2016a). GACS 4.4.1 Shared bridge scheme functions as a multilingual KOS hub that includes interoperable concepts related to agriculture from Open data is a trend that has resulted in an incredible num- several large KOS, including AGROVOC of the ber of high-quality open datasets from government and Food and Agriculture Organization of the UN, the international institutions in various domains. Yet, open CAB Thesaurus by CAB International of UK, and the data needs common semantics for linking diverse infor- U.S. National Agricultural Library (NAL) Thesaurus, all mation. One of the strategies is to create a shared bridge maintained independently. The latest GACS beta concept scheme via integrating existing standard vocabu- version provides mappings for 15,000 concepts and laries used by the dataset providers in related fields or do- over 350,000 terms in twenty-eight languages (Baker mains. As usual, the existing KOS involved would vary in et al. 2016a). The processes’ unique points are: 1) the the structure, language, scope, and culture, maintained by mappings focused on three sets of frequently used different institutions. It would not be realistic to select any concepts (10,000) from each of the three partners as the “hub” or “source” vocabulary and map others to it, (which are only a portion of an original vocabulary); nor applicable to create a new “authority” vocabulary to 2) mappings were automatically extracted and then unify them. As reported by Baker et al. (2016a), in search- manually evaluated and corrected; 3) a classification ing across large databases in agriculture and environment, scheme that was developed jointly in the 1990s was a shared concept scheme would improve the semantic revised to tag concepts by thematic group (chemical, reach of these databases by supporting queries that freely geographical, organisms, products, or topics); and, 4) draw on terms from any mapped vocabulary and achieving alongside generic thesaurus relations to broader, nar- economies of scale from joint maintenance. In the ontol- rower, and related concepts, organisms will be re- ogy community, bridge ontologies (vs. reference ontolo- lated to relevant products (Baker et al. 2016b).

Figure 7. Shared/bridge scheme built on the selected sets of source vocabularies. 134 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

4.4.2 Reference ontologies 4.4.3 Virtual harmonization through linking

“Reference ontologies” is a term used in the formal ontol- The semantic web encourages the sharing and reuse of ogy community. These ontologies are intended to be re- data, including the components of KOS vocabularies, such used and are not rigidly tied to an application’s specific use as a concept in the definition, a parallel appellation/nomen, cases and requirements (Fritzsche et al. 2017), which dif- a visual representation, etc. Around the world, activities of ferentiate them from the shared bridge schemes discussed virtual harmonization through linking as well as generating in the previous sub-section. Explained in the Ontology multilingual labels by using SKOS-XL, have proven to be Summit 2016 Communiqué (Fritzsche et al. 2017), refer- successful. ence ontologies reflect the base-level knowledge of a For example: broad domain or the semantic consensus of an industry sector. By design, they are created to facilitate integration Thesaurus of Plant Characteristics (TOP) (http://www. across systems, repositories, and data sources. Rather than top-thesaurus.org/) is committed to the harmonization serving as an upper ontology that helps mediate between and formalization of concepts for plant characteristics other ontologies, a reference ontology serves as a means widely used in ecology. An entry of TOP presenting the for mapping the terminology of multiple information sys- definition with multiple sources of the concepts, coded tems and data to a common set of shared concepts. with the URI from the original namespaces (e.g., PO, Properly conceived, a collection of reference ontologies PATO, EFO, and Mayr) (Figure 8.1). can be viewed as orthogonal (non-overlapping), interoper- able resources. Another example: For example: The Art & Architecture Thesaurus (AAT) Online includes – The Foundational Model of Anatomy (FMA) is an ontology links to representative images of a concept hosted by that represents the structure of the human body and is outside collaborators (Figure 8.2). one of the largest computer-based knowledge sources in the biomedical sciences. It is among the information And finally, a third example: resources integrated in the distributed framework of the anatomy information system developed and main- Faceted Application of Subject Terminology (FAST) (http:// tained by the Structural Informatics Group at the Uni- experimental.worldcat.org/fast/ reported using “foaf: versity of Washington. Anatomy is considered funda- focus”) to allow FAST’s controlled terms (representing mental to all biomedical sciences. Comprised of instances of “skos:Concept”) to be connected to URIs roughly 75,000 classes, 120,000 terms, and 168 relation- that identify real-world entities specified at VIAF, ship types, FMA represents a coherent body of explicit GeoNames, and DBpedia. With the correct coding of prop- declarative knowledge about human anatomy. Its onto- erties, machines can understand (reason) that a FAST’s logical framework can be applied and extended to all controlled term is related to a real-world entity and allows other species. The computer-based knowledge source humans to gather more information about the entity that distinguishes itself from other traditional sources of is being described (O’Neill 2013; Mixter 2013)(Figure anatomical information, such as atlases, textbooks, dic- 8.3, which is based on O’Neill 2013 and Mixter 2013, tionaries, thesauri or term lists (http://sig.biostr.wash- with screenshots from FAST, 2019-01). ington.edu/projects/fm/AboutFM.html) – Financial Industry Business Ontology™ (FIBO) is an 5.0 Mapping ontology created by the Enterprise Data Management Council and the Object Management Group (OMG). Many KOS vocabularies have been independently devel- FIBO® specifies the definitions, synonyms, structure, oped or have already been applied to collections. Mapping, and contractual obligations of financial instruments, le- a process of establishing relationships between the con- gal entities, and financial processes. It is an industry initi- cepts of one vocabulary and those of another, is a widely ative to define financial industry terms, definitions and used approach to achieve the semantic interoperability of synonyms using semantic web principles, aiming to con- existing KOS vocabularies. The term “mapping” might be tribute to transparency in the global financial system to used to refer to a process of establishing relationships be- aid industry firms (https://www.omg.org/hot-topics/ tween the contents of one vocabulary and those of an- finance.htm) other, or as a product of mapping process, e.g., a statement of the relationships between the terms, notations or con- cepts of one vocabulary and those of another. Knowl. Org. 46(2019)No.2 135 M. L. Zeng. Interoperability

Figure 8.1. An example from Thesaurus of Plant Characteristics showing the virtual harmonization of multiple sources of the concepts (http://www.top-thesaurus.org).

Figure 8.2. An example from AAT showing the links to representative images managed at collaborators’ sites (AAT ID: 300198841).

5.1 Major challenges guage homonymy and inter-language homonymy are also problematic semantic issues (IFLA Classification and Index- For achieving interoperability among existing KOS vocab- ing Section 2005). Hudon (1997) pointed out that while var- ularies, challenges of mapping arise when existing KOS ious textbooks and guidelines provided many details on the vocabularies differ with regard to structure, domain, lan- “conceptual equivalence” issue, when discussing semantic guage, or granularity. The foremost problem might be the solutions, display options, management issues, or use of number and variety of problems to be encountered in any technology, the guidelines seldom go as far as commenting mapping process. on whether or not a particular option is truly respectful of Special challenges and controversial opinions have always a language and its speakers. The issue of “language equality” overshadowed the projects that have attempted to map mul- must be taken into account in the analysis and eventual se- tilingual vocabularies. For example, equivalence correlation lection of a solution to a specific problem. must be dealt with not only within each original language Further complications arise when perspectives of dif- (intra-language equivalence) but also among the different ferent cultures need to be integrated. With the assumption languages (inter-language equivalence) involved. Intra-lan- that all languages are equal in a crosswalk table, the central 136 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Examples from Faceted Application of Subject Terminology (FAST)

1. Search FAST and find the item. 3. In the RDF record for ID 35588: http://fast.oclc.org/searchfast/

--through “foaf:focus” (first line in the box of codes below), the identifiers (such as Wikipedia URI) allow FAST terms to include detailed information that is usually excluded in authority records; and --through “schema:sameAs” (also in the box of codes below), the identi- fier of VIAF lets FAST take advantage of all of the various string values included in VIAF (containing dozens of multilingual name authorities) 2. TERM DETAILS display includes “Source and Other without having to manually include the values in the RDF triples for the Links” specific entry in FAST.

Screenshots captured 2019-01-17 from http://experimental.world- cat.org/fast/35588/rdf.xml

& has a link to RDF record. 

Figure 8.3. Examples from Faceted Application of Subject Terminology (FAST) showing the machine understandable coding of linkages to ex- ternal vocabularies. question is whether the unique qualities of a particular cul- sought. While experimenting with an expert system to map ture expressed through a KOS vocabulary can be appro- mathematics classificatory structures and test the “con- priately transferred during the mapping process. Gilreath vertibility,” Iyer and Giguere (1995) identified several kinds (1992) suggests that there are four basic requirements that of semantic relations between DDC and Mathematics Subject must be harmonized in terminology work: concepts, con- Classification, comprising these situations: exact matches, cept systems, definitions, and terms. specific to general, general to specific, many to one, cyclic In addition to language and cultural variants, KOS vo- mapping strategies, no matches, and specific and broad cabularies have different microstructures and macrostruc- class mapping. Different types of equivalencies have also tures: they represent different subject domains or have dif- been defined by various important manuals and standards, ferent scope and coverage; they have semantic differences identified as: exact equivalence, inexact equivalence (or caused by variations in conceptual structuring; their de- near-equivalence), partial equivalence, single to multiple grees of specificity and use of terminology vary; and the equivalence, and non-equivalence (Aitchison, Gilchrist, syntactic features (such as word order of terms and the use and Bawden 1997; ISO25964-2:2013)(Figure 9). of inverted headings) are also different. Discussing the The complex requirements and processes for matching unification of languages and the unification of indexing terms, which are often imprecise, may have a significant im- formulas, Maniez (1997) pointed out that paradoxically the pact on several aspects of vocabulary mapping: browsing information languages increase the difficulties of cooper- structure, display, depth, non-topical classes, and the balance ation between the different information databases, con- between consistency, accuracy, and usability (Zeng and Chan firming what Lancaster (1986) observed earlier: “Perhaps 2010). Various levels of mapping or linking can coexist in somewhat surprisingly, vocabularies tend to promote in- the same project, such as those identified by the Multilingual ternal consistency within information systems but reduce Access to Subjects (MACS) project: terminological level intersystem compatibility” (Lancaster 1986, 181). (subject heading), semantic level (authority record), and syn- In reality, during the transforming, mapping, and merg- tactic level (application) (Freyre and Naudi 2003). ing of concept equivalencies, certain specific nomens that Even with the advancement of information technolo- represent the concepts, formed with definite syntaxes, are gies, there are still many mapping processes done at the Knowl. Org. 46(2019)No.2 137 M. L. Zeng. Interoperability

Figure 9. Degrees of equivalence (Aitchison, Gilchrist, and Bawden 1997). syntax-level (word, phrase, and context), rather than at the ent languages demonstrated the feasibility of connecting semantics-level. The issues of incorrect mapping of hom- and semantic cross-searching of the integrated information. ographs for concepts belonging to different domains can The semantic linking of textual reports and datasets opens be found in the mapping services and individual vocabu- new possibilities for integrative research across diverse re- lary’s published mapping results (e.g., “recruitment” as a sources (Aloia et al. 2017; Binding et al. 2018). biological process and as a personnel management pro- cess). The concept mapping according to the semantics 5.2 Models of mapping process will be a major and much-needed service; it is still a chal- lenge for those dependent on machine mapping. The direct mapping and hub structure mapping models, New AI (artificial intelligence) with machine learning recommended by ISO 25964-2:2013 (see Table 2 based on does present great potential to reduce all such conflicts and ISO 25964-2:2013, 6.3 and 6.4), addresses the mapping of improve interoperability at all layers, especially semantic in- contents between two or more vocabularies that do not teroperability. In archaeological communities, ARIADNE, share the same structure, differ in scope or language, and which consortium consists of twenty-four partners in six- may belong to different types (e.g., thesauri, classification teen countries, has reported extensive research and develop- schemes, name authority lists, etc.). Note that syntax dif- ment activities, including using rule-based and machine ferences of encoding languages or expressions are not ad- learning mechanisms. For example, rule-based techniques dressed. The mappings should always be established be- have been employed with available archaeological vocabu- tween the concepts (i.e., not the appellations representing laries from Historic England (HE) and Rijksdienst voor het Cul- the concepts). tureel Erfgoed (RCE). A study on semantic integration of data In both direct mapping and hub structure models, the extracted from archaeological datasets with information ex- double-headed arrows indicate that the mappings are in- tracted via natural language processing (NLP) across differ- tended to work in dual directions. Each double-headed ar- 138 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Model: Direct-linked Model: Hub Structure

When multiple vocabularies are involved, it is often convenient to designate one Each double-headed arrow represents a pair of vocabulary as a “hub” (Voc B in this figure) to which each of the other vocabular- mappings, one in each direction. ies is mapped. Each concept in the hub vocabulary should be mapped to the cor- As more vocabularies become involved in direct responding concept(s) in the other vocabularies, and vice versa. mapping, the number of mapping processes will increase dramatically. For example, mapping among four vocabularies will require a total of twelve sets of mappings, rep- resented in six pairs, as shown in the above figure.

When two-way mappings are not necessary, mappings can be in one direction only, “from” or “to” the hub.

Table 2. Mapping models recommended by ISO 25964. row represents a pair of mappings, one in each direction. the concepts of each vocabulary and those of each other’s When two-way mappings are not necessary, alternative vocabulary. The mapping may be initiated by either end of models can be used in which one of the vocabularies is used the involved vocabularies. This may be extended to any as the source and the other one as the target. The un- number of vocabularies by establishing direct mappings matched members of the vocabularies need to be treated from each vocabulary to each other one. As more vocabu- with additional strategies. Between any pair of vocabularies, laries become involved, the number of mapping processes the mapping quality that can be achieved is best when the will increase dramatically. A mapping cluster, defined as a target vocabulary has equal specificity as well as the same “coordinated set of mappings between the concepts of breadth of coverage as the source vocabulary (ISO 25964- three or more vocabularies” (ISO25964-2:2013. Section 2:2013 Section 6). 3.42) is generally maintained and published with a particular The direct-linked model, as illustrated in Table 2, indi- publishing or application objective. For example, a cluster cates that direct mappings should be established between of mappings between four different thesauri might be main- Knowl. Org. 46(2019)No.2 139 M. L. Zeng. Interoperability tained so that a user of any one of them can easily search systems: Dewey Decimal Classification (DDC), Universal document collections indexed with any of the four. Decimal Classification (UDC); Library of Congress With more KOS vocabularies publishing their whole Classification (LCC), The Bliss Bibliographic Classifica- datasets as linked data and opening for free use, more and tion (BC), and Colon Classification (CC). The encouraging more direct mapping results have become available. Often results of top-level comparison were reported in 1998 the mapping results of each concept can be found on its with DDC and UDC almost matched; and the total entry page. For example, each entry of the LCSH can be matching to five classifications was fifty-two, among viewed and downloaded in various RDF formats, while the eighty-one subject groups of ICC. The types of map- mapping of the subject heading to other national libraries’ ping fall into four groups: “equivalence,” “inclusion,” subject headings (e.g., from The Bibliothèque nationale de “is about,” and “union” (Dahlberg 1998). France) are listed as “Closely Matching Concepts from Other Schemes” (see an example for “Smartphones” at Selective mapping models showing in Table 2 all require http://id.loc.gov/authorities/subjects/sh2007006251. significant work to build and maintain. In the circumstance html). that it is unnecessary to map the entire vocabularies, map- The “hub structure” uses a cross-switching approach, pings can be established only for the concepts that have normally applied to reconcile multiple vocabularies. In this been used or are likely to be used within the application in model, one of the vocabularies is used as the switching question. This model could be applied when there are rel- mechanism between the multiple vocabularies. Such a atively few concepts common to two or more vocabularies. switching system can be a new system (e.g., UMLS Metathe- In such a case, only a limited number of mappings can and saurus, introduced in Section 4.3) or an existing system (e.g., should be established. Another case is to conduct the map- AGROVOC). The hub needs to be comprehensive enough ping among the products that applied the vocabularies, in the required subject area(s), at least at high levels. The e.g., in the indexes or catalogues. While this reduces the following examples are, again, from the well-known KOS initial mapping effort, it can increase updating mainte- interoperability efforts: nance tasks when changes are made in the collection (ISO 25964-2:2013, Section 6.5). – AGROVOC thesaurus, developed and maintained by For example: the Agricultural Information Management Standards (AIMS) division of of the Food and Agriculture Or- – MACS Multilingual Access to Subjects was a pioneer project ganization (FAO) of the United Nations, is the switch- aimed to enable users to simultaneously search the cata- ing vocabulary of fifteen important KOS vocabularies logues of the project’s partner libraries in the languages used worldwide plus DBpedia (as seen June 2018). of their choice (English, French, German). It mapped Global vocabularies such as LCSH, DDC, EuroVoc and subject headings used in three monolingual subject au- specialized vocabularies in a variety of related domains thority files: Schlagwortnormdatei/Regeln für den Schlagwort- (involving multiple natural languages) are all mapped katalog (SWD/RSWK), Répertoire d’autorité-matière ency- through machine-assisted human mapping process. clopédique et alphabétique unifié (Rameau), and Library of Con- AGROVOC and the mapping results are completely en- gress Subject Headings (LCSH) (Freyre and Naudi 2003). coded with SKOS, with mapping degrees indicated by – SciGator (http://scigator.unipv.it) is a new tool devel- SKOS “exactMatch,” “closeMatch,” “broadMatch,” oped at the University of Pavia, Italy, a well-known in- and “narrowMatch.” (Refer to http://aims.fao.org/ stitution of medieval origins. Its nine main libraries standards/agrovoc/linked-data for the current status have the tradition of using local schemes to organize of the mapping; see an example for the concept their collections to satisfy their specific needs. With the of “tuna” at http://aims.fao.org/aos/agrovoc/c_8003. efforts of standardizing the shelfmarks among a num- html.) ber of libraries by adopting a single scheme based on – Information Coding Classification (ICC) was designed DDC as well as the action of a number of other librar- by the founder of ISKO, Ingetraut Dahlberg, as a the- ies to supplement DDC as additional subject access oretical superstructure of a universal system. It consists points, SciGator has been developed to allow users to of nine general object areas according to the principle browse the DDC classes used in different libraries at the of evolution. The over 6,500 knowledge fields were de- University of Pavia. Besides navigation of DDC hierar- fined with the combination of concepts of ontical level chies, SciGator suggests “see-also” relationships with objects (ontical refers to a particular area of being), cat- related classes and maps equivalent classes in local egorical concepts, and its subdivisions (Dahlberg 2017). shelving schemes, thus allowing the expansion of In 1996, the author proposed its use as a switching search queries to include subjects contiguous to the in- mechanism between the five widely used classification itial one (Lardera et al. 2017). 140 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Co-occurrence mapping works at the application level, e.g., niques offer to input information) or bottom-up (focusing in metadata records that have assigned subject terms from on the type of input that the matching techniques use), more than one vocabulary (e.g., MeSH and LCSH subject while both meet at the “concrete techniques” tier. headings assigned to the same publication). Instead of pre- On top of all the surveys conducted on complex ontol- paring a completely mapped work at the source vocabulary ogy matching, Thiéblin et al. (2018) carried out a study of level (as illustrated in Table 2), the group of subject terms these surveys. The paper indicates that there still is no can actually result in loosely-mapped terms (Zeng and Chan benchmark on which complex ontology matching ap- 2004). A new study reported a different kind of co-occur- proaches can be systematically evaluated and compared. rence mapping, using a social network approach that lever- With a proposed definition of complex correspondences ages online social platform information (i.e., research activ- and alignments, a classification of these approaches based ities and social activities) for mapping. The underlying as- on their specificities is proposed in the paper. The speci- sumption behind the approach is that “two classes/terms in ficities of the complex matching approaches rely on their different KOS are related if their corresponding research output (type of correspondence) and their process (guid- objects are connected to similar researchers” (Du et al. ing structure) (Thiéblin et al. 2018). 2017). Blended mapping models. In mapping between two vo- 5.3 Encoding the alignment degrees cabularies, multiple models might be used in the same case, as summarized by the AAT-Taiwan team for a study on the Encoding the alignment degrees is a prominent process conceptual structures of concepts for Chinese art in the Tai- called upon by the LOD movement. The mashups, cross- wan National Palace Museum (NPM) Vocabulary (treated as walks, and interlinking all rely heavily on the alignment of the source) and AAT (treated as target) (Chen et al. 2016). the components in RDF triples. The requirement for the Patterns found in this project might represent many similar precision of mapping is more important than that in the cases of vocabulary mapping, such as: concepts are com- previous non-LOD environment. ISO 25964-2 enumer- pletely covered, incompletely covered, or not covered by the ates scenarios of mappings and categorized them in three target vocabulary; and specific category can be found or groups: equivalence mappings (including simple equiva- does not exist in the target. Each model, simply interpreted lence (one-to-one) and compound (one-to-many) equiva- in the following table, specifies whether a vocabulary is se- lence mappings), hierarchical mappings, and associative lected as the “base,” supplemented by the other vocabulary, mappings. It also discusses in detail exact, inexact, and par- or if the vocabulary is “fully adopted.” All depend on the tial equivalence. situations encountered (Table 3, based on Chen et al. 2016.). To encode and represent the mapping degrees when Ontology matching is a global interest, as reflected by multiple vocabularies are involved, RDFS, OWL, and Otero-Cerdeira et al.’s 2015 paper, “Ontology Matching: A SKOS have provided guidance and properties, including: Literature Review” of more than 1,600 papers. A classifi- cation framework by Euzenat and Shvaiko (2013) was used – between ontological classes: “owl:equivalentClass” and in the study (Figure 10), which can be followed top-down “rdfs:subClassOf ”; (focusing on the interpretation that the different tech-

Source Vocabulary Target Vocabulary Model For situations such as: National Palace Museum (NPM) Vocabulary Art & Architecture Thesaurus (AAT)

Model A [supplement] base plant Model B base [supplement] Chinese painting techniques When well-structured sub-fac- ets and hierarchies are needed Model C [--none --] fully adopted (e.g., animal, plant, people, building)

For categories carrying strong Model D fully adopted [--none --] cultural distinctions, and work- ing only for the Chinese styles

Table 3. Blended mapping models—four models used in mapping the National Palace Museum (NPM) Vocabulary (treated as the source) Knowl. Org. 46(2019)No.2 141 M. L. Zeng. Interoperability

Figure 10. Matching techniques classification (Euzenat and Shvaiko, 2013; Otero-Cerdeira et al. 2015).

– between properties: “owl:equivalentProperty” and pre-defined constraints should be able to reveal the equiv- “rdfs:subPropertyOf ”; alency or similarity of the two concepts or classes to be – between concepts from concept schemes: “skos:exact- mapped. A “skos:closeMatch” is more appropriate than a Match,” “skos:closeMatch,” “skos:relatedMatch,” and the “skos:exactMatch” in the majority of situations. reciprocal pair “skos:broadMatch” and “skos:narrow Other designed schemas are also available. UMBEL Match” (Figure 11 left); for transitive super-properties of (Upper Mapping and Binding Exchange Layer http://um- “skos:broader” and “skos:narrower,” “skos:broaderTran bel.org/) is designed to help content interoperate on the sitive” and “skos:narrowerTransitive” (Figure 11 right). web and has mapped OpenCyc, DBpedia, PROTON, GeoNames, and Schema.org. This is enabled through its UM- It should be remembered that, in mapping concepts, an BEL Vocabulary, which contains three classes and thirty- exact match of two concepts found from different vocab- eight properties for describing domain ontologies, provid- ularies is rare, even though sometimes the labels read the ing expressions of likelihood relationships distinct from same and the scope notes or definitions are similar. Their exact identity or equivalence.

& inferring a transitive hierarchy from asserted “skos:broader” statements

Figure 11. Demonstration of matching results encoded with SKOS properties (Isaac 2010; Isaac and Summers 2009). 142 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

– UMBEL classes: “reference concept” (for more broadly sent an entirely new dimension in KOS research and de- understood concept), “super type” (for a higher-level velopment (Tudhope, Koch, and Heery 2006; Golub et al. of clustering and organization of “reference con- 2014). cepts”), and “qualifier” (a set of descriptions that indi- Variations of these terminology services can be found cate the method used to establish an “isAbout” or “cor- in terms of: respondsTo” relationship between an UMBEL refer- ence concept (RC) and an external entity). – The natural languages involved: monolingual, bilingual – UMBEL properties: thirty-eight properties provide the (mainly involving an original language and English), or mapping basis for the vocabulary: “correspondsTo,” multi-lingual; “isAbout,” “isRelatedTo,” “relatesToXXX” (thirty- – The number of KOS vocabularies contained in a ser- onee variants), “isLike,” “hasMapping,” “hasCharacter- vice; istic,” and “isCharacteristicOf ” (http://techwiki.um- – Vocabulary version information’s availability: current bel.org/index.php/UMBEL_Vocabulary) version only or all versions, with or without descriptive, technical, and administrative metadata of a vocabulary; 6.0 Harmonization through terminology services – Provenance data’s availability: As more KOS vocabular- ies are released and updated online, the provenance data Thousands of KOS vocabularies have been developed and might be emphasized at a different level, e.g., vocabu- are in use in every field. Even with the same or similar con- lary, entry, or even concept- and term-specific level; cept scope, concepts in vocabularies that are in isolation – Scope of the services: registration of vocabularies only, from one another might be represented in different terms, or accessing to all KOS vocabularies hosted; onsite use different formats and formalisms, and are published searching, browsing, displaying, and navigating, direct and stored with different access methods. Vocabulary har- linking to data values, etc. The highest-level service monization is needed not only for those existing vocabu- would be the alignment among vocabularies (e.g., Bi- laries, but also for the initiations of new vocabulary devel- oPortal). opments. Terminology services is a broad term referring to the The scope of the services mentioned above indicates two repositories and registries of vocabularies (including both major types of terminology services to be referenced: re- value vocabularies and property vocabularies). A group of positories vs. registries (Zeng and Mayr 2018). services is used to host and present vocabularies, member concepts, terms, classes, relationships, and detailed expla- – Vocabulary repositories are used for those services host- nations of terms that facilitate semantic interoperability. ing full content of a KOS vocabulary as well as the man- Powered by semantic technologies and enabled by RDF, agement data for each component updated regularly. A SKOS, and OWL, they have emerged during the last dec- typical example, BioPortal (https://bioportal.bioontol- ade. In addition to registering, hosting, publishing, and ogy.org/) is presented in Appendix 1 at Zeng 2018 managing diverse vocabularies and machine-processable (http://www.isko.org/cyclo/interoperability#app1). schemas, these services also aim to enable searching, – Vocabulary registries differ from repositories because browsing, discovery, translation, mapping, semantic rea- they offer information about vocabularies (i.e., metadata) soning, automatic classification and indexing, harvesting, instead of the vocabulary contents themselves; they are and alerting. the fundamental services for locating KOS products. The Terminology services can be interactive machine-to- metadata for vocabularies usually contain both the de- machine or between human and machine. User-interfacing scriptive contents and the management and provenance services can also be applied at all stages of the search pro- information. The registry may provide the data about the cess. For example, in supporting the needs of searching reuse of ontological classes and properties among the for concepts and the terms representing the concepts, ser- vocabularies, as presented by linked open vocabularies vices can assist in resolving search terms, disambiguation, (LOV)(https://lov.linkeddata.es/dataset/lov/). browsing access, and mapping between vocabularies. As a search support for queries, services facilitate query expan- During the last twenty years, there have been well-funded sion, query reformulation, and combine browsing and projects that could be seen as pioneers in terminology ser- search. These can be applied as immediate elements of the vices. Information about these experimental projects can be end user interface, or they can act in underpinning services found in a previous encyclopedia article by Zeng and Chan behind the scenes, depending upon the context. Techno- (2010, 2015). A list containing truly functional services as of logically, web services can be used effectively to interact July 2018 can be found in Appendix 2 of Zeng (2018), with controlled vocabularies. Terminology services repre- (http://www.isko.org/cyclo/interoperability#app2). It is Knowl. Org. 46(2019)No.2 143 M. L. Zeng. Interoperability based on a recent review by Zeng and Mayr (2018) with up- their information and processes is essential for col- dates after May, 2018 (e.g., changes of the EU Vocabularies laboration, shared services, information sharing and and LOV). analytics. These capabilities are not optional in to- Registering KOS vocabularies and services would need day’s world; they are essential for the continued ex- to begin with a set of common attributes that describe istence of commercial enterprises and the effective- them. Metadata of KOS vocabularies, including descrip- ness of government. tions of a vocabulary’s data model, type, protocol, status, responsible body, available format, affectivity, and other References features, are very important to terminology services, vocab- ulary users (machine or human), and retrieval systems. At a Adebesin, Funmi, Paula Kotze, Rosemary Foster, and minimum, metadata for KOS resources will describe spe- Darelle Van Greunen. 2013. “A Review of Interopera- cific characteristics of a KOS, facilitate the discovery of bility Standards in E-health and Imperatives for their KOS resources, assist in the evaluation of such resources Adoption in Africa.” South African Computer Journal 50: for a particular application or use, and facilitate sharing, re- 55-72. doi:10.18489/sacj.v50i1.176 using, and collaboration of the KOS resources. A Dublin Aitchison, Jean, Alan Gilchrist, and David Bawden. 1997. Core Application Profile for KOS Resources (NKOS AP) Thesaurus Construction and Use: A Practical Manual. 3rd ed. was released in 2014, which was developed based on the London: ASLIB. work begun in 1998 by members of the Networked Aloia, Nicola. et al. 2017. “Enabling European Archaeo- Knowledge Organization Systems (NKOS). The specifica- logical Research: The ARIADNE E-Infrastructure.” In- tion, known as NKOS AP (http://nkos.slis.kent.edu/nkos- ternet Archaeology 43. doi:10.11141/ia.43.11 ap.html) defines the set of RDF classes and properties that Arms, William A., Diane Hillman, Carl Lagoze, Dean can be used to describe any KOS resource. Krafft, Richard Marisa, John Saylor, Carol Terrizzi, and Herbert van de Sompel. 2002. “A Spectrum of Interop- 7.0 Conclusion erability, The Site for Science Prototype for the NSDL.” D-Lib magazine 8, no. 1. doi:10.1045/january2002-arms This article represents an attempt to bring together the National Library of Medicine. 2009. “Metathesaurus.” major approaches and standards in the semantic interop- Chap. 2 in UMLS® Reference Manual [Internet]. Be- erability dimension through reported cases and available thesda, MD: National Library of Medicine. https:// real services. The selected examples are only demonstra- www.ncbi.nlm.nih.gov/books/NBK9684/ tions of the approaches, and each actually could represent Baker, Thomas, Caterina Caracciolo, Anton Doroszenko, more than one method. Tools and technologies are not dis- Lori Finch, Osma Suominen, and Sujata Suri. 2016a. cussed in this entry but certainly can be labeled as ground- “The Global Agricultural Concept Scheme and Ag- breaking at the current stage of the web. ISO 25964- risemantics.” In DC-2016--The Copenhagen, Denmark Pro- 2:2013 has a dedicated section (Section 14) on the tech- ceedings, 14-5. http://dcpapers.dublincore.org/pubs/ar- niques for identifying candidate mappings, including com- ticle/view/3847 puter-assisted direct mapping. Baker, Thomas, Caterina Caracciolo, Anton Doroszenko, Following the linked data principles that benefit from and Osma Suominen. 2016b. “GACS Core: Creation of the data-driven, shared editing and publishing workflow, as a Global Agricultural Concept Scheme.” In Metadata and well as an increasing number of KOSs published in ma- Semantics Research: 10th International Conference, MTSR chine-understandable formats, the reusability of any of 2016, Göttingen, Germany, November 22-25, 2016, Proceed- the existing and new KOS vocabularies is greatly increased. ings, ed. Emmanouel Garoufallou, Imma Subirats Coll, Mapping the semantics promotes cooperation and reduces Armando Stellato. and Jane Greenberg. Communica- duplication. Coherent semantics benefit research, innova- tions in Computer and Information Science 672. Cham, tion systems, and value chains (Baker et al. 2016b). Switzerland: Springer, 311-6. doi:10.1007/978-3-319- The author would like to conclude this article by using 49157-8_27 a statement from the recent Ontology Summit 2018 Com- Binding, Ceri, and Douglas Tudhope. 2016. “Improving muniqué (Ontolog 2018, 5-6): Interoperability Using Vocabulary Linked Data.” Inter- national Journal on Digital Libraries 17: 5-21. doi:10. Each system, organization, community, database or 1007/s00799-015-0166-y message format is thus defined in its own, too often Binding, Ceri, Tudhope Douglas, and Vlachidis Andreas. implicit, context which might, in turn, depend on 2018. “A Study of Semantic Integration Across Archae- other contexts .… While these systems are defined ological Data and Reports in Different Languages.” Jour- and built independently, systematic integration of nal of Information Science. doi:10.1177/0165551518789874 144 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Bountouri, Lina. 2017. Archives in the Digital Age: Standards, Satellite Meeting Held in Dublin, OH, 14-16 August 2001, Policies and Tools. Cambridge, MA: Chandos Publishing. ed. I.C. McIlwaine. München: K. G. Saur, 3-10. CHIN (Canadian Heritage Information Network). 2017. Fritzsche, Donna, Michael Grüninger, Ken Baclawski, “CHIN Guide to Museum Standards.” https://www. Mike Bennett, Gary Berg-Cross, Todd Schneider, Ram canada.ca/en/heritage-information-network/services/ Sriram, Mark Underwood, and Andrea Westerinen. collections-documentation-standards/chin-guide-mu 2017. “Ontology Summit 2016 Communiqué: Ontolo- seum-standards.html gies Within Semantic Interoperability Ecosystems.” Ap- CC:DA (Committee on Cataloging: Description & Access). plied Ontology 12: 91-111. doi:10.3233/AO-170181 2000. “Task Force on Metadata: Final Report.” https:// Gilreath, C. T. 1992. “Harmonization of Terminology - An www.libraries.psu.edu/tas/jca/ccda/tf-meta6.html Overview of Principles.” International Classification 19: Chan, Lois Mai, Eric Childress, Rebecca Dean, Edward T. 135-9. O'neill, and Diane Vizine-Goetz. 2001. “A Faceted Ap- Golub, Koraljka, Douglas Tudhope, Marcia Lei Zeng, and proach to Subject Data in the Dublin Core Metadata Maja Žumer. 2014. “Terminology Registries for Know- Record.” Journal of Internet Cataloging 4, no. 1-2: 35-47. ledge Organization Systems: Functionality, Use, and At- doi:10.1300/J141v04n01_05 tributes.” Journal of the Association for Information Science and Chen, Shu-jiun, Marcia Lei Zeng, and Hsueh-hua Chen. Technology 65: 1901-16. doi:10.1002/asi.23090 2016. “Alignment of Conceptual Structures in Con- Heiler, Sandra. 1995. “Semantic Interoperability.” ACM trolled Vocabularies in the Domain of Chinese Art: A Computing Surveys (CSUR) 27, no.2: 271-3. Discussion of Issues and Patterns.” International Journal Hudon, Michèle. 1997. “Multilingual Thesaurus Construc- on Digital Libraries 17, no. 1: 23-38. doi:10.1007/s00799- tion: Integrating the Views of Different Cultures in 015-0163-1 One Gateway to Knowledge and Concepts.” Knowledge Clarke, Stella G. Dextre and Marcia Lei Zeng. 2012. “From Organization 24: 84-91. ISO 2788 to ISO 25964: The Evolution of Thesaurus IFLA (International Federation of Library Associations Standards Towards Interoperability and Data Model- and Institutions). Classification and Indexing Section. ing.” Information Standards Quarterly 24, no.1: 20-6. 2005. “Guidelines for Multilingual Thesauri.” http:// Dahlberg, Ingetraut. 1998. “Classification Structure Prin- www.ifla.org/VII/s29/pubs/Draft-multilingualthesauri. ciples: Investigations, Experiences and Conclusions.” In pdf Structures and Relations in Knowledge Organization: Proceed- Isaac, Antoine. 2010. “SKOS and Linked Data.” Presenta- ings of the 5th International ISKO Conference 25-29 August tion at ISKO-UK Linked Data: The Future of 1998 Lille, France, ed. Widad Mustafa El-Hadi. Advances Knowledge Organization on the Web Conference, in Knowledge Organization 6. Würzburg: Ergon Ver- 2010. https://www.slideshare.net/antoineisaac/skos- lag, 79-87. and-linked-data Dahlberg, Ingetraut. 2017. “Brief Communication: Why a Isaac, Antoine. 2013. “Correspondence Between ISO New Universal Classification System is Needed.” 25964 and SKOS/SKOS-XL Models.” https:// Knowledge Organization 44: 65-71. doi:10.5771/0943-7444- groups.niso.org/apps/group_public/download.php/1 2017-1-65 2351/CorrespondenceISO25964-SKOSXL-MADS-20 “Data on the Web Best Practices: W3C Recommendation 31 13-12-11.pdf January 2017.” 2017. https://www.w3.org/TR/dwbp/ Isaac, Antoine and Ed Summers. 2009. “SKOS Simple Doerr, Martin. 2008. “DCC Digital Curation Manual: In- Knowledge Organization System Primer.” https:// stalment on Ontologies,” ed. Seamus Ross, Michael Day. www.w3.org/TR/skos-primer/ [United Kingdom]: Digital Curation Centre. https:// Isaac, Antoine, William Waites, Jeff Young, and Marcia www.era.lib.ed.ac.uk/bitstream/handle/1842/3341/ Zeng. 2011. “Library Linked Data Incubator Group: Doerr%20ontologies.pdf Datasets, Value Vocabularies, and Metadata Element Du, Wei, Xusen Cheng, Chen Yang, Jianshan Sun, and Jian Sets.” http://www.w3.org/2005/Incubator/lld/XGR- Ma. 2017. “Establishing Interoperability Among Know- lld-vocabdataset-20111025/ ledge Organization Systems for Research Management: ISO (International Organization for Standardization). A Social Network Approach.” Scientometrics 112: 1489- 2011. Thesauri for Information Retrieval. Pt. 1 of Thesauri 1506. doi:10.1007/s11192-017-2457-0 and Interoperability with Other Vocabularies. International Euzenat, Jérôme and Pavel Shvaiko. 2013. Ontology Match- Standards ISO 25964-1:2011(E). Geneva: ISO. ing. 2nd ed. Berlin: Springer. ISO (International Organization for Standardization). Freyre, Elisabeth and Max Naudi. 2003. “MACS: Subject 2013. Interoperability with Other Vocabularies. Pt. 2 of The- Access Across Languages and Networks.” In Subject Re- sauri and Interoperability with Other Vocabularies. Interna- trieval in a Networked Environment: Proceedings of the IFLA tional Standards ISO 25964-2:2013(E). Geneva: ISO. Knowl. Org. 46(2019)No.2 145 M. L. Zeng. Interoperability

Iyer, Hermalata and Mark Giguere. 1995. “Towards De- Conference on Information and Knowledge Management. New signing an Expert System to Map Mathematics Classifi- York: ACM, 366-9. catory Structures.” Knowledge Organization 22: 141-7. Ontolog. 2018. “Ontology Summit 2018 Communiqué. Joudrey, Daniel N. and Arlene G. Taylor. 2017. The Organi- Contexts in Context.” http://ontologforum.org/in- zation of Information. 4th ed. Santa Barbara, CA: Libraries dex.php/OntologySummit2018 Unlimited. O’Neill, Edward. 2013. “FAST Authorities as Linked Data.” Koch, Traugott. 2006. “Electronic Thesis and Dissertations In 76th Annual Meeting of the American Society for Information Services: Semantic Interoperability, Subject Access, Mul- Science and Technology, Montreal, Canada, Nov. 2-6, 2013. tilinguality.” Paper read at E-Thesis Workshop, Amster- Proceedings of the American Society for Information dam 2006. http://www.ukoln.ac.uk/ukoln/staff/t.koch/ Science and Technology 50. doi:10.1002/meet.145050 publ/e-thesis-200601.html 01019 Lancaster, F.W. 1986. Vocabulary Control for Information Re- Otero-Cerdeira, Lorena, Francisco J. Rodríguez-Martínez, trieval. 2nd ed. Arlington, VA : Info Resources Press. and Alma Gómez-Rodríguez. 2015. “Ontology Match- Lardera, Marco, Claudio Gnoli, Clara Rolandi and Marcin ing: A Literature Review.” Expert Systems with Applications Trzmielewski. 2017. “Developing SciGator, a DDC- 42: 949-71. doi:10.1016/j.eswa.2014.08.032 Based Library Browsing Tool.” Knowledge Organization Ouksel, Aris M., and Amit Sheth. 1999. “Semantic In- 44: 638-43. doi:10.5771/0943-7444-2017-8-638 teroperability in Global Information Systems.” ACM Maniez, Jacques. 1997. “Database Merging and the Com- Sigmod Record 28: 5-12. doi:10.1145/309844.309849 patibility of Indexing Languages.” Knowledge Organization Patel, Manjula, Traugott Koch, Martin Doerr, and Chrisa 24: 213-24. Tsinaraki. 2005. “Semantic Interoperability in Digital Mazzocchi, Fulvio. 2017. “Relations in KOS: Is It Possible Library Systems.” http://opus.bath.ac.uk/23606/1/SI to Couple a Common Nature with Different Roles?” _in_DLs.pdf Journal of Documentation 73: 368-83. doi:10.1108/JD-05- Pollock, Jeffrey T. and Ralph Hodgson. 2004. Adaptive In- 2016-0063 formation: Improving Business Through Semantic Interoperabil- Mazzocchi, Fulvio. 2018. “Knowledge Organization Sys- ity, Grid Computing, and Enterprise Integration. Hoboken, tem (KOS): An Introductory Critical Account.” N.J.: Wiley-Interscience. Knowledge Organization 45: 54-78. doi:10.5771/0943- Powell, Andy, Mikael Nilsson, Ambjörn Naeve, Pete John- 7444-2018-l-54. ston, and Tom Baker. 2007. “DCMI Abstract Model.” Mitchell, Joan S., Marcia Lei Zeng, and Maja Žumer. 2014. http:/dublincore.org/documents/abstract-model/ “Modeling Classification Systems in Multicultural and RDA (Research Data Alliance) Research Data Repository Multilingual Contexts.” Cataloging & Classification Quar- Interoperability Working Group. 2018. “Research Data terly 52, no. 1: 90-101. Repository Interoperability WG Final Recommen- Mixter, Jeff. 2013. “Benefit of Linked Data in Enriching dations.” https://www.rd-alliance.org/group/research- and Managing FAST Linked Data.” In 76th Annual data-repository-interoperability-wg/outcomes/research- Meeting of the American Society for Information Science and data-repository-0 Technology, Montreal, Canada, Nov. 2-6, 2013. Proceedings Sheth, Amit P. 1999. “Changing Focus on Interoperability of the American Society for Information Science and in Information Systems: From System, Syntax, Structure Technology 50. doi:10.1002/meet.14505001019 to Semantics.” In Interoperating Geographic Information Sys- Moen, William E. 2001. “Mapping the Interoperability tems: Second International Conference, INTEROP'99, Zurich, Landscape for Networked Information Retrieval.” In Switzerland, March 10-12, 1999: Proceedings, ed. Andrej Proceedings of the 1st ACM/IEEE-CS Joint Conference on Vckovski,̌ Kurt E. Brassel, and Hans-Jorg̈ Schek. Lecture Digital Libraries. New York: ACM, 50-1. Notes in Computer Science 1580. Berlin: Springer, 5-29. NISO (National Information Standards Organization). Sowa, John F. 2006. “A Dynamic Theory of Ontology.” In 2004. Understanding Metadata. Bethesda, MD: NISO Formal Ontology in Information Systems: Proceedings of the Press. http://www.niso.org/publications/press/Under Fourth International Conference (FOIS 2006), ed. Brandon standingMetadata.pdf Bennet and Christiane Fellbaum. Frontiers in Artificial NISO (National Information Standards Organization). Intelligence and Applications 150. Amsterdam: IOS 2005. Guidelines for the Construction, Format, and Manage- Press, 204-13. ment of Monolingual Controlled Vocabularies. Bethesda, Svenonius, Elaine. 2004. “The Epistemological Founda- MD: NISO. ANSI/NISO Z39.19-2005 (R2010). tions of Knowledge Representations.” Library Trends Obrst, Leo. 2003. “Ontologies for Semantically Interoper- 52, no. 3: 571–87. able Systems.” In Proceedings of the Twelfth International Taylor, Arlene G. 2004. The Organization of Information, 2nd ed., Westport, CN: Libraries Unlimited. 146 Knowl. Org. 46(2019)No.2 M. L. Zeng. Interoperability

Thiéblin Elodie, Ollivier Haemmerlé, Nathalie Hernandez, Zeng, Marcia Lei and Lois Mai Chan. 2004. “Trends and Is- and Cassia Trojahn dos Santos. Forthcoming. “Survey sues in Establishing Interoperability Among Knowledge on Complex Ontology Matching.” Organization Systems.” Journal of the Association for Infor- Tudhope, Douglas and Ceri Binding. 2008. “Machine Un- mation Science and Technology 55: 377-95. derstandable Knowledge Organization Systems.” Project Zeng, Marcia Lei and Lois Mai Chan. 2010. “Semantic In- no. 507618. Bath: UKOLN. http://hypermedia.re teroperability.” In: Encyclopedia of Library and Information search.southwales.ac.uk/media/files/documents/2008- Sciences. 3rd ed., ed. Marcia J. Bates and Mary Niles 07-05/Additional-report-wp5.pdf? Maack. Boca Raton, FL : CRC Press, 4645-62. Tudhope, Douglas, and Ceri Binding. 2004. “A Case Study Zeng, Marcia Lei. 2018. “Hands-on (Details) Obtain Data of a Faceted Approach to Knowledge Organisation and Using SPARQL – Demo: Using AAT” Paper read at Retrieval in the Cultural Heritage Sector.” In Resource Learning, Understanding, and Using Linked Data, Discovery Technologies for the Heritage Sector, ed. Guntram Workshop at 2018 Digital Initiatives Symposium, April Geser and John Pereira. DigiCULT Thematic Issue 6. 23, 2018. San Diego, California. http://metadata Salzburg: Salzburg Research Forschungsgesellschaft, etc.org/LOD/6hands-on-Microthesauri-from-AAT.pdf 28-33. https://www.digicult.info/downloads/digicult_ Zeng, Marcia Lei and Philipp Mayr. 2018. “Knowledge Or- thematic_issue_6_lores.pdf ganization Systems (KOS) in the Semantic Web: A Multi- Tudhope, Douglas, Traugott Koch, and Rachel Heery. Dimensional Review.” International Journal on Digital Li- 2006. “Terminology Services and Technology: JISC braries. doi:10.1007/s00799-018-0241-2 State of the Art Review.” http://core.ac.uk/download/ Zeng, Marcia Lei, Maja Žumer, and Athena Salaba, eds. pdf/2809596.pdf 2011. Functional Requirements for Subject Authority Data Tuominen, Jouni, Nina Laurenne, and Eero Hyvönen. (FRSAD): A Conceptual Model. IFLA Series on Biblio- 2011. “Biological Names and Taxonomies on the Se- graphic Control 43. Berlin: De Gruyter Saur. mantic Web–Managing the Change in Scientific Con- Žumer, Maja. 2018. “IFLA Library Reference Model (IFLA ception.” In The Semantic Web: Research and Applications, LRM): Harmonisation of the FRBR Family.” Knowledge 8th Extended Semantic Web Conference, ESWC 2011, Hera- Organization 45: 310-8. doi:10.5771/0943-7444-2018-4- klion, Crete, Greece, May 29 – June 2, 2011, Proceedings, Part 310 II, ed. Grigoris Antoniou, Marko Grobelnik, Elena Sim- perl, Bijan Parsia, Dimitris Plexousakis, Pieter De Leen- heer, and Jeff Pan. Lecture Notes in Computer Science 6644. Berlin: Springer, 255-69.

Knowl. Org. 46(2019)No.2 147 Review

Review

DOI:10.5771/0943-7444-2019-2-147

Semantic Perception: How the Illusion of a Common Language the brain as nothing but electrochemical events. Arises and Persists, Jody Azzouni. New York: Oxford Uni- What can be measured are action potentials and versity Press, 2013. ISBN 9780199967407. 2015. ISBN transmitter release, but no meaning. The behaviorist 9780190275549. dogma was that this was sufficient to explain behav- ior and cognitive acts. On the other hand, psycholo- Ontology Without Borders, Jody Azzouni. New York, NY: gists, philosophers and computer scientists believed Oxford University Press. ISBN 9780190622558. and to a large extent still believe that meaning or “in- formation” constitutes a domain in itself, with its It is worth noting, at the outset, that classification and own laws and phenomena that can be described and semantics share a type of genealogy. The founder of the understood independently of brain processes. modern study of semantics, Michel Bréal, in addition to inaugurating the modern marathon run in Olympic com- It might reasonably be said that the analytic tradition in petition, worked with manuscripts in the National Library philosophy has had a very close relationship with scien- of France prior to his professorial appointment in lin- tific inquiry, such as that alluded to by Roth above, but al- guistics at the Collège de France. Any engagement with so that many of its broader characteristics have emerged semantics requires a commitment analogous to the stam- out of the prioritization of the philosophy of language, ina required of a marathon runner. Its complexity, its which emerged in the nineteenth-century from Gottlob convoluted argument, is much like the hardest race ath- Frege’s development of first-order logic in his Concept letes run (while there is no dehydration from the study of Script and as later adopted by Bertrand Russell. This first- semantics it is just possible that blood pressure will rise in order, or predicate, logic was thought to offer real prom- the course of inquiry). ise in solving problems through substituting symbols for The Greek term, sēmantikós, means “significant” and words and helping to uncover different aspects of mean- while we often think of semantics as the study of mean- ing expressed through language. Of note here, also, is ing, its relation to questions of why and how words, Frege’s Principle, the principle of semantic composition- phrases and symbols denote a relationship of signifier ality. This is “the principle that the meaning of a (syntac- and signified is crucial to appraising the broader ques- tically complex) whole is a function only of the meanings tions the field encompasses. Study of semantics has been of its (syntactic) parts together with the manner in which grounded in a range of approaches, from linguistic study these parts were combined” (Pelletier 1994, 11). (how sense, reference and truth relate through relation- Philosophy of language developed further, through ships of smaller and larger linguistic units or texts) to log- Ludwig Wittgenstein’s Tractatus Logico-Philosophicus, and ical modalities (parsing sentences using rules of logic to many readers will have come across the propositions be- arrive at a predicate) winding up in a veritable hotchpot low distilled therefrom: of cognitive science orientations that look in some way to brain function to free the understanding of the use of 1. The world is everything that is the case. language units from uncritical links to mechanisms (be 2. What is the case (a fact) is the existence of states of they characteristics of languages-as-tools or of our so- affairs. cialization-as-speakers-and-hearers). Writing in 1995, the 3. A logical picture of facts is a thought. neurobiologist Gerhard Roth was able to help make clear 4. A thought is a proposition with a sense. that the search for a naturalistic theory of meaning, one 5. A proposition is a truth-function of elementary prop- that located “the origin of semantics in the brain,” was ositions. (An elementary proposition is a truth- gaining traction (Roth 1995, 1 emphasis original): function of itself). 6. The general form of a proposition is the general form Until very recently, such an endeavor was seen by of a truth function, which is: [p, ξ,N(ξ)]. This is the almost every scientist and philosopher as vain from general form of a proposition. the very beginning. The brain was viewed by neuro- 7. Whereof one cannot speak, thereof one must be si- physiologists and neurochemists as a purely physico- lent. chemical system and the processes going on inside 148 Knowl. Org. 46(2019)No.2 Review

According to G. H. Von Wright (1955, 538), Wittgen- tion of the one for the other in a sentence always yields stein’s later work “abandoned the picture-theory of lan- an equivalent sentence” we can only say that for sentenc- guage, the doctrine that all significant propulsions are es “they are equivalent if their use is the same … they are truth-functions of elementary propositions, and the doc- equivalent if their utterance would be prompted by the trine of the unspeakable.” Other scholars, such as James same stimulatory situations” (Quine 1979, 2). Conant, see a consistency at work from the earlier to the Most book reviews do not require such a preface, and later work such that they advocate a reading in which if they did, few would read them and even fewer would (Conant 2006, 182): write them. In the case of Jody Azzouni’s Semantic Percep- tion, it is respectful and necessary to take some time link- the Tractatus has no general story about what makes ing author to review reader. The work claims to show something nonsense … [and] moments of recogni- how we experience the meaning-properties of language tion that a reader is called upon (in TLP 6.54) to at- independent of intentions and to reveal that (Azzouni tain must come one step at a time. This is contrary 2015, 1 emphasis original): to the spirit of most standard readings, according to which there can be a possible moment in a read- human beings involuntarily experience certain physi- er’s assimilation of the doctrines of the book when cal items, certain products of human action, and the theory (once it has been fully digested by the certain human actions themselves, as having mo- reader) can be brought to bear wholesale on all of nadic meaning-properties: for example, as pos- the (putatively nonsensical) propositions that make sessing meanings, as referring, or as having (or be- up the work. ing capable of having) truth values … we (human beings) involuntarily see uttered words, among oth- This reading of Wittgenstein’s work, as a unified whole, er things, as possessing certain monadic meaning- rests on how so-called “resolute readers” interpret this properties, and that we involuntarily see uttered phrase (at 6.54) (Wittgenstein 1922, 90): sentences as possessing other (but related) monadic meaning-properties. My propositions are elucidatory in this way: he who understands me finally recognizes them as sense- While many of us have an idea of what a monad might less, when he has climbed out through them, on be, from the Greek monas, “singularity,” it is unlikely that them, over them. (He must so to speak throw away most of us would see Azzouni’s technical meaning for the ladder, after he has climbed up on it.) He must “monadic meaning-properties” as straightforward. On surmount these propositions; then he sees the my reading, meaning-properties is meaning amenable to world rightly. analysis in chunks, so to speak, while monadic modifies this with regards to units of perceptual reality. In a sense, The gap between this approach and Russell’s introduction it would seem that he refers to “specifically real meaning to the Tractatus, in which he refers to Wittgenstein’s aim units.” These are experienced “as properties of uttered as being to elicit “the conditions which would have to be words and sentences similar to how we perceive ordinary fulfilled by a logically perfect language” (1922, 7), should objects to have … shape and colour” (1) and they are nei- be obvious. Wittgenstein’s death in 1951 was a watershed ther contextual nor interactional; meaning is experienced in the sense that logical positivists continued to take a cue independently of speaker intention. It is the agglomera- from the early work, seeking certainty through system tion of “a large class of physical objects and human ac- building while at the same time a coterie of philosophers tions as possessing monadic meaning properties” (2) that (led by J. L. Austin and Gilbert Ryle), working within the forms what Azzouni calls the semantic perception view. “meaning is use” premise articulated by Wittgenstein in Simply put, it is this view which Azzouni seeks to con- Logical Investigations, began a systematic focus on how lan- trast against Paul Grice’s (H. P. Grice) work on meaning guage is used in ordinary, everyday ways and how philo- that relates to propositional-attitude psychology. Chris sophical problems can be interpreted with regard to this Daly provides an explanation of how we deal with words knowledge base—but also how they interpolate on lin- or sentences and psychological states in terms of mean- guistic analysis, creating separate domains of meaning. ing (2013, 7): “You have to read or hear words or sen- While we can easily see how relations of synonymy can tences in order to understand them. But you do not have be made to fit words “less often than not,” it is not quite to visualize or hear thoughts in your head in order to un- as straightforward with regards to “the semantical equiva- derstand them.” He points to John Searle’s distinction re- lence of whole sentences” (Quine 1979, 2). So, while “a lating to “the derivative meaning of words and sentences word is synonymous to a word or phrase if the substitu- and the intrinsic meaning of thoughts” and how it is that Knowl. Org. 46(2019)No.2 149 Review the quality of derivativeness is that they “get their mean- an intuitive argument for [Grice’s] account of lin- ing by being interpreted by someone.” Thoughts, on the guistic meaning .… The kind of meaning involved other hand, “do not get their meaning by being interpret- in (2) is the meaning of something which is sup- ed by someone, and, more generally, they do not get their posed to show something, in some sense: those meaning from anything else.” What is of concern in the three rings of the bell are there in order to show issue between Gricean approaches and Azzouni’s is how that the bus is full .… The fact that something is to better unravel the relations between different kinds of supposed to show that the bus is full allows that it meaning and what, if any, priorities might take place be- can be faulty—in our example, that it can be pro- tween the two. duced even when the bus is not full. This same While the nuances of Grice’s work cannot be explored point explains why it is natural to express the mean- here, more should be said about what it is that Azzouni is ing in quotation: the quotation isolates what seems seeking to clarify or reposition. Michael Morris is helpful, to be shown from the actual facts. And it is objects, he notes (Morris 2007, 249-250) or features of objects, which have purposes—not facts. Grice’s ultimate aim is … to understand the every- day notion of meaning, which has much wider ap- As a non-expert, I hope I am right to say that it is at plication than just to linguistic expressions. He be- about this point that Azzouni begins to part company gins by making a division within this general notion with the neo-Griceans who claim that public languages of meaning, between what he calls natural and what do not contain objects or events with monadic meaning- he calls non-natural meaning. As an example of properties. He does not agree that language tokens used natural meaning, we might suggest this: (1) Those in communication are meaning inert and that they “are spots mean that she has measles. And as an exam- derived entirely from the intentions and mutual ple of non-natural meaning, we might suggest this: knowledge of their users” (2015, 3-4). Azzouni does not (2) Three rings on the bell mean that the bus is full. differ from the Griceans that “the apparent meaning- Despite the similarity in form of these two state- properties of public language entities must be derived ments of meaning, Grice thinks that there’s some- from human psychology,” but he differs in how the deri- thing fundamentally different going on in them .… vation is explained, specifically “which psychological Here, slightly differently put, are the basic marks of traits of human beings are relevant to understanding the difference Grice finds: effortless communication events we engage in”—both (i) In the case of natural meaning, “X means see the pure physicality of the written word or spoken that p” implies that it is true that p (in our case, sentence. Azzouni asks that we try to understand that the (1) implies that she really does have measles); semantic perception view differs from all forms of Gri- this does not hold for non-natural meaning (so, cean analysis in that it holds that we involuntarily experi- in the case of (2), the bell might have been rung ence these things as having monadic meaning-properties three times by mistake); and this sense of experience is “as items that refer, and (ii) In the case of non-natural meaning, what fol- that are meaningful.” Azzouni argues that it is not neces- lows “means that” could be put in quotation sary to systematically deploy communicative intentions marks (the rings meant “the bus is full”); this is and expectations or notions of mutual knowledge “be- not possible with natural meaning; cause if two people involuntarily experience an uttered (iii) Natural meaning can be understood as the sentence as monadically meaning something, then that significance of certain facts (such as the fact that perceived meaning is the default experience of what that she has spots), whereas non-natural meaning is uttered sentence means … the uttered sentence is experi- concerned with the significance of certain ob- enced as meaning what it’s perceived to mean by virtue jects or features of objects… of its own meaning-properties.” (iv) Statements of non-natural meaning of the Azzouni self-declaredly seeks to modify the semantic- form “X means that p” imply that somebody pragmatic apparatus, specifically “what is said” and “im- meant that p by X (in the case of (2), that some- plicature” (2015, 4): body meant that the bus was full by three rings on the bell); but this is not the case with natural to show how Gricean assumptions about the cen- meaning. trality of mutual knowledge and communicative in- tentions to the phenomena of perceived meaning- Morris states that what we see here developing is teleo- properties, badly distort the ordinary folk-psycho- logical (goal-oriented); in the distinction is (2007, 250): logical attributions of intentions, beliefs and expec- 150 Knowl. Org. 46(2019)No.2 Review

tations, as well as, ordinary intuitions about what is for changes in parameters to change resulting phenome- said [and] what is not said. na. For Azzouni, the variations on Gricean approaches that claim to deal adequately with “our experiences of The semantic perception view “explains and sustains the what’s meant by our sentences and by us when we utter ordinary phenomenology of the experience of under- those sentences” need to locate “the mechanisms that standing language, whereas Gricean and neo-Gricean make differences in those specific experiences” (168). views instead consistently distort or attempt to explain Description of the phenomena in question is crucial, as is away this phenomenology.” On top of this, intentions locating the relationship between measurable regularities and expectations of speakers “play a constitutive role in and appropriate laws for physical effect, but the phenom- [the] experience of meaning” for Griceans while Az- ena themselves, the appearance, also must be saved. This zouni’s approach sees these as just “ancillary” (5). connects with the systematic regularities that are of con- Azzouni’s general metaphysical approach here is nom- cern to theorising, they save a given phenomenon’s ap- inalist, and this leads to a denial of the common-sense pearance, which may include aspects that do not reinforce proposition that we speak a common language, such as or sustain the theory. Azzouni’s example is of a phenom- English. His view is that this would imply the existence enon that is due to interaction of multiples forces, “as of types and there are no types. As he makes clear, onto- one gets a workable theory of the effect of certain forc- logically speaking “all they are—are specific communica- es, the remaining unexplained aspects of the phenome- tion events: actions taken by people during which they non become evidence for the nature of the remaining produce noises and experience an understanding of one forces” (169n). The Gricean literature on communicative another” (a range of artifacts accompanies this). He also transactions departs from the need for explanation of buttresses his arguments with the notion that “there are what “relatively robust regularities” require characteriza- no physical objects in the world with meaning-properties tion in theory development, according to Azzouni. These of any kind” or an object cannot mean “some other regularities are “what speaker-hearers experience what thing in the way that we experience words to so mean they say to one another to mean as well as their various what they refer to”—reference relationships are projec- usage patterns” (169). tions “by persons who so experience words as so refer- Azzouni claims that it is possible to show how mean- ring to things.” ingfulness can “be induced by the sheer shape of designs, The semantic perception view rests on at least a partial despite the knowledge that no agents are involved” and requirement to agree with Azzouni that ordinary people that this undermines approaches that centralize speaker have a “psychologically involuntary misapprehension” intentions as bases for “factors inducing meaningfulness” (6); as a result of “how they involuntarily experience lan- (170). So-called cherry-picking approaches to semantic guage phenomena, they are impelled to think the words warrant, where “what he said was …” approaches are and sentences of their language have an interlocked sys- taken as indicative of “what is said” ignores that the latter tem of properties that words and sentences don’t have.” is, in Azzouni’s view, always characterized as the locus of Our experience of objects and events is one “endowed what is said, and it is not the individual who produces the with monadic properties that they don’t have.” Finally, event. Context shifting arguments (Cappelen and Le- Azzouni explains the “disconnect” he has identified be- pore’s method that intuitions appealed to are semantic tween the experience of language objects and events and and that context sensitivity of expressions are a part of “the meaning properties that our subpersonal language our language and are non-obvious) “ignore stark intuitive faculties project onto those objects and events.” Both are differences between what is said and what is implicated,” projections (lacking meaning-properties), but the projec- or, they see context in what is said as fairly uncontrover- tions emerge from “different ‘faculties’ of mind” (7). sially “unbounded”—essentially ignoring empirical expe- Interspersed within Semantic Perceptions are helpful rience and phenomena. methodological interludes. The first makes broader con- For Azzouni, what is said is experienced involuntarily nections to the central question of how to, or how we and while contextual factors play a part, the speaker- might better, understand the notion of “what is said” and hearer is usually oblivious to this, their effects having a “what is implicated” in the transaction of coming to limited range. Speaker-hearers do not “experience what is grips with what is said. Variations of Gricean theories said as due to the intentions of the speaker” but rather as that seek to “reveal the operations of certain psychologi- something from which intentions can be inferred. Con- cal mechanisms in the participants of language transac- textual factors, like gestures, aid recognition of speaker tions” (167) and the effect these have on a subject’s intentions “as a constitutive part of what is implicated” transactional experience and why they take on unique (171). In the consciously accessible of what is said and forms are supposed to be scientific. They should allow implicated, data associated with the evaluation must re- Knowl. Org. 46(2019)No.2 151 Review main open; it is not only the favourable that should con- deavor. Tractable theories of what is said have a tendency tribute to building the groundwork of theory on seman- to derail at a level of analysis of, for instance, the seman- tic transaction and meaning. Azzouni believes there are tic property of a particular class of language types or problems with how semantics and pragmatics relate to when restricted to truth conditional context; capturing cognitive processing theories. Linguistic phenomena, he regularities is problematic and indicative of how the lin- says, can fall under both categories and examples such as guistic framework struggles for explanatory fruitfulness. “John has had nine girlfriends” allow for readings associ- Azzouni transfers his wager to how cognitive science ated with “at least nine” (semantic) or “exactly nine” might better characterise “various modularized cognitive (through conversational implicature a pragmaticist read- processes specific to language” (174). ing). This emerges from a tendency to semantically treat We suffer from a tendency to allow our primitive ref- the sentence as ambiguous and to seek a parsimonious erential intuitions, say to a stick figure or a smiley face, to outcome. Whether semantic or pragmaticist in orienta- be easily triggered as real “but not” socially ontological, tion there is a related cognitivist demarcation of territory as constitutively recognizable in “who’s smiling?” (that that aligns with both, a further brain-body distinction face) “but not” of “who’s that” (it’s not a being it’s a emerges in turn and semantics and pragmatics demarcate drawing). Azzouni asks we acknowledge how these intui- different capacities. tions conflict with the sensible treatment of reference Semantic approaches are not necessarily simpler, be- where it arises semantically-pragmatically or when “refer- cause Azzouni claims this assumes one predefines a result ence is always to items in domains of discourse of some that “can only be empirically established” (172); he sees sort (such as mental spaces or resource situations) that recourse to an Occamite “theoretical virtue” as emblem- are given contextually or otherwise.” It is not possible to atic of a more tainted approach, one that ignores how find the intuitional within the semantic-pragmatic model evolutionary theory (172): as something as simple as a shape can trigger the referen- tial intuition and can do so “in the absence of any con- has shown that the engineering virtues of simple textual or semantic elements that justify the involvement designs that straightforwardly handle various func- of a domain of discourse” (174). tions are almost never exemplified by biologically Azzouni argues that use content in conversation has evolved designs ... instead [we find] peculiarly com- an involuntary character (in rapid conversation) such that plex designs that manifest all sorts of unneeded what is said “appears as a monadic property of the ut- complications, the reasons for the presence of tered expression” (175) while the implication is a machi- which are explained by contingent historical devel- nation, a site of recognition and inference to our inter- opments. locutor’s “ingenuity” and “agenda.” Rather than being a semantic or pragmatics-oriented problem, Azzouni bets Demarcation over whether what is said is semantic or has on it having a basis in “central as opposed to cognitive a pragmaticist component is finding less neurophysiologi- processing” as context has so little influence on what is cal justification (novi scientiam), according to Azzouni, than said (in his construction). explanatory weight in theory-virtues which are, in turn, Azzouni claims that we can conceive of semantic- characterised by their own methodological inadequacy re- pragmatics theories as “top-down autonomous special- lated to their trade in simplicity and, by extension, foun- science theories,” which do not need us to “account for all dationalism. the phenomena that arise” but perhaps only to situations Azzouni does not object to the efficacy of these in which “the triggering of reference intuitions is accom- methodological tools. That they have clear warrant to be panied by an appropriate domain of discourse.” An ex- applied shows, it seems, that semantics and pragmatics re- planation is offered such that referential figures (like the flect the hermeneutic rather than the “genuine neurocog- stick man or smiley face) can be included in a semantic- nitive-neurophysiological aspects of us” (172). Azzouni pragmatics theory that allows these to be “referred to de- takes us on another journey, based on the assumption just spite the absence of a domain of discourse” in support. elucidated; we must consider how, when the same phe- Reference is likely to, as a theoretical construct, require at nomena are classified as either semantic or pragmatic, least a changed formulation from “a collection of objects “the only relevant factor to how a linguistic phenomenon referred to” (175). We then expunge such cases as refer- is to be classified is whether the resulting pair of theories ence (for ease) as reference has special (and appropriate) manifest certain user-friendly virtues.” Any attempt at a meaning in semantics. We hold still to the possibility common-sense attribution of a particular phenomenon though, in this framework, to how “this particular carving as a semantic or pragmaticist type from the neurocogni- out of the data is one that wouldn’t be respected as a cer- tive-neurophysiological point of view is a Quixotic en- tain cognitive processing level if it were discovered that 152 Knowl. Org. 46(2019)No.2 Review there is a kind of module that (on the basis of certain in- there” is reasonably answered as “everything” this is not puts) generates ‘reference’” (176). Azzouni claims that at to settle disagreement over cases. What has become this point reference becomes the term appropriate to the Quine’s “ontological commitment” (Bricker 2014, 1.1) underlying cognitive science. He claims that references might be processed, cognitively, with “referents” finding allowed one to measure the ontological cost of their place, as a result, in a domain of discourse. Where theories, an important component in deciding we cannot position or apportion to a domain of discourse which theories to accept; it thus provided a partial the “intuitions of reference would ‘idle’” he says, and this foundation for theory choice. Moreover, once one is similar to how questions of who is the stick man or had settled on a total theory, it allowed one to de- who is the smiley face languish as bizarre formulations termine which components of the theory were re- without resolution. Semantic notions of reference are, for sponsible for its ontological costs. [It also] played a Azzouni, “echoed by a corresponding term in the underly- polemical role. It could be used to argue that oppo- ing science … [they] can be characterised as reference plus nents' theories were more costly than the theorists additional conditions.” admitted … [and] it could be used to advance a tra- David Manley describes metametaphysics as inquiry in- ditional nominalist agenda because, as Quine saw it, to a range of unknowns, such as whether “questions of ordinary subject-predicate sentences carry no onto- metaphysics really have answers … are these answers sub- logical commitment to properties or universals. stantive or just a matter of how we use words? And what is the best procedure for arriving at them—common Bricker outlines Quine’s central claim made in “On What sense? Conceptual analysis? Or assessing competing hy- There Is”—it is that a “theory is committed to those and potheses with quasi-scientific criteria?” (2009, 1). Jody Az- only those entities to which the bound variables of the zouni’s latest work, Ontology Without Borders, is a work that theory must be capable of referring in order that the af- contrasts these questions with more traditional metaphysi- firmations made in the theory be true” (1948, 33). As cal questions relating to ontology. briefly as possible, Bricker’s explanation is summarised One of the more interesting things found ex vulgus sci- below (Bricker 2014, 1.1 emphasis original): entia in undertaking this review was that as of December 2018 there was no Google record for “relationship of on- The criterion should be understood as applying to tology to metaphysics” and neither is there one for “rela- theories primarily, and to persons derivatively by tionship of metaphysics to ontology.” The subject matter way of the theories they accept .… It is important of the search query is, however, not foreign to the world to note at the start that Quine’s criterion is descrip- encompassed by Google (just semantically distant); we tive; it should not be confused with the prescriptive get in search relata that ontology is a sub-field of meta- account of ontological commitment that is part of physics and that the former encompasses existence, while his general method of ontology. That method, the latter encompasses reality. Achille C. Varzi is helpful roughly, is this: first, regiment the competing theo- in laying out how Quine’s approach to this question has ries in first-order predicate logic; second, determine become standard, it says (2011, 407) which of these theories is epistemically best (where what counts as “epistemically best” depends in part ontology is concerned with the question of what on pragmatic features such as simplicity and fruit- entities exist (a task that is often identified with that fulness); third, choose the epistemically best theory. of drafting a “complete inventory” of the universe) We can then say: one is ontologically committed to those whereas metaphysics seeks to explain, of those en- entities that are needed as values of the bound vari- tities, what they are (i.e., to specify the “ultimate na- ables for this chosen epistemically best theory to be ture” of the items included in the inventory). true. Put like this, the account may seem circular: ontological commitment depends on what theories Azzouni expects readers to have significant knowledge of are best, which depends in part on the simplicity, the debates within philosophy of language and, as a re- and so the ontological commitments, of those the- sult, like Semantic Perception, this is not a work for novices ories. But there is no circularity in Quine’s ontologi- unless they intend to engage in some reasonably extensive cal method. The above account of ontological contextual background reading with the text. Readers are commitment is prescriptive, and applies to persons, asked, once more, to be patient with the reviewer for a not to theories. What entities we ought to commit our- short digression which, hopefully, will aid in a clearer selves to depends on a prior descriptive account of overview. In “On What There Is,” Quine (1948, 21) what entities theories are committed to. pointed out that while an answer to the question “what is Knowl. Org. 46(2019)No.2 153 Review

In an earlier work, “Freeing Talk of Nothing from the (1979, 1) noted ironically, “semantics … or the theory of Cognitive Illusion of Aboutness,” Azzouni claims that re- meaning, is a vitally important subject, despite the disrep- jecting Quine’s criterion “yields the neutralist interpreta- utable character of its ostensible subject matter.” We tion of the quantifiers” (2014, 443). Uzquiano describes would all be better off, or at least we should be less- what quantifiers are (Uzquiano 2018, Introduction): dogmatic reasoners, with a more attuned knowledge of semantics and its encompassing and derivative fields. This Quantifier expressions are marks of generality. cannot ever be a bad thing: “ ¬ (P →Q).” They come in a variety of syntactic categories in English, but determiners like “all,” “each,” “some,” References “many,” “most,” and “few,” provide some of the most common examples of quantification. In Eng- Aparecida Moura, Maria. 2014. “Emerging Discursive lish, they combine with singular or plural nouns, Formations, Folksonomy and Social Semantic Infor- sometimes qualified by adjectives or relative clauses, mation Spaces (SSIS): The Contributions of the Theo- to form explicitly restricted quantifier phrases such ry of Integrative Levels in the Studies carried out by as “some apples,” “every material object,” or “most the Classification Research Group (CRG).” Knowledge planets.” These quantifier phrases may in turn Organization 41: 304-10. combine with predicates in order to form sentences Azzouni, Jody. 2014. Freeing Talk of Nothing from the such as “some apples are delicious,” “every material Cognitive Illusion of Aboutness. The Monist 97: 443-59. object is extended,” or “most planets are visible to Azzouni, Jody. 2015. Semantic Perception: How the Illusion of the naked eye.” a Common Language Arises and Persists. New York, NY: Oxford University Press. Azzouni takes an opposing view to Quine’s criterion re- Azzouni, Jody. 2017. Ontology Without Borders. New York, sulting in what he says is the only possible conclusion, NY: Oxford University Press. that quantifiers have a neutral interpretation (in natural Bricker, Phillip. 2016. “Ontological Commitment.” In The language and formal senses). In this interpretation, “all Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta. quantifiers, regardless of the semantics they are endowed https://plato.stanford.edu/entries/ontological-com with, are open to being additionally supplemented with mitment/ various metaphysical conditions” (2014, 443). Further- Budd, John. M. 2011. “Meaning, Truth, and Information: more, he states, “quantifier neutralism allows into logical Prolegomena to a Theory.” Journal of Documentation 67: space a position that describes our use of quantifiers (and 56-74. our accompanying thought) as operating in a more pure Conant, James. 2006. “Wittgenstein’s Later Criticism of metaphysically-deflated way. This is that, despite appear- the Tractatus.” In Wittgenstein: The Philosopher and His ances, there is nothing that terms and the quantifiers linked Works, ed. Alois Pichler and Simo Säätelä. Publications to those terms are about” (444). of the Austrian Ludwig Wittgenstein Society, New Se- For all of us with an interest in knowledge organiza- ries 2. Warschau/Berlin: De Gruyter Open, 172-204. tion, questions of aboutness sit high in the pantheon of Dahlberg, Ingetraut. 1992. “Knowledge Organization and topical subjects for investigation. Quantifier neutralism Terminology: Philosophical and Linguistic Bases.” In- and the associated nominalist positions that scholarship, ternational Classification 19: 65-71. such as Azzouni’s, seems to support offers a promising Daly, Chris. 2013. Philosophy of Language: An Introduction. area of interdisciplinary inquiry that with few exceptions London and New York: Bloomsbury. (Aparecida Moura 2014; Budd 2011; Dahlberg 1992; Hjørland, Birger. 1998. “Information Retrieval, Text Hjørland 1998, 2008; Holma 2005; Jaeneeke 1998; Composition, and Semantics.” Knowledge Organization Lingard 2012; Mazzocchi, Tiberi, De Santis and Plini 25: 16-31. 2007; Mazzocchi and Tiberi 2009; Silva Saldanha 2014; Hjørland, Birger. 2008. “Semantics and Knowledge Organ- Scheibe 1996) has not developed in any real autochtho- ization.” Annual Review of Information Science and Technology nous mode to date within information science. While the 41: 367-405. https://doi.org/10.1002/aris.2007.144041 detailed disputes of philosophers of language, logicians 0115 and their linguistics interlocutors will be a bridge too far Holma, Baiba. 2005. “The Reference System of Relation- (an act of overreach) for most IS theorists, there is a lot al Semantics in Knowledge Organizational Systems.” to be learned even at the foundational level that can help In La dimensió humana de l'organització del coneixement: So- to promote work across, within and through classifica- ciedad Internacional para la Organización del Conocimiento. tion, thesauri development, folksonomy, abstraction, Capítulo Español, Congreso (7. 2005. Barcelona), ed. Jesús knowledge representation and domain analysis. As Quine Gascón García, Ferrán Burguillos Martínez, Amadeu 154 Knowl. Org. 46(2019)No.2 Review

Pons i Serra. Barcelona: Universitat de Barcelona, Fac- by Peter Kruse and Michael Stadler. Springer Series in ultat de Biblioteconomia i Documentació, 296-308. Synergetics v. 64. Berlin and Heidelberg: Springer- https://dialnet.unirioja.es/servlet/articulo?codigo=29 Verlag, 1-2. 66303 Russell, Bertrand. 1922. “Introduction.” In Wittgenstein, Jaenecke, Peter. 1996. “Elementary Principles for Repre- Ludwig, Tractatus Logico-Philosophicus. London: Kegan senting Knowledge.” Knowledge Organization 23: 88-l02. Paul, Trench, Treubner & Co. Lingard, Robert. G. 2013. “Information, Truth and Scheibe, Erhard. 1996. “Calculemus! The Problem of the Meaning: A Response to Budd’s Prolegomena.” Journal Application of Logic and Mathematics.” Knowledge Or- of Documentation 69: 481-99. https://doi.org/10.1108/ ganization 23: 67-76. JD-01-2012-0010 Silva Saldanha, Gustavo. 2014. “The Philosophy of Lan- Manley, David. J. 2009. “Introduction: A Guided Tour of guage and Knowledge Organization in the 1930s: Metaphysics.” In Metametaphysics: New Essays on the Pragmatics of Wittgenstein and Ranganathan.” Know- Foundations of Ontology, ed. David. J. Chalmers, David ledge Organization 41: 296-303. Manley and Ryan Wasserman. New York: Oxford Varzi, Achille. C. 2011. “On Doing Ontology Without University Press, 1-37. Metaphysics.” Philosophical Perspectives 25: 407–23. Mazzocchi, Fulvia, Melissa Tiberi, Barbara De Santis and Von Wright, Georg Henrik. 1955. “Ludwig Wittgenstein: Paolo Plini. 2007. “Relational Semantics in Thesauri: A Biographical Sketch.” The Philosophical Review 64: Some Remarks at Theoretical and Practical Levels.” 527-45. Knowledge Organization 34: 197-214. Wittgenstein, Ludwig. 1922. Tractatus Logico-Philosophicus. Mazzocchi, Fulvia and Melissa Tiberi. 2009. “Knowledge London: Kegan Paul, Trench, Treubner & Co. Organization in the Philosophical Domain: Dealing Quine, Willard Van Orman. 1948. “On What There Is.” with Polysemy in Thesaurus Building.” Knowledge Or- The Review of Metaphysics 2: 21-38. ganization 36: 103-12. Quine, Willard Van Orman. 1979. “Use and Its Place in Morris, Michael. 2006. An Introduction to the Philosophy of Meaning.” In Meaning and Use, ed. Avishai Margalit. Language. Cambridge: Cambridge University Press. Dordrecht: D. Reidel, 1-8. Pelletier, Francis Jeffry. 1994. “The Principle of Semantic Compositionality.” Topoi 13: 11-24. https://doi.org/10. Matthew Kelly 1007/BF00763644 School of Information Technology and Mathematics Roth, Gerhard. 1995. “Introduction.” In Ambiguity in University of South Australia Mind and Nature: Multistable Cognitive Phenomena, edited [email protected]

ERRATUM:

Xu, Liwei and Jiangnan Qiu. 2019. “Unsupervised Multi-class Sentiment Classification Approach.” Knowledge Or- ganization 46(1): 15-32. 64 references. DOI:10.5771/0943-7444-2019-1-15.

The authors appreciate the comments by anonymous reviewers and editors, which are helpful for improving the full paper. The work described in this paper has been supported by the National Nature Science Foundation of China (No.71573030) & (No.71533001), and Social Science Planning Fund Key Program, Liaoning Province (No. L15AGL017).

Knowl. Org. 46(2019)No.2 155 Books Recently Published

Books Recently Published Compiled by J. Bradford Young

DOI:10.5771/0943-7444-2019-2-155

Adorno, Theodor. 2019. Ontology and Dialectics: 1960/61, Kumar, Akshi. 2019. Web Technology: Theory and Practice. Boca trans. Nicholas Walker. Cambridge, UK: Polity Press. Raton, FL: CRC Press. Translation of Ontologie und Dialektik. Lagerlund, Henrik, ed. 2019. Knowledge in Medieval Philosophy. Akkach, Samer, ed. 2019. ʻIlm: Science, Religion and Art in Is- London: Bloomsbury Academic. lam. [South Australia]: University of Adelaide Press. Lebanidze, Giorgi. 2019. Hegel’s Transcendental Ontology. Lan- Bangura, Abdul Karim. 2019. Falolaism: The Epistemologies and ham, MD: Lexington Books. Methodologies of Africana Knowledge. Durham, NC: Carolina Losee, Robert M. 2019. Predicting Information Retrieval Perfor- Academic Press. mance. [San Rafael, CA]: Morgan & Claypool. Cesario, Marilina and Hugh Magennis. 2019. Aspects of Lowenthal, David. 2019. Quest for the Unity of Knowledge. Ab- Knowledge: Preserving and Reinventing Traditions of Learning in ingdon, Oxon.: Routledge. the Middle Ages. Manchester: Manchester University Press. Lozada, María Cecilia and Henry Tantaleán, eds. 2019. An- Corea, Francesco. 2019. An Introduction to Data: Everything dean Ontologies: New Archaeological Perspectives. Gainesville: You Need to Know About AI, Big Data and Data Science. Stud- University Press of Florida. ies in Big Data 50. Cham, Switzerland: Springer. Marchesan, Eduardo and David Zapero, eds. 2019. Context, Flanagan, Thomas and Craig H. Lindell, eds. 2019. The Co- Truth, and Objectivity: Essays on Radical Contextualism. New herence Factor: Linking Emotion and Cognition When Individu- York: Routledge. als Think as a Group. Charlotte, NC: Information Age Michel, Johann. 2019. Homo Interpretans: Towards a Transfor- Publishing. mation of Hermeneutics, trans. David Pellauers. London: Gendler, Tamar Szabó and John Hawthorne, eds. 2019. Ox- Rowman & Littlefield International. Translation of Homo ford Studies in Epistemology. Volume 6. New York, NY: Ox- interpretans. ford University Press. Nicholas D. Smith, ed. 2019. Knowledge in Ancient Philosophy. Givre, Charles and Paul Rogers. 2019. Learning Apache Drill: London: Bloomsbury Academic. Query and Analyze Distributed Data Sources with SQL. Sebas- Nichols, Tom. 2019. The Death of Expertise: The Campaign topol, CA: O’Reilly Media. Against Established Knowledge and Why it Matters. New York, Hepworth-Sawyer, Russ, Jay Hodgson, Justin Paterson, and NY: Oxford University Press. Rob Toulson, eds. 2019. Innovation in Music: Performance, Olshin, Benjamin B. 2019. Lost Knowledge: The Concept of Van- Production, Technology, and Business. New York, NY: Rout- ished Technologies and Other Human Histories. Leiden: Brill. ledge. Shaowen, Wang and Michael F. Goodchild, eds. 2019. Hetherington, Stephen and Markos Valaris, eds. 2019. CyberGIS for Geospatial Discovery and Innovation. Dordrecht: Knowledge in Contemporary Philosophy. London: Bloomsbury Springer. Academic. Tate, Marsha Ann. 2019. Web Wisdom: How to Evaluate and Hu, Allen H., Mitsutaka Matsumoto, Tsai Chi Kuo, and Create Information Quality on the Web. 3rd ed. Boca Raton, Shana Smith, eds. Technologies and Eco-innovation towards Sus- FL: CRC Press. tainability. II, Eco Design Assessment and Management. Singa- Vine, Angus. 2019. Miscellaneous Order: Manuscript Culture and pore: Springer. the Early Modern Organization of Knowledge. Oxford: Oxford Jo, Taeho. 2019. Text Mining: Concepts, Implementation, and Big University Press. Data Challenge. Studies in Big Data 45. [Cham?]: Springer Welch, Shay. 2019. The Phenomenology of a Performative Know- International ledge System: Dancing with Native American Epistemology. Ba- Kompatsiaris, Ioannis, Benoit Huet, Vasileios Mezaris, singstoke, Hampshire: Palgrave Macmillan. Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vro- Wilson-Hokowhitu, Nalani,̄ ed. 2019. The Past Before Us: chidis, eds. 2019. MultiMedia Modeling: 25th International moʻokūʻauhau as Methodology. Honolulu: University of Ha- Conference, MMM, Thessaloniki, Greece, January 8-11, 2019 waiʻi Press. Proceedings. Part I. Cham, Switzerland: Springer. Knowl. Org. 46(2019)No.2

KNOWLEDGE ORGANIZATION KO

Official Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Publisher Examples of classification arrays should be configured as figures and set into the document as jpgs; they should not be entered as editable text. Ergon – ein Verlag in der Nomos Verlagsgesellschaft mbH Remove all active hyperlinks, including those from reference formatting software (if Waldseestraße 3-5 hovering over the text with a mouse produces a gray highlight, the text is hyperlinked; D-76530 Baden-Baden remove the link “Insert,” “Hyperlink,” “Remove link”). Tel. +49 (0)7221-21 04-667 Reference citations within the text should have the form: (Author year). For example, Fax +49 (0)7221-21 04-27 (Jones 1990). Specific page numbers are required for quoted material, e.g. (Jones 1990, Sparkasse Baden-Baden Gaggenau 100). A citation with two authors would read (Jones and Smith 1990); three or more au- IBAN: DE05 6625 0030 0005 0022 66 thors would be: (Jones et al. 1990). When the author is mentioned in the text, only the date BIC: SOLADES1BAD and optional page number should appear in parentheses: “According to Jones (1990), …” or “Smith wrote (2010, 146): ….” A subsequent page reference to the same cited work (e.g., to Smith 2010) should have the form “(229).” There is never a comma before the Editor-in-chief (Editorial office) date. In-text citations should not be routinely placed at the end of a sentence or after a KNOWLEDGE ORGANIZATION quotation, but an attempt should be made to work them into the narrative. For example: Journal of the International Society for Knowledge Organization Richard P. Smiraglia, Editor-in-Chief “Jones (2010, 114) reported statistically significant results. [email protected] “Many authors report similar data; according to Matthews (2014, 94): “all seven stud- ies report means within ±5%.”

Instructions for Authors In-text citations should precede block quotations, and never are placed at the end of a block-quotation. Manuscripts should be submitted electronically (in Microsoft® Word format) in English References should be listed alphabetically by author at the end of the article. Refer- only via ScholarOne at https://mc04.manuscriptcentral.com/jisko. Manuscripts that do ence lists should not contain references to works not cited in the text. Websites mentioned not adhere to these guidelines will be returned to the authors for resubmission in proper in passing in the text should be identified parenthetically with their URLs but not with form. references unless a specific page of a specific website is being quoted. Manuscripts should be accompanied by an indicative abstract of approximately 250 Author names should be given as found in the sources (not abbreviated, but also not words. Manuscripts of articles should fall within the range 6,000-10,000 words. Longer fuller than what is given in the source). Journal titles should not be abbreviated. Multiple manuscripts will be considered on consultation with the editor-in-chief. citations to works by the same author should be listed chronologically and should each A separate title page should include the article title and the author’s name, postal ad- include the author’s name. Articles appearing in the same year should have the following dress, and E-mail address. Only the title of the article should appear on the first page of format: “Jones 2005a, Jones 2005b, etc.” the text. Contact information must be present for all authors of a manuscript. Proceedings must be identified fully by title, editor, and details of publication. To protect anonymity, the author’s name should not appear on the manuscript. Journal issue numbers are given only when a journal volume is not through-paginated. Criteria for acceptance will be appropriateness to the field of knowledge organization References for published electronic resources should be accompanied by either a URL or (see Scope and Aims), taking into account the merit of the contents and presentation. It DOI but not in lieu of actual publication data; access dates are not allowed. is expected that all successful manuscripts will be well-situated in the domain of Unpublished electronic resources may use an access date in lieu of a data of publica- knowledge organization, and will cite all relevant literature from within the domain. Au- tion. In cases of doubt, authors are encouraged to consult The Chicago Manual of Style 17th thors are encouraged to use the KO literature database at http://www.isko.org/lit.html. ed. (or online), author-date reference system (chapter 15). The manuscript should be concise and should conform to professional standards of English usage and grammar. Authors whose native language is not English are encouraged Examples: to make use of professional academic English-language proofreading services. We recom- mend Vulpine Academic Services ([email protected]). Dahlberg, Ingetraut. 1978. “A Referent-Oriented, Analytical Concept Theory for INTER- Manuscripts are received with the understanding that they have not been previously CONCEPT.” International Classification 5: 142-51. published, are not being submitted for publication elsewhere, and that if the work received Howarth, Lynne C. 2003. “Designing a Common Namespace for Searching Metadata- official sponsorship, it has been duly released for publication. Submissions are refereed, Enabled Knowledge Repositories: An International Perspective.” Cataloging & Classi- and authors will usually be notified within 6 to 8 weeks. fication Quarterly 37, nos. 1/2: 173-85. Under no circumstances should the author attempt to mimic the presentation of text Pogorelec, Andrej and Alenka Šauperl. 2006. “The Alternative Model of Classification of as it appears in our published journal. Instead, please follow these instructions. Belles-Lettres in Libraries.” Knowledge Organization 33: 204-14. In Microsoft® Word please set the language preference (“Tools,” “Language”) to Schallier, Wouter. 2004. “On the Razor’s Edge: Between Local and Overall Needs in “English (US)” or “English (UK).” Knowledge Organization.” In Knowledge Organization and the Global Information Society: The entire manuscript should be double-spaced, including notes and references. Proceedings of the Eighth International ISKO Conference 13-16 July 2004 London, UK, edited The text should be structured with decimally-numbered subheadings (1.0, 1.1, 2.0, by Ia C. McIlwaine. Advances in knowledge organization 9. Würzburg: Ergon Verlag, 2.1, 2.1.1, etc.). It should contain an introduction, giving an overview and stating the pur- 269-74. pose, a main body, describing in sufficient detail the materials or methods used and the Smiraglia, Richard P. 2001. The Nature of ‘a Work’: Implications for the Organization of Know- results or systems developed, and a conclusion or summary. ledge. Lanham, Md.: Scarecrow. Author-generated keywords are not permitted. Smiraglia, Richard P. 2005. “Instantiation: Toward a Theory.” In Data, Information, and Footnotes are not allowed. Endnotes are accepted only in rare cases and should be Knowledge in a Networked World; Annual Conference of the Canadian Association for Infor- limited in number; all narration should be included in the text of the article. Do not use mation Science … London, Ontario, June 2-4 2005, ed. Liwen Vaughan. http://www.cais- automatic footnote formatting. Instead, insert a superscript numeral (Format, Font, Su- acsi.ca/2005proceedings.htm. perscript) and create the text of the note manually in a separate list at the end of the manuscript, before the reference list. Upon acceptance of a manuscript for publication, authors must provide a digital photo Paragraphs should include a topic sentence, a developed narrative and a conclusion; and a one-paragraph biographical sketch (fewer than 100 words). The photograph a typical paragraph has several sentences. Paragraphs with tweet-like characteristics (one should be scanned with a minimum resolution of 600 dpi and saved as a .jpg file. or two sentences) are inappropriate. Italics are permitted only for phrases from languages other than English, and for the titles of published works. Bold type is not permitted. © Ergon – ein Verlag in der Nomos Verlagsgesellschaft, Em-dashes should not be used as substitutes for commas. Dashes must be inserted Baden-Baden 2019. All Rights reserved. manually (Insert, Advanced Symbol, Em-dash) with no spaces on either side. Do not use automatic formatting of any kind. To indent, use the ruler. Do not use KO is published by Ergon. tabs under any circumstances. For a bulleted list, indent the list using the ruler, then insert bullets (Insert, Advanced Symbol, bullet). Do not use automatically-numbered paragraphs. Annual subscription 2019: Illustrations should be embedded within the document. Photographs (including color – Print + online (8 issues/ann.; unlimited access for your Campus via Nomos and half-tone) should be scanned with a minimum resolution of 600 dpi and saved as .jpg eLibrary) € 359,00/ann. files. Tables should contain a number and caption at the bottom, and all columns and rows – Prices do not include postage and packing should have headings. All illustrations should be cited in the text as Figure 1, Figure 2, etc. – Cancellation policy: Termination within 3 months‘ notice to the end of the cal- or Table 1, Table 2, etc. endar year Knowl. Org. 46(2019)No.2

KO KNOWLEDGE ORGANIZATION

Official Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Scope Aims

The more scientific data is generated in the impetuous present times, the Thus, KNOWLEDGE ORGANIZATION is a forum for all those in- terested in the organization of knowledge on a universal or a domain- more ordering energy needs to be expended to control these data in a specific scale, using concept-analytical or concept-synthetical approaches, retrievable fashion. With the abundance of knowledge now available the as well as quantitative and qualitative methodologies. KNOWLEDGE questions of new solutions to the ordering problem and thus of im- ORGANIZATION also addresses the intellectual and automatic compi- proved classification systems, methods and procedures have acquired un- lation and use of classification systems and thesauri in all fields of foreseen significance. For many years now they have been the focus of knowledge, with special attention being given to the problems of termi- nology. interest of information scientists the world over. KNOWLEDGE ORGANIZATION publishes original articles, re- Until recently, the special literature relevant to classification was pub- ports on conferences and similar communications, as well as book re- lished in piecemeal fashion, scattered over the numerous technical jour- views, letters to the editor, and an extensive annotated bibliography of nals serving the experts of the various fields such as: recent classification and indexing literature. KNOWLEDGE ORGANIZATION should therefore be available philosophy and science of science at every university and research library of every country, at every infor- science policy and science organization mation center, at colleges and schools of library and information science, in the hands of everybody interested in the fields mentioned above and mathematics, statistics and computer science thus also at every office for updating information on any topic related to library and information science the problems of order in our information-flooded times. archivistics and museology KNOWLEDGE ORGANIZATION was founded in 1973 by an in- journalism and communication science ternational group of scholars with a consulting board of editors repre- industrial products and commodity science senting the world’s regions, the special classification fields, and the subject terminology, lexicography and linguistics areas involved. From 1974-1980 it was published by K.G. Saur Verlag, München. Back issues of 1978-1992 are available from ERGON-Verlag,

too. Beginning in 1974, KNOWLEDGE ORGANIZATION (formerly IN- As of 1989, KNOWLEDGE ORGANIZATION has become the TERNATIONAL CLASSIFICATION) has been serving as a common official organ of the INTERNATIONAL SOCIETY FOR KNOW- platform for the discussion of both theoretical background questions LEDGE ORGANIZATION (ISKO) and is included for every ISKO- and practical application problems in many areas of concern. In each is- member, personal or institutional in the membership fee. sue experts from many countries comment on questions of an adequate Annual subscription 2019: Print + online (8 issues/ann.; unlimited structuring and construction of ordering systems and on the problems access for your Campus via Nomos eLibrary) € 359,00/ann. Prices do of their use in opening the information contents of new literature, of not include postage and packing. Cancellation policy: Termination within 3 months‘ notice to the end of the calendar year data collections and survey, of tabular works and of other objects of sci- entific interest. Their contributions have been concerned with Ergon – ein Verlag in der Nomos Verlagsgesellschaft mbH, Wald- seestraße 3-5, D-76530 Baden-Baden, Tel. +49 (0)7221-21 04-667, Fax

+49 (0)7221-21 04-27, Sparkasse Baden-Baden Gaggenau, IBAN: DE05 (1) clarifying the theoretical foundations (general ordering theory/ 6625 0030 0005 0022 66, BIC: SOLADES1BAD science, theoretical bases of classification, data analysis and reduc- Founded under the title International Classification in 1974 by Dr. tion) Ingetraut Dahlberg, the founding president of ISKO. Dr. Dahlberg (2) describing practical operations connected with indexing/classifi- served as the journal’s editor from 1974 to 1997, and as its publisher (In- cation, as well as applications of classification systems and the- deks Verlag of Frankfurt) from 1981 to 1997. sauri, manual and machine indexing The contents of the journal are indexed and abstracted in Social Sci- ences Citation Index, Web of Science, Information Science Abstracts, INSPEC, Li- (3) tracing the history of classification knowledge and methodology brary and Information Science Abstracts (LISA), Library, Information Science & (4) discussing questions of education and training in classification Technology Abstracts (EBSCO), Library Literature and Information Science (Wil- (5) concerning themselves with the problems of terminology in gen- son), PASCAL, Referativnyi Zhurnal Informatika, and Sociological Abstracts. eral and with respect to special fields.