Chapter K. Controlled Vocabulary (CV)

CHAPTER K Controlled Vocabulary (CV) K.1 INTRODUCTION Vocabulary is one of the main attributes of any language. In subject indexing, vocabulary plays a very important role since the subject matters of the respective documents are represented by words or terms which are constituents of the vocabulary of the language used in indexing. As indi- cated earlier, mainly two types of languages are used in indexing, viz., uncontrolled or natural language and controlled (artificial) language. The difficulties faced while using natural language in indexing have been discussed in the previous chapter. The concept of Controlled Vocabulary has emerged to obviate those difficulties. K.2 DEFINITION OF CV A controlled vocabulary is an authoritative list of terms to be used in indexing (human or automated) [1]. More precisely, it is “an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching” [2]. A controlled vocabulary essentially includes preferred terms and may or may not include variant terms for cross-reference. A controlled vocabulary has “a defined scope or describes a specific domain” [3]. The term “controlled” here signifies that only terms from the list (vocabulary) can be used for indicating the subject of a document while indexing. It also signifies that “if it is used by more than one person, there is control over who adds terms or how terms can be added to the list. The list could grow, but only under defined policies….. The objectives of a controlled vocabulary are to ensure consistency in indexing, tagging or categorizing and to guide the user to where the desired information is” [2]. K.3 CHARACTERISTICS OF CV The characteristics of different types of controlled vocabulary may slightly vary. But broadly the main characteristics of a controlled vocabulary are: ● It is based on any natural language vocabulary, but its size is always smaller than the vocabulary on which it is based; Elements of Information Organization and Dissemination © 2017 Amitabha Chatterjee. DOI: http://dx.doi.org/10.1016/B978-0-08-102025-8.00011-9 Published by Elsevier Ltd. All rights reserved. 151 152 Elements of Information Organization and Dissemination ● It allows only one term out of all synonyms and quasi-synonyms representing an idea for use in an index; ● It may allow use of variants of preferred terms for cross-referencing; ● It avoids use of homonyms, but in cases where it is at all not possible, qualifiers are added to indicate the context; ● The scope of the term is sometimes deliberately restricted to a selected meaning which is best suited for an indexing system; ● Spellings, number (singular/plural), and other word forms are standardized; ● A definite rule is followed for compound terms. K.4 TYPES OF CV Controlled vocabularies are structured to enable displaying the different types of relationships among the terms they contain. There are different types of controlled vocabulary, determined by their increasingly complex structure. The main types of controlled vocabulary fall in the following sequence of increasing complexity. Classification Scheme/ Authority List Synonym Ring Taxonomy Thesaurus Ontology Increasing Complexity Ambiguity Synonym Ambiguity Ambiguity Ambiguity control control control control control Synonym Synonym Synonym control control control Hierarchical Hierarchical Customized relationships relationships associations Associative relationships ANSI/NISO Z39.19-2005 ISBN: 1-880124-65-3 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. (Note: The figure is based on the one proposed by Redmond-Neal [1].) The different types of controlled vocabulary are introduced below. However, thesaurus being the most widely used controlled vocabulary in alphabetical subject indexing, it has been discussed in more detail. K.4.1 Subject Authority List The simplest form of controlled vocabulary is subject authority list or file. This is a bare list of subject headings consistently being used by an Controlled Vocabulary (CV) 153 indexing system arranged primarily in alphabetical order. This is main- tained to ensure avoidance of synonyms by the indexers doing indexing work simultaneously in an organization and also by the same indexer working at different times and different indexers working at different times. Such a list often does not indicate any type of relationship that might exist between the terms and as such is shorter in size. K.4.2 Taxonomy The word taxonomy means the science of classifying things, and tradition- ally the classification of plants and animals, as in the Linnaean classification. It has become a popular term now for any hierarchical classification or categorization system [2]. In the field of information retrieval it denotes “a kind of controlled vocabulary that has a hierarchy (broader term/narrower terms), but not necessarily the related-term relationships and other fea- tures of a standard thesaurus” [2]. Taxonomies are often displayed in a tree structure. Terms within a taxonomy are often called “nodes.” A node may be repeated at more than one place within the taxonomy if it has multiple broader terms. This is referred to as a polyhierarchy. Another type of taxonomy, with a more limited hierarchy, comprises multiple sub-taxonomies or “facets,” whereby the top-level node of each represents a different type of taxonomy, attribute, or context. This is used in post-coordinated searching, whereby the user chooses a combination of nodes, one from each facet. The use of equivalent synonyms or see references may or may not exist in a taxonomy. If a hierarchy is not too large and can be browsed, and especially if there are polyhierarchies, there is less of a need for non- preferred variants [4]. K.4.3 Subject Heading List A subject heading list is “a standard list of terms to be used as subject headings, either for the whole field of knowledge or for a limited subject area, including references made to and from each term, notes explaining the scope and usage of certain headings, and occasionally corresponding class numbers” [5]. Such a list is normally arranged alphabetically. Both preferred and rejected terms are listed in the same sequence. The terms are linked by “See” and “See also” references. The most well known subject heading lists for the whole field of knowledge are Library of Congress Subject Headings and Sears List of Subject Headings, while Medical Subject Headings (MeSH) is an example of subject headings list on a limited subject area. However, most of the subject headings lists have now adopted 154 Elements of Information Organization and Dissemination thesaural structure. More discussions on subject headings lists may be found in any book on library cataloguing or resource description. K.4.4 Classification Scheme A classification scheme is a list of class terms with corresponding notation, accompanied by an alphabetical index. There are mainly two types of classification schemes: enumerative and faceted. An enumerative classification scheme consists of a single list or schedule of all class terms representing universe of subjects or a subject domain, while a faceted scheme consists of different schedules of class terms representing different facets of the concerned domain. A classification scheme contains a notational vocabulary, while its index represents an alphabetical vocabulary. More discussions on classification schemes may be found in any book on library classification or knowledge organization. K.4.5 Thesaurus As mentioned, thesaurus is the most widely used example of controlled vocabulary. The word “Thesaurus” is of Greek origin meaning “treasury or storehouse of knowledge” [6]. In modern usage, it denotes a list of terms arranged according to their relationships of ideas [7]. It was Peter Mark Roget who first conceived the idea of such a compilation and brought out in 1852 his Thesaurus of English Words and Phrases for the benefit of writers looking for appropriate words to express their ideas. Roget’s thesaurus had nothing to do with information retrieval, but his novel idea was profitably utilized in compilation of modern IR thesauri. According to B.C. Vickery, Helen Brownson was the first person who used the term “Thesaurus” in the context of IR in her paper presented at Dorking Conference on Classification Research in 1957. Hans P. Luhn was possibly the first person to think about information retrieval thesaurus, who suggested the compilation, for indexing purposes, of “families of notions,” and dictionary of “notional families,” very similar to the principles of Roget [8]. The first thesaurus used in information retrieval was developed at the E I Dupont de Nemours Company in the United States around 1959 and since then a large number of IR thesauri have been brought out in different subject fields. K.4.5.1 Definition of Thesaurus An IR thesaurus, from the point of view of function, is “a terminologi- cal control device used in translating from the natural language of documents, by indexer or users into a more constrained ‘system language’ Controlled Vocabulary (CV) 155 (i.e., documentation language, information language).” From the point of view of structure it is “a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge” [9]. According to Kent, it is “a compilation of terms of a given information system’s vocabulary, arranged in some meaningful form and which provides information relating to each term that will enable a user of the information file to predict the relevance of responses to ques- tions when this particular control mechanism is used” [10]. Briefly, it may be defined as a list of descriptors for use in information retrieval system arranged in a systematic order and manifesting various types of relationship existing between them [11]. K.4.5.2 Difference from S H List Both thesauri and subject headings lists control the use and form of index terms and summarize the relationships between terms in an indexing language.

Chapter K. Controlled Vocabulary (CV)

Thesauruses and Ontologies

Glossary and Bibliography for Vocabularies 1 the Codes (For Example, the Dewey Decimal System Number 735.942)

Catalogue and Index

OGC Testbed-14: Semantically Enabled Aviation Data Models Engineering Report

Contents the Three Languages Theory In

What Are Controlled Vocabularies?

The Advantages and Disadvantages of Social Tagging: Evaluation of Delicious Website1

Taxonomy Directed Folksonomies

Download the 2021 IEEE Thesaurus

Controlled Vocabulary and Folksonomies

Controlled Vocabularies: an Overview

Controlled Vocabularies in the Digital Age