A new standard for controlled vocabularies Emily Fayen

This article reviews the changes in the information industry that led NISO to propose a revision of ANSI/NISO Z39.19, Guidelines for the Construction, Format, and Management of Monolingual Thesauri, one of its most frequently requested Standards. In spite of its age – Z39.19 was first presented as a Standard in 1974 – it is still relevant in many parts of the information community. The Standard has been revised twice since its inception, most recently in 1993. The limitations of the existing Standard and the scope of the planned revisions are described. The article concludes with the status of the work in progress and plans for its release.

When I was fresh out of undergraduate school with a newly importance it holds in the information community. Z39.19 minted major in math and physics, I planned to continue my is the primary source for guidance in the construction, studies, but had no idea where to focus my work. As a result, format, and maintenance of this special type of controlled I took a position as an Abstracter/Indexer with Documenta- vocabulary. tion, Incorporated, or DocInc as it was familiarly known. At the time, the company had the contract to prepare the Background content for NASA’s Scientific and Technical Aerospace Reports (STAR). The original NASA Thesaurus, which The first edition of this Standard, published in 1974, was predated ANSI/NISO Z39.19 by several years, supplied the prepared by Subcommittee 25 on Thesaurus Rules and terms to be used for indexing the reports that were included Conventions of American National Standards Committee in the NASA database. Mortimer Taube, the founder of Z39 on Standardization in the Field of Library Work, Docu- Documentation, Incorporated, was one of the early propo- mentation, and Related Publishing Practices (later known nents of what later became known as post-coordinated as Library and Information Sciences and Related Publishing retrieval; that is, where concepts are indexed using terms Practices). The subcommittee drew heavily on standards of that may be combined during the search process to achieve practice developed by the Engineers Joint Council, the the desired level of specificity. The NASA Thesaurus and Committee on Scientific and Technical Information of the others developed at about the same time became the core of Federal Council for Science and Technology, and the work that led eventually to ANSI/NISO Z39.19. The UNESCO. current revision of this Standard is the subject of this paper. When Z39.19 was first conceived, terms selected from a thesaurus were generally applied when indexing various Introduction collections of documents. These might be printed resources such as journal articles, technical reports, or newspaper arti- ANSI Z39.19-1974, Thesaurus Structure, Construction and cles. As new information storage and retrieval systems have Use, was first issued in 1974 and revised in 1980. In 1993, a emerged, the concept of ‘document’ has been extended to second revision was issued under the title ANSI/NISO include materials such as patents, chemical structures, maps, Z39.19-1993, Guidelines for the Construction, Format, and music, videos, museum artifacts, and many other types of Management of Monolingual Thesauri. The 1993 revision material that are not traditional documents. Furthermore, draws heavily on the international (ISO 2788) and British the display methods described were almost entirely for (BS 8723) standards. Even since 1993, when the Standard various sorts of printed products. In today’s online world, was last revised, vast changes have occurred in the informa- other methods of organization and display must be taken tion industry. These have resulted from very rapid changes into account. in computer and information processing technology and the In 1998, the Standard was reviewed and reaffirmed. At this global rise of the internet. Today, the expanding use of time, however, the review confirmed a need for a fresh look at information databases in all aspects of business and the Standard, updating it for use in the rapidly evolving elec- commerce, government, and education, and the need to tronic information environment. In response, NISO organized discover millions of sites on the internet, mean that there are a national Workshop on Electronic Thesauri, held November thousands of applications in which controlled vocabularies 4–5, 1999, to investigate the desirability and feasibility of devel- of various types provide better ways to manage large oping a standard for electronic thesauri. The workshop was co- amounts of content while at the same time making it easier sponsored by the American Psychological Association (APA), for users to find the information they need.Thirty years after the American Society of Indexers (ASI), and the Association its introduction, ANSI/NISO Z39.19 is the Standard most for Library Collections and Technical Services (ALCTS), a frequently requested for download from the NISO site. The division of the American Library Association. The current strong interest in this standard provides evidence of the project to revise Z39.19 grew out of the recommendations

62 The Indexer Vol. 24 No. 2 October 2004 Fayen: A new standard for controlled vocabularies developed by consensus at the Workshop. The report on the Sabine Kuhn Chemical Abstracts Service Workshop on Electronic Thesauri, November 4–5, 1999 is Pat Kuhr H.W. Wilson Company available on the NISO web site at http://www.niso.org/ Diane McKerlie DMA Consulting news/events_workshop/thes99rpt.html. Peter Morville Semantic Studios The Workshop identified a number of limitations of the Stuart Nelson National Library of Medicine existing Standard: Diane Vizine-Goetz OCLC, Inc. Trish Yancey Synapse Corporation G It is difficult for non-lexicographers to understand. Many Marcia Lei Zeng Special Libraries Association potential users who expressed interest in the Standard had no background in library science or related fields, and Cynthia Hodgson (NISO) and Emily Gallup Fayen thus found the concepts difficult to apply to their partic- (MuseGlobal, Inc. and NISO SDC Liaison) are preparing ular applications even though many recognized the need the revision. to do so. G It is focused on construction and maintenance. The Stan- NISO’s goal for the revised Z39.19 Standard dard assumed knowledge of the underlying principles of information science that promoted the use of controlled In February 2003 NISO conducted a survey to learn more vocabularies. about how Z39.19 was being used. The survey results G It is limited to document indexing applications. Although showed that the respondents wanted several things from the the original context for controlled vocabularies was for revision: indexing and retrieval of documents, in the intervening G The revised standard should provide a better, more inclu- years it became highly desirable to apply the underlying sive way to represent content; that is, the standard should discipline to many different types of materials including be applicable to a broader array of materials than websites. documents. G It is limited to post-coordinate retrieval. The Standard G The revision must take into account a changing audience assumed that the controlled vocabularies within its scope as well as a vastly different information environment. were to be used in post-coordinate retrieval systems. This G As the number of information resources that use assumption limited its applicability to other types of controlled vocabularies grows, there is increasing need retrieval, including browse and navigation systems. for interoperability and sharing across applications. G It is limited to printed products. The display formats for the controlled vocabularies that were recommended included only printed presentations of the controlled vocabularies. The scope of the revision Because of the date the work was first conceived (and As a first step, the Advisory Group discussed several ways in even at the time of its last revision in 1993), virtually no which the scope of the standard could be broadened and controlled vocabularies were being used in a web-enabled changed to meet the changing needs of implementers. environment. G It uses outdated technology. Finally, although the princi- G Expand the scope beyond thesaurus to include controlled ples presented in the original Standard are still relevant, vocabularies. This change in scope reflected the need to many of the examples were based on outdated tech- make the Standard applicable to controlled vocabularies nology. This needed to be updated to make the Standard other than those that had been used so extensively in the relevant to contemporary users. indexing of documents by the various abstracting and indexing (A&I) services. The title of the revision will be Relying on feedback from the community and extensive Guidelines for the Construction, Format, and Management internal discussions, NISO launched an initiative to revise of Controlled Vocabularies. Z39.19. The work has been made possible by generous G Make the Standard more accessible to users. The original support from the H.W. Wilson Company, the Getty Foun- Standard had been developed by lexicographers for lexi- dation, and the National Library of Medicine. With the aim cographers. It assumed that readers were familiar with the of achieving the best possible result, and making sure the underlying concepts and principles of vocabulary control. major stakeholders were involved, NISO assembled an advi- This is no longer true for the greatly expanded audience sory group to guide the work. The Thesaurus Advisory for the Standard, which is resulting in the frequent Group, or TAG as it has become known, consists of requests for download from the NISO website. members from many segments of the information industry. G Explain important concepts. Because many potential users The members are: of this Standard do not have a background in library Vivian Bliss Microsoft science or information science, it is important to explain Carol Brent ProQuest the important concepts so that users will understand the Dave Clark Synapse Corporation reasons behind some of the rules and guidelines. John Dickert US Department of Defense, G Explain principles of vocabulary control. As above, many DTIC new users of the Standard are unfamiliar with the basic Lynn El-Hoshy principles of vocabulary control. Consequently, it is Patricia Harpring Getty Foundation important to explain these ideas without getting too tech- Stephen Hearn American Library Association nical, and to provide good examples to illustrate the Marjorie Hlava Access Innovations, Inc. points being made.

The Indexer Vol. 24 No. 2 October 2004 63 Fayen: A new standard for controlled vocabularies

G Include the electronic information environment. At the example Extra Large might be selected as the preferred time Z39.19 was first proposed, little information was term to be included in a list rather than XL. available in electronic form. Thirty years later, many content resources are available in electronic form and Example: many controlled vocabularies are available in electronic Alabama form. Consequently, the revision must include informa- Alaska tion on display formats for print, electronic, and web- Arkansas enabled environments. California G Include additional user access methods. The 1993 revision Colorado of the Standard assumed that the predominant search mode would be post-coordinated searching using Lists are frequently used in website pick lists and pull- Boolean operators. In today’s information environment, down menus. The terms included in a list are generally the Standard must also provide for controlled vocabu- presented in alphabetical order, although other logical laries that are suitable for browsing and navigation as well sequences (e.g. small, medium, and large) may also be used. as keyword searching. As far as the work on the revision for Z39.19 is concerned, G Expand beyond the abstracting and indexing (A&I) applica- we have not allowed for the possibility that one list could be tions. As the information industry has grown and embedded within another. This would imply a structure matured, other applications have realized that the that, we have stipulated, lists do not have. formalism and discipline inherent in controlled vocabu- laries might be useful in their applications as well. Synonym rings G Include web applications. As information resources and the tools to manage them have moved inexorably to the A synonym ring is a list of synonyms or near-synonyms that internet, it has become imperative to include web imple- are used interchangeably for retrieval purposes. mentations, of both controlled vocabularies and their target databases, within the scope of the revision. Example: Speech disorders Speech defects The revised Standard – extended coverage Speech, disorders of For the revision now underway, the concept of a ‘document’ Defective speech has been extended by use of the term ‘content object.’ A Synonym rings are frequently used to enhance retrieval in content object is any information-bearing entity. It could systems where the content is not indexed or where the exist in virtually any physical or electronic form. Content indexing vocabulary is not controlled. objects may be contained in databases or archives or other A synonym ring may be generated automatically from information repositories, or they may simply be one or more clusters of co-occurring terms in full text or developed by websites on the internet. Further, the metadata describing a subject specialists. Synonym rings are not used in indexing, content object is also itself a content object. only in retrieval. Therefore, a preferred term is not desig- nated. The terms in a synonym ring have equal status for The revised Standard – extended types of retrieval. controlled vocabularies Taxonomies Similarly, for this revision, the notion of a thesaurus is being A taxonomy is a set of preferred terms, all connected by a expanded to include other types of controlled vocabularies hierarchy or poly-hierarchy. Each term in a taxonomy such as lists, synonym rings, and taxonomies. These are belongs to at least one hierarchical structure. A term may, subclasses of thesauri, because they have some of the prop- however, belong to more than one hierarchical structure. In erties of a thesaurus, but not all. this case, the structure is called a poly-hierarchy. G A list consists only of preferred terms. G A synonym ring consists only of terms that have an Example: equivalence relationship. chemistry G A taxonomy consists only of preferred terms that have a organic chemistry hierarchical relationship. polymer chemistry G A thesaurus has all these properties plus additional types of relationships. Taxonomies are widely used in classification schemes and for web navigation systems. Lists Thesauri A list is a simple group of terms. All are preferred terms. A preferred term is selected for inclusion in a controlled A thesaurus is a controlled vocabulary with multiple types of vocabulary from among synonyms or near-synonyms, for relationship. The types of relationship allowed in the

64 The Indexer Vol. 24 No. 2 October 2004 Fayen: A new standard for controlled vocabularies thesaurus are generally defined at the time the thesaurus is of these types of links. The MeSH Browser, provided by the constructed. However, other types of relationships may be National Library of Medicine (http://www.nlm.nih.gov/ added at any time if required. mesh/MBrowser.html) has other good examples.

Example: The revised Standard – interoperability rice UF paddy As the number of information resources and controlled BT cereals vocabularies used to index them has grown, the need for BT plant products tools that will enable cross-database and cross-system access NT brown rice has grown in importance. However, there are no general RT rice straw solutions to the problem. The revised standard identifies the critical issues so that controlled vocabulary designers and where: users will be aware of the potential problems. UF = Used for A special case of interoperability involves both indexing BT = Broader term and searching content across multiple languages. The issues NT = Narrower term involved in developing and maintaining multilingual RT = Related term controlled vocabularies are also being addressed in the bold type face = a preferred term revision. As noted previously, the title of the revision will be Guide- Note: Various display techniques are used to indicate the lines for the Construction, Format, and Management of levels of hierarchical relationships among the terms. In the Controlled Vocabularies. Consequently, there is no limita- example above, indentation is used to show that cereals is a tion to monolingual vocabularies. The special requirements broader term to rice and plant products is a broader term to of multilingual controlled vocabularies are considered in the cereals. section on Interoperability. Three relationship types are permitted in a thesaurus. These are: Progress G Equivalence (i.e. use/used for) – indicates the preferred term. The group has been working hard toward completing the G Hierarchy -– indicates broader and narrower term revision in 2004. The work proceeds via preparation of draft relationships. copy for review by the team, followed by monthly confer- G Association – indicates other types of relationships ence calls and emails to discuss changes that need to be among terms. made in the drafts. Plans call for 11 or 12 sections in the new version. As I A thesaurus generally consists of an alphabetical listing of write this (July, 2004) eight sections are largely completed, all the terms and entry terms, together with the three types two are being reviewed and revised, and the remaining of relationship for each term (equivalence, hierarchy, and sections are scheduled for completion during the third association). In addition to the alphabetical display, a quarter of 2004. The goal is to have the revised standard thesaurus also shows the terms as hierarchically arranged complete and ready for review by others outside the trees or some other type of hierarchical representation. Advisory Group by November, 2004. These formats are essential to a thesaurus and set it apart Information about the work is available on the NISO from other, simpler controlled vocabularies such as pick website at www.niso.org. A search at that site for Z39.19 lists, synonym rings, and taxonomies. A thesaurus has the will retrieve all of the relevant materials about this work in strongest and most complex structure of all types of progress, including the notes from the Advisory Group controlled vocabularies. discussions. The revised Standard – display formats Being able to make controlled vocabularies available in elec- Emily Gallup Fayen is Vice President, Digital Content and Access tronic format and on the web allows users much greater flex- for MuseGlobal, Inc. She has extensive experience in information ibility to move around the collections of terms. Today’s web- systems. During her career she served as Director for Library Auto- enabled controlled vocabularies take advantage of naviga- mation at two large academic institutions. She has developed tion tools such as keyword search, browse, and hyperlinks to systems to provide access to electronic content, ranging from online additional types of displays and specific information such as catalogs to web-based digital information delivery systems. Before scope notes, history notes, tree structures, and so forth. The joining MuseGlobal, she was the Product Manager for Information Art & Architecture Thesaurus, developed by the Getty Quest, RoweCom/divine's very large online database of scientific, Research Institute (http://www.getty.edu/research/ technical, and medical journal literature. Email: emily@ conducting_research/vocabularies/aat/) shows many museglobal.com

The Indexer Vol. 24 No. 2 October 2004 65