OCLC ONLINE COMPUTER LIBRARY CENTER, INC.

www.oclc.org he purpose or purposes for which this Corporation is formed are to establish, \\ Tmaintain, and operate a computerized ryi library network and to promote the evolution of library use, of libraries themselves, and of '^S^0^0r- librarianship, and to provide processes and

i ^ ^ products for the benefit of library users and " V ^ ^ ^ >' • libraries, including such objectives as increasing availability of library resources to individual ' F- F ; library patrons and reducing the rate-of-rise of

.V?'«< : A ^ ^ library per-unit costs, all for the fundamental > *-i t-f v" public purpose of furthering ease of access to " J^:L and use of the ever-expanding body of worldwide scientific, literary, and educational i,:C".'v-'^'!^^'j-. v-;a#V';f i.. knowledge and information. "*• ^ ANNUAL REVIEW OF OCLC RESEARCH 1996 ^^•OC H ©1997 by OCLC Online Computer Library Center, Inc. 6565 Frantz Road Dublin, Ohio 43017-3395

Printed in the United States of America

ISSN 0894-198X

OCLC users are hereby granted permission to reproduce this publication for their internal use. Reproductions of substantial portions of the publication must contain the OCLC copyright notice.

Product and service names are trademarks or service marks of their respective companies.

OCLC® is a registered trademark of OCLC Online Computer Library Center, Inc.

The Annual Review of OCLC Research 1996 is available on the Internet:

Via the World Wide Web www.purl.org/oclc/reviewi996

Via FTP (File Transfer Protocol) Logon to ftp.rsch.oclc.org with anonymous (as your username) and your E-mail address as the password. Find the Review in the /pub/documentation/review96 directory.

Via Listserv Address an Internet E-mail message to: [email protected]. Type the get command in the body of your E-mail message: get review96 [file-name]. The file names are: front, parti, part2, part3, part4, part5, appendix CONTENTS

LETTER OF INTRODUCTION, V

FROM THE DIRECTOR, VII

THE RESEARCH ADVISORY COMMITTEE, VIII

THE RESEARCH ENVIRONMENT, X

REMEMBERING MARK CROOK, XI

1 OCLC SPECIAL REPORT, 1 Four-Figure Cutter Tables, 1 Edward T. O'Neill, Brian F. Lavoie, Jeffrey A. Young, and Patrick D. McClain

2 OCLC PROJECT REPORTS, 19 Automatic Subject Assignment via the Scorpion System, 20 Keith E. Shafer Characteristics of Articles Requested through OCLC Interlibrary Loan, 21 Chandra G. Prabha Characteristics of Book Collections in Academic Research Libraries, 25 Chandra G. Prabha Classification Research at OCLC, 27 Diane Vizine-Goetz Enhancing the Indexing Vocabulary of the Dewey Decimal Classification, 30 C. Jean Godby Evaluating a Mtdtiprocessor NT Server for Z39.50 Use, 33 Thomas B. Hickey, Richard Bennett, and Thomas L. Terrall FirstSearch Next Generation: Another Look at FirstSearch, 36 Thomas B. Hickey, Jenny Colvard, and Thomas L. Terrall Image Description on the Internet: Summary of CNI/OCLC Image Metadata Workshop, 38 Stuart L. Weibel and Eric J. Miller Kilroy: An Internet Research Project, 44 Keith E. Shafer A Metalanguage for Describing Internet Resources, 45 C. Jean Godby and Eric J. Miller Mr. Dui's Topic Finder, 49 Mark W. Bendig Use of the OCLC PURL Service, 53 Keith E. Shafer Visualizing Spatial Relationships between Internet Objects, 54 Eric J. Miller

iii 3 EXTERNAL AND COLLABORATIVE RESEARCH, 59 The Bosnian National Library: Building a Virtual Collection, 60 Edward T. O'Neill, Jeffrey A. Young, and Robert Bremer Feasibility of a Computer-Generated Subject Validation File Based on Frequency of Occurrence of Assigned LC Subject Headings: Phase n. Nature and Patterns of Invalid Headings, 64 Lois Mai Chan and Diane Vizine-Goetz The Monticello Project: Design Considerations for a Virtual Library, 72 Eric J. Miller, Tod Matola, Pat Stevens, and Jay Hayden The Warwick Metadata Workshop: A Framework for the Deployment of Resource Description, 77 Lorcan Dempsey and Stuart L. Weibel 4 LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM, 85 Analyzing the Viability of Using Peer Group Holdings as an Evaluation Tool for Public Library Adult Fiction, 86 James H. Sweetland and Judith J. Senkevitch An Experimental Study on Graphical Tables of Contents, 88 Xia Lin The Impact of Electronic Journals on Scholarly Communication: A Reference and Citation Study, 91 Stephen P. Harter A Relational Thesaurus: Modeling Semantic Relationships Using Frames, 94 Rebecca Green 5 DISTINGUISHED SEMINAR SERIES, 99 Cataloging Rules and Conceptual Models, 100 Barbara B. Tillett The Copyright Dilemma: Legal Tensions and Information Networks, 102 Kenneth Crews PROGRAMS, 105

OCLC STAFF, 107

PATENT, PUBLICATIONS, PRESENTATIONS, 109

PRINCIPAL INVESTIGATORS, 112

iv s an organization, OCLC has a certain character. It comes from our unique mission, governance, member services, Aand research efforts. Second only to member services, research is perhaps OCLC's most defining quality. How many times have you read, "OCLC is a nonprofit computer library service and research organization..."? It is more than just a tag line; it is who we are. This Annual Review spells out who we are by describing the many and varied research projects that have been conducted on behalf of, and often in conjunction with, OCLC member libraries and the information management community generally. Much has been written recently about the "virtual library." OCLC is working to make the virtual real. And we are having a real effect. At OCLC, we work to see that our research is focused, timely, relevant, and useful in the near term. Not every research project results directly in a new or enhanced product or service. It does not need to. But every research project contributes significantly to making a real difference in library operations and member services. It has to. The rapidly changing environment in which OCLC, libraries, and users operate demands that much of us. Once, shared cataloging, interlibrary loan, and online reference service were just ideas. Today, they are the bases of library operations for thousands of libraries worldwide. We expect that the "next great idea" will be shaped by the work of the OCLC Office of Research. When you read the reports that follow, we think you will agree.

K. Wayne Smith Donald J. Muccino President and Executive Vice President and Chief Executive Officer Chief Operating Officer

FROM THE DIRECTOR

apid, technology-driven change in the library field continues to provide the Office of Research with a wealth of topics worthy of study. The Rchallenge is not finding projects that will yield value for OCLC members, but selecting from the broad range of opportunities. Our priority in selecting projects focuses on areas at the intersection of information technology and libraries. We believe that OCLC's research into these areas will continue to inform libraries on how to best use this technology in providing services. In keeping with that mission, a significant amount of our effort has focused on the description of electronic information. This can be seen in the continuing series of workshops on metadata which OCLC has sponsored: the fourth will be held in Australia, March 3-5, 1997. We have also been working with the World Wide Web Consortium to include the necessary capabilities in the Platform for Internet Content Selection (PICS) standard to support encoding metadata. These efforts can assist libraries in describing electronic information and can promote development of standard protocols that support this information. The Office of Research has also been exploring the automatic creation of metadata. The first step has been to automatically generate subject classification of electronic resources collected from the Internet. Multiple scientists are approaching this project from several directions. The ultimate goal is to provide a tool that can be used to bring order (i.e., authority control, vocabulary control, subject classification) to electronic resources. In simple terms the project aims to bring discipline to the Internet. We are also exploring how the delivery of information to patrons is being affected by the electronic age. The availability of full text online is beginning to have a significant effect on the traditional sources of supply, library ownership, and interlibrary loan. Finally, but certainly of vital importance to OCLC members, we continue to enhance WorldCat (the OCLC Online Union Catalog). Projects to correct uniform titles and add Dewey Decimal Classification to WorldCat records will improve their quality. In fact this edition of the Review features a special report about expanded cutter tables and their versatility. These efforts are important to protect and enhance the value of this database that has been cooperatively developed by the OCLC membership. It is indeed an interesting time to be working in the library field. We hope that the information in this document will help you chart your library's path into the electronic future.

Terry R. Noreault, Director Office of Research THE RESEARCH ADVISORY COMMITTEE

Edward Emil David, Jr. Clifford A. Lynch

Carol A. Mandel Joseph Hardin

he Research Advisory Committee guides and evaluates the activities of the OCLC Office of Research. Twice each year committee members come to Dublin, Ohio, to engage in extended Tdialogues with staff, to learn about current projects, and sometimes to suggest entirely new avenues of inquiry. They frequently affirm the value of what we are doing and often suggest improvements and changes in direction. This open interaction with leaders in the fields of library and computer science directly supports the OCLC mission of service to libraries. Committee members are appointed to a three-year term; those who served during 1996 appear above.

viii THE RESEARCH ADVISORY COMMITTEE

Edward Emil David, Jr., Sc.D., is an industrial consultant, Carol A. Mandel, M.S.L.S., M.A., is Deputy University with a research specialty in electrical engineering. He Librarian for Columbia University, New York, NY. The previously served as president, Exxon Research and Deputy's office has leadership and direction Engineering Company, vice president of Exxon responsibilities for the 11 divisions of the University Corporation, and executive vice president of Gould, Inc. Library, as well as for achieving the library's strategic Dr. David also served as U.S. representative to the NATO initiatives. Mandel obtained her M.S.L.S., and her M.A. in Science Committee, a science advisor to President Nixon, Art History, from Columbia University. She serves as chair and a member of the White House Science Council. His of the Bibliographic Services Study Committee for the publications are extensive, and include, in co-authorship, Council on Library Resources, is active as a consultant to Man's World of Sound, Waves and the Ear, and the Man- several organizations, and has served on a variety of Made World. Dr. David received his B.S. degree from editorial and advisory boards. She has published and Georgia Institute of Technology and his M.S. and Sc.D. given presentations extensively, especially in the area of from MIT. He also holds numerous honorary doctorate of cataloging and technical services. engineering degrees from institutes of higher learning throughout the country. Joseph Hardin, B.A., is Associate Director, Software Development Group, National Center for Supercomputing Clifford A. Lynch, Ph.D., is Director, Division of Library Applications, University of Illinois at Urbana-Champaign, Automation, University of California, Oakland. The Champaign, IL. The Software Development Group is known Division's mission is to create, maintain, and provide world-wide as the developers of a variety of significant access to a database containing the bibliographic holdings scientific software tools, including NCSA Telnet, the of more than 100 libraries of the University of California Hierarchical Data Format (HDF), and NCSA Mosaic. Hardin system via the public access catalog, MELVYL®. Lynch is quite active and influential in the World Wide Web obtained his Ph.D. in computer science from the (WWW) community regarding the evolution of standards University of California at Berkeley, has published within the WWW. For example, he was instrumental in the extensively, and is an active member of the Internet organization and conducting of the two WWW Metadata Engineering Task Force (IETF). He is active as a consultant Workshops that have been held to date. He has published for a variety of government, academic, and commercial on a variety of topics concerning the scientific application organizations, and he serves on a variety of national of computers, particularly in the areas of visualization and standards committees. hypermedia. In relation to his work at NCSA, he has been the recipient of numerous grants and awards. THE RESEARCH ENVIRONMENT

he mission of the OCLC Office of Research is to expand knowledge that advances the goal of OCLC's commitment Tto improving access to the world's information resources, whatever their form, substance, subject, language, or location. This mission is pursued through the integrated employment of the computer, library, and information sciences in research activities such as performing experiments, building prototypes, advancing standards, undertaking studies, and participating in research collaborations.

Working for a better WorldCat

The Research Process The Effects of Research Research Standards Proposal (856 Field Influence) Advisory Scholarly journals Committee Government Printing Office (PURL technology) Feedbacl< Scholarly papers

Research Research prototypes (FirstSearch Next Generation) Advisory Products

Project SjjeQSomth Project Project Project Auvrld(^i?lformaliar! tmline Proposal Accepted Execution Concluded OCLC EiEaRONic JOURNALS ONLINE msSernth (36/ Dewey for Professional Internal Windows and OCLC Contacts OCLC Feedback

X REMEMBERING MARK CROOK

For the last six months of his life, I was privileged It has been an honor and a pleasure to have to work directly with Mark Crook in Distributed worked with Mark Crook over the last ten years. Systems as his manager. He joined Distributed The breadth and depth of his contribution to Systems and put in place a formal marketing plan in OCLC are a tribute to his ability, dedication, and short order. Prior to that, Mark had been the key work ethic. Regardless of his job title, Mark researcher of the business plan that put the OCLC continued throughout his career to assist people SiteSearch product in business and a periodic from all over the company in analyzing business advisor and researcher. To say the least, corporate data. He was an expert in working Mark was a key contributor to the work we've Mark A. Crook with this information and freely shared that ability with anyone who needed help. He did done with SiteSearch. Perhaps more important Marketing and Sales than the volume of work Mark did was the that because it was the right thing to do, Support Manager, way he went about his work. He brought a even though it often was not even remotely great passion for learning and teaching to his OCLC Distributed Systems part of his job description. work. As he encountered uncharted waters, he In the last year, Mark assumed the investigated them and shared his knowledge May 15, 1953-August 17, 1996 leadership role in marketing OCLC SiteSearch. with our group and others. He was truly shining in that position. He Mark's example gave me insight into being focused his ability and experience in defining a better manager, leader, and human being. Mark was always and executing a marketing plan for the product. In a short time, disciplined in his approach to work, but he balanced it with an he established himself as a leader of the OCLC SiteSearch group. easygoing attitude and a willingness to assist others. Even with Mark was, as many of you may know, a dedicated and the workload in his new job, he continued to find time to further gifted runner. Six years ago he and I were both in a five-mile the research of others in the library community outside OCLC. In race. After about two miles, I was shocked to see Mark about 50 addition, he balanced all this work for OCLC with a strong family feet in front of me because, frankly, he was much faster. Then I life. It was clear when you talked to him for any length of time noticed that he was carrying a kitten that had wandered into the that his family was never far from his thoughts. street. He had moved it to the side so it would not be trampled As many people I've talked to have said, I will miss the by the runners. That was so representative of Mark—always friendship, advice, and counsel that Mark so selflessly provided. taking time out from his own tasks to help. —Taylor Surface On a personal level, I will miss the advice and counsel Mark offered. He had a strong sense of what was right, and I Mark Crook lived the spirit and ideals of an Eagle Scout. cijunted on him for insight into how to manage most effectively. —Chandra Prabha I will miss Mark greatly. —Terry Noreault Ralph Waldo Emerson (1803-1882) wrote; "To laugh often and love much; to win the respect of intelligent persons and the I thought about Mark as I was reading The Best of Sbeehan. affection of children; to earn the approbation of honest citizens George stated that we are all seekers and that in every race, we and endure the betrayal of false friends; to appreciate beauty; reach for the most honest effort we can muster. to find the best in others; to give of one's self; to leave the Truly, Mark reached for the most honest effort that life had world a bit better, whether by a healthy child, a garden patch to offer... or a redeemed social condition; to have played and laughed —Keith A. Carter with enthusiasm and sung with exultation; to know even one A member of the OCLC afternoon mnning gang! life has breathed easier because you have lived—this is to have succeeded." Mark was one of the most intelligent and kindest people I have This is what we all strive for, and this is what Mark Crook known. Mark was there to help whether it be co-worker, fellow achieved! athlete, stranger, or friend, Mark was truly a great man. —Susanne Krouse —Dave Morris

xi 1 OCLC SPECIAL REPORT

FOUR-FIGURE CUTTER TABLES

Edward T. O'Neill, Consulting Research Scientist, Brian F. Lavoie, Research Associate, Jeffrey A. Young, Consulting Systems Analyst, and Patrick D. McClain, Systems Analyst

Abstract creating book numbers, "we seek to convey the maximum The resolution power of the cutter tables has diminished information using the minimum number of letters and due to growth in library collections and the increasing figures." [Comaromi 1981, 64] The LC classification scheme incidence of corporate name and title main entries. Heavily uses its own special table for assigning book numbers, used cutter numbers routinely require adjustment to while DDC users select between one of two different prevent shelflist conflicts and to establish unique call tables—the Cutter table or the Cutter-Sanborn table. numbers. OCLC's expanded cutter tables balance the Entries retrieved from the cutter tables are called cutter distribution of main entries over the cutter table entries and eliminate some of the manual processing required to numbers. In referring to the Cutter and Cutter-Sanborn complete call numbers in Dewey classes with extensive tables collectively, the convention in this paper is to use holdings. The expanded cutter tables are compatible with the term cutter tables with a lowercase c. existing two-figure or three-figure schemes, but also include The method for finding the appropriate cutter number new features that enhance their versatility and facilitate is the same for both tables. For example, when assigning algorithmic cuttering. the cutter number for an author name, the table is searched, and the table entry corresponding to the name becomes the cutter number. If the name falls between two Introduction entries, the lower cutter number is used. Classification schemes such as the Dewey Decimal Many items classified with the same DDC class Classification (DDC) or the Library of Congress number also share the same cutter number. In these Classification (LC) provide an orderly arrangement of circumstances, the cutter tables only provide the "root" of materials by subject. The use of a book number appended the cutter number; the rest of the number must be to the class number creates a useful subarrangement of manually assigned by the cataloger to maintain an ordered materials within classes. Taken together, the class number shelflist and/or to create a unique call number. For and book number provide a brief code or call number that example, within a given class, a library collection might organizes, sequences, and uniquely identifies cataloged contain a book by George Prescott with the cutter number materials. P931, and a book by Michael Prescott with the cutter The term cuttering has become common usage for number P932. If the library later acquires a book by John any shelflisting device intended to order materials within Prescott, its cutter number according to the table is P931, classes. Cuttering can be as simple as appending the first which is already in use. To resolve this conflict, the three letters of the author's last name to the class number cataloger must assign an adjusted cutter number (for or as complex as including work marks and dates. In example, P9315) to fit the new book correctly between

1 1. OCLC SPECIAL REPORT the two previously cataloged items. The numeric portion Background of the cutter number is interpreted as a decimal, which Lehnus [1980] and Comaromi [1981] both provide detailed permits infmite interpolation. histories of cuttering; the following history is adapted from Shelflist conflicts may also occur within a subject class their works. In 1880, Charles Ammi Cutter published a when multiple works by the same author or multiple preliminary version of his two-figure author table, which editions of the same work are cataloged. All works within was well-received by the library community. Building on a given class by the same author are assigned the same earlier work by Jacob Schwartz and John Edmands, the cutter number. This cutter number is then made unique by table used the then-novel scheme of combining the first adding work marks, usually consisting of one or two letter or letters of the author's surname with a decimal letters from the title. Multiple editions of the same work number. This number could then be expanded as are treated similarly, except that the year of publication is necessary to incorporate new materials in the shelflist, used as the distinguishing element. The cutter number and without disturbing the existing arrangement. Cutter's any supplementary work marks or dates are collectively approach enjoyed the favor of Melvil Dewey himself, who referred to as the book number. promoted the use of the Cutter table with his own To illustrate these concepts, consider the book classification scheme. Cutter updated the table in 1886 and number for Charles Dickens' A Tale of Two Cities: again in 1888. By the turn of the century, the use of cutter D5485ta 1981. In this example, D548 is the cutter numbers was common practice. number (from the Cutter-Sanborn table), the next 5 is a Although the two-figure table demonstrated the local adjustment to fit the book correctly in the shelflist, ta efficacy of book numbers as a means of ordering materials is the work mark (from the first significant word in the within classes, its limited size soon proved too restrictive title), and 1981 is the publication date. The entire for libraries with large collections. Cutter recognized this alphanumeric sequence is the book number. problem and resolved to expand his two-figure table to The cutter number is the only element of current copy three digits. This task was delegated to Kate Sanborn (later cataloging practice which routinely requires adjustment. As Kate Emery Jones), an associate of Cutter's. The new table such, cuttering is an expensive, time-consuming and error- was developed in two stages: first, expanded entries were prone operation. O'Neill and McClain [1996] find that while created for the vowels and the letter S, and later, others 78% of surveyed DDC class numbers assigned by LC were were added for the consonants. accepted either in whole or in part by local libraries, Two features of Sanborn's table are noteworthy. First, "Mirtually every record used by a Dewey library undergoes the table abandoned Cutter's convention of including more a review by cataloging staff to assign some type of book than one letter in the cutter number. Sanborn used a single number." [p. 15] They found that the DDC class number letter in combination with two or three digits throughout assigned by LC was accepted in 78% of the records. This the table. Second, the Sanborn table was not compatible suggests that cuttering is the primary obstacle to with the two-figure Cutter table—cutter numbers from the automatically downloading records into the local system. new table could not be integrated with the old two-figure This research identifies heavily used cutter numbers in scheme. The full table was published circa 1896 and is the Cutter and Cutter-Sanborn tables—^those table entries commonly known as the Cutter-Sanborn table. that are especially inadequate in generating unique class/ The incompatibility of the Cutter-Sanborn table with cutter combinations—and expands these entries by one, or Cutter's original two-figure table moved Cutter to develop in rare cases, two digits. Since the expansion follows decimal his own three-figure table. His table, published in 1901, interpolation, a series of new cutter numbers is created retained the use of multiple letters in the cutter number between existing ones. These new numbers are allocated for the vowels and S, and was compatible with the two- to an appropriate sequence of alphanumeric strings to figure table. Unfortunately for Cutter, the five years during form complete table entries. The expansion increases the which the Cutter-Sanborn table was the only three-figure resolution power of the cutter tables and, therefore, table available proved to be a costly delay: an installed should reduce the amount of manual adjustment required base had been created which was difficult to penetrate. By to maintain uniqueness in the shelflist. The expanded the time Cutter's three-figure table was published, many of tables include entries for cuttering numerals and were the library community's prominent figures had already designed according to a fixed set of cuttering guidelines. adopted the Cutter-Sanborn table as the standard. Dewey These two features should facilitate algorithmic cuttering. used the Cutter-Sanborn table in his Simplified Library The expansion of the cutter tables involved four School Rules, while the Library of Congress used it as a discrete steps: guide for subdividing classes in the new LC classification • Editing the cutter tables scheme. The period between the publication of the two • Creating a test file of DDC-classified main entries tables (1896-1901) also saw many new libraries • Using the test file to analyze the distributional established, chiefly through the beneficence of Andrew efficiency of the tables • Expanding the cutter tables

2 1. OCLC SPECIAL REPORT

Carnegie. "The establishment of so many new libraries The most obvious difference between the cutter tables is with the need for library tools," notes Lehnus [1980, 42], their size. With 20,879 entries, the Cutter table is significantly "made the Cutter-Sanborn table a basic library tool, whose larger than the Cutter-Sanborn table, which has only 12,361 popularity has not diminished in the twentieth century." entries. (These figures are from the OCLC versions of the Empirical evidence reflects the Cutter-Sanborn table's cutter tables, which are described in detail in the next popularity in comparison to the Cutter table. Analysis of section.) The two-figure version of the Cutter table has 2,727 survey data on DDC-classified call numbers [O'Neill and entries. The comparatively large size of the Cutter table can McClain 1996] reveals the following user distribution be traced to its use of two letters for the vowels and 5', and among the various cutter tables (fig. 1). three letters for Sc, resulting in a far greater number of entries in these portions of the table. In addition, the Cutter table allocates more entries to the consonants J, K, Q, and X. For the remaining consonants, the number of entries in the Cutter and Cutter-Sanborn tables is similar. No definitive evidence indicates how either Cutter or Sanborn determined the distribution of entries across the alphabet in their tables. Lehnus notes that Cutter studied the catalogs of the Bodleian Library, the Boston Public Library, and the Boston Athenaeum in compiling his table, but seemed more concerned with identifying all possible letter combinations than with ensuring that their distribution was appropriate. No records have been found that explain Sanborn's distribution scheme. In 1969, Esther Swift and Paul Swanson jointly Fig. 1 Types of Cutter Numbers Used produced valuable revisions of both the Cutter and Cutter- Sanborn tables, the first since their original publication. Of the 110 book numbers reported in the survey, Much of the revision was confined to cosmetic changes in more than half (52%) were derived from the Cutter- the appearance of the tables and the correction of Sanborn table, compared with only 37% from the Cutter typographical errors. No substantive changes were made table. The remaining 11% is allocated between cutter to either table. In the instruction books accompanying the numbers that are derived from the LC table (3%), and two tables, Swanson and Swift describe their contribution: those whose source cannot be resolved (8%). Unresolved book numbers include those that could have been derived The table appears in a new typeface. Rearrangement into a single, consecutive from both tables, and those that could not have been alphabet should make the table easier to use. derived from either. Many of the unresolved book This edition may be employed in conjunction numbers appear to be erroneous. For example, in one with earlier editions, because individual letter case, the book number G8295 was reported for the main and figure combinations have not been entry "Gruss"; the closest equivalent in the cutter tables is changed, save for the correction of a few G892, from the Cutter-Sanborn table. The last digit (5) in typographical errors. the reported cutter number is a local adjustment; the The Swanson-Swift editions, as these tables are known, discrepancy between 829 and 892 is probably a are the source of most cutter numbers today. transcription error. Although the Swanson-Swift revision may seem The apparent preference today for the Cutter-Sanborn modest, it is a significant contribution to the field of table may be a residual effect from its status as the first librarianship. While the original cutter tables were three-figure table made available to library catalogers. indispensable cataloging tools, their printed format was However, it is likely that this preference also reflects the clumsy and often difficult to interpret. The Swanson-Swift "cleaner" set of cutter numbers the Cutter-Sanborn table convention of matching only one entry to each cutter provides. Specifically, only a single letter is used in each number, in conjunction with the alphabetical re-ordering cutter number, resulting in a more uniform and compact of the entries so that the table runs consecutively, shelflisting method. Lehnus [1980, 47] concedes this point, improves readability and the ease with which cutter noting that "[t]here is no doubt that the basic reason for numbers can be generated. The example in table 1 the popularity of Sanborn's table is due to the fact that illustrates this point. librarians have preferred her idea of using only the initial letter of the name with digits."

3 1. OCLC SPECIAL REPORT

2. A number of entries were added to the Cutter table to Table 1 Excerpt from the Cutter-Sanborn Table provide for letter combinations that were not Original Format Swanson-Swift Revision accommodated in the original table. The problem Bayn 361 Cath Bayn 361 stems from the fact that the Cutter table uses more Baz 362 Cathc Baz 362 than one letter for the vowels and S. For example, no Bazi 363 Cathe Bazi 363 entry exists for the letter combination Iw. Since the Bazo 364 Cati Bazo 364 Cutter table uses two letters for vowels, the cutter Be Be 365 Call 365 number for a main entry beginning with Iw would be Beal 366 Cato Beal 366 Bean 367 Catr Bean 367 Iv9, an illogical result. Cutter numbers for the Bear 368 Catt Bear 368 combinations li, Iw, Ix, ly, Oo, Uo, Uq, Uu, Scb, Scd, Scf, Beat 369 Catto Beat 369 Scg, Scj, Sck, Scq, Scv, Sew, and Sex were formed by combining the letter combination with the numeral 1. Editing the Cutter Tables The combinations In, lo, Ip, Iv, and Iz all corresponded to existing entries beginning with that combination, Before OCLC could expand the cutter tables, the sources followed by the letter a. The new cutter number was upon which the expansion is based—ca. 1896 Cutter- formed by dropping the a from the existing entries. Sanborn table and 1901 Cutter table—had to be edited to For example, the entry [Ina Inl] was changed to [In ensure that they were complete and error-free. The editing Inl]. Finally, the entry for Scz was created by process sought to: dropping the e from the existing entry [Scze Scz96]. 1. Identify and correct all errors appearing in the original Inclusion of these new entries closed the gaps in the versions of the cutter tables. table. These entries can be important for corporate 2. Resolve any ambiguity surrounding entries in the names or acronyms, which often use a combination of original tables caused by poor printing quality (i.e., initials that are cuttered as a single word. Some of the blurred or illegible entries, incompletely stamped entries are also important for personal names. For letters, etc.). example, Kazlmierz Sczaniecki would have been It should be emphasized that the OCLC expansion assigned the cutter number Scy95 using the old table, uses the original versions of the cutter tables as its rather than the more appropriate Scz95. sources. The entries in each table were keyed and 3. Eight new entries were added to each table to provide verified. Each entry consisted of the cutter number with the option of cuttering numerals as-is. For the Cutter the alphabetic string to which it corresponds: e.g., [Ab26 table, the entry [A Al] was expanded to Al—A9, with Abbott]. The entries were sorted by cutter number and Al—A8 representing cutter numbers for numerals, and then sorted again by alphabetic string. Any irregularity in A9 serving as the cutter number for A. For the Cutter- these sequences represented either an error in the table or Sanborn table, the entry [Aa Alll] was expanded to a keying error, usually caused by illegible printing. After Allll—A1119, with Allll—A1118 representing cutter these errors were compiled and corrected, the original numbers for numerals, and the entry [A All 19] table was manually compared with the Swanson-Swift replacing the original entry [Aa Alll]. edition to identify any errors that sorting could not filter. This two-stage editing process identified 19 errors in the Table 2 Numeric Cuttering Extensions Cutter-Sanborn table and 23 errors in the Cutter table Cutter Table Cutter-Sanborn Table which were not attributable to keying error. Old New Old New The OCLC editions required some revisions to the A A1 0 Al Aa A111 0 Aim original cutter tables: Aa Aa11 17 A2 Aal A112 17 A1112 1. The original Cutter-Sanborn table lacks entries for Aab Aa12 1975 A3 Aar A113 1975 1113 single letters (e.g., A, B, C, etc.). The OCLC edition 1990 A4 1990 1114 2 A5 2 A1115 accommodates these entries by truncating the original 20 A6 20 A1116 first entry for each letter (which is the letter followed 21 A7 21 A1117 by fl) to just the single letter. For example, the first 5 A8 5 A1118 entry for the letter C, [Ca CI 11], was changed to [C cm]. Main entries consisting of corporate names, A A9 A A1119 Aa Aall Aal A112 acronyms, etc., require these single-letter entries to Aab Aa12 Aar A113 avoid illogical cutter numbers. Without them, a strict interpretation of cuttering rules implies a main entry beginning with a single letter should be assigned the last cutter number of the preceding letter. Completion of the editing produced up-to-date, accurate tables that were the basis for the expansion.

4 1. OCLC SPECIAL REPORT

Creating a Test File Table 3 Example of Data File To determine which portions of the cutter tables should be expanded, the likely distribution of main entries across Class Main Entry the cutter numbers in the tables needs to be predicted. An 284.9 Jones, John A. uneven distribution would suggest that some cutter 343.4104 Jones, John A. 343.4104 Jones, John A. numbers experience a disproportionately high frequency 359.3 Jones, John A. of use. Therefore, expansion should focus on these areas 359.3 Jones, John A. of the tables. The main entry distribution can be 808.51 Jones, John A. approximated by cuttering a sample database of main 821.5 Jones, John A. entries. Cutter numbers that are characterized by relatively 378 Jones, John Abraham 808.51 Jones, John Abraham heavy usage would be expected to generate a high 016.9706572 Jones, John Alan conflict rate in a typical shelflist. 970.3 Jones, John Alan Distribution analysis requires a representative sample 359.3 Jones, John Albert of names and titles. Although there is no definitive 359.3 Jones, John Albert documentation on the data that either Cutter or Sanborn 362.10973 Jones, John Albert 811.49 Jones, John Alfred used to develop their tables, they probably consulted 811.49 Jones, John Alfred author catalogs. The differences between the two tables 352.1 Jones, John Alfred suggest that the Cutter and Cutter-Sanborn tables were 352.1 Jones, John Alfred based on different catalogs. Eor example, the Cutter- 378 Jones, John Arthur Sanborn table has 21 entries for various permutations of 378 Jones, John Arthur Saint—the Cutter table has only 4 entries. By current standards, the catalogs available when the tables were a class would be assigned a new cutter number. originally developed were relatively small and probably Subsequent occurrences would be assigned the same too parochial to be representative of a typical library. cutter number, with the call number being made unique The ideal source of cuttering data would be a large by either the use of a work mark if the author has union catalog. A large catalog minimizes random patterns, multiple titles in the class, or by the addition of a date if and a union catalog reduces the influence of a particular there are multiple editions. In the above example, John collection. WorldCat (the OCLC Online Union Catalog) Albert Jones would be cuttered twice: once in 359.3 and closely matches these criteria: it contains over thirty once in 362.10973. The second occurrence in the class million records and represents the collections of over 359.3 would have the same cutter number as the first, and seven thousand libraries. To ensure that the sample data is would be distinguished by either a work mark or by a representative of the type of material commonly assigned date. In view of this, the file was refined by the removal a DDC call number, only records for books with a DDC of duplicate class-main entry pairs. Only the first class (either a 082 or a 092 field) were considered. The occurrence of the class/main entry combination, shown in DDC class number, the main entry, and other related bold, was retained. The resulting data file contained information were extracted from each WorldCat 5,546,655 unique class-main entry combinations. bibliographic record meeting these criteria. If the record Each main entry in the test file was cuttered contained alternate class numbers, only the primary class algorithmically according to the machine-readable OCLC number was extracted and any segmentation marks were versions of the cutter tables. This required the compilation removed. The cuttering guidelines described in appendix 1 of a set of standard cuttering practices so that the cutter list the specific information included for main entries. numbers would be derived according to uniform criteria. The resulting file contained 7,700,517 main entries. Unfortunately, there is no universally accepted body of The file was sorted by main entry and class number. An cuttering practices to consult; hence, the rules used for example constructed to illustrate the process is shown in this project do not represent a definitive statement on table 3. "proper" cuttering. On the other hand, the rules are not ad At this stage, the file contained entries that would not hoc: rather, they are a synthesis of several cuttering and have received a unique class/cutter combination, since filing guides and the suggestions of skilled catalogers. In only the first occurrence of a particular main entry within this sense, they represent a reasonable standard for cuttering practice. The rules are detailed in appendix 1.

5 1. OCLC SPECIAL REPORT

Analyzing the Distribution of Main Entries Adjustment of the cutter number occurs within Growth in the size of library holdings virtually guarantees classifications. Therefore, shelflist conflicts will occur when increased usage of a fixed pool of cutter numbers and, two or more unique works share both the same class therefore, a higher incidence of conflicts in the shelflist. number and the same cutter number. The WorldCat test Figure 2 shows holdings data for 12 major academic file contains 5,546,655 unique class number/main entry research libraries for the years 1908, 1969, and 1995 combinations. Although each combination represents a [Association of Research Libraries]. unique work, it does not necessarily represent a unique In 1908, approximately ten years after the Cutter tables class/cutter combination since two unique main entries were developed, average holdings for the 12 sample can generate the same cutter number. When this happens, libraries were 107,425 volumes. By 1969, the year the their class/cutter combinations will be the same, and Swanson-Swift revisions were released, this figure had adjustment of one or both cutter numbers is required to risen to 2,391,567 volumes. In 1995, average holdings for resolve the discrepancy. these libraries were 5,224,025 volumes. It is useful to recall that prior to 1908, libraries had already found the two- figure Cutter table unsuitable for large collections. Given the fact that the holdings in the sample are on average almost 50 times larger today than in 1908, library collections have probably once again exceeded the tables' resolution capacity.

Fig. 2 Holdings Data

6 1. OCLC SPECIAL REPORT

Figure 3 illustrates the rate at which new conflicts The rate at which conflicts appear in the shelflist appear in library collections of varying size. Random increases sharply as collection size increases. A 15% samples of class/cutter combinations were extracted from increase in collection size from 10,880 volumes to 12,553 the test file to create the sample collections. Cutter volumes implies that 0.36% of the new acquisitions will numbers were generated using the Cutter table. conflict with existing class/cutter combinations. However, the conflict rate grows dramatically as the initial size of the collection increases: a 15% increase from 4,759,868 volumes to 5,464,135 volumes results in 23.28% of the new acquisitions generating conflicts. Clearly, the rate at which new conflicts enter the shelflist accelerates as the size of the collection grows. Calculations for the Cutter- Sanborn table produced almost identical results. As more shelflist conflicts occur, the amount of manual processing required to generate cutter numbers increases. However, the severity of this problem will not be uniform across cutter numbers. The sample data from WorldCat illustrates how cutter number usage is asymmetric across a selection of entries from the Cutter table (fig. 4).

Collection Size (in thousands)

Fig. 3 Rate of New Conflicts

Fig. 4 Cell Distribution Example

7 1. OCLC SPECIAL REPORT

Disaggregation of the data into individual table entries, According to fig. 5, the Cutter table contains 5,086 or cells, reveals that some cells are rarely used (e.g., Ajl9, cells containing ten or fewer entries. More than half of Ak39, Ao32) while others show heavy usage. In these contain no entries at all. In other words, nearly a particular, 15,618 items correspond to the cutter number quarter of the Cutter table (24.4%) is either unused or Ain35. The alphabetic string associated with this cutter rarely used. In contrast, the Cutter-Sanborn table has only number, Amer, suggests that many of these items have 282 cells with 9 or fewer entries, only 40 of which are corporate main entries, such as American Library unused. These cells reflect only 2.32% of the table's Association or American Heart Association. The case of entries. This comparison suggests that the greater size of Ain35 clearly illustrates the problems that can develop the Cutter table supplies a less-than-proportionate increase when many unique main entries correspond to the same in resolution power. cutter number. Extensive local adjustment of the cutter Figure 5 also illustrates the crowding which occurs numbers will be required to avoid shelflist conflicts. Some within certain segments of the tables. Each table has more of this burden will be alleviated once the main entries are than 1,000 cells capturing 1,000-9,999 main entries, and 9 distributed among individual class numbers; however, the cells with 10,000 or more entries. It is interesting to note same problem will arise, albeit on a smaller scale, within that while the Cutter and Cutter-Sanborn distributions are heavily used classifications. If unique call numbers are a virtually identical for cells with 100 or more entries, they concern within a library, the problem is further magnified. are widely divergent for cells with relatively few entries Using the complete data set in the test file, the (99 or fewer). This can be interpreted as further evidence distribution of main entries across the Cutter and Cutter- that the extra cells in the Cutter table do little to improve Sanborn table cells was analyzed. Figure 5 summarizes the resolution power. number of cells in each table capturing various main entry Further analysis of the sample data indicates that the aggregations. most severe crowding occurs in virtually the same areas in each table (table 4). Table 4's comparison of the six most 8381 crowded cells in each table reveals a close 8073 correspondence between the Cutter and Cutter-Sanborn tables. In all cases, most or all of the entries for a cell in one table fall into the common range shared by the •D c corresponding cell in the other table. This reinforces the CO idea that despite the Cutter table's larger size, little improvement is gained in resolution power since the extra entries are not distributed appropriately among the most 0 heavily used portions of the table. O This analysis illustrates the extent to which crowding o occurs in the Cutter and Cutter-Sanborn tables. Another z issue is the source of this crowding: in particular, is 1224 1163 crowding more prevalent among certain types of main entries? Table 4 shows that in the cells with the most 9 9 severe crowding, the main entries frequently start with d 1-9 10-99 100- 1000- 10000- 999 9999 words such as American or International, which are No. of Entries per Cell common initial words in corporate names. Figure 6 shows the distribution across nonpersonal-name main entry types • Cutter • Cutter-Sanborn for cells containing more than 5,000 entries.

Fig. 5 Cell Distribution

Table 4 Six Most Crowded Ceils in Eacli Cutter Table Cutter Cutter-Sanborn

Cutter No. and Entries Cutter No. and Entries Common Entries in Frequent Range Range Range Common Starting Words In8 Int-Inz 41,772 161 Int-lnv 39,298 Int - Inv 39,298 International; Introduction Un3 Ung-Unk 41,346 U58 Uni-Uns 41,277 Uni - Unk 40,680 United; University N213 Natio-Nato 21,770 N277 Nati-Nativ 21,417 Natio - Nativ 21,406 National N42 New-Newall 19,953 N532 New-Newb 20,226 New - Newall 19,953 New Am35 Amer-Ames 15,618 A512 Amer-Ames 15,618 Amer - Ames 15,618 American G798 Grear-Grebn 13,630 G786 Gre-Greav 13,286 Grear - Greav 13,217 Great

8 1. OCLC SPECIAL REPORT

100% T probability that it will be matched to any given cell in the table? Ideally, the answer would be the inverse of the number of cells in the table (1/N); in other words, there is an equal probability that it will be matched to any cell in 80% - the table. This scenario corresponds to the case in which main entries are uniformly distributed across the cutter (0a> table cells. In practice, however, a uniform distribution 1 60% •• Ui cannot be obtained. For example, an arbitrary main entry o probably has a greater chance of being matched to Am35 oc than Aj25 (Ajcu - Ajd). The nature of language implies 0)2 40% •• a. that some cells are more likely or less likely to be used than others, and therefore, resolution power (as measured by entropy) will be less than the ideal. 20% Lower entropy values can be interpreted as meaning that the cutter table has a less than perfect resolution power. There should be an inverse relationship between 0% entropy values and the frequency of conflicts. Entropy (H) Total Cutter Cutter-Sanborn is defined [Applebaum 1996, 95] as;

• Title • Uniform Titles • Conferences • Corporate Names H = -5:i=i,„N Pilog2Pi where N is the number of entries in the cutter table, and p^ Fig. 6 Main Entry Types is the probability of cell i being matched to an arbitrary main entry. Operationally, p- is estimated from the relative Seventy percent of all the main entries in the test file frequency of use derived from the sample data. are personal names. However, only 10.1% (Cutter) and Based on this formula, entropy was calculated to be 16.3%) (Cutter-Sanborn) of the main entries in cells with 12.7 for the Cutter table, and 12.6 for the Cutter-Sanborn 5,000 or more items are personal names. As fig. 6 table. These entropy values indicate that the Cutter table indicates, nonpersonal-name main entries (i.e., corporate has a slightly higher resolution power than the Cutter- names, conferences, uniform titles, and titles) constitute Sanborn table. However, the difference in resolution power the vast majority of entries cuttered in the most heavily is not significant. The Cutter-Sanborn table is more efficient used cells. While only 7.9% of the main entries in the test in that it embodies more resolution power in proportion to file are corporate names, 40.1% and 35.4% of the main its size than the Cutter table. The similarity of the entropy entries in crowded cells are corporate names in the Cutter values reinforces the point that the extra cells in the Cutter table and Cutter-Sanborn table, respectively. Skewed table do little to improve resolution power. results are also obtained for other types of nonpersonal name main entries. Four-Figure Cutter Tables (Note; Many entries in the expanded cutter tables have Entropy Analysis more or less than four figures. This is similar to the old The resolution power of the Cutter and Cutter-Sanborn cutter tables, which are often described as three-figure tables can be quantified with entropy calculations. Entropy cuttering schemes even though many of the cutter is an information theoretic concept that translates the numbers have fewer than three figures.) The preceding average information embodied in a random variable into a analysis suggests that the Cutter and Cutter-Sanborn tables measure of uncertainty. Uncertainty in this context lack the necessary resolution power to serve today's large "represents 'potential information' in the sense that when a libraries adequately. This is not surprising; the tables were random variable takes on a value we gain information and originally developed to meet the needs of nineteenth lose uncertainty." [Applebaum 1996, 98] In other words, a century libraries. Since the publication of the Cutter- potential outcome with a high probability conveys little Sanborn table in 1896 and the Cutter table five years later, information if it occurs; on the other hand, a potential the size of the average library collection has grown outcome with a low probability conveys much information steadily. As a result, conflicts requiring manual resolution if it occurs. Since entropy is a measure of uncertainty, it is are now common. maximized at the point of greatest uncertainty, i.e., when all possible outcomes are equiprobable. In the context of cuttering, the issue can be posed as follows; in cuttering an arbitrary main entry, what is the

9 1. OCLC SPECIAL REPORT

Approaches to expansion There are two approaches to expanding the cutter tables: Table 6 Example of Selective Expansion (1) uniform expansion-, adding an extra figure to all C51 CI entries; or (2) selective expansion-, expanding only the C52 Clah C53 Clap heavily used cells. C54 Claq When Cutter expanded his two-figure table, he chose C541 Clarizio, Jerry uniform expansion. This involved expanding most cells by C542 Clark, Carolyn adding nine new entries between each existing entry. C543 Clark, Dick Examples of Cutter's expansion from two to three figures C544 Clark, G. M. C545 Clark, J. B. are shown in table 5. The bold entries are from the two- C546 Clark, Joyce figure table. C547 Clark, Mary C548 Clark, Robert C549 Clark, Sylvester Table 5 Cutter's Expansion CSS Clarke C51 CI C53 Clap C55 Clarke CS6 Clarks C511 Ola C531 Clapa C551 Clarke, C. C57 Clat C512 Clab C532 Clape C552 Clarke, F. C58 Cle C513 Clad C533 Claph C553 Clarke, H. C59 Clem C514 Clado C534 Clapi C554 Clarke, K. C515 Clae C535 Clapp C555 Clarke, N. C516 Claess C536 Ciapp, F. C556 Clarke, Q. selectively, the result might look like the example shown C517 Claey C537 Clapp, M. C557 Clarke, T. in table 6. C518 Claf C538 Clapp, S. C558 Clarke, W. In table 6, only one cutter number, C54, was C519 Clag 0539 Clappe C559 Clarki expanded; the other nine cells were unchanged. The nine C52 Clah C54 Claq C56 Clarks two-figure cells (shown in bold) captured 13,717 entries, C521 Clai C541 Clarar C561 Clarkso with the single cell C54 capturing just more than half of C522 Clair C542 Claret C562 Ciaro the total (6,904). Selective expansion of C54, while leaving C523 Claira C543 Clari C563 Claroi the other cells unchanged, serves the purpose of C524 Clairau C544 Clarin C564 Clase C525 Clairb C545 Claris C565 Clasen alleviating the congestion in C54 without creating a host C526 Clairi C546 Clariss C566 Clasi of superfluous cells. C527 Ciairm C547 Clark C567 Claso C528 Clairo C548 Clark, H. C568 Class Level of expansion C529 Claj C549 Clark, R. C569 Classe In Cutter's expansion, each cell from the two-figure table was replaced with ten in the expanded table. This requires the implicit use of the digit zero which has traditionally been avoided because its use restricts expansion and can One problem with Cutter's approach is that many of easily be mistaken for the letter o in a work mark. Cutter, the new entries created in the expansion are rarely used. in his Explanation of the Alphabetic Order-Marks [p. 4], This is illustrated in figs. 4 and 5. For example, the two- stresses that "zero should be used only in extreme cases." figure table entry Aj2 (covering the interval Ajc-Aje) was However, his ten-fold expansion method forces the use of expanded despite the fact that it represents seldom-used zero when expanding an entry from the two-figure table letter combinations (only 11 entries corresponded to this to three figures. cell). After expansion to three figures, seven of the ten Referring to table 6, if C55 had previously been new cells corresponding to Aj2 were empty. Cuttering the assigned to Adam Clarke and a cutter number is needed test data resulted in 2,583 empty cells in the Cutter table; for Brian Clarke, the new cutter number is derived by in contrast, the Cutter-Sanborn table had only 40 empty appending another digit. Using this procedure, a cutter cells. These findings indicate that uniform expansion number for Brian Clarke could be derived by appending a would serve little purpose and would result in very large 4 to form the number C554. However, C554 is already tables with a very high proportion of unused cells. assigned to Clarke, K. The only way to resolve a conflict Selective expansion results in a cutter table with of this type is to explicitly add the implied zero prior to significantly higher resolving power than that of a table of adjustment: the correct cutter number for Brian Clarke equivalent size created by uniform expansion. One requires a zero to be inserted between the cutter number disadvantage of selective expansion, however, is that the from the table (C55) and the appended digit (4) to form resulting cutter numbers will be of variable length, and the new cutter number, C5504. Adjustment of a number therefore, some readability is sacrificed. If the section of originally from the two-figure table must use a zero to the two-figure table shown in table 5 had been expanded avoid conflicts.

10 1. OCLC SPECIAL REPORT

If Cutter had used a nonary (base 9) expansion rather If Clark, Edna Hume is selected as a boundary, the than a decimal (base 10) expansion, the use of the digit boundary text could be truncated to Clark, Edn without zero could have been avoided. For example, the two- disturbing the distribution of entries among the cells. figure cell C54 might be expanded as shown in table 7. When there are multiple occurrences of an entry, only the first instance of the entry is a candidate to be a cell boundary. For example, suppose that the entry at Line 0, Table 7 Selective Expansion Clark, Edson L., would be an ideal boundary in terms of C541 Claq producing the desired cell size. Since all occurrences of an C542 Clark, Alan entry must be cuttered identically, this entry must be C543 Clark, Charles T. C544 Clark, Edward eliminated as a valid cell boundary. The entry at Line -3 C545 Clark, Jan (the first occurrence of Clark, Edson L.) or the entry at C546 Clark, John A Line 1 (the next entry after all the occurrences of Clark, C547 Clark, Louis Edson L.) are the closest valid candidates. In this case, C548 Clark, Ralph Clark, Edw would be selected since it is closer to the C549 Clark, Sue ideal position. While the primary criterion in selecting the new cell In this expansion, the original two-figure entry [C54 Claq] boundaries is to produce uniform cell sizes, brief entries was modified by appending the digit 1 and eight new are also desirable. Therefore, some additional variation in entries were added. By adding eight new entries rather cell size is permitted to shorten the entries. If Clark, than nine, all of the cutter numbers can be expanded Edward A. had been in the ideal position in terms of cell without using zeroes. size, the entry for Clark, Edward would be preferred Four-figure expansion since it results in a shorter text with minimal impact on cell size. Since case and punctuation are not significant for The four-figure expansion was done using a selective cuttering, all text is treated as upper case and all nonary expansion. Any existing table cell with more than punctuation is dropped in accordance with the guidelines 500 entries was expanded into nine cells-the original cell outlined in appendix 1. The actual text retained for the plus eight new cells. Very large cells, with more than 4,500 cell boundary associated with Clark, Edward would be entries were expanded into 81 cells—the original cell plus "CLARK EDW". 80 new cells (9 times 9 = 81). The cuttering data was Most entries in both the Cutter and the Cutter-Sanborn re-created prior to expanding the tables. The new data file tables did not require expansion. For the Cutter table, contained 5,727,090 unique class/main entry combinations. 2,993 cells were expanded once (i.e., the original entry This number is slightly larger than the original test file plus 8 new ones) and 53 cells were expanded twice (i.e., used for the distributional analysis, because the number of the original entry plus 80 new ones). The four-figure records available in WorldCat increased during the time edition of the Cutter table has 49,06l entries. For the between extractions. Cutter-Sanborn table, 3,124 cells were expanded once and New cell boundaries were truncated so that only the 70 were expanded twice. The four-figure edition of the portion of the entry necessary to distinguish it from the Cutter-Sanborn table has 42,938 entries. proceeding entry was retained. Using the entries in table 8 Samples of the Four-Figure Cutter and Cutter-Sanborn as an example, only the bold section of the entry had to tables are provided in appendixes 2 and 3, respectively. be retained to distinguish it from the preceding entry: The capitalization in the tables is not significant and is used only to improve readability. Table 8 Truncation of Cell Boundaries -7 Clark, Edmund Roy Analysis of the four-figure tables -6 Clark, Edna Hume The expansion of the cutter tables significantly improved -5 Clark, Edna Maria the uniformity of cell sizes. The distribution of various -4 Clark, Edna Nichols main entry aggregates in the four-figure tables is -3 Clark, Edson L. illustrated in fig. 7. This figure corresponds to fig. 5, -2 Clark, Edson L. -1 Clark, Edson L which summarizes the same information for the three- 0 Clark, Edson L figure tables. 1 Clark, Edward 2 Clark, Edward, 3 Clark, Edward A. 4 Clark, Edward B. 5 Clark, Edward Brayton 6 Clark, Edward Brayton 7 Clark, Edward Dale

11 1. OCLC SPECIAL REPORT

Both four-figure cutter tables have entropy values of 25 23499 15.1. This result was expected since a maximum cell size •122279 of 500 was used for both tables. The entropy values for 20297 20269 20 the three-figure tables and for the two-figure Cutter table TJc are also shown in fig. 8. A comparison shows that the <0 increase in entropy resulting from the expansion from two 15 to three figures is almost identical to that obtained from expanding from three to four figures.

O 10 o Algorithmic Cuttering Using the Four-Figure Tables d The original and Swanson-Swift editions of the cutter tables were designed for and are primarily used in print 2579 2505 form. The four-figure tables are designed to be compatible 39 J I U III! nil with machine use. This includes algorithmic cuttering, 1-9 10-99 100-499 500- which can take a number of forms. No.of Entries per Cell One approach is an independent cuttering routine I Cutter • Cutter-Sanborn which accepts as input an alphanumeric string, and outputs the appropriate cutter number from the machine- Fig. 7 Four-Figure Cell Distribution readable tables. The OCLC Office of Research has developed a Windows-based application which performs The primary difference between fig. 7 and fig. 5 is the this task. This form of machine-assisted cuttering requires shortening of the left-hand tail of the distribution and the the highest level of intervention on the part of the lengthening of the right-hand tail. This corresponds to an cataloger. It is the responsibility of the cataloger to find increase in the number of cells capturing between 10-499 the appropriate text to be cuttered and pass it to the main entries, with a simultaneous reduction in the number cuttering application. Once the cutter number is generated, of cells containing 500 or more main entries. The few cells the cataloger must perform any necessary adjustments and with 500 or more entries that remain are the result of then transfer the result to the cataloging record. individual main entries which occur many times (e.g.. A second approach to algorithmic cuttering is an General Accounting Office, which occurred 1,529 times) in application which operates within the computerized the test file. Although a set of identical main entries cataloging environment, such as OCLC Passport for cannot be split, the cutter number need not be adjusted. Windows software. A Passport macro was designed which Rather, work marks or dates can be used to distinguish identifies the main entry in a bibliographic record, passes each main entry. it to a cuttering routine, retrieves the cutter number from Figure 8 illustrates the entropy calculations for the the routine, inserts the cutter number into the 092 field, four-figure tables and compares them with the results from and then reformats the record to form a complete call the previous versions of the cutter tables. number. The macro also accepts highlighted text in the bibliographic record to override main entry cuttering, a necessary feature for cases where main entry cuttering is not appropriate. With this cuttering package, the cataloger's role is minimized to identifying exceptions to

12 1. OCLC SPECIAL REPORT

In developing cuttering algorithms, the adoption of a Appendix 1 Cuttering Guidelines definitive set of guidelines governing cuttering practice 1. Main Entry: The USMARC fields used to form the becomes especially important. The guidelines developed main entry are as follows: by the OCLC Office of Research are described in appendix 1. However, local cuttering practices may conflict with Personal Names (100 field) these rules. One solution to this problem is to tailor Personal name (a subfield) cuttering algorithms to local practice. A more efficient Numeration (b subfield) solution, however, might be the emergence of a Titles and other words associated with name universally accepted set of cuttering practices. Widespread (c subfield) adoption of the OCLC four-figure tables, which are Dates associated with name (d subfield) designed to be used with the cuttering guidelines in Fuller form of name (q subfield) appendix 1, might be a step toward standardizing cuttering practice. Corporate Names (110 Jield) Corporate name or jurisdiction name Conclusion (a subfield) The cutter tables have survived virtually unchanged for a Subordinate unit (b subfield) century. While this is testimony to their utility as a Location of meeting (c subfield) shelflisting device, it is also a reminder that their original Date of meeting or treaty signing (d subfield) design arose from catalogs markedly different, both in size and composition, from current standards. Today's library Conference Names (111 field) catalogs are considerably larger and contain numerous Meeting name or jurisdiction name (a subfield) corporate name and title main entries. The three-figure Location of meeting (c subfield) cutter tables have significant limitations when applied to Date of meeting (d subfield) catalogs of this kind. In particular, analysis of the Cutter Subordinate unit (e subfield) and Cutter-Sanborn tables indicates that a sample distribution of main entries across cutter numbers is highly Uniform Titles (130 field) uneven. Heavily used cutter numbers are likely to Uniform title (a subfield) generate shelflist conflicts requiring manual resolution. Date of treaty signing (d subfield) Expansion of the cutter tables to four-figures smoothes Number of part/section of a work (n subfield) the distribution of main entries across cutter numbers, Name of part/section of a work (p subfield) thereby improving resolution power and reducing shelflist conflicts. The four-figure tables are compatible with the Title Statement (245field) older two-figure and three-figure cuttering schemes. Since Title proper (a subfield) the expansion included a well-defined set of cuttering Remainder of title (b subfield) guidelines, the test data used to derive the expanded tables reflects a consistent and logical approach to For each main entry type, the elements appearing in cuttering practice and provides a firm foundation for bold form the text string to be cuttered. The remaining algorithmic cuttering. The four-figure tables also supply elements are used to determine uniqueness. For the means to handle numeric cuttering directly. example, the main entries Amherst, John and Although the three-figure Cutter table possesses a Amherst, John 1940- would be cuttered on the same slight advantage over the three-figure Cutter-Sanborn table text—Amherst, John—but are nevertheless unique in terms of resolution power, the four-figure Cutter- main entries. The date in the second main entry serves Sanborn table has approximately the same resolution as the distinguishing element. power as the four-figure Cutter table. The Cutter-Sanborn table achieves this level of resolution power with fewer 2. Nonfiling Characters: Initial characters identified as table entries and is a simpler and more compact cuttering "nonfiling characters" by the appropriate indicator are scheme due to its use of a single alphabetic character. For dropped. libraries that are considering switching to a cutter table from some other type of book numbering scheme, the 3. Bracketed Data: If the data within brackets is the Cutter-Sanborn table is the preferable choice. However, phrase sic or begins with i.e., the bracketed entry is the advantages are not significant enough to warrant ignored. For all other cases, the bracketed entry is switching from the Cutter table to the Cutter-Sanborn included. table.

13 1. OCLC SPECIAL REPORT

4. Prefixes: Entries beginning with Mc, M', or Mac are These rules have a number of implications. The all treated as beginning with MAC and any spaces characters included and their filing order are identical following the prefix are ignored. Any space following to those of the ALA filing rules. All abbreviations are the initial prefixes de, De, del, Del, di, Di, du, Du, el, cuttered exactly as written. Words connected by a El, la, La, san, San, van. Van, von, and Von are hyphen will be treated as separate words. Initials ignored. Likewise, a dash following the prefixes el or separated by spaces or periods are treated as separate El is also ignored. For Mc or M' to be converted to words. Acronyms, initials not separated in any way, MAC, the M must be uppercase. Otherwise, the entry and/or initials separated by marks of punctuation will be cuttered "as is." without spaces are treated as a single word regardless of capitalization. There are no special provisions for 5. Standard Characters: The only characters which are Roman numerals, which are simply treated as letters. used directly for cuttering are the space, the digits 0 Only the text of the entries is considered; entries with through 9, and the letters A through Z. The space has different MARC tags are ordered solely on the text of the lowest collating order, followed by the digits, and the entry. Finally, there are no explicit rules for then the letters. All other characters are either treating initial articles, or other leading characters converted to one of these standard characters or which should be ignored. Since the assumption here ignored. Characters which are ignored are treated as if is that cutter numbers are derived from a USMARC they did not exist. Uppercase and lowercase letters are record field, it follows that nonfiling characters will be considered equivalent. correctly specified by the appropriate indicator in the MARC record. 6. Special Letters: The special letters are converted to A major departure from accepted practice is in the similar standard characters, as follows: treatment of numbers, both Arabic and Roman. While neither Comaromi, Lehnus, nor the instruction Name of character character convert to booklets accompanying the cutter tables explicitly discuss numeric cuttering, the prevailing practice is to Greek alpha a a spell out the number in the language of the title. Ligature ae E, ae ae However, this approach is less than definitive: for Greek beta P b example, should 1845 be spelled out as one thousand, Eth 2 d eight hundred and forty five, or as eighteen forty five? Crossed d B, d d While it is difficult to spell out numbers in English Greek gamma y g consistently, spelling out numbers for the more than Turkish i (undotted) 1 i 3,000 languages is virtually impossible. Neither the Polish 1 1 1 ALA filing rules nor LC shelflisting practice spell out Script 1 £ 1 numbers, preferring the "file-as-is" principle instead. Scandinavian slashed o 0, 0 o Hooked o a, d o The major differences from the ALA filing rules are: Ligature oe (E, oe oe Icelandic Thorn & th Ampersand: The ALA filing rules provide the option Hooked u uf u (but do not require) of spelling out the ampersand (&) in the language of the entry. Under this option, the 7. Spaces: Hyphens -, slashes /, and periods . are German entry "A&O" would be filed as "A und O". In treated as spaces. Multiple spaces and their contrast, the rules utilized in this study ignore the equivalents are compressed to a single space. All ampersand: i.e., this entry would be cuttered as "A O". spaces or their equivalents which proceed the first Thus, these rules treat the ampersand identically to the alphanumeric character are ignored. ALA filing rules, but without the "spelling out" option.

8. Superscript and Subscript Numerals: Superscript Numbers: ALA filing rules (Rule 8) provide detailed and subscript numerals are treated identically to "on- procedures for the treatment of punctuation within the-line" characters. numerals, while these rules do not provide for any special treatment of numeric punctuation. For 9. Other Characters: All other characters, symbols, example, under ALA rules the entry 5.000 anos de diacritics, and/or marks of punctuation are ignored. historia is filed as 5000 ANOS DE HISTORIA, since the period is not used as a decimal point, but rather to increase readability. Under these rules, the entry would be cuttered as 5 000 ANOS DE HISTORIA,

14 1. OCLC SPECIAL REPORT

since the period would be converted to a space. It The following example illustrates how titles starting would be very rare for these differences to produce with the digit 5 would be ordered: different cutter numbers. 5 Roman Numerals: Under ALA filing rules, Roman 5-14 numerals are interfiled with their Arabic equivalents; 5.9L cummins diesel truck 5/appel under these rules, Roman numerals are not 5 architectures distinguished from their equivalent letters. 5-ASA 5 B.C. poets Superscript and Subscript Nimierals: Under ALA #5 bazaar bestseilers filing rules, superscript and subscript numerals are 5 card stud 5 [cinq] piliers filed as normal numerals preceded by a space; no 5 continents preceding space is used under these rules. 5-day diets 5 Day Workshop on The major differences from the LC filing rules are: 5 [i.e. Cinco] dias de junho 5-M Co. 5-minute jungie tales Ampersand: Under LC filing rules, the ampersand (&) 5 Minute kitten tales has a filing value between the space and the digit "5 to 1" zero. Therefore, LC filing rules would file the entry A 5 to 15 & O between A. O. International and A and O 5 unearthly visions Company. (5) year guide 50 best new cars 500 days Numbers: LC filing rules specify that numbers 5000 helpful pictures precede letters, are arranged according to their value, 5,000 years of Korean art and are assigned a cutter number in the A1-A19 501 sewing hints range. These rules are similar to the LC rules in that 50th anniversary 51 Reasons all numbers are cuttered together. However, the order 56.1 program within the numbers is significantly different. 561 from the north 58F Plaza Roman Numerals: Under LC filing rules, Roman 5BX plan numerals are treated identically to their Arabic 5th plan equivalents, and are arranged according to their value. Under these rules, Roman numerals are not distinguished from their equivalent letters.

15 1. OCLC SPECIAL REPORT

Appendix 2 Section of the Four-Figure Cutter Table

Ami Am Am3523 American Association Of Am3583 American Society Of Civil Amll Ama University "Women J Engineers A Aml2 Amad Am3524 American Automobile Association C Am3584 American Society Of In Aml31 Amal Am3525 American Bap Am3585 American Society Of Mechanical Aml32 Amalgamated U Am3526 American Bar Association Mi Engineers O Aml33 Aman Mu Am3527 American Bar C Am3586 American Society Of Mi Ami 34 Amann Am3528 American Book S Am3587 American Spor Aml35 Amans Am3529 American Cancer Society A Am3588 American Ster Ami 36 Amar V Am3531 American Chemical Society Co Am3589 American Telev Aml37 Amaral Go Am3532 American City Am3591 American Tw Aml38 Amaras Am3533 American College Of T Am3592 American Vi Aml39 Amaris Am3534 American Cong Am3593 American Welding Society Aml4 Amas Am3535 American Council On Education A Committee On G Aml51 Amat Am3536 American Cul Am3594 American Wr Ami 52 Amateur At Am3537 American Dir Am3595 Americans R Ami 53 Amateur R Am3538 American Electr Am3596 Americas Con Aml54 Amati Am3539 American Entr Am3597 Americas Re Aml55 Amato Giu Am354l American Far Am3598 Amerika No Ami 56 Amatur Am3542 American Flan Am3599 Amerique Le Ami 57 Amaya T Am3543 American Foundrymens Society A Am36 Ames Aml58 Amazing Lig Am3544 American Gas F Am37 Ames J Aml59 Amazona Am3545 American He Am38 Amet Aml6 Amb Am3546 American Heritage Fo Am39 Amf Ami 7 Ambo Am3547 American Ho Am4l Amg Aml81 Ambr Am3548 American Host Am42 Amge Aml82 Ambroi Am3549 American Indian R Am43 Amgo Aml83 Ambrose A Am3551 American Institute Of Architects D Am44 Amgw Aml84 Ambrose Edn Am3552 American Institute Of Certified Am45 Amh Aml85 Ambrose K Public Accountants Co Am46 Amhe Am186 Ambrose Ste Am3553 American Institute Of Con Am47 Amherst Aml87 Ambrosi Am3554 American Institute Of Pr Am48 Amherst J Aml88 Ambrosini M Am3555 American Iss Am49 Amhu Aml89 Ambroz Am3556 American Juv Am51 Ami Aml9 Ambu Am3557 American Legion Au Am52 Amie Am21 Amc Am3558 American Library Association Boo Am53 Amien Am22 Amcc Am3559 American Library Association Res Am54 Amin Am23 Amcl Am356l American Lux Am55 Amint Am24 Amco Am3562 American Management Am56 Amir Am25 Amer Association A Am57 Amis Am26 Amcu Am3563 American Medic Am58 Ami Am27 Amd Am3564 American Meter Am59 Amlo Am28 Amdo Am3565 American Nat Am6l Amm Am29 Amdr Am3566 American National Standard For Ph Am62 Amme Am31 Ame Am3567 American National Standards I Am63 Ammi Am32 Amel Am3568 American New Am64 Ammo Am33 Amelo Am3569 American Oi Am65 Amn Am34 Amen Am3571 American Pet Am66 Amo Am3511 Amer Am3572 American Pho Am67 Amon Am3512 America Hi Am3573 American Polity Am68 Amor Am3513 America Ph Am3574 American Psychiatric G Am69 Amos Am3514 American Academy Of Op Am3575 American Public We Am71 Amp Am3515 American Academy Of Ps Am3576 American Re Am72 Amper Am35l6 American Antique Am3577 American Revolution C Am73 Ampf Am3517 American Ass Am3578 American School Of Co Am74 Amph Am3518 American Association For Th Am3579 American Society For Co Am75 Ampu Am3519 American Association Of Co Am3581 American Society For S Am76 Amr Am3521 American Association Of N Am3582 American Society For Testing Am77 Amrap Am3522 American Association Of St Materials C

16 1. OCLC SPECIAL REPORT

Appendix 3 Section of the Four-Figure Cutter-Sanborn Table

Elll E E184 Echeno El967 Economic Opt E112 EHT E185 Echeverria EI968 Economic Policy E113 EMM E186 Echeverria To EI969 Economic Profiles E114 E W G E187 Echo Du El971 Economic Reg El 15 Eadie Dou El88 Echoes I EI972 Economic Reviv El 16 Eagan M E189 Echols Je El973 Economic Stre El 17 Eagle L EI91I Eck EI974 Economic Survi El 18 Eaglest EI912 Eckard R El975 Economic Vi El 19 Eakin Ro EI913 Eckardt Wo EI976 Economics An E121 Earn E1914 Eckbo El977 Economics As E122 Earl E EI915 Eckelme E1978 Economics O El 23 Earle Mau EI916 Ecker Br El979 Economics Of E El 24 Early Am EI917 Eckermann H E1981 Economics Of lo E125 Early Cu EI9I8 Eckert All E1982 Economics Of Pu El26 Early lo EI919 Eckert Hel El983 Economics Of The P El27 Early Recor EI92I Eckert Ru E1984 Economid El 28 Earnhart El922 Eckhardt D E1985 Economic Et Po E129 Earth Rh El923 Eckhardt U E1986 Economies O E131 Eas EI924 Eckholt El987 Economist L E132 East Asia A EI925 Eckly El988 Economy A E133 East Hart EI926 Eckstein Bu E1989 Economy Of L EI34 East Pot El927 Eckstein Maxw EI99I Ecopl El35 East Yo El928 Eclairc EI992 Ecotou El36 Eastern Bl EI929 Eco El993 Ecrits De Lo El37 Eastham R EI93I Ecob EI994 Ecrivain Er E138 Eastman Lo EI932 Ecole De Ph El995 Ecu El 39 Eastop EI933 Ecole Dete De P EI996 Ecuador A E141 Eat EI934 Ecole J EI997 Ecuador Inf E142 Eating Q EI935 Ecole Normale S E1998 Ecuador Mu E143 Eaton Ch EI936 Ecoles EI999 Ecumenical Con El44 Eaton E EI937 Ecological Ap E211 Ed El45 Eaton Jane E1938 Ecological Gr E212 Edc S El46 Eaton Le EI939 Ecological Su E213 Eddiso El47 Eaton Ro El941 Ecology And Co E214 Eddy Dar El48 Eatw EI942 Ecology I E215 Eddy No El49 Eau Po EI943 Ecology Of Q E216 Edeja EI5I Eb EI944 Econometric Mo E217 Edelg El 52 Ebas EI945 Economia Dell E218 Edelman Mark E153 Ebbat E1946 Economia Y H E219 Edelson M El 54 Ebbi El947 Economic Analysis Of En E22 Eden EI55 Ebdo El948 Economic And D E231 Edg El 56 Ebel 0 EI949 Economic And Sc E232 Edgar Pe El57 Ebeling F EI951 Economic And T E233 Edgecombe G El58 Ebeling W El952 Economic Be E234 Edgeworth J EI59 Ebenf El953 Economic Con E235 Edica EI6I Eber EI954 Economic Cou E236 Edinburgh H EI62 Eberhardt I EI955 Economic Development And I E237 Edins El63 Eberle G EI956 Economic Development Committee E238 Editions E164 Eberlin F For The E E239 Editorial Research Reports On T El65 Ebersole My EI957 Economic Development Fo E2411 Edm El66 Ebert S E1958 Economic Development Pot E2412 Edme EI67 Eble El959 Economic Effects Of N E2413 Edmiston R El68 Ebong El961 Economic Facto E2414 Edmonds Ann El 69 Ebsw EI962 Economic Gu E2415 Edmonds Els E17 Ec EI963 Economic Impact Of N E2416 Edmonds I E181 Ech E1964 Economic Impact Of Tv E2417 Edmonds J S E182 Echao EI965 Economic Impo E2418 Edmonds Mart E183 Echea El966 Economic J

17 1. OCLC SPECIAL REPORT

References Library of Congress.1986, Subject Cataloging Manual: Shelflisting. Washington, D.C.: Subject Cataloging Division, American Library Association.1980. ALA Filing Rules. Chicago: Library of Congress. American Library Association. O'Neill, E. and P, McClain.1996, "Copy Cataloging Practices: Use Applebaum, D. 1996. Probability and Information: An of the Call Number by Dewey Libraries," In Annual Review Integrated Approach. Cambridge: Cambridge University of OCLC Research 1995. Dublin, OH: OCLC Online Press, Computer Library Center, Inc.: 11-15. Association of Research Libraries "ARL Statistics" available at: Sanborn, K. C. 1896. C.A. Cutter's Alphabetic-Order Table: http://arl, cni, org/stats/statistics/stat. html, Altered and Fitted with Three Figures by Miss Kate E. Comaromi, J. P. 1981. Book Numbers: A Historical Study and Sanborn. Chicopee, MA: The H.R. Huntting Company. Practical Guide to Their Use. Littleton, CO: Libraries Sanborn, K. 1969. Cutter-Sanborn Three-Figure Author Table Unlimited. (Swanson-Swift revision). Chicopee, MA: The H.R. Huntting Cutter, C. A. 1901. C.A. Cutter's Three-Figure Author Table. Company. Chicopee, MA: The H,R, Huntting Company, Cutter, C, A, [date unknown]. Explanation of the Alphabetic Acknowledgment Order-Marks. Chicopee, MA: The H,R, Huntting Company. The authors gratefully acknowledge the comments and Cutter, C. 1969. C.A. Cutter's Three-Figure Author Table suggestions of Joan Mitchell and Linda Gabel. (Swanson-Swift revision). Littleton, CO: Libraries Unlimited. Lehnus, D, J, 1980, Book Numbers: History, Principles, and Application. Chicago: American Library Association,

18 2 OCLC PROJECT REPORTS

n a recent meeting of the OCLC Office of Research staff, Terry Noreault, Director, mentioned that he had been asked to speak Ito another group within OCLC about "The Future." Staff quickly noted that he should tell them that the future is indeed coming, that it too will pass, and that it will always be with us. Just the kind of advice a room full of futurists might be expected to give, but apt when one considers these OCLC Project Reports. What was once a research topic anticipating the future—full-text retrieval combined with graphical display of the text (Thomas Hickey, Graph-Text), became a present product (OCLC Electronic Journals Online), which in turn will give way in the future to a new generation, OCLC FirstSearch Electronic Collections Online. Yes, the future comes, it passes, yet it is always coming. As you read these reports, consider how they might impact your future. Somewhere in them is surely the seed of the next OCLC EPIC service, OCLC FirstSearch service, OCLC CatCD for Windows software, or OCLC NetFirst—all of which have made (and are still making) a difference in how libraries do their work, serve their patrons, and support their communities. If you have any suggestions on where these efforts might or should go in that always coming, always disappearing, always present future, please contact the researchers involved. Viewpoints from beyond Dublin, Ohio, are always welcome! 2. OCLC PROJECT REPORTS

AUTOMATIC SUBJECT surrogates to manage the items of interest. For instance, to provide access to a book, a catalog entry is created to ASSIGNMENT VIA THE describe the book. The potential reader need not have the book in hand to know what it is about, who wrote it, SCORPION SYSTEM where it can be found, etc. This is an effective scheme. Creating metadata about an object makes searching, Keith E. Shafer, Senior Research Scientist filtering, organizing, and retrieving the item efficient. This is true of traditional materials as well as new electronic resources. Abstract These two worlds—the seemingly unorganized Web The Scorpion system is a research project at OCLC tliat and the organized world of libraries—have much to offer explores indexing and cataloging electronic resources. Since subject information is key to advanced retrieval, one another. The Web offers automated tools for searching browsing, and clustering, thie primary focus of Scorpion is raw information and the library world offers experience building tools for automatic subject recognition based on organizing and understanding information of all types. By well-known schemes like the Dewey Decimal Classification. combining their talents and techniques, these two communities can bring powerful resources to bear on the problems of accessing, maintaining, and supplying electronic information. Electronic Chaos If every electronic item had a catalog entry or its The recent explosion of electronic information has far equivalent, then interfaces could provide the best of both outpaced the availability of automated tools to effectively worlds—access to the raw content (the free search manage it. This phenomenon has gained visibility due to services model) and access to the filtered information (the the popularity of the World Wide Web, but the problem is library model). However, librarians can barely keep up not restricted to the Web. For instance, some organizations with their traditional workload, let alone begin to catalog have more electronic information available online than all and organize the vast amounts of information available of the publicly available Web pages combined. Similarly, electronically. most printed material now exists in electronic format long All electronic resources will never be humanly before it is published. cataloged. The process is just too expensive. Clearly, As the amount of accessible electronic information automated tools to apply library science ideas such as increases, the cost of accessing this information will classification and filtering to electronic resources at high increase. Even though users can now use free search speed and low cost are needed. services to find items of interest, they will increasingly spend their valuable time wading through masses of Automation irrelevant documents to get the information they need. While traditional catalog entries contain a wealth of Accordingly, communities unfamiliar with library science metadata, the subject portion is arguably the most are beginning to grapple with the problem of metadata important when it comes to building advanced search and and the organization of large collections of data. retrieval interfaces. If there were a way to automatically As the Web has grown, there has been a general push assign subject headings or concept domains to electronic to apply and develop techniques to make Web resources items, then interesting filtering tools could be built. That's searchable and more widely accessible. For instance, there where Scorpion comes in. are now many free search services like Yahoo! (at http:// Scorpion is a research project attempting to combine www.yahoo.com) and AltaVista (at http://www/altavista.com). indexing and cataloging, based on the observation that There is even wide acceptance of information retrieval these are complementary activities. Scorpion specifically techniques like ranked retrieval to help users sift through focuses on building tools for automatic subject recognition the volumes of Web information now available. Yet by combining library science and information retrieval searching the raw content of every document still seems to techniques. For instance, to assign subject codes to a be an unacceptable solution since it is common to retrieve document, the document can be treated as a query against hundreds of documents for a given search. a Dewey Decimal Classification database using ranked retrieval. The results of the search can then be treated as Order Makers the subjects of the document. Subject assignment in this Historically, librarians have organized the world's manner provides clear differentiation from the traditional information. For centuries, they have successfully computer indexing behind the currently available free managed, classified, and filtered information of many search services. types. This has been accomplished via the creation of

20 2. OCLC PROJECT REPORTS

Scorpion cannot replace human cataloging. Many Introduction aspects of human cataloging are difficult if not impossible Interlibrary loan of materials is inextricably linked to to automate. However, Scorpion should produce tools that issues of copyright. The copyright issues tend to surface help reduce the cost of traditional cataloging by especially when a library requests articles from the automating subject assignment when items are available periodicals of sister libraries. This paper presents electronically. For instance, a list of potential subjects characteristics of articles that libraries request from each could be presented by Scorpion to a human cataloger who other with the hope that a better understanding of these could then choose the most appropriate subject. characteristics will help librarians and publishers address problems and facilitate scholarly endeavors. The findings Additional Information are based on analysis of a sample of articles requested For more information, please visit the Scorpion site at through OCLC Interlibrary Loan in 1995. http://purl.oclc.org/scorpion where additional documentation explaining the Scorpion tools and Selection of Sample Requests experimental results is posted. Local clients or users with Approximately 8.5 million requests were processed passwords and IDs can see the Scorpion tools in action. through OCLC Interlibrary Loan between December 1994 Anyone desiring a password and ID should contact Keith and November 1995. OCLC maintains a database of all Shafer ([email protected]). online transactions performed on 1% of the bibliographic Scorpion contributors: Roger Thompson, Vincent M. records in the WorldCat (the OCLC Online Union Catalog). Tkac, and Jonathan R. Fausey From this 1% sample, 2,000 bibliographic records which were used by libraries for interlibrary loan were selected randomly. The cataloging codes in fixed fields (bibliographic level, type of serial, regularity of publication, and frequency) of matching bibliographic records in the WorldCat were used as filters to limit the sample to requests for articles from periodicals. Figure 1 shows the distribution of the 2,000 Interlibrary Loan/Document Delivery requests by format. CHARACTERISTICS OF Requests for monographs (47%) were removed. From the remaining 1,058 requests, monographic series and ARTICLES REQUESTED newspapers were eliminated. The technical definition of THROUGH OCLC periodical includes annuals, semiannuals, and irregular serials. In common usage, the term periodical usually INTERLIBRARY LOAN refers to journals and magazines which are published at a regular interval several times a year. Because the serial Chandra G. Prabha, Senior Research Scientist crisis is most pronounced for periodicals requests for annuals, semiannuals, and irregularly published serials were further disqualified. The remaining 734 requests (37% of 2,000 Interlibrary Loan/Document Delivery requests) Abstract were for periodical articles. As the the number of periodical titles in electronic editions increase, and as libraries are faced with new restrictions for sharing articles from site-licensed pehodical literature, copyright issues surface. Understanding the characteristics of articles users now request can facilitate discussion. This paper presents the attributes of articles sought through OCLC Interlibrary Loan. Nearly two-thirds of the requested articles were published within five years. Over 80% of the periodical source titles were sought five or fewer times a year.

21 2. OCLC PROJECT REPORTS

Begin / r/oof \ 8.5 miilion | , ILL Requests f

f

/ "Periodicals"\ / Serial \ / (Journals & \ ' Study Sample \ / Requests \ Periodicals \ 1 1 Magazines) y 2,000 J - y 53% j - y 49% - J \ 37% / \^=734^ T • T

/ Newspaper \ / Monograph \ / Annual, \ Requests \ j and \ / Semi-Annual, \ Monographic and Irregulars \ Serials j \ } V 12% / \4%/

Fig. 1 Interlibrary Loan Requests by Format

The objective was to examine manually about 300-400 requests for periodical articles. As noted, 734 of these 52% requests were obtained. For analysis, 390 were chosen randomly. From this working set of 390 requests, 4% of the requests were disqualified because citations were incomplete (no volume, year, and /or pages). The findings presented here are based on the remaining 373 requests. Profile of Requesting Libraries Figure 2 shows that academic libraries generated 52% of the 373 requests. Of these requests, 14% were from major academic research libraries; junior college and medical libraries contributed 7% each; federal and public libraries accounted for 5% each. All other types of libraries together accounted for the remaining 10% of the requests. Four- of V® fifths of the 373 requests examined originated in academic Library Type environments. n=373 ILL request forms require library staff to enter the library's preferred method of delivery and the maximum Fig. 2 Article Requests by Type of Library cost the library is willing to pay. Analysis of these data indicate libraries' current practices in obtaining articles via interlibrary loan.

22 2. OCLC PROJECT REPORTS

Delivery method Periodical scatter Libraries preferred to obtain 66% of the requested articles Requests were widely scattered with respect to periodical via U.S. mail at library rate; 16% via courier service; 8% via sources. Table 2 shows that 48% of the 120 periodicals U.S. mail, first class; 5% via fax; 1% via air mail. The received one request each. Eighty-four percent of the libraries were willing to pay for the fastest method periodicals were requested by member libraries five or possible for another 1% of the articles and did not indicate fewer times during a 12-month period. a preference for the remaining 3%. Maximum cost Table 2 Number of Requests per Periodical How much are libraries willing to pay? Twenty-nine Number of Requests Percentage Cumulative Percentage percent wanted the article at no cost. While 14% of the 1 48 48 libraries were willing to pay $l-$5, l6% were willing to 2-3 23 71 pay $6-$10. A quarter of the libraries were willing to pay 4-5 13 84 $ll-$20. Seven percent indicated $21-$50, while the 6-9 10 94 remaining 9% were willing to pay any amount. Overall, 10 or more 6 100 n = 120 nearly 60% of the libraries wanted an article at $10 or less. The fact that two-thirds of the libraries preferred article delivery via U.S. mail at library rate and nearly 60% Subject dispersion wanted the cost to be $10 or less suggests that while The articles were categorized according to the subject area libraries may be trying to operate Interlibrary of the source periodical. Dewey Decimal Classification Loan/Document Delivery units efficiently, they are not (DDC) was chosen over the Library of Congress really allocating money for faster modes of delivery. Classification scheme because of its simplicity. Table 3 shows that nearly one-third (32%) of the articles came Requested Articles from periodicals classed in social sciences (DDC 300). Articles were categorized by three attributes: publication Nearly one-fourth (23%)) pertained to applied sciences and date, requests per periodical, and subject matter. technology (DDC 600), while another one-eighth (12%) were from mathematics and pure sciences (DDC 500). If Publication date the articles classed in DDC 500 are grouped with those Table 1 presents percentages of articles for selected classed in DDC 600, then the percentages of requests in intervals and the corresponding cumulative percentages. sciences and social sciences are roughly equal. One-fifth of the 373 articles were published in the year preceding the year of the request. This percentage drops Table 3 Subject Dispersion of Articles gradually as the articles age. Note that 64% of requests were for articles published within the previous five years, DDC Classes Percentage* 88% within the previous 15 years, and more than 95% 000s Generalities 6 within the previous 25 years. 100s Philosophy & Psychology 9 300s Social Sciences 32 500s Natural Sciences & Mathematics 12 Table 1 Percentage of Article Requests by Publication Year 600s Technology (Applied Sciences) 23 700s The Arts 4 Year Percentage Cumulative Percentage 800s Literature & Rhetoric 1 1995 8 8 900s & History 2 1994 21 29 Unknown 11 1993 14 43 All 100 1992 12 55 *No article requests fell in either the 200s (religion) or 400s 1991 9 64 (languages). 1990 6 70 n=373 1989-1980 18 88 1979-1970 9 97 1969 or earlier 3 100 While a higher demand for recent publications is not n=373 surprising, the extent of this skew is illuminating. Were it not for the phenomenon of article dispersion among a vast array of periodicals ancillary to the subject's core, patron requests could probably be fulfilled to a greater degree from local collections.

23 2. OCLC PROJECT REPORTS

Periodical Titles Publishers Periodicals can be characterized in several ways. The Publishers of periodicals were grouped into four study analyzed the following attributes: age of periodical, categories: for-profit (e.g., Blackwell, Academic Press), type of publisher, price, country of publication, and professional or trade associations and societies (e.g., language of publication. Examining these attributes could Central States Speech Association, American Entomological explain why it may be necessary to obtain the periodical Society), universities and colleges (e.g.. Harvard article through the Interlibrary Loan/Document Delivery University, National College of Teachers of the DeaO, and service. Country and language of publication attributes do civic or cultural institutions (e.g.. The Whale Museum). not appear to be significant barriers in subscribing to For-profit publishers comprised the largest category, periodicals. Major publishers tend to be multinationals, producing more than 50% of the periodical titles in the and the English language is predominant in scientific and sample. One-third of the periodical titles (33%) were scholarly communication. Because this is well recognized, published by professional associations or societies and percentage breakdowns are not included. Findings 11% by universities. The remaining 5% were from civic or pertaining only to the attributes age, publisher, and price cultural institutions. are presented. Price Age Subscription prices of periodicals were obtained from Age of a periodical is obtained by subtracting the year the reference sources, especially Ebsco. The median periodical began from the current year. Older periodicals, subscription price of a periodical is $ 98.50. The price especially those launched 50 or more years ago, are likely distribution is not normal as revealed by the average to pertain to classical disciplines. An example is the subscription price, which is $289. Periodical prices by the Transactions of the American Entomological Society three major publisher types are shown in fig. 3. Note that periodical, which began publication in 1879. An older the median price of periodicals from for-profit publishers periodical may be of general interest, too, such as ($175) is three times the median price of periodicals from Donahoe's Magazine, which also began in 1879. The professional or trade associations ($55). former ceased in 1889, and the latter ceased in 1908. Both appeared in the study sample. As new knowledge is gained, new specialties are born, which in turn gives rise to periodicals on new topics. New periodicals are also started sometimes as vehicles for communicating social issues or concerns. In this study, two articles were requested from two new periodicals that originated in the same year as the Interlibrary Loan/Document Delivery requests were made. These two periodicals. Narcotics Enforcement and Prevention Digest and Violence Against Women, began in 1995. As is the case with many bibliometric distributions, the age distribution of the periodical titles is not normal (table 4). Note that nearly 50% of the periodicals were started in For-Profit Associations Universities the most recent 20 years; 25% of the titles began publication Publislier Type in the most recent 10 years. This data seems to support the fact that emerging periodicals add to the financial n=120 • Median • l\/lean strain libraries face from upwardly spiraling serial prices. Fig. 3 Median and Mean Subscription Prices

Table 4 Percentage of Periodicals by Starting Years Start Years Percentage Cumulative Percentage Conclusions 1995-1990 12 12 In summary, an overwhelming majority of the periodicals 1989-1985 13 25 are published in the English language in a handful of 1984-1975 25 50 western countries. Most periodicals needed for article 1974-1950 30 80 1949-1900 14 94 references were recently published and more than half <1900 3 97 were produced by for-profit publishers. The average Unknown 3 100 periodical subscription price varies by publisher type, with n=120 the for-profit publisher price being substantially above the others.

24 2. OCLC PROJECT REPORTS

for book titles from well-known publishers in the United CHARACTERISTICS OF BOOK States. The infrastructure that provides cataloging data in COLLECTIONS IN ACADEMIC advance of publication from these publishers is generally known as the Cataloging-in-Publication (CIP) program. RESEARCH LIBRARIES Research libraries, like other library types, use CIP data to record and represent domestic titles that are acquired Chandra G. Prabha, Senior Research Scientist widely. As stated, cataloging experts at research libraries as well as at the Library of Congress and other national Abstract libraries create cataloging data (original cataloging) for In recent years, the library profession has been addressing issues pertaining to cataloging electronic publications. In many book titles that are selected for their research sheer dollar ternns, most of the library resources are still potential. Where are these publications produced? In what spent on procuring materials in paper format. Characteristics languages are they published? Are they recent of research libraries' collections are analyzed on the premise publications? Who publishes these titles? Answers to that a general awareness of the scope of collections will questions such as these might help identify which help libraries segment collections into those which are segments of collections are likely to become electronic in likely to become predominantly electronic in the near the near future and which might not. An awareness of the future and those which may not. The findings are based on attributes about these books might also indicate the range cataloging records contributed by 95 academic research of language skills cataloging staff collectively bring to bear. libraries in the United States in 1995. This paper is limited to books for which academic research libraries produce machine-readable cataloging data. Introduction Method Libraries will manage collections in both paper and Cataloging data reveals the characteristics of the source electronic publications in the near future. An increasing publications it represents (describes). We studied cataloging number of reference books are now available in electronic data contributed by academic research libraries to WorldCat, and paper editions. The number of periodical titles that (the OCLC Online Union Catalog). (At the outset, it is are being published in both electronic and paper is worth noting that the number of unique cataloging records increasing at an impressive rate. We cannot predict how may not show a one to one correspondence with unique soon collections will become predominantly electronic. We publications. This discrepancy arises for several reasons, can, however, differentiate segments of collections that are some of which reflect interpretation and application of already being published exclusively as electronic editions cataloging policies and practices. Despite this discrepancy or accompanying paper editions (e.g., abstracting and in numbers, cataloging records, especially those created indexing services) from those that may continue to be according to accepted standards, serve as an excellent published just on paper for the foreseeable future (e.g., surrogate for surmising characteristics of the publications publications from technologically less advanced countries). they represent.) Serials and other publication formats for Understanding the scope of our collections in terms of which research libraries create and contribute cataloging cataloging requirements might help libraries to plan and records were excluded for this analysis. The proportion of budget resources for managing the collections. items in these formats was too small to permit reasonable Collections in research libraries are the focus of this confidence in the findings. paper since a significant number of publications for which In 1995, academic research libraries in the United research libraries create an "original" cataloging record are States (95 members of the Association of Research procured from outside the United States. Many of these Libraries) together contributed approximately half a million non-U.S. publications are not in the English language. The new cataloging records to WorldCat. Of these contributed time frame in which library collections transform to being records, 5,482 (1%) were analyzed using statistical predominantly electronic may depend on several factors, software. In addition, we examined and categorized a the scope of the collection being an important one. With random subsample of 500 records manually. this premise, this paper presents characteristics of materials that research libraries collectively acquire and catalog. Findings Segment Collections of the 5,482 catalog records, 4,885 (89%) were created for books (fig.l). A book is roughly defined as a one-time An operational approach to segmenting collections is by publication printed on paper or microform. In all, 67 the level of technical expertise required for cataloging languages were represented in the sample. Fifty percent incoming materials. Most libraries, including research (50%) of the 4,885 books were published in languages libraries, do not create cataloging data (original cataloging)

25 2. OCLC PROJECT REPORTS

other than English (fig. 2). Books in French, Spanish, Note that 60% percent of books originally cataloged German, Chinese, Arabic, and Russian languages each were published at least six years prior to the contribution accounted for between 7% and 3% of the sample. to WorldCat (fig. 4); fewer than 30% of the titles were published in the current or immediately preceding years.

Pre-1800 Monographs 89% 1800-1899

Serials 4% 1900-1959 1960-1979 Other 7% 1980-1989

1990-1995

n=5,482 10 20 30 40 n=4,885 Percent

Fig. 1 Original Cataloging by Format of Publication Fig. 4 Original Cataloging of Books by Year of Publication

English Discussion French The publication date skew to older books suggests that Spanish libraries are perhaps producing machine-readable records German also for older titles—titles that may have been cataloged Chinese before machine-readable cataloging became the norm. If Arabic this is the case, in what time frame will research libraries Russian catch up with creating machine-readable records Other (retrospective conversion)? We do not know. It is appears, 0 20 40 60 however, that recent publications (current and previous n=4,885 Percent two years) are only a small segment of what these cataloging experts describe and analyze. Fig. 2 Original Cataloging of Books by Language Research libraries can expect that publications in non- English languages may continue to be collected in paper A total of 124 countries were represented in the sample. form in the foreseeable future. Although 50% of the books Of the 4,885 books, just 18% were published in the United were in English, fewer than 20% were published in the States (fig. 3). The percentages for books from the United United States. Some of the English-language titles in the Kingdom, France, Germany, Russia, and Italy ranged from sample were from African countries. Many books 8% to 3%. published abroad most likely will be produced on paper for several decades irrespective of language of publication. Footnote United Kingdom _ France ^mm An analysis of the characteristics of materials libraries now West Germany acquire may alert the libraries to the continuing need for staff with language skills, if we assume that these libraries Russia will continue to acquire materials from many countries. Italy

0 10 20 30 40 n=4,885 Percent

Fig. 3 Original Cataloging of Books by Country of Publication

26 2. OCLC PROJECT REPORTS

CLASSIFICATION RESEARCH In addition to providing custom views, this prototype could also be regarded as a first step in improving links to AT OCLC Dewey in other languages (research area 3). See "Mr. Dui's Topic Finder" by Mark Bendig for a complete description Diane Vizine-Goetz, Consulting Research Scientist of the Mr. Dui prototype.

Abstract Browsing NetFirst by Dewey In summer 1996, OCLC Forest Press published the 21" The NetFirst browsing capability allows users to click edition of the Dewey Decimal Classification (DDC 21). For subject categories, topics, and subtopics to view records the first time, a new edition of the classification was grouped by DDC numbers. For instance, a NetFirst user published in two formats: the traditional four-volume print interested in finding resources containing information format and an electronic version on CD-ROM (Dewey for about health concerns for travelers can browse to the Windows). The publication of Dewey for Windows second level topic Health and Medicine under the completed several years of collaborative effort by OCLC category Health, Home, Technology and then search for and Forest Press staff. The goal was to produce an items in this topic area about travel and tourism (fig. 1). electronic version of the DDC that employs the electronic form to maximum advantage but remains sensitive to traditional features that carry over to the new media. In £!• EA li" ^ BMluMiiw Qplim C)»d0iy tiati addition to the publication of Dewey for Windows, the ®|YRIAI|AR«| »J collaboration between OCLC researchers and Dewey classificationists has resulted in an active classification ipiiiii iffi % wmmrn^ research program. This article provides an update on the Dewey research program described by Vizine-Goetz and 3. S«l«cttrabtoiucinH»ia,MB4klMtof«wchfbr«]lit4lwtftUl«i«ufatoi>ic,or£taiul. Mitchell [1996] and introduces new research springing Citogonvt HtalOi. Horn*. Ttehnotogy TofMct from that work. hRtgtlliMlSgatl Crasfmlm Intum.t E»li)«Y 2. Enhancing links to other thesauri Pmii »fl HMM in »«i«ct»d crttgonf PiiMl iB ittni itmlttml lam Starch ctttgwy for. StMch topic far 3. Improving links to DDC editions in other languages Itravel Of tourism 4. Transforming the captions into end-user language limaiiM 5. Decomposing numbers and using component parts for improved access (ZMAsf niAtltorattieiUmlitryprliilnmihsaum Important progress has been made in research areas 1, 2, and 4. Developments in research areas 1 and 4 are addressed in the first section of this report. Custom Views Fig. 1 NetFirst Browse Display of the DDC. The next section is devoted to advancements in linking the DDC to other thesauri (research area 2). A Browsing and filtering the database records in this way (using the structure of DDC but not its class numbers) new project, ExTended Concept Trees, is also introduced. enables users to retrieve relevant items that may not be as easily discovered using traditional keyword searching Custom Views of the DDC capabilities. The categories, topics, and subtopics used in During the past year, research staff members have the NetFirst Browse feature represent concepts from the developed two custom views of the DDC: the OCLC first, second, and third summaries of Dewey. In table 1, NetFirst browsing capability and a research prototype captions from the first Dewey Summary are shown called Mr. Dui's Topic Finder. Although both systems use alongside the first-level browse categories from NetFirst. the Dewey summaries to provide access to Internet- The headings in these two groups differ in accessible resources in the OCLC NetFirst database, the terminology and in emphasis. The Dewey caption two interfaces focus on different approaches to headings use broad, descriptive terminology coextensive customizing Dewey for end-user access. The NetFirst with the classes, whereas the NetFirst categories employ browse feature is an operational capability that uses current vocabulary that highlights selected aspects of adapted DDC summaries to provide end-user access to the classes at the expense of others. The Dewey summaries complete NetFirst database, while the Mr. Dui prototype were recast for use in NetFirst by research staff who used provides multilingual access to a restricted set of records. statistics on class number assignments in the database to

27 2. OCLC PROJECT REPORTS

for Windows CD-ROM database. In the future, they may Table 1 Comparison of Captions be used to support customized views of the DDC. NetFirst Browse Categories First Dewey Summary Arts, Recreation, Sports 700 The arts Enhancing Links to Other Thesauri Fine and decorative arts Books, Computers, Internet 000 Generalities Headings or terms from other associated thesauri or Economics, Education, Society 300 Social sciences subject term lists serve as a source of useful index terms Genealogy, Geography, History 900 Geography & history not found in terminology used in Dewey. Linking the Health, Home, Technology 600 Technology DDC with other subject access systems has several other (Applied sciences) advantages. Linking provides: Language & Linguistics 400 Language Literature 800 Literature & rhetoric • A mechanism for associating new topics with the Philosophy, Psychology, 100 Philosophy & classification Paranormal psychology • Knowledge structures based on navigation and Religion 200 Religion retrieval tools based Sciences & Mathematics 500 Natural sciences & • Expansion of the Dewey knowledge base for mathematics electronic resource description and discovery Two OCLC-sponsored efforts are under way to guide their efforts. The captions associated with frequently augment such linkages; one is an Office of Research used class numbers were primary candidates for review project—ExTended Concept Trees—and the other is an and transformation. For example, special attention was ongoing service of OCLC Forest Press. In the latter, the given to Dewey ranges 000-099 Generalities, 300-399 Dewey editorial staff review newly approved Library of Social sciences, 500-599 Natural sciences & Congress Subject Headings (LCSH) and pair them with mathematics, and 600-699 Technology, which together candidate DDC numbers. These new headings represent account for more than 80% of class numbers in NetFirst topics of current interest not specifically mentioned in the records. The Dewey relative index was a major source of latest edition of the DDC. (See LCSH/DDC Numbers of replacement terms. Figure 2 provides an example of Current Interest on the OCLC Forest Press DDC Web site for another prototype system, NetBrowse, that uses edited details. See also "Enhancing the Indexing Vocabulary of the captions for browsing by Dewey classes. Dewey Decimal Classification" by Jean Godby for a report on research to link terminology from full text to the DDC.) n • f* vO A ESlI ExTended Concept Trees 1 B Educalbn" ExTended Concept Trees is a new project devoted to R^ophy *td iNwfy. e/m arxJ objecHves, p^tchdogy expanding the Dewey knowledge base. Its goal is to Educ^ioi arxi leMM^ch Sodd groups enhance Dewey concept trees with supplemental vocabulary Events, people, pt^s 1j and to extend these structures through associations with $ Schools; siMciai edic^on Eiementery educ«li«i i other subject-oriented knowledge bases. The imported Secondly education i terminology and other associations are then combined with OrganizaUon and ectivlies in secondary education 5econd«y$diods4Kid{»ogr«nsc^^)ecificidnds.lev^,eumcula,N)cus the Dewey knowledge base to automatically assign subjects Secondary education in spe<^ic continents, countries to electronic documents. An example of a Dewey AdUt educi^ion" ' ExTended Concept Tree is shown in fig. 3. m His^ educj^m 8- Public poiiQi issues The Extended Concept Trees project is largely directed B «id con^d pii)fc education B S^)port ctf pU^ edicatk^" toward exploiting technology to link subject-access Sdiool d>oice systems like LCSH and the Library of Congress Ej^n^^taiy education Classification with the DDC. Linking is accomplished by Secondly ed«^c»i mining the WoridCat (the OCLC Online Union Catalog) and electronic versions of other subject access systems for Fig. 2 NetBrowse Browse Display relationships between subject-oriented data in these files and the Dewey knowledge structure. The techniques for In this example, the term "School choice" is used in making these associations include use the OCLC Scorpion place of the Dewey caption "Educational vouchers." This system, a research prototype that employs a series of revised caption is a term from the Dewey Relative Index, ranked retrieval databases built from the machine-readable but substitute captions need not and should not be limited version of DDC 21. The system generates ranked lists of to the index terminology of the classification system at Dewey numbers that function as possible subject hand, as long as they are coded and linked appropriately descriptors for documents. The Scorpion databases can be in the underlying database. Mitchell [1996] describes how accessed via a Web interface that is capable of retrieving Library of Congress Subject Headings are being mapped to an electronic document and generating a database query Dewey numbers in the DDC 21 database. These headings from its content. are a source of additional entry vocabulary in the Dewey

28 2. OCLC PROJECT REPORTS

Integration of DDC with Library of Congress Subject Headings

001.942 Unidentified flying objects (UFOs. Flying saucers)

[Encounters with unidentified flying objects^ Close encounters of the second kind Close encounters of the first kind Unidentified flying objects Men In black ^Three men (UFO phenomenon)snon)| \ Sightings of unidentified flying objects j Sightings and encounters (UFO phenomenon) MIBs (UFO phenomenon) UFO encounters (sh 85139679) (sh 96011191) n) J MIB (UFO phenomenon) UFO... 1/

Fig. 3 ExTended Concept Tree

The following examples will illustrate. When a Web by the system. This example illustrates the potential value document, in this case "M.I.B. (MEN IN BLACK)" (fig. 4), of the Scorpion system to automatically generate subject- is processed by the Scorpion system, results like those oriented metadata for electronic documents. shown in fig. 5 are produced. The highest ranked class assigned to this document is 001.94 Mysteries (fig. 5). The Scorpion system record for this class number is |Wri^|[S^ctC»d* llSubjaet 1314 0] y 001 94 130.51 II OIC BibSoaraphy shown in fig. 6. The highlighted terms indicate matches : 246.231 T?6 A&fettc md ouldjof ^oitt andji^B^ 130.471 133 42 [>e)&ooolosy 1235.941 y>S2i Tdet acd lor« ofparanatarad of hum 128,55 B 391 Cottume aad personal ^earance between terminology in the input document and in the [2014311 36S 127.8811 732 323 Ubtfs CoamunMi tervice 11«,67| 133 901.3'5 125.631 2)0 nideaopfajr aod tkeor? of reisie»e at the end of the record, which instructs DDC users to 154.701 230 ^oonBoaiioot aed sect* <^Ch»iiBk churd) 123,3311 003 942 Umiuaed ftviaK object* UFOs Flviiw 149.94 1 070 4 1^3 9511 550 EaUfatdencet apply this class number to items about nonastronomical 146,01 1 573 8 JTotous add sensojy syttems 117.71 B 001944 Moonen aod raUted Dhenomena extraterrestrial influences on earth. The two related class 134 40 1 362 1 ntysicai ilktesf 115.15 B 362 2 numbers—001.942 Unidentified flying objects (UFOs, Fig. 5 Scorpion Subject Assignment Results Flying saucers) and 001.944 Monsters and related phenomena—are also among the top 20 classes assigned

Netsc.^[)i' jCROP ClliCI (• CONNf CI <)(i| Uc«br.f«p://sni«tfvacte.ow6109/el«>iy«cA.<*CL9«Plesi7«*«-1lwcki»-a30S838a3138B4a29jSK18ai^^ ^ £ne £dlt yiew jSo Bookmarks jettons j^irectory Window jjelp locartion jhttp://fldpho,mic dunciee.acuk/ft/crop_cirdes/temp/mib.h«Til MIS Dl^«)r iNT rteori for sidkjM (»dc rmk 1-wiUi wci|^ 314^1 DmnyNmbMT 001.W Captua (Ifosdini^ (EH) Mystariet LArwy at Congras* SdgMt H«a£Rc(a) Devili Triangle Pettatoo of Death Trian^e ^Death Upwihl Ifiowclv OOE) On Genera^t TRUTH SEEKERS REVIEW 001OOx GenfralibMKnowledge 001.9 Coairo«»r attitudes toward UFOs they gave this statement to a cross-section o( the American Public: 'A 807-00-27 government agency maintains a Top Secret fBe of UFO reports that are deHberaiely wtthheld from Nat*< (NDI) the public.* The respondents were supposed to answer TRUE or F/siSE. A substantial majority, Repgiied plienomenanot n^aioed. ful^ vnfied sixty-one percer«, thought that the statement was tiue wrhtte only thirty-one percent said it was caut Hw* NM* O^CH) false. Among teenagers, the credibility gap was even wider - 73 perc«it believed the statement to Clasf hot n^attraMeMcal extraterrestri^ influeacet oa earflt be true, Gehw^al opinion studies conducted by the Condon Committee, and other surveys about UFOs came up with the rather parado)4cal facas tharf there were more peoplewho believed in a conspiracy of silence about UFC^ than belio/ed in UFOs in the first place mm ibewa«ftisr~ PARANOID

It has often been said that we Amertcjuis today are a bit po'anoid; thai we always lend to befieve 4. Fig. 6 Scorpion Record for DDC Class Number 001.94 Documertf Done

Fig. 4 Web Document Processed by Scorpion System 29 2. OCLC PROJECT REPORTS

Staying with the topic Men in black, one additional References example shows a technique being explored to effect Mitchell, Joan S. 1996. "The Dewey Decimal Classification at 120: automatic associations between the DDC knowledge base Edition 21 and Beyond," Knowledge Organization and and other subject access systems. Since this topic Change: Proceedings of the 4th International ISKO corresponds to the LC subject heading Men in black (UFO Conference, 15-18 July 1996. Rebecca Green, ed. phenomenon), it is possible to generate a "concept record" Frankfurt/Main: INDEKS Verlag. for the topic from information in the OCLC Authority File: Vizine-Goetz, Diane and Joan S. Mitchell. 1996, "Dewey 2000." OCLC Authority Record Concept Record Annual Review of OCLC Research July 1995- Dublin, OH: [Fixed fields and indicators omitted] OCLC Online Computer Library Center, Inc.:l6-19. 010 sh 96011191 Men in black (UFO phenomenon) Available at: http://www.oclc.org/oclc/research/publications/ 040 DLC $c DLC MIB (UFO phenomenon) review95/partl/vizine.htm 005 19970127145656.2 MIBs (UFO phenomenon) 053 BF2055.M45 Three men (UFO phenomenon) 150 Men in black Life on other planets (UFO phenomenon) 450 MIB (UFO phenomenon) Men ENHANCING THE INDEXING 450 MIBs (UFO phenomenon) Unidentified flying objects $x Sightings and encounters VOCABULARY OF THE DEWEY 450 Three men (UFO phenomenon) 550 Life on other planets $w g DECIMAL CLASSIFICATION 550 Men $w g 550 Unidentified flying objects $x C. Jean Godby, Research Scientist Sightings and encounters An HTML version of the concept record is generated and then sent in turn to the Scorpion system for Abstract This work explores the possibility of using unrestricted processing, with the following top three classifications machine-readable full text to augment the indexing being returned: vocabulary of the Dewey Decimal Classification and make Dewey Number Caption (Heading) it more accessible to end users. It is based on the hypothesis that contemporary text is more likely to contain 001.942 Unidentified flying objects (UFOs, Flying saucers) up-to-date vocabulary than a reference work that must 133.88 Psychokinesis undergo a rigorous editorial review process. 001.94 Mysteries These and similar results are quite promising (the candidate DDC class paired with this heading by the Introduction Dewey editors is 001.942), but many research questions Like any sophisticated reference work, the Dewey remain: Decimal Classification (DDC) must keep pace with cultural and linguistic change. Ten years ago, there was little or no 1. How should information from discrete knowledge talk of repetitive strain injuries, clickable maps, screen bases be integrated? savers, fathers' rights, inline skating, seasonal affective 2. What are the relationships among mapped concepts disorder, or hypertext markup language—but now it is and how should they be coded? difficult to read a newspaper or magazine or browse a 3. How can Scorpion results sets be post-processed to database of World Wide Web resources without filter out spurious classes and collocate valid ones? encountering a reference to one of these concepts. The task of keeping a resource like the DDC up to date is Conclusion usually a manual effort, but we are currently testing the This report describes several efforts to enhance the hypothesis that some of this work can be automated by usefulness of the DDC as a knowledge structuring tool, extracting vocabulary from machine-readable text. including (1) developing custom views of the DDC, (2) The DDC is a controlled vocabulary that is used to transforming selected DDC captions into end-user standardize the language for indexing or classifying a language, and (3) linking Dewey to other subject thesauri collection of works. As part of a classification scheme that and knowledge structures. Through these projects, OCLC is usually a published reference work maintained by research staff are helping to forge the Dewey Decimal subject-matter experts, editors, and lexicographers, the Classification into a next generation knowledge DDC must achieve a balance among the sometimes organization tool, thereby ensuring that classifiers have the incompatible goals of economy, consistent editorial style, advanced classifying tools and that classification-based and accurate representation of the subject. The result can schemes play a role in the conceptual ordering of appear cumbersome to users who are not classification electronic document collections. experts. For example, fig. 1 shows the DDC entry for common neuroses. Users who are interested in particular

30 2. OCLC PROJECT REPORTS

neuroses may find the caption too broad. The index terms statistic used by corpus linguists to measure the degree of are more helpful but they are either incomplete or they co-occurrence among terms in a coherent text. In this text, identify idiosyncratic facets of a subject such as depression for example, simple inspection can verify that there is a (Mental state)—social welfare. high degree of co-occurrence among the words index, vocabulary, controlled, DDC, Dewey, and classification. Class Number: These are not random words, but are instead a reasonably 362.2/5 coherent cluster of words that give clues to the subject of Caption: the paper. In a large corpus of texts like this one, pairs of Neuroses words such as controlled and vocabulary or Dewey and classification would receive high mutual information Notes: scores. Conversely, pairs of words that co-occur only Including anorexia nervosa, compulsive gambling, depression accidentally in this text, such as and and problem or DDC Index Terms: example and purpose, would receive low mutual Affective disorders information scores. Affective disorders—social welfare Anorexia nervosa Anorexia nervosa—social welfare Direct Associations Compulsive gambling A computationally tractable way to obtain associations is Compulsive gambling—social welfare Depression (l\/1ental state) to use vocabulary in the DDC captions or indexes as Depression (Mental state)—social welfare probes and identify the words and phrases in a relevant Neuroses corpus of free text that are highly associated with them. Neuroses—social welfare For example, laptop is a term in the Relative Index of DDC 21 that can be used to access the entry 004.16, Fig. 1 The DDC Entry 362.2/5 Digital_Microcomputers. The "Class here" note suggests that this entry should be used to classify works on laptop, notebook, palmtop, pen, personal, and pocket computers. Many articles in contemporary popular magazines also discuss seasonal affective disorder, a type of mild A corpus of news articles about the computer industry is also likely to contain articles about laptops and words depression caused by the short days and light deprivation related to laptops that could be imported into the DDC. commonly experienced in the fall and winter. Since this For example, consider the abstract of an article from the phrase can be extracted from texts about affective June 5, 1995 issue of Business Week-. disorders, it might reasonably be added to the index for Dell Computer Corp. won a silver medal in the 362.2/5. It was missed, however, by the techniques that 1995 Industrial Design Excellence Awards have been used so far to create and maintain the DDC. competition for its design of the Dell Latitude This example suggests that the following steps are XP family of laptop computers. In search of a involved in enhancing the DDC indexes with vocabulary machine that could keep going through an imported from free text. entire coast-to-coast flight, Dell utilized lithium • From machine-readable free text, identify current ion batteries used in camcorders.... terms and phrases in a given subject to add to the This text contains the related phrase laptop computers DDC. Depending on the intended purpose of the and the name of a laptop-computer manufacturer. In a enhanced version of the DDC, they may include sufficiently large corpus of such texts, there are many technical terms, new senses of common words, other terms that are highly associated with laptop as company and product names, names of new shown in fig. 2. (The data for this example was obtained inventions, or slang. by extracting words that appeared in the same paragraph • Once extracted, attach these terms or phrases to the as laptop in a 400,000-word corpus of text from appropriate entries in the DDC database, perhaps in a ABI/Inform, a database of business, economics, and separately indexed field that identifies them as computer science articles accessible for browsing and uncontrolled vocabulary. Ideally, the processing searching through the OCLC FirstSearch service.) should be automatic and general. subnotebooks MIT Importing Terminology from Free Text Wincomm carrier portables monopoly The essential problem is that the words of interest in the Stacker frame relay corpus of free text cannot, by definition, be found in the Norton chips schedules, notes, and indexes of the DDC. They must be NEC Japan added through some measure of association, such as terminal mutual information [Church and Hanks 1990], a simple Fig. 2 Words Associated with Laptop

31 2. OCLC PROJECT REPORTS

Associations like these solve the problem of linking Indirect Associations DDC vocabulary to new vocabulary in a corpus of free These problems suggest that it is necessary to use a more text, but two challenges must be met before they can be indirect method to import new terms and phrases from useful on a large scale. First, the words in fig. 2 include free text into the DDC. Consider the task of associating the synonyms for laptop—such as subnotebooks and portables, phrase seasonal affective disorder and the DDC caption that would be useful additions to the index of the DDC Neuroses. These phrases cannot be associated directly entry 004.16. They also refer to PC hardware and software because of stylistic differences between popular writing and the companies that create them, a country that is the and classification schedules, but they may still have similar home of several PC manufacturers, and a university with a contexts. For seasonal affective disorder, the context is the prestigious computer science department. These words set of paragraphs in a corpus containing this word or, constitute a semantic network that requires closer analysis, more more precisely, the words in those paragraphs that perhaps with expert human intervention, before the words are highly associated with this phrase. For neuroses, the would be useful as index terms that increase the context is the complete text of the Dewey entry 363.2/5 accessibility of DDC's controlled vocabulary without (fig. 1), including the caption, index terms, and associated sacrificing precision. The major problem presented by the Library of Congress Subject Headings extracted using the words in fig. 2 is that, with the exception of the processes described in Vizine-Goetz and Mitchell [1996]. synonyms, the terms should be attached to DDC entries When these contexts are compared, the following other than 004.16 because they are broader, narrower, or common words and phrases are identified: mood, therapy, related. However, the solution developed so far is too depression, and psychiatry. impoverished to identify the appropriate DDC entries for Cast in these terms, the problem of importing the terms that are not synonymous with laptop. vocabulary from free text into the DDC is an information- Second, many of the DDC captions and index terms retrieval task. The query is the new term or phrase to be are unsuitable as probes. Not only are they occasionally imported, enhanced with the words in a representative written in archaic language, but they may also be part of a corpus that are highly associated with it; the database is structure that can not be used in its literal form. the DDC; and the result set is a ranked list of DDC entries. Sometimes a caption can be understood only by reference To test the feasibility of this idea, we conducted a to those higher in the same hierarchy. For example, the pilot study. The free text was a 6.9 million-word sample of entry 621.197 Accessories is designed to be used for the records from OCLC NetFirst, a database of pointers to classification of works about power plants and component World Wide Web documents available for browsing and structures such as cooling towers and smokestacks. The searching through the OCLC FirstSearch^" service. The caption for 621.197 is ambiguous and its meaning sample contained approximately 53,000 abstracts about becomes clearer only in the context of the captions above business, economics, computer science, liberal arts, and it: 621.19 Central stations and 621.1 Steam engineering. popular culture from 1995 and 1996. Using familiar But even with this context, it still is not obvious how shallow parsing techniques for extracting words and to construct a probe from the captions that could be used phrases from free text [Godby 1994], we identified 520 to find associated terms in a corpus of texts about power targets to be classified. Queries were constructed for these plants. Should it be the literal phrase accessories in central targets by collecting terms appearing in the same stations, or the Boolean query accessories with central paragraph whose mutual information score with the target stations? Or should the term accessories be ignored was higher than four. All 520 queries were submitted to altogether because it has a special meaning as a Scorpion system for classification. In the Scorpion controlled-vocabulary term in the DDC that deviates from database, terms were stemmed, and each DDC category the usage of civil engineers? Similar problems occur with was enhanced with the text of the categories in the same index terms for faceted subjects, such as in the relative downward hierarchy. index phrases computers—engineering and on Table 1 gives the top five categories for seasonal computers—music. affective disorder retrieved by Scorpion. These problems suggest that the task of automatically importing free text into the DDC must be restated. Not Table 1 Classifications for seasonal affective disorder only does a given corpus of free text contain words and phrases that are not in the DDC—but, because of the 616.8527 Depressive neuroses 615.716 Heart depressants design and style of the Dewey schedules and indexing 362.2/5 Neuroses vocabulary—the converse is also frequently true. 362.2 Mental and emotional Illnesses and disturbances 616.8528 Neurasthenia (such as chronic fatigue syndrome)

32 2. OCLC PROJECT REPORTS

In other cases, the classifications seem to define complete result set. For example, table 1 shows a good technical terms, connect a trade or company name with a result set because seasonal affective disorder could be a product, or identify the terms or phrases that describe new reasonable index term for four of the five categories in the inventions or cultural trends. These effects are shown in list. The categories in table 1 are cohesive in a way that fig. 3, All classifications are taken from the top five we are currently formalizing. categories returned by the Scorpion system. References Technical terms Church, K. and P. Hanks. 1990. "Word Association Norms, Mutual voice recognition Information and Lexicography." Computational Linguistics 06.454 Speech recognition 16, 22-29. 621.399 Devices for special computer methods hypertext marl(up language Godby, J. 1995. "Two Techniques for Identifying Noun Phrases in 025.04 Information storage and retrieval systems Full Text." In Annual Review of OCLC Research 1994. Dublin, 004.678 internet OH: OCLC Online Computer Library Center, Inc.: 28-31. clickable maps Available at: http://www.oclc.org/oclc/research/publications/ 912.09 Historical and persons treatment of maps and map-making review94.htm 912.014 Map reading Vizine-Goetz, D. and J. Mitchell. 1996. "Dewey 2000: Cataloging Proper names Productivity Tools." In Annual Review of OCLC Research Appalachian Trail 1995. Dublin, OH: OCLC Online Computer Library Center, 333.782 Wilderness areas Inc.: 16-19. Available at: http://www.oclc.org/oclc/fp/ 796.51 Walking research/dwy2000/dwy2000.htm Netscape Navigator 025.04 Information storage and retrieval systems 004.678 Internet Denver Broncos 796.3328 Variants of football EVALUATING A 796.3326 Specific types of American football Cultural trends MULTIPROCESSOR NT fathers' rights 323.4 Specific civil rights; limitations and suspension SERVER FOR 239.50 USE 306.8742 Father-child relationships gourmet coffee Thomas B. Hickey, Chief Scientist, 641.514 Gourmet cooking Richard Bennett, Senior Systems Analyst, and 633.73 Coffee screen savers Thomas L. Terrall, Senior Systems Analyst 005 Computer programming, programs, data 004.77 Output peripherals Abstract OCLC's FirstSearch Next Generation research project is Fig. 3 Classifications of Phrases Extracted from a Sample of investigating the use of Java and Microsoft's Windows® NT NetFirst Documents operating system as a platform for a Java-based Z39.50 retrieval system. We have conducted a series of low-level benchmark tests of both random l/Os and socket Future Work throughput under varied parameters. The maximum disk read rate for each server is from 410 to 1,350 random Much work remains to transform the results of this pilot reads/second for block sizes of between 32K and 4K, study into useful indexing vocabulary for subsequent respectively. Each NT server can provide a connection rate versions of the DDC, but the preliminary results are of 110 transactions per second for transaction sizes up to promising. Unlike the solution based on direct association 32K and a maximum connection rate of 190/second for discussed, the information-retrieval solution successfully smaller transaction sizes. places terms imported from free text in the DDC hierarchies. These results are clean enough to serve as the basis for a tool that assists the editorial staff in updating the DDC by skimming text in a given subject for new Approach vocabulary and tentatively classifying it. As PC hardware and software become more capable, PCs We are also actively developing fully automatic can support applications that previously required a classification processes that do not depend on a human mainframe or UNIX system. The computer system we have mediator to select the best classification from a list. Such evaluated consists of two Hewlett-Packard NetServer-5/166 processes programmatically evaluate the quality of the machines. Each has a full complement of memory

33 2. OCLC PROJECT REPORTS

(768MB), 6OGB of disk space split into three disk arrays, Read rate and four processors. The processors are 166MHZ Figure 1 summarizes the maximum read rate for the Pentiums^" with 1MB of cache. desired range of block sizes, as well as the related total The cycle speed, memory, and disk space available on disk throughput. The values plotted on the graph were these machines makes them look formidable. We wanted obtained by adding the Performance Monitor maximum to determine the throughput the systems can provide disk read rate values for each disk. The values reported by under a load requiring large numbers of short connections the Java test code are slightly higher, since it includes and random disk reads to rate the machines' capabilities in reading from the small amount of each file that is already relation to the demands of our current systems. in memory. Since memory is 768MB, and the total file size The current OCLC FirstSearch service consists of is 12GB, the typical amount of any file in memory is about several IBM AIX UNIX machines communicating with a 10 6%. The Java code output was not reported since the CMOS-processor IBM 9672. At this writing, FirstSearch memory access reads do not reflect physical disk supports nearly 1,500 simultaneous users with the performance. expectation that usage will grow substantially over the next year. The prototype system we are developing is quite different from FirstSearch. One of the largest differences is the movement of the user interface from our front-end machines (which currently format either TTY or HTML screens) to the user's client. This change is possible by programming the interface in Java and using the Z39.50 protocol for communications to the server. A second difference is the integration of the Z39.50 protocol directly with the retrieval engine. Both of these changes offer the possibility of substantially reducing the load on the central system while providing an enhanced interface to the user. Another change is the use of multi-threading to 1 -I 1 1 1 1 1 1 provide responsiveness and protection to users within a 4096 8192 12288 16384 20480 24576 28672 32768 small number of processes on the machine. Because each Block Size active user will require at least one thread on the server, • Reads/sec • MB/sec the prototype is designed to process commands and close connections as quickly as possible to reduce the number Fig. 1 NT Random Reads of simultaneously active threads. To supplement instrumentation within the test tools, we used the Windows NT Performance Monitor for Response to number of process threads collection of data and interpretation of the results. The NT Driving the system at the maximum read rate requires Performance Monitor examines and reports many NT multiple simultaneous processes. The read rate as a system parameters while a process runs. function of the total number of process threads is shown in fig. 2. Disk Access Benchmarks Random reads The Newton search engine uses significant disk read resources. The read tests focused on read sizes of between 4096 and 32768 bytes, which are typical of Newton. The primary test tool was a Java program that can be configured to generate an arbitrary number of threads that do random reads from several files at once. A fixed read length was used, which is set at runtime. Six 2GB files were typically used for testing, two on each RAID disk. The combined size of these files is large Threads enough to ensure that only a small portion could be in Block Size memory at any one time, minimizing the impact of caching. —^4096 —"—8192 —*—16384 —•-32768

Fig. 2 NT Disk Read Rate

34 2. OCLC PROJECT REPORTS

Communications Benchmarks threads (fig. 6). For both transaction rate plateaus, about The primary test tool was a Java client program that, for 60% of the server CPU was used. each transaction, connected to the server, transmitted a command, and read and verified the returned data. The 200 o © server program waited for connections and started a .« 150 thread for each connection. Each thread then parsed a « command and returned a record of the requested length. •| 100 + Trials were made with both single-threaded and multi­ 50 threaded versions of the driver. The clients were run on the second NT server, on PCs, and on UNIX machines. 100 1000 10000 100000 Connection rate with one client Read Size The basic connection processing rate is shown in fig. 3 for one client process. The single process was capable of Fig. 5 IVIaximum Read Rate about 110 connections per second for small transfer sizes and a consistent five transactions per second at larger transfer sizes. The transition is at about 1,300 bytes. The 10000000 server CPU utilization (fig. 4) dropped according to what 1000000 •- was actually being accomplished. We do not have a 100000-- 10000 satisfactory explanation for this abrupt transition. >. 1000 -- m 100 - » 1000 10 I 100 1-- 100 1000 10000 100000 c 10 CO \ Read Size I- 1 100 1000 10000 100000 Fig. 6 Maximum Throughput Read Size

Fig. 3 Transaction Rate One Client Process Conclusions At a connection rate of 110 connections/second at 100 4KB/transaction, each of the servers can transmit 3 80 -• approximately 450,000 bytes/second. One thousand a. 60 o concurrent users on the current FirstSearch system 40 -• generate approximately 100,000 bytes/second and 25 20 0 , 1 commands/second, so the communication benchmarks 100 1000 10000 100000 correspond to about 4,000 concurrent FirstSearch users. Read Size The random read rate of 1,000/second for 8KB blocks implies a sustained disk transfer rate of 8MB/second. A 1,000 user load on the current system is about 400 Fig. 4 Resource Utilization One Client Process reads/second, implying that each server can do the disk I/O needed for 2,500 FirstSearch users. Maximum connection rate with multiple clients Given the simplicity of the benchmarks reported here, The number of concurrent client processes was increased it is fair to say that these figures are probably an over- until the addition of new clients either did not increase the estimation of the number of users the servers can transaction rate or we exhausted client CPU resources. support. For example, the production search engines Both the multi-threaded client and single-client processes consume substantial numbers of cycles in addition to showed similar results, and after about five processes the requiring I/O. Over the next few months, we hope to do multi-threaded client was used exclusively. For small load testing with large databases to measure performance transaction sizes, the transaction rate peaked at about 190 of a system more representative of a production connections/second and used only four client processes information retrieval server. (fig. 5). For larger transaction sizes, a consistent 110 transactions/second was found, using 30 to 40 client

35 2. OCLC PROJECT REPORTS

FIRSTSEARCH NEXT databases available. At the same time, we want the interface to be as pleasant as possible to use, and not to GENERATION become burdensome as users become more familiar with it. To accomplish these goals we have used several ANOTHER LOOK AT FIRSTSEARCH approaches: • Search multiple databases simultaneously. Thomas B. Hickey, Chief Scientist, • Rely on structured subject headings and author names Jenny Colvard, Consulting Systems Analyst, and for browsing. Thomas L. Terrall, Senior Systems Analyst • Use a tree structure to display headings and search results. • Offer ranked retrieval as an alternative access Abstract mechanism. The OCLC FirstSearch service encompasses several • Embed the client within a Web browser. components from the user interface to the search engine. • Follow the Web's "point and click" interaction model In this project, we have tried to step back from FirstSearch and design an experimental system from scratch. Our as much as possible. design takes into account the current usage of FirstSearch Figure 1 shows a typical screen browsing the and the changes in computing and communications that Author/Subject index of all databases. The main heading have occurred since FirstSearch was introduced. Monroe County NY has been opened, showing a list of the subheadings and a summary of all records. The main heading is displayed by database. When one of the Approach databases is selected, the first few records are displayed in FirstSearch has been a great success for OCLC over the the right window. These brief records can be selected and last five years. Usage is now approaching 1,500 displayed in a separate Netscape window (fig. 2). Authors simultaneous users during peak periods and it continues and subject headings in the full records are "hot" links. to grow. FirstSearch itself has changed substantially since Selecting a link jumps the user back into the browse its introduction, in types and numbers of databases, in window at the appropriate spot for further browsing. purchase options available to libraries, and in the interface itself, as a Web interface has been introduced and refined. EHe £dtt tfew fio flooltnurk* Qptiong directory MJntlow Hc

monrot courty I Rochester and naighboring ... $ > monroe courty nd 1991 by: Marine Mtdand changes and investigating new approaches, almost as -• ffl > moryoe coun^ town Champion rnap of RochMta. New York ? ffl > mxroe couHy fTKh i9$$ by: ftand McM««y and Ccrnpany. .V. though we were starting anew. I $ ) mcrriM couiy nes Morrot County. N y (TWToe Kunty ny 1d94by. Erasijs Oarrcw & Co. Of course, we are not starting over. We now have a 'i % (Summary by Dattnse) MarrM County. Sii li VUorldC«t (2$0k) (7 RKords) Monroe CouKy good idea of the popularity of the service, how and by Wi ^ yiinri 13$4by:Era3toD«tow&Co. i 4 ciwandiCMrEt-mipt M^)olgrf*r Rodwter. whom it is used, what users want, and what we can ecomnwpei^ 19d4by'MapVtt)rks,h:. Monroe Coun^. N.Y hwlyy - biJoyiptiy 19»4by:Jud«]n.H.8 provide. FirstSearch Next Generation is concentrating in indusl'KS -lociion Monoe Ccur^. N.Y. 1994by:Judm,H. 8. two areas: the client interface and the retrieval server. Our UoraotCnrly.N.Y. 1394 by: PanvYcik Co- goal is to develop a prototype that is both more attractive MapcfRodw^: 19Mliy:MapV«rk$.lnc. to users and more efficient than the present system. MorreeCour^.N.Y. 19$4 Champicn map of Rochester, New York, indudng Morroa Cc 1993 by: Chamfsion Map Corporaion. Client Interface Software The client is the most visible part of any retrieval system. Our goal has been to provide a powerful yet simple Fig. 1 Prototype Client within Netscape Navigator interface to the user. We assume that users are not information retrieval experts; they need as much guidance as possible to discover the structure and content of the

36 2. OCLC PROJECT REPORTS

Netscape - [wysiv/yq://18/http://hic...6074/sstest/Record.html| Retrieval Server Software ® standard FormatO Catalogjbg Format We decided to use Java to replace much of the server software, including communications and retrieval. Java is ACCESSION: more often associated with programming client software 31536652 within Web browsers than with programming servers, but AUTHOR; it offers several advantages for server applications: Erastus Darrow & Co. XTTLE: • Reliability. The extensive type, bounds, and Monroe County. exception checking inherent in Java makes Java MAP DATA: programs reliable with less effort. Scale [1:205,920] PLACE: • Multi-threading. Efficient operation of a server Rochester, N.Y. : practically demands multi-threading. Java's built-in PUBLISHER: support makes this much easier to program. Erastus Darrow & Co., YEAR: • Good software libraries. Many of the 1997 communications and Internet protocol routines needed PUB TYPE: already exist for use in Java programming. Map FORMAT: • Portability. The ability to run the same software on 1 map ; 32 X 30 cm. multiple platforms with virtually no porting overhead NOTES: increases the likelihood that research prototype code Shows post offices. Oriented with north toward the upper left. SUBJECT: can form the basis of production code. Postal service -- New York (State) -- Monroe County •• Mans. Although Java tends to execute more slowly than Monroe County (N.Y.'I -- Maps. equivalent C programs, its advantages outweigh this speed disadvantage in most applications. An advantage of using Java to program the client is Fig. 2 Full Record Display that the client can undertake much more of the processing needed to interact with the retrieval server. By using the In addition to viewing the full set of records after ANSI/NISO Z39.50 Information Retrieval protocol, we are finding a subject heading, records within a database can able to connect to the client directly with the retrieval be further categorized. Figure 3 is a close-up of a browse server, eliminating a great deal of protocol translation that for United States, showing the various record types in the our current system is forced to do. This requires users to 6,046 records in the test WorldCat database. Users could have a recent version of the Web browser, but has the then select a subcategory if they were interested in, for advantage that many interactions occur on the user's example Music Scores, and view only those records. machine, and are therefore more responsive than those that require network access. We have found the Z39.50 5 ^ united society of bdievers standard remarkably complete, offering a standard » protocol into which we can fit virtually all of our 0 ^3 (Summary by Datoase) client/server interaction. 1 51 % WforWCat (10k) (260 Records) Currently the client is only partially multi-threaded, but 1 0 % WorWCat (250k) Records) the server makes extensive use of multi-threading to do i 15 (By Ma^ Language) concurrent searches across several databases. We have not 0 ^ (By Ftecord Type) yet run the system under a full load with large databases, i i f Books (^$54 records) so actual throughput of the server is still unknown. j 1 Maps (315 records) j i Sound Recordings P84recffl'ds) 1 t Serials (1S5 records) Retrieval System Hardware and System Software 1 1 # Manuscripte (134reca-ds) The platform we selected for implementing the FirstSearch j i " # Visual Matwl^ (126 records) Next Generation server is Microsoft Windows® NT running i ! # Computer Fites (29 recccds) on two Hewlett-Packard 4-processor NetServer 5/166 t Music Scores (13 records) machines. Each server has 768MB of memory and 60GB of i <1^ (By Year) disk in three RAID disk arrays. We chose this configuration ' (By Aiaior/Tile/Publisher) because we wanted to explore the possibility of 1 B 3ft WorldCaJ Maps (5737 Records) supporting a large application under Windows NT. We 5 ^ (dates) wanted a multiprocessor system to better support j H places) concurrent threads and operations, and almost all NT ^ 34h congress 1st session 18&6-1856 - tiwsa software becomes available first (and sometimes only) in ro... A. Intel X86 versions. This platform has worked out well, Fig. 3 Analysis of United States Records

37 2. OCLC Project Reports

since the major Java environments have followed the Conclusion pattern of becoming available for Intel processors first. We think that the graphical display possible using Java, Platform independence is one of the major benefits of Z39.50, and Newton will be an appealing and useful using Java for the implementation language. Although we interface for users accessing our databases. Providing both selected the HP servers with the possibility of running Java subject heading browsing and ranked retrieval based on on them, we have not tailored the information server and keywords in the records makes a powerful and natural retrieval code particularly to the platform. We did some retrieval combination. By adopting the 239-50 protocol platform-specific coding to reduce the number of throughout the system, we have gained both simplicity concurrent threads, which our experience suggests is and efficiency. In the coming months, we plan to bring up beneficial to other environments as well. We have actually larger databases, complete user testing to improve the run the retrieval software on three different platforms interface, and test the system under realistic loads. without modification, indeed without recompilation. Database Structure Since the debut of OCLC's reference products, our text IMAGE DESCRIPTION ON THE retrieval engine has been an OCLC-developed system called Newton. Although we rewrote the search software, the INTERNET database structure and most of the building software remains SUMMARY OF CNI/OCLC IMAGE unchanged. We were able to do this because the present system offers a flexible indexing and database structure. METADATA WORKSHOP Currently, we have three major indexes to the data. The most extensive index consists of individual terms from Stuart L. Weibel, Consulting Research Scientist, and all indexable fields of the records. This index is used for Eric J. Miller, Associate Research Scientist ranked retrieval. Subject headings and author names are also added to hierarchical indexes for browsing. For subject headings, each level of the hierarchy is defined by Abstract The CNI/OCLC Image Metadata Workshop, held the logical structure set out in the MARC record, each September 24-25, 1996, in Dublin, Ohio, was the third subheading becoming a new level. For display, if there is workshop in a series being sponsored by OCLC in only one child below a node in the hierarchy, it is brought conjunction with a number of partners. This workshop up to the next level. Figure 1 shows an example: "Maps" focused on the application of the Dublin Core element set has been added to the heading "Cities and towns" because to image resource description. A consensus was reached it was the only subheading under "Cities and towns." At supporting the Dublin Core and the related Wanvick any level in the hierarchy, the (Summary by Database) Framework, both of which were products of earlier node can be selected to get a breakdown and display of workshops, as viable models for image resource records by database. description supporting network-based discovery of One of the more vexing tasks facing FirstSearch users images. is the requirement that they first select a database. Choosing from approximately 60 databases can be difficult for even experienced searchers. Our design attempts to alleviate Seventy practitioners in the area of networked image this problem by offering searching on groups of databases description attended a two-day workshop September selected by subject coverage and types of resources 24-25, 1996, sponsored by the Coalition for Networked indexed by the database. This approach has ramifications Information (CNI) and the OCLC Online Computer Library to the server, requiring merging of arbitrary collections of Center in Dublin, Ohio. This third in the series of databases for searching and browsing. Performance may metadata workshops addressed the application of the also degrade as more records are searched at a time. At Dublin Core element set to image resource description this early stage of the project, we lack the experience with (see the Dublin Core Home page for more detailed large databases required to determine whether these issues information about this workshop and others in the series). can be satisfactorily resolved to provide acceptable The two-day workshop reached consensus supporting performance at reasonable cost. the notion that the Dublin Core, within the context of the Warwick Framework, affords a foundation for the development of a simple resource description model to support network-based discovery of images. As Charles Rhyne, Chair of Art History at Reed College observed:

38 2. OCLC PROJECT REPORTS

I was not especially surprised that we The single, slightly extended set of core elements for concluded that the elements needed to image discovery emerged as a set that seems to reflect the discover text and images on the Internet are various conceptual categories researchers bring to their similar. The text and the images themselves are search for information. These categories were judged by radically different and require different types of the image specialists in attendance not to differ expertise to study and interpret them, but most of the primary categories under which we significantly based on the media (visual or otherwise) of classify and search for them are similar. the information resources that might satisfy the query. The defining characteristic of a document-like-object is Given the original objective of the Dublin Core not its textual versus graphical content, but rather whether element set—to define a simple, easily understood the resource is bounded, or fixed, in the sense that the semantic core for network resource discovery—satisfying resource looks the same to all users. Thus, images, core description requirements for both textual and visual movies, musical performances, speeches and other information with a single element set is attractive indeed. information objects that are characterized by being fixed The enthusiasm for settling on a single set was tempered (i.e., having identical content for each user) can also be by a strong recommendation to make the labels for thought of as document-like objects. existing elements more amenable to the dual purpose of Nondocument-like-objects, on the other hand, include text and image description. such resources as virtual experiences, databases (including ones that generate document-like outputs), business Is an Image a Document-like-Object? graphics, CAD/CAM or geographic information generated The abstraction of a document-like-object emerged in the from database values, and interactive applications that first workshop as a way of sidestepping differences in might have different content for each user. In the context individual notions of what constitutes a discrete object of image discovery, these sources do not "contain" images worthy of separate description. One of the first issues as much as they "generate" images. These images may be addressed in the Image Metadata Workshop: is an image a described as fixed document-like-objects, but the metadata document-like-object or is it so different that an alternative required to describe them (the systems doing the framework for description is required? generating) are distinct. Consensus emerged around the idea that images are Consider the example of the visible Human Project, not so different from the document-like-objects of the first available at: http://www.nlm.nih.gov/research/ workshop. The expectation that a set of image-specific visible_human.html (described in a workshop plenary talk elements (an Image Core) would emerge from the by Earl Henderson of the National Institutes of Health). workshop gave way to the idea that the application of a More than a collection of fixed images, the Visible Human slightly modified Dublin Core element set might serve as Project is a collection of applications unified by a data set well. As Jennifer Trant, of the Arts and Humanities Data that is nothing if not visual in character. The scope of the Service in the U.K., wrote after the workshop: project itself is dynamic and evolving rapidly. The That images are "document-like" was to me one character of the visual outputs of any of the many of the more significant contributions of the applications growing up around this data set defy simple meeting. We went into the discussion assuming description and certainly are not bounded in the sense that there would be an "image core," expressed understood in this workshop. Such applications are as a separate box within . . . the initial model systems, rather than collections of images. for our discussions. We emerged from our two days of discussion with only one, slightly extended, set of core elements to support the discovery process, a set which seems to me to reflect the various conceptual categories researchers bring to their search for information. These categories did not change based on the media of the information resource [visual or otherwise] that might satisfy the query. After spending so long thinking that images were "special" [to use museum-like assumptions] it was fascinating for me to have a group of image specialists say that in most content terms fixed/staticAsounded images really are a lot like text-based document-like objects. Workshop Attendees

39 2. OCLC Project Reports

Model for Metadata One need not imagine all possible linkages to Much of the consensus-building surrounding the Dublin recognize the complexity of such a model, nor is it Core has involved accommodating stakeholder pragmatic necessary to accommodate at the outset all possible concerns arising from experience with legacy description elements, packages, and necessary infrastructure. But in models. It is helpful to have a conceptual model to guide the search for appropriate compromises, it is helpful to this pragmatism, and just such a model developed in the see the larger picture that this model attempts to capture. course of the workshop. This model is an outgrowth of What is necessary, though, is an agreement as to the previous work of Bearman (see Metadata Requirements notion of assembling sets of descriptive elements, which for Evidence} and provides conceptual support for both enables extensibility and forward compatibility. the Dublin Core and Warwick Framework by illustrating Like the Warwick Framework, this model explicitly the transactional relationship between metadata and the recognizes that metadata will be created and managed by research process. a variety of agents, for different reasons, at different times Research can be thought of as a series of interactive in the life of the object. This implies an infrastructure and processes, which can provisionally be described as architecture that does not now exist, but that will evolve, including: driven by the marketplace. The modest achievement of • Discovery: the identification of relevant resources this workshop is to reaffirm the semantic characteristics of • Retrieval: the transfer of resources to a local site a single variety of metadata package—the core elements • Collation: the aggregation and organization of selected of a resource discovery element set—and to assert its resources suitability for both textual and visual resources. • Analysis: the intellectual and/or computational analysis of resources How Are Images Different? • Re-presentation: the formulation of derivative It is gratifying that the workshop reached agreement that intellectual artifacts based on the resources and text and images could be classified using similar categories, previous processes in the sequence but just as clearly, images offer a number of technological These processes involve events and resources and descriptive challenges peculiar to themselves. Textual distributed among institutions, machines, networks, and materials can be indexed, often simplifying or partially the minds of individuals. Metadata, then, becomes any one automating the task of description, whereas most of the set of elements drawn from the many kinds of information descriptive elements of images are extrinsic to the work necessary for decision-making within this matrix of minds, (or are not easily extracted from the work). machines, and networks. Encoding schemes are critical for using images. This For example, access to discovery metadata may lead can be true for textual materials as well, but there are to the return of terms and conditions elements necessary fewer varieties of textual representation, and at least for for retrieval. Retrieval metadata might include the network some of them, there is some graceful failure (HTML or address of a resolver from which the resource may be SGML, for example, are hard, but possible to read without accessed or the publisher of an item with whom a usage a suitable rendering program). agreement must be transacted. Collation metadata might Rendering images is unforgiving and the variant forms include data about an image collection schema or the are combinatorially overwhelming. Commonly provenance of an item. Analysis might require a color map encountered Web graphics display by default and for the item. Re-presentation could involve information presumably tolerate wide differences in display validating credit to rights holders, and might well require a characteristics. As more sophisticated imaging applications link to update use history of the source object. populate the Internet, metadata will play a more important A variety of metadata will be needed to satisfy the role in discovery and selection. Information necessary for requirements of each stage, and hence the functional rendering may include: requirements of metadata packages might be defined by • Type (bit-mapped, vector, video) these requirements. To be used effectively, elements of • Format (TIFF, GIF, JFIF, PICT, PCD, Photoshop, EPS, metadata must be readily available as required by each CGM, TGA . . .) stage in which the user is engaged (though different • Compression schemes and ratios (JPEG, LZW, implementations might deliver some metadata at stages QuickTime. . .) prior to its being needed). It is recognized that the • Dimensions pragmatics of collection and management of metadata will • Dynamic range likely compromise this ideal, but the model can • Color lookup tables nonetheless inform our thinking and design. • Related metrics (CMYK, RGB. . .)

40 2. OCLC Project Reports

Characteristics of original image capture, while less critical for the casual user, may be of overwhelming Table 1 Abbreviated Description of Dublin Core Elements importance to the archival or research significance of the Element Descriptions image or collection. This sort of information is also, for 1. TITLE the most part, irrecoverable if not recorded at the time of The name given to the resource by the CREATOR or capture. Categories of information about the scanning PUBLISHER. process include: 2. AUTHOR OR CREATOR The person(s) or organization(s) primarily responsible for the • Light source (full spectrum or infrared, for example) Intellectual content of the resource. • Resolution 3. SUBJECT AND KEYWORDS • Dynamic range The topic of the resource, or keywords, phrases, or classification descriptors that describe the subject or content • Type of scanner of the resource. • Date of scan 4. DESCRIPTION • Journal/audit trails A textual description of the content of the resource, including • Digital signatures for authentication abstracts in the case of document-like objects or content descriptions in the case of visual resources. Variant forms of the image content are also important. 5. PUBLISHER Indentifying and managing versions is a problem with all The entity responsible for making the resource available in its content; the number of variant forms of a given image is present form, such as a publisher, a university department, or likely to be particularly problematic. a corporate entity. 6. OTHER CONTRIBUTORS • Source image Person(s) or organization(s) in addition to those specified in • Different views of the same object the CREATOR element who have made significant intellectual • Different scans of the same object contributions to the resource but whose contribution is secondary to the individuals or entities specified in the • Different resolutions of the same image CREATOR element. • Details of the same image, source ID 7. DATE • Responsible institution The date the resource was made available in its present form. 8. RESOURCE TYPE All these categories may be critical elements of The category of the resource, such as home page, novel, metadata for a particular image or collection. poem, working paper, technical report, essay, dictionary. It is The complexity of adequately capturing and encoding expected that RESOURCE TYPE will be chosen from an such information conflicts with one of the original design enumerated list of types. goals of the Dublin Core: simplicity. If the Dublin Core is 9. FORMAT The data representation of the resource, such as text/HTML, to be applied in some useful way to the domain of ASCII, Postscript file, executable application, or JPEG image. images, it will be necessary to isolate the essential core of FORMAT will be assigned from enumerated lists such as information appropriate to a simple description record and registered Internet Media Types (MIME types). to identify a graceful extension mechanism that supports 10. RESOURCE IDENTIFIER encoding of the richer array of descriptors hinted at in the String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs preceding paragraphs. (when implemented). 11. SOURCE Modifications to tlie Dublin Core The work, either print or electronic, from which this resource is derived, if applicable. The workshop consensus and subsequent animated 12. LANGUAGE discussions on the META2 list (the primary forum for Language{s) of the intellectual content of the resource. discussion of Dublin Core issues) resulted in several 13. RELATION changes to the Dublin Core element set (table 1). Several Relationship to other resources. Formal specification of RELATION is currently under development. element names were modified slightly to make them less 14. COVERAGE text-centric, and 2 elements were added to the original 13. The spatial locations and temporal durations characteristic of The reference description of the elements resides at the resource. Formal specification of COVERAGE is currently http://purl.org/metadata/dublin_core_elements. under development. 15. RIGHTS MANAGEMENT The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way.

41 2. OCLC PROJECT REPORTS

Subject and description separated Open Issues SUBJECT and DESCRIPTION are now separate elements in the core, partly because of the judgment among the image Surrogates and objects specialists that these are quite distinct concepts for images. Among the most important impediments to coherent Other participants in the metadata discussions on META2 deployment of a metadata element set is the confusion agreed that such a distinction is also useful for other between description of the object versus description of the media. Thus, SUBJECT is intended to include keywords, digital surrogate of that object. This can be a problem with controlled vocabulary terms, and formal classification text, but in general, it is the intellectual content rather than designators, while DESCRIPTION is to be used for its presentation that is of primary importance with text, descriptive prose or content description (in the case of and increasingly the primary version of a text is its images) and affords a natural place for abstracts in the electronic form. case of textual documents. With images, the variety of forms an image may Rights-management field assume in its life cycle is liable to be greater than for a piece of text, and the relationships among these versions A simple rights-management field is perceived by many as are intrinsically more complex. The degree to which such a necessary component of a core description record. While information is captured and the means of encoding it are arguably not an intrinsic dimension of discovery, it is of difficult problems whose solutions must evolve in tandem such importance to the use of images that failure to with the pragmatics of implementation. include such an element may hinder wide deployment. This is a good example of the imprecise lines of Collection versus item-level description demarcation between different varieties of metadata that Collection descriptions and the schemas that account for inevitably will blur the idealized functional boundaries one the aggregation of images in such collections are essential might hope for among metadata packages. Resource for effective collection discovery. Early discussions description is a messy business—ask any cataloger. embraced the possibility of a separate element for The digital world requires a sophisticated language for addressing this, though ultimately the consensus was to expression and negotiation of intellectual property rights; capture this sort of information in existing fields. The the evolution of the supporting infrastructure is well under simplest possibility is to include a Resource Type or way. This element should not be construed as a substitute RELATION flag (COLLECTION | ITEM). Further for such a language or metadata structure, but rather as a explorations are necessary to determine whether this is means for communicating simple terms and conditions sufficient or whether there might be other sensible values where they exist or providing a link to more complex for this subelement. This is part of the larger relation information as it evolves. problem, which requires elaboration for visual and textual One proposed application of the field is: materials alike. 1. Null—there may or may not be restrictions on use, SOURCE dangerously recursive and users must figure it out independently, outside the SOURCE information is potentially recursive (and probably context of this particular collection of metadata. complex) for any object, but especially with images. How 2. The string "No Restrictions on Reuse"—there are no can such object-surrogate-derivative relationships be restrictions on re-use. expressed both to aid in discovery and to explicate 3. URL or other pointer—there are restrictions on use, intellectual property lineage? and users can follow the link to find out more information. Mapping the Dublin Core to other element sets One of the first tangible outcomes of the first metadata This approach addresses several implementation meeting was mapping Dublin Core elements to MARC issues. The metadata could be used to retrieve materials fields by Rebecca Guenther of the Library of Congress. with no restrictions on use at a top-level search, without Her discussion paper contributed substantively to the getting into any subsidiary packages of metadata. Second- community awareness of the emergence of the Dublin level packages of rights-management metadata could be Core as a model for network resource description and fed retrieved automatically or presented as links within the back into the change process for MARC. Similar mapping search results. All records in a single collection could between Dublin Core elements and existing image share a single value in the rights management field. description standards will clarify the role of the various Additionally, managers who do not fill in the rights- elements and provide guidance for applying the Dublin management field, or who have rights issues but have no Core to image collections. As suggested on the workshop online access to that information, enjoy the presumption discussion list, existing standards or practices can serve as that a null response means there might be restrictions.

42 2. OCLC PROJECT REPORTS

templates for developing guidelines, thereby promoting Paul Evan Peters interoperability and reducing the effort required to December 12,1948- develop description standards. November 18,1996 Viewing requirements The OCLC Office of Research The bandwidth and time penalties for retrieving images lost a good friend in November are often high, making it desirable to have some 1996, with the death of Paul indication of usability prior to retrieval. It was agreed that Evan Peters, 48, Executive Director of the Coalition for an existing element (the FORMAT element, previously Networked Information. He called FORM) could probably be used to express this became involved with the Office information, but a standard of best practice needs to of Research in early 1996 when evolve through real-world implementation. The most we decided to hold the problematic aspect of this issue is where to stop. An CNI/OCLC Metadata Workshop archival site using the Dublin Core to describe items at a September 24-25, 1996, at OCLC. While working with Paul deep level might want to include a large set of image- on this workshop, we grew to know a man whose breadth of related descriptors but this would hardly be expected to vision for the future of networked resources excited us. He be the norm for broad deployment. A flexible means for will be missed. including such information in the FORMAT element should Following is the announcement of Paul's passing made be proposed in user guidelines, with an eye toward the by the Coalition for Networked Information: evolution of a Warwick Framework-style package to We are very saddened to report the death of Paul Evan support such needs in the future. Peters, 48, Executive Director of the Coalition for Networked Information. He died suddenly on November 18, 1996 while Future Developments he walked on a beach with his wife while on a trip to Florida. Paul was the founding director of the Coalition for The Dublin Core is a high-level reference model. In and Networked Information and served as its head since March, of itself, it does not provide guidance for cataloging or 1990. Highly respected in the library, information technology, searching, nor is it a blueprint for system development. and scholarly communities, he sought common ground for Rather, it provides guidance for the semantic content of a many constituencies in order to develop global networked simple resource description model that may profitably be information resources. A true imagineer, his vision and his applied to visual as well as textual resources. The ability to pull people together to build new realities were consensus developed around this model is the major unique. product of the workshops. This consensus is the Paul led CNI through two cycles of formal evaluations by achievement of many prominent people who have used the sponsoring organizations and as recently as September their own experiences and the collective intelligence of saw it move from the status of a sunset enterprise to one of an ongoing nature, recognizing the achievement of its the communities they represent to arrive at a common essential role in the North American dialogue to advance foundation for networked resource description. scholarship and intellectual productivity. These workshops are but tentative steps in bringing Before founding the Coalition in March 1990, Paul was this collective intelligence to bear on the difficult problem Systems Coordinator at the New York Public Library from of resource discovery on the Internet. The consensus is 1987 through 1989, and was Assistant University Librarian for important but incomplete. It requires integration of detail, Systems at Columbia University, where he also earned a elaboration and extension, and building a small master's degree in sociology in 1986. From 1970 until 1978, community into a larger one. Most importantly, it requires Paul was a principal in a variety of research and development sharing this vision with the system designers, authors, and projects and he earned a master's degree in library and information managers who must, through application and information science at the University of Pittsburgh. Paul use, turn the model into applications that will help real worked briefly as a Retail Systems Engineer for the National Cash Register Corporation immediately following the users solve real problems. completion of his undergraduate studies in computer science The path forward will be charted through the and philosophy at the University of Dayton in 1969. Paul was collective means of the workshop mailing lists and a former president of the Library and Information Technology subsequent workshops that will refine and elaborate this Association, was a former chair of the National Information work-in-progress (see http://www.dstc.edu.au/DC4/ for Standards Organization, and served on the editorial boards of information on the upcoming fourth Dublin Core a number of networking, networked information, and library Workshop). Slightly less than two years after the original technology journals. He also served on the Council of the workshop, prototype applications of the Dublin Core are American Library Association. emerging, and recognition of the Dublin Core as a He is survived by his wife Rosemarie Kozdron, his parents Austin and Mary Peters, and a brother Philip Peters.

43 2. OCLC Project Reports

foundation for discovery-oriented resource description is growing. Building on these prototypes and refining this KILROY consensus will provide the foundation for a networkwide AN INTERNET RESEARCH PROJECT body of practice that can help rationalize resource description across domains and make Internet resources Keith E. Shafer, Senior Research Scientist more accessible. References Abstract Kilroy is an OCLC research project that is building an Bearman, David and Ken Shoats. Metadata Requirements for Internet harvester, full-text databases, and metadata Evidence. Available at: databases of Internet resources. Via Kilroy, we hope to http://www.sis.pitt.edu/~nhprc/BACartic.html study several aspects of the changing state of Internet Burnard, Lou, Eric Miller, Liam Quin, and C.M. Sperberg- resources including end-user access to Internet search McQueen. A Syntax for Dublin Core Metadata: services, link analysis, and the automatic generation of Recommendations from the Second Metadata Workshop. metadata. Available at: http://purl.oclc.org/net/eric/DC/syntax/metadata.syntax.html The past few years have seen an explosive growth of free Dempsey, Lorcan and Stuart L. Weibel. 1996. "The Warwick Internet search services. These services provide extremely Metadata Workshop: A Framework for the Deployment of Resource Description." D-Lib Magazine, July 1996. Available useful tools to the information community by making the at: http://www.dlib.org/dlib/july96/07weibel.html vast amounts of information on the Internet freely searchable. As a result, there is increased interest in The Dublin Core Home page. The definitive source of information on historical and current developments understanding the quality of information available on the concerning the Dublin Core and related metadata initiatives. Internet, how it is structured, and how it is being used. Available at: http://purl.org/metadata/dublin_core The purpose of the Kilroy research project is to build an Internet database similar to those available from current free Knight, Jon and Martin Hamilton. Dublin Core Qualifiers. Internet search services. Via Kilroy, we hope to smdy many Available at: http://www.roads.lut,ac.uk/Metadata/DC- SubElements. html aspects of Internet resources including link analysis, the changing use of metadata, the developing use of the Dublin Lagoze, Carl, Clifford A. Lynch, and Ron Daniel, Jr. 1996. "The Core, and the changing state of Internet resource content. Warwick Framework: A Container Architecture for It is a significant task to harvest, maintain, and make Aggregating Sets of Metadata." D-Lib Magazine, July 1996. Available at: available a large Internet database. For instance, imagine a http://www.dlib.org/dlib/july96/lagoze/071agoze.html system that could harvest and process a document a second. At this rate, it would take nearly a year to build a 30,000,000 The META2 Mailing List, [email protected]. The primary record database. By building a system to support this kind mailing list forum for the discussion of Dublin Core issues. To subscribe, send a message to [email protected] of database, OCLC staff will be able to understand the with a body of: subscribe meta2 your-name-here costs associated with maintaining a general Internet database in addition to the OCLC NetFirst database. Miller, Paul. 1996. "Metadata for the Masses." Ariadne, September. Available at: Several research projects at OCLC require the http://www. ukoln .ac.uk/ariadne/issue5/metadata-masses/ harvesting of documents from the Internet. Kilroy will contain the infrastructure to provide this information A Proposed Convention for Embedding Metadata in HTML. A quickly and centralize the tools to maintain and collect position paper from the May 1996 W3C Workshop on Distributed Indexing and Searching. Available at: this raw information. For instance, we expect to use Kilroy http://www.oclc.org:5046/~weibel/html-meta.html as a means of evaluating Scorpion, available at: http://purl.oclc.org/scorpion. By collecting large resource samples and creating metadata via Scorpion, we will be able to analyze how well automatically assigned subjects would help Internet search services. The Kilroy database could also be used as a large test collection for FirstSearch the Next Generation. Once we have a large Kilroy database, we may open Kilroy access to a restricted set of end users. Doing so would let us evaluate the types of queries users issue to other free search services and the responses they get. This information would help OCLC plan future Internet database offerings and benefit our patrons since this information is currently not shared by the free search services. More importantly, we hope to be able to

44 2. OCLC PROJECT REPORTS

understand better how library science ideas like controlled Current options vocabulary and metadata could be efficiently applied to when information providers create descriptions of an such a rapidly changing data collection. electronic resource, they have two choices. They can use the standard formats for the creation of descriptive records, or metadata that is optimized for the precise characterization of the unique data under their control. For example, if they are describing texts that might be of interest to linguists, they might use the Text Encoding Initiative [McQueen and Burnard 1994], which contains guidelines for transcribing conversations and for recording the social context of an utterance. If they are describing maps, they might use the Federal Geographic Data Committee [FGDC 1994] standard, which has guidelines for Kilroy project participants: Jonathan R. Fausey, Vincent coordinating geospatial data with maps. If they are M. Tkac, and Roger Thompson describing a resource that is traditionally handled by libraries, they might use the MARC [MARC 1994] standard. An alternative is to register the resource with an Internet directory service such as Yahoo, Excite, or AltaVista. A METALANGUAGE FOR These alternatives have tradeoffs. Metadata standards DESCRIBING INTERNET that have been developed to serve the needs of particular scholarly communities can produce the best descriptions, RESOURCES but software that manipulates these formats is expensive to develop and not freely available. Moreover, users are C. Jean Godby, Research Scientist, and Eric J. Miller, required to have extensive knowledge of these formats if Associate Research Scientist they have a multidisciplinary topic. For example, a user who is interested in the geographic distribution of a Abstract dialect feature might search databases encoded in the TEI This paper describes a tool that can be used to create and FGDC formats, both of which have many pages of customized descriptions of Internet resources. When these documentation and are still evolving. On the other hand, descriptions adhere to established standards such as Internet directory software is accessible and easy to use, MARC, TEI, and FGDC, they can be collected into but it cannot be used to create descriptions that are databases and searched with a common, easy-to-use detailed enough to support scholarly inquiry. interface using the OCLC Spectrum System. For example, consider the description in the appendix to this report. It is suitable for encoding in FGDC for a Web site describing a research project sponsored by the Scholarly Information on tfie World Wide Web United States Geological Survey that analyzes the distribution and characteristics of polar sea ice in a Scholars and professional information providers are specified latitude/longitude range. The Web site has beginning to address the problem of creating descriptions pointers to technical reports as well as raw geospatial and catalogs that identify electronic resources accessible data. Some of the technical reports that are available on on the Internet. In many cases, these resources consist of the Internet are based on conventially published papers. important primary data, theses and dissertations, computer What happens when someone tries to register this software, prepublication drafts of papers, or electronic resource with Yahoo? To create a description that has a versions of rare or difficult-to-access books and chance of being found again by the target audience, it is manuscripts. Though they may be included in Web necessary to supply the publisher, date, geographic indexes such as Lycos or Yahoo, they are inadequately coverage, and relationships among the raw data and the described by the automated methods used by these technical reports. Since this information must go in systems to identify Internet resources. As a result, they are Yahoo's unstructured Comments field, there is no lost in noise, effectively inaccessible to all but the most guarantee that the description will be in a standardized patient searcher. Until automatically generated indexes can form that would enable an automatic process to group provide a clear idea of the logical structure, quality, and similar records. subject of the work, they will not replace indexes created Once created, the description must be placed in by human experts. Yahoo's subject hierarchy. Although Yahoo has several categories for geography, the classification scheme is alien to a professional geographer, so it is unclear where the

45 2. OCLC PROJECT REPORTS new record belongs. To find a suitable location, it is that interprets the subject as an FGDC subject and requires necessary to scroll to the end of hierarchies such as FDGC's field for recording latitude and longitude. This science/geography/maps/institutes or information would be used to update two cross-linked regional/regions/arctic to find similar objects, databases: one of simple core records and one of FGDC essentially searching a subtree that is constantly growing. records. The user would search for the record through an The user must repeat much of this process to find the easy-to-use interface to the database of core records, with resource. As long as the number of resources in a subtree the option of linking to the FGDC database for additional is small, the tasks of registering and finding Internet detail. With a system like this, it would be possible to resources may be manageable, but it is an open question enlist the power of established metadata standards to how well this scheme works on a larger scale. create more precise descriptions of Internet resources than those currently available, while hiding the complexity of New tools these standards from the casual user. An improved tool is needed to create and access structured records that are now available only in the The Spectrum Solution standard metadata formats. The tool must be as easy to Our implementation is an extension and generalization of use as the current Internet directory services. To do this, it the OCLC Spectrum System. As described in detail is necessary to define a bridge across these standards, elsewhere [Vizine-Goetz et al. 1995], the Spectrum System acknowledging that it is neither practical nor desirable to unify them under a single model. This involves extracting allows users to register and describe a resource for semantic overlap, especially in fields that identify, classify, inclusion in a Web-accessible database of Internet and point to the location of a resource. Once extracted, a resources. The Spectrum System has three major core record could be mapped or linked to FGDC, MARC, components: a record-creation subsystem, a record- or TEI records. conversion subsystem, and a record-retrieval subsystem A computer system that is designed around the (fig. 1). Interacting with a series of HTML forms, the user alternative view of metadata might serve the scholarly creates a simple but useful description that is based on the community better than the current Internet registry Dublin Core [Weibel et al. 1995a]. The record-conversion services, with only a slight increase in user-apparent subsystem converts this record to the MARC format and complexity. To register the research project described in creates a database of records in a compatible format. At the appendix, the geographer would fill out a form based the user's request, the Spectrum System can also convert the input record to the TEI format. The record-retrieval on the core record that asks for author, tide, publisher, date, subject, electronic location, and other information. subsystem presents an HTML interface to a database that is Optionally, the geographer would indicate, in a simple accessible from the Internet via OCLC WebZ [Weibel et al. scripting language, that this record has an FGDC flavor 1995b], an HTTPD server that maintains a database session

Web Browser

Record Entry i NCSA HTTPD Server Web Browser

Record Display

CGI Script Accessing OCLC's OCLC's SGML Document WebZ Server Grammar Builder

• Record Retrieval DB Preparation i Database w Newton DB| k Preprocessing Utilities Ir Build Utilities| r Newton DB ]

Fig. 1 Spectrum System Architecture

46 2. OCLC PROJECT REPORTS

and bridges the gap between the HTTP protocol and the communities, the geographer who wishes to register the Z39.50 information retrieval protocol. research on polar sea ice will have problems similar to Spectrum has two design features that make it those encountered in the attempt to register the resource suitable. First, all components except the user interface are on Yahoo. The Spectrum record cannot indicate written using industry-wide standards and software that is unambiguously that the subject is from a classification available for license or purchase, primarily OCLC scheme used by geographers. The Spectrum record gives SiteSearch software'. An immediately apparent result of no place to record the geographic coordinates that Spectrum's design is that the database of Internet precisely identify the distribution of the data. As a result of resources created by the interaction with the user interface the fixed input record. Spectrum achieves only limited contains sophisticated structured records that support interoperability among metadata standards. highly specific queries. For example, a user can request all electronic texts about Shakespeare written in English, Generalization of the Spectrum System French, or German but not Portuguese accessible by FTP In the revised Spectrum system, the backend processes with dates no later than 1994. This degree of specificity is map among metadata standards and build a Web- beyond the scope of popular Internet searching tools accessible database (fig. 2). [Cortois et al. 1995]. Instead of a fixed record-entry form, the user interface Second, the Spectrum System uses OCLC Document is built with a simple but extensible metalanguage coded Grammar Builder^ software to map from the Spectrum in SGML that generates HTML markup appropriate for input record to the TEI and MARC formats. Since the entering and submitting descriptive records. This language intellectual mappings among the record types are recorded is the Spectrum Cataloging Markup Language (SGML). in the Document Grammar Builder's fourth-generation It is flexible to allow user communities to build scripting language, the interoperability that already exists customized HTML interfaces. Though SCML currently between the Spectrum-input record and TEI and MARC generates only HTML, it could, in principle, be extended records is easily extensible to other metadata standards. to generate scripts in Java and Visual Basic. The mappings can also be changed as the standards Trivially, SCML can be used to generate Spectrum's evolve, without changing Spectrum's source code. original record-entry form. For example, the Author field Spectrum's design, however, has some limitations. The can be created with the code fragment in fig. 3. After this most important problem is that the user is limited to a fragment is processed with a CGI script, the result is the single input record. The Spectrum data entry form was HTML form. If a user creates a record entering names in designed to gather information that could be mapped to a the HTML text field. Spectrum's CGI scripts produce the minimal record in several metadata standards, using the SGML record. Dublin Core as a starting point [Vizine-Goetz et al. 1995]. Though this form may be valuable to some user

Spectrum < • Back-End CGI Database Building Database Scripts Creation

Spectrum API

Document < • Grammar Builder Engine

Fig. 2 System Design for a Generalized Spectrum

47 2. OCLC PROJECT REPORTS

SCML Code SGML Definition of TEI Author Field

       HTML Code  Autlior; SCML Fragment SGML Code Publisher Jean Godby and Eric Miller Responsibility Role Nameinput name=NULL size=40> Fig. 3 SCML, HTML, and SGML Code for Spectrum's Author Field l-ITML Fragment olement id = "publisher"> The HTML fragment is created by using two tags in  is The generalized version of Spectrum achieves some renamed to the Dublin Core . measure of interoperability among metadata standards while hiding its complexity from casual users. Because the records in the Spectrum system can support descriptions 

48 2. OCLC PROJECT REPORTS

that are deemed adequate by the scholarly communities References that generate them, users can expect search results of higher quality than those obtained from the current AltaVista, http: //www.alta vista .com generation of Internet search services. Courtois, Martin P., William M. Baer, and Marcella Stark. 1995. "Cool Tools for Searching the Web: A Performance Appendix Evaluation." Online (19) 15-32. Description of a Research Project on Polar Sea Ice Excite Netsearch. http://www.excite.com Keyword: Sea Surface Temperatures Federal Geographic Data Committee. 1994. Content Standards for Digital Geospatial Metadata. Washington, Federal Title: Modern Average Global Sea-Surface Temperature Geographic Data Committee. Author: Schweitzer, Peter N. Guenther, Rebecca. 1995. Mapping the Dublin Core Metadata Author's email address: [email protected] Elements to USMARC. Library of Congress. MARBI discussion Publisher: U.S. Geological Survey paper No, 86. Publication Date: 1993 Sperberg-McQueen, C, M. and Leu Burnard, eds, 1994, Object Type: technical report Guidelines for Electronic Text Encoding and Interchange. Object Form (IMT): text/html Chicago and Oxford: Text Encoding Initiative, Electronic Location: http:// USMARC Format for Bibliographic Data. Network Development geochange.er.usgs.gov/pub/magsst/magsst.html and MARC Standards Office, 1994, Washington: Library of Relationship (child of): Congress Cataloging Distribution Service. http://geochange.er.usgs.gov/pub/info/holdings.html Vizine-Goetz, Diane, Jean Godby, and Mark Bendig, 1995, Relationship (sibling of): "Spectrum: A Web-based Tool for Describing Electronic file://geochange.er.usgs.fov/pub/sea_ice/README.html Resources," Computer Networks and ISDN Systems Source (book): 27:985-1001, NOAA Advanced Very High Resolution Radiometer Weibel, Stuart, Jean Godby, Eric Miller, and Ron Daniel, 1995a. Multichannel Sea Surface Temperature data set produced The OCLC/NCSA Metadata Workshop Report, available at: by the University of Miami/Rosenstiel School of Marine http://www.oclc.org:5047/oclc/research/conferences/ and Atmospheric Science metadata/dublin_core_report.html Originator: Jet Propulsion Laboratory Weibel, Stuart, Eric Miller, Jean Godby, and Ralph Levan. 1995b, Date: 1991 "An Architecture for Scholarly Publishing on the World Wide Coverage:Keywords: Web," Computer Networks and ISDN Systems 28:23S>-245, North Atlantic Ocean, South Atlantic Ocean, Indian Ocean, Yahoo! http://www,yahoo,com Pacific Ocean, Mediterranean Sea Spatial: West: Orientation = W; Deg = 180 East: Orientation = E; Deg = 180 MR. DUI'S TOPIC FINDER North: Orientation = N; Deg = 72 South: Orientation = S; Deg = 66 Mark W, Bendig, Consulting Systems Analyst Temporal: Begin (YYYYMMDD): 19811001 Abstract End (YYYYMMDD): 19891231 OCLC's Office of Research is engaged in several projects with the objective of leveraging the Dewey Decimal Classification (DDC) as a tool to facilitate end-user Notes searching of databases of Internet resources. To 1 Documentation for OCLC SiteSearch is available from Victoria experiment with different approaches to DDC-assisted Mueller, OCLC Product Support Specialist; [email protected]. searching, we created a prototype browser called Mr. Dui's 2 Fred: The SGML Grammar Builder is available at: Topic Finder. A further objective of the Topic Finder is to http://www.oclc. org/fred/ experiment with the storage and display of non-English 3 The SCML template that creates an FGDC record is available versions of the DDC. at: http://www.oclc.org:5046/~emiller/scml/fgdc,scml. The HTML interface created by this template is available at: http ://purl.oclc.org/ net/spectrum/screen=fgdc. scml The Office of Research has several projects under way that focus on leveraging the Dewey Decimal Classification (DDC) as a tool to facilitate end-user searching of databases of Internet resources. One such database is

49 2. OCLC PROJECT REPORTS

OCLC NetFirst, a database which contains records which capable browser, such as Microsoft's Internet Explorer. have been classified using the DDC. Each NetFirst record This was not considered to be a serious liability inasmuch has been assigned one or more DDC numbers as the required software may be obtained for free from a representing its topic(s). These assigned numbers have plethora of download sites on the Web. been indexed to allow the database to be searched by The Topic Finder is implemented as a collection of DDC number. static HTML files, organized as shown in fig. 1. The files To experiment with different approaches to DDC- are divided into several groups: assisted searching, we have created a prototype browser • The Topic Finder Home Page and several pages called "Mr. Dui's Topic Finder". (Melvil Dewey was a linked directly to it strong advocate of spelling reform, and for several years • Files displaying the DDC, 10 entries per file (111 files spelled his name "Dui" to demonstrate his commitment to for each language) this goal.) The Topic Finder, which is implemented as a • Files containing "canned" Search Results (10 files) World Wide Web application, allows the user to browse • Files containing "canned" Records (100 files) the first three levels of the DDC hierarchy in search of a topic of interest. Once the topic is located, the user can initiate a search of the NetFirst database which will retrieve all records assigned the DDC number corresponding with that topic. (In this prototype, database access is simulated.) A further objective of the Topic Finder is to experiment with the storage and display of non-English versions of the DDC. The initial prototype can display the DDC hierarchy in Spanish, French, and Russian in addition to English. The Russian display presented special challenges in that it requires the user's Web browser to be configured with a Cyrillic font of the proper encoding in order for the hierarchy entries to display correctly. The Topic Finder prototype does not actually search the NetFirst database. Instead, one of a collection of ten "canned" results lists is presented to the user following the activation of the search function. These results lists are based on real results of real searches conducted against the NetFirst database via the OCLC FirstSearch service. Choosing an entry from the returned list results in a display of the associated record from the NetFirst database. An actual record retrieval is not performed; the Fig. 1 System Diagram records which occur in the ten canned results lists have been downloaded from NetFirst in advance. Any URLs occurring in these records are "live." Clicking a URL (i.e., a Using the Topic Finder WWW link) appearing in the record's Location field takes The Home Page provides a choice of languages in the the user directly to the Internet resource described in that form of a set of clickable links, as can be seen in fig. 2. record. The currently supported languages are English, Spanish, French, and Russian. Selecting one of these links causes Implementation the Main Page to be displayed (fig. 3). All subsequent use The user interface for the Topic Finder is based on the of the Topic Finder takes place within this page. All concept of frames as implemented in Netscape 2.0. As a Browsing Menus, Search Results, and Records are result, the site can only be properly accessed with displayed here, each in sight at all times. Netscape 2.0 (or later version) or with a compatible frame-

50 2. OCLC PROJECT REPORTS

iMt, hiJi'-. I --jii' I 11,^1.• Upon initial access, the Main Page displays a Browsing Die gm ytew Bo Bookmarks Qp^ons grectory jtf>ndow Help Menu containing the ten Main Classes of the DDC in the specified language. Placeholder text appears where the Search Results and Records will later be displayed (right

mi*. dui'Sr oirfc Finder side of screen). Each entry in the top-level Browsing Menu Topic Finder Home Page (e.g., 000, 100, 200, etc.) is itself a hypertext link. Clicking Tfte adveRteffi on one of these links replaces the current Browsing Menu Weicotm to Mr. Dui's Topic FliKfar, a )Hototvp« database access syst«n creatdd by with a new one containing subdivisions of the chosen A Xfw Office of Rese^ch at CX:LC. he,. CXdjfIn, Ohio,

^ Th® Topic FliKter allows you to search OCLC's NetFlr^ database of rrterrwt Class. For example, clicking on the "100" link (Philosophy resources. Items of int<^est can be accessed vntfi a dtcK and Psychology) would display a Browsing Menu containing entries for 100, 110, 120, etc. Of course, these The T<^ FlfKf«r can be disi^ayed in severai lai^ages. entries are also hypertext links, and clicking on one of Your choice: En»;?li^. Spanish. French. Russian them (say 170, for Ethics) displays another Browsing Menu containing a final level of subdivisions (in this case, 170, 171, 172, etc.). Provision is also made for the user to go

Note: The Topic Rnder requires N€teo^e2.0orltfer. back "up" the hierarchy at any time. The Topic Finder does not allow further topical subdivision beyond the third level, although the DDC itself E-maii any comfner^s to Mgk Consulting Systems Analyst, UpdaiBct JimaryS, 1987 does include several additional levels in most topic areas. If the user clicks on one of the links in a third-level Browsing Menu (say 172, for Political Ethics), the Topic Fig. 2 Home Page Finder would issue a search against the NetFirst database for all records which have been assigned that DDC number or any of its subdivisions. In the prototype system, an actual search is not performed; instead, a preloaded, canned search results list is presented. There are ten such preloaded search results lists, one for each of the ten Main Classes. Each search results list each contains ten hits. These hits are derived from real NetFirst search

Mkiiing U»t results and correspond to actual records in the NetFirst database. 1?0 Etfiics (lAimf phtbaof^iy) Wortd V^cie Web fleaouR^e* 171 Syifems & doctrine* The search results are displayed in the frame in the 172 Poltcai elhtc* 173 Ethicsof familyr ralatiomh^s Worid Wfcie Wet)Re*ou?ce* 174 Economic & pnsfestionty ethe* upper right portion of the browser window. Entries in the 175 Etftict of recreation & future OWTHllfH moKie^n: Wnrf and DiwaiMikw 176 E^ics of *«( & feprodueton WoHd Wid* Web f^*ources search results list appear as hypertext links. Clicking on 177 Etftb* of social reWbn* 178 Ethic*of con«umptbn la Patiaiof Lfcwiv: Aa t»tt?y»fftinB To Tte Ptitoaonhw Of FrettUMn one of these links will cause the corresponding NetFirst World Wid* WtobRatouice IZS Other «^^l nomti record to be displayed in the frame immediately below the CWdcoeemBawtcnoarcli NeH^inrt.

Rocord Dl^iliy search results. The 100 NetFirst records specified in the DATABASE NO: £2953 pre-loaded search results lists (ten lists specifying ten TITLE: EJAP The Eleccronlc Journal o£ Analytic Philosophy records each) were themselves preloaded, each in its own TYPE; Hailing 1.1st P0BLISHER: Onivarsity C(wq»utii^ Scrvieu file. When a search results list entry is clicked on, the Zzullana Dnivecsity BloonlngtOR appropriate record file is accessed and displayed. CONTACT: For List Adninistratoi; gjap-requgaciiubvw.usa. iitdlana.adu For Kail to groi NetFirst records typically contain URLs corresponding ejapBltdJva.Mcs.ACCESS: (eaaail) inctiana.etaw listaervfli^awM.ws. tndiatia.adu to the described Internet resources and, on occasion, Xr^UIN: edu EciucaciOR EJAP is a Mlling Xiac covering che t. related sites or documents. These URLs appear in the Topic Finder's record display frame as clickable hypertext links. Because the Topic Finder records are in fact "real" Fig. 3 IVIain Page NetFirst records, clicking one of these links will display the contents of the associated URL. (The target document is displayed using the full browser window. Returning to the Topic Finder re-establishes the frame-based appearance of the Main Page.)

51 2. OCLC PROJECT REPORTS

Creating the Topic Finder files Using non-Roman character sets As can be seen from the System Diagram in fig. 3, the The inclusion of Spanish and French versions of the Topic Topic Finder requires a great many files. For example, the Finder's Browsing Menus was relatively straightforward in Browsing Menus for any one of the supported languages that the special characters and diacritics required for these are contained in 111 files: one top-level file, 10 second- languages are defined within the standard ASCII character level files, and 100 third-level files. In addition to these set and appear in all standard fonts used by Netscape and files, 10 files contain preloaded search results lists and 100 other browsers. The Russian version, by contrast, requires files containing NetFirst records. The sheer number of files that a Cyrillic font be installed on the user's system to imposed the requirement that most of them be created display the Browsing Menus correctly. A further automatically. Automatic file creation also facilitates complication is that there are a number of standard modifications in the wording or presentation of the DDC character mappings for Cyrillic text in current use (e.g., captions found in the Browsing Menu entries. Such KOI8-R for the Internet, CP1251 for Windows, CP866 for changes can be conveniently made in a single source file, DOS, etc.). These character mappings determine which leaving the file creation program to create the necessary codes correspond to which characters, just as ASCII does Topic Finder files containing the modified data. for Roman-character text. Separate Cyrillic fonts are During the development of the prototype, three file available for each character mapping. The font installed in creation programs were written: the user's machine must match the character mapping 1. FileMaker, which reads a 1000-line file containing the used to generate the Browsing Menus. DDC summary (000 through 999) and creates the 111 Browsing Menu files. Future Directions 2. ListMaker, which reads files containing NetFirst search As always, there are various ways in which the prototype results and converts them to the form necessary for could be improved or enhanced. One important change display by the Topic Finder. would be to have the system do actual live searches and 3. RecMaker, which reads files containing NetFirst retrievals against the NetFirst database, rather than relying records and converts them to the form necessary for on preloaded responses to simulate database access. This display by the Topic Finder. would allow us to analyze the ability of the chosen Each of these programs was written in Visual Basic. In browsing approach to satisfy real information needs of each case, the files created are in standard HTML format. real users. Another possible improvement would be to The 111 Browsing Menu files are given specially coded implement the Browsing Menus using CGI scripts on the names and are interlinked in such a way as to permit server side to dynamically create each menu as needed, browsing up and down the DDC hierarchy as previously based on a single source file. Alternatively, JavaScript described. The main screen for FileMaker, the most could be used to implement all Browsing Menus and their complex of the three programs, is shown in fig. 4. interconnections on the client side (i.e., within the browser). Under this scenario, no contact would be made with the server until a search or record retrieval (or Mr. Dui's File MalpScafem. bandwidth requirements.

* Input Ltuigtmg^ Beyond the technical implementation improvements oEnglish (ENG) described above, it should be possible to use the oSpanish (SPA) O French completed prototype to experiment with different O Russian (RUS) approaches to the DDC itself as a front-end topic selector. (§) Auto — determfine from file name DDC captions that have been revised to reflect end user input Format Ou^ut Forniat" language could be displayed instead of the original O SM3 — as used in DFW ^111 files O VG1 — 4 elements. spac» delimited O 3 files summaries. Some topics that appear at a relatively low O JM1 — Russian transliteration O JM2 — Russian encoded (7) level in the DDC (e.g.. Computer Networks) could be ® Auto — determine from fite name "promoted" to a higher level, reflecting the presence of

Input Fits: many resources on that topic in the target database. It may be desirable to have more than ten subtopics listed in Go Exit some Browsing Menus. But how many are too many? These issues could be addressed by running usability tests Fig. 4 FileMarker Main Window on various versions of the prototype. In keeping with the DDC's status as an internationally recognized classification system, additional languages could be included in the Topic Finder. Although the

52 2. OCLC PROJECT REPORTS

search results and retrieved records would appear in the Spain (es) 1 % Other 6% language of the target database, the user interface Italy (it) 1 % components of the Topic Finder (i.e., the Browsing United Kingdonn (uk) 1% Menus, user prompts. Home Page text, etc.) should all United States (us) 1% appear in the language chosen by the user. (In the Canada (ca) 2% prototype, only the Browsing Menus themselves appear in Misc (net) 4% different languages.) Inclusion of other languages using non-Roman character sets, such as Arabic, would allow Organizations (org) 8% experimentation with processing and display techniques appropriate for those languages. Commercial (com) 9%

USE OF THE OCLC PURL Government (gov) 12% SERVICE Education (edu) 55%

Fig. 1 Most Pointed to Top-level DNS Domains Keith E. Shafer, Senior Research Scientist

Abstract Registered Users As reported in the Annual Review of OCLC Research 1995, OCLC has been running the OCLC PURL Service since the while anyone can resolve PURLs in the OCLC PURL beginning of January 1996. During this past year, we have Service, only registered users can freely create and seen increased use of the OCLC PURL Service and maintain the PURLs in the OCLC PURL Service. The adoption of PURLs outside OCLC. This article presents an Service has over 280 registered users. The distribution of overview of OCLC PURL Service use through mid-October registered users is presented in fig 2. It is interesting to 1996. note that the distribution of registered users does not match the distribution of pointed-to domains in fig. 1.

A PURL is a Persistent Uniform Resource Locator. Other 9% France (fr) 1% - Education Functionally, a PURL is a URL. However, instead of (edu) 24% pointing directly to the location of an Internet resource, a Netherlands (nl) 2% - PURL points to an intermediate resolution service. The United Kingdom (uk) 2% PURL resolution service associates the PURL with the United States (us) 2% - actual URL and returns that URL to the client. The client Government (gov) 3% ' can then complete the URL transaction in the normal Australia (au) 4% - fashion. In Web parlance, this is a standard HTTP redirect. Canada (ca) 4% - Spain (es) 4% \ To What Are PURLs Assigned? Misc (net) 8% Commercial The OCLC PURL Service contains over 9,000 PURLs. While (com) 21% we cannot easily analyze the content of the resources Organizations (org) 16% pointed to by PURLs, we can determine where they point. Figure 1 displays the most pointed to top-level DNS Fig. 2 Registered User Distribution by Top-level DNS Domains domains, that is, fig. 1 shows the distribution of sites where the OCLC PURL Service would redirect clients. Currently 55% of the PURLs in the OCLC PURL Service PURL Resolution Requests point to educational domains. The PURLs in the OCLC PURL Service have already been resolved over 660,000 times by over 42,000 unique clients. While most PURLs point to educational domains (fig. 1), more PURL resolution requests come from commercial domains than from educational domains (fig. 3). Note that we removed OCLC access statistics from the organizations domain in fig. 3. Had we not, the

53 2. OCLC PROJECT REPORTS

organizations domain would have dominated the chart because OCLC is by far the heaviest user of the Service. VISUALIZING SPATIAL RELATIONSHIPS BETWEEN Other 12% Commercial United States (us) 1%- (com) 27% INTERNET OBJECTS Netherlands (nl) 1% Organizations (org) 1% Eric J. Miller, Associate Research Scientist Government (gov) 3% United Kingdom (uk) 3% Australia (au) 3%' Abstract Germany (de) 3% This paper focuses on several projects investigating the applicability of conceptual frameworks of spatial Canada (ca) 3%' relationships developed by cartographers interested in Unresolved 5% mapping physical spaces to the ever-changing information Misc landscapes of the Internet. Education (edu) 19% (net) 19%

Fig. 3 PURL Resolution Requests by Top-level DNS Domains "Maps! Maps! We don't need no stinking maps!" B. Miller, lost somewhere in Arizona, 1987 Other PURL Resolvers Although a PURL service is being run and maintained at Introduction OCLC, the PURL model lends itself to distribution across The Internet is a myriad of complex, dynamic, multi- the net. PURL servers can easily be run by organizations faceted linkages. This complexity makes it extremely with a commitment to maintaining persistent naming difficult to determine overall structures and/or schemes such as libraries, government organizations, and relationships between Internet objects. Geographical maps publishers. Accordingly, OCLC now freely distributes its traditionally have provided a practical way to travel in our PURL source code to aid in rapid, wide adoption of this physical world. From handwritten maps to satellite enabling technology. Since mid-March 1996, the OCLC imaging, geography has played a major role in the analysis PURL software has been downloaded by more than 270 and in the expansion of various activities. Maps have institutions. become more than practical records of locations, they have provided us valuable concepts regarding perception .Other 14% and location of space [Gould and White 1986]. People are Sweden (se)1% • I / Commercial easily able to explore cities and navigate through countries Organizations (org) 2% • / (com) 20% they have never visited before thanks to geographic and Unknown 2% • cartographic representations. Despite the fact that a map United Kingdom (uk) 4% distorts reality to portray meaningful relationships [Monmonier 1991], for a complex world, it is an essential France (fr) 4% - component to our understanding of various physical Canada (ca) 4% - landscapes. Maps provide a picture of the world to help Australia (au) 4% - us understand spatial patterns, relationships, and Germany (de) 4% — complexities of the environment in which we live [Robinson et al. 1995]. This paper focuses on several Local Access 5%' \ Education projects investigating the applicability of these conceptual Government (gov) 6%'^ j \ (edu) 17% frameworks of spatial relationships to the dynamic Misc (net) 8% South Korea (kr) 6% changing information landscapes of the Internet.

Fig. 4 PURL Software Downloads by Top-level DNS Domains Representations of Space; Absolute and Relative Effective spatial representations must take the intrinsic For further PURL information, please see the OCLC properties of space into account and these representations PURL Service and documentation at: www.url.oclc.org. can be viewed in many ways [Peuquet 1994]. Views of PURL contributors: Stuart L. Weibel, Vincent M. Tkac, space can be classified into what have historically been Jonathan R. Fausey, Eric J. Miller, Roger Thompson, and termed absolute and relative [Hawking 1988]. Peuquet Erik Jul. [1994] defines absolute space as objective; spatial metrics

54 2. OCLC PROJECT REPORTS are fixed and corresponding attributes are measured. This Content Selection) Initiative (available at: view assumes an immutable structure that is rigid, purely http://www.w3.org) and the Dublin Core/Warwick geometric, and that serves as the backcloth upon which Framework initiatives (available at: objects are draped. Relative space is defined as subjective; http://purl. oclc. org/metadata/dublin_core). attributes are fixed and space and time are measured. This Another descriptive content format, Apple's Meta view assumes a flexible structure that is more topological Content Format, is the underlying framework of the in nature, defined in terms of relationships among objects. HotSauce interface. The MCE is "an open standard (file) format used to represent a wide range of information Visualizing Relative Relationships about content." The MCE can be used to describe various resources including Web documents, gopher and FTP files. To analyze complex patterns of interaction produced by E-mail, and structured databases, making their content humans on the Earth, geographers increasingly looked at available through a variety of views. questions of relative location—questions that consider places not in terms of their absolute latitude-longitude, but HyperSpace in terms of their cost and times to all other places with HyperSpace (available at: http://www.cs.bham.ac.uk/ which they might exchange goods, money, people, and ~amw/hyperspace/) is an attempt at displaying messages [Gould and White 1986]. This relative organizational areas of the Web based on topological representation of interconnectivity is in many ways directly information. It arranges the information according to a applicable to the Internet. The dominant paradigm for user-defined structure in an attempt to relate topical areas World Wide Web navigation is through following links in by spatial proximity. hypertext browsers. A visual representation of relations In HyperSpace, each page on the web is represented between networked objects can communicate large as a sphere and corresponding hyperlinks are represented amounts of information to the human visual system. If as lines between the spheres. These spheres and lines are relational information can be identified between Internet randomly placed into a three-dimensional lattice. The objects and an emergent topology of the World Wide Web chaotic and unstructured mesh of nodes and links is then can be found, various projections can be built in a allowed to self-organize according to some imposed dimension appropriate for visualization. The following physics within the reality. Nodes repel each other, links sections identify several current projects attempting to (as lines) provide an attractive force and thus provide a visualize these relational interactions. three-dimensional environment that shows clusters of highly interrelated documents. The Geometry Center The Geometry Center at the University of Minnesota Viewing Networked Resources (available at:http://www.geom.umn.edu/) specializes in Another example of similar work can be found at the the use of technologies to visualize and communicate Viewing Networked Resources home page (available at: mathematics and related sciences. One project that is http://purl.oclc.org/net/graph). Similar to the HyperSpace currently active at the Center is in the integration of three- research, this work also attempts to visualize relationships dimensional graphics and the Web. The WebOOGL (Web between resources based on URLs with similar self- Object-Oriented Graphics Library, available at: organizing concepts. Unlike the HyperSpace research, http://www.geom.umn.edu/docs/research/webviz/) is an however, this work has reduced the visual interface to two attempt to use three-dimensional graphic primitives to dimensions. This reduction of space, while perhaps not as represent objects and corresponding links as defined in a aesthetically pleasing, provides near real-time graphical three-dimensional hyperbolic space. visualization of resources and their corresponding linkages in a Java application. Additionally, this work allows any Apple's HotSauce user to seed the process with any URL and show the HotSauce (available at: http://hotsauce.apple.com) is a "3D graph up to five links away. information navigation system that allows users to effortlessly explore Internet or Intranet websites and desktop content." Relationships between Internet resources Visualizing Absolute Relationships are defined both by explicit relationships (e.g., URLs) and Absolute spatial views, the ability to measure through relationships based on similar content (e.g., the corresponding attributes upon fixed metrics, provide same author). While HTML (HyperText Markup Language) additional insights toward the relationships between provides a syntax for the layout of Web documents, few networked objects. The following sections identify several provisions describe content-based information. This lack of current projects attempting to visualize these relational descriptive, content-based information is the major focus interactions based upon an absolute spatial framework. of several of the "metadata" discussions currently active on the Internet, including the W3's PICS (Platform for Internet

55 2. OCLC Project Reports

The SPIRE project Relative to Absolute Visualization The SPIRE (Spatial Paradigm for Information Retrieval and An additional area of interest regarding the visualization of Exploration) projea (available at: http://multimedia.pnl.gov:2080 networked objects lies in the mapping back of relative /showcase/?it_content/spire.node) from the Pacific topological information on an absolute framework. Much Northwest National Laboratory, Information Technology of the Internet traffic can be represented by networks with Department, is a suite of software that allows users to nodes corresponding to objects and links representing explore complex relationships between text documents. relationships among the objects. The relationships Relationships between documents are defined based on represent raw physical measurements, such as the number word similarities and themes in text. Visual representations of packets sent between routers; computed aggregates, of these relationships organized into visual/interactive such as the mean link utilization; or abstract quantities, maps that allow the user to explore and discover such as the probability that two items are purchased on a relationships between text documents. Two technologies single grocery receipt. The relationships are directed, within SPIRE, "Galaxies" and "Themescapes," are used to undirected, time varying, or static [Eick 1996]. The convey these relationships. following sections identify several current projects Galaxies computes word similarities and patterns in attempting to visualize these topological interactions and documents and then displays the documents on a computer map them in an absolute, global coordinate system. screen to look like a universe of "docustars." Closely related documents cluster together in a tight group, while Planet Multicast: Visualization of the Mbone unrelated documents are separated by large spaces. The Mbone is the Internet's multicast backbone. Multicast In Themescape, themes within the document spaces is the most efficient way of distributing data from one appear on the computer screen as a relief map of natural sender to multiple receivers with minimal packet terrain. The mountains indicate where themes are duplication. Developed and initially deployed by dominant; valleys indicate weak themes. Their shapes—a researchers within the Internet community, the Mbone has broad butte or high pinnacle—reflect how the thematic been extremely popular for efficient transmission across information is distributed and related across documents. the Internet of real-time video and audio streams such as Themes close in content will be close visually, based on conferences, meetings, congressional sessions, and NASA the many relationships within the text spaces. shuttle launches. The Mbone, like the Internet itself, grew Cyberspace geography visualization exponentially with no central authority. The resulting suboptimal topology is of growing concern to network Luc Girardin from the Graduate Institute of International providers and the multicast research community. The Studies in Switzerland (available at: http://heiwww.unige.ch/ visualization of this network on a global backdrop is an girardin/cgv/), is also involved in mapping resources in an important function for understanding this topology. absolute framework. In this research, cartographic representation based on Kohonen neural networks [Kohonen 1995] and a distance matrix method have been used to produce two-dimensional maps with a distance- representing relief structure for regions of the World Wide Web. Metrics for this space are defined by the maximum distances (or dissimilarity) between any two resources. This distance metric is defined by the length of the shortest path between them. This metric makes use of the inherent knowledge contained in the hypertext and hypermedia structure. A topologically organized map was deemed insufficient for the visualization of this information. The problem resided in the fact that only the order of distances among elements were conserved and, thus, better methods to visualizing the structure of the distances were required. In this research, the unified matrix method [Ultsch and Siemon 1989] was used to determine the weighting vector as input to the Kohonen neural networks [Kohonen 1995] for the reorganization of the map. Similarly to geographical maps, this representation provides a way to differentiate the distances between locations. In this case "mountains" and "ravines" represent the property of increased "travels." Global Coordinate System

56 2. OCLC PROJECT REPORTS

The Planet Multicast Visualization research (available • Display clutter. Displays are easily overwhelmed and at: http://www.nlanr.net/Viz/Mbone/) is attempting to become cluttered and visually confusing when they provide geographic representations of the Mbone traffic as display too much information. arcs on a globe by resolving the latitude and longitude of • Node positioning. Interpretation of the display Mbone routers. A snapshot of this process is shown in depends heavily on the node positions. The same figs. 1 and 2. network drawn with different node-positioning algorithms often leads to quite different interpretations of the data. • Perceptual tension. Viewers interpret closely positioned nodes as related; conversely, they perceive distantly positioned nodes as unrelated. Yet lines connecting distant nodes dominate visually because they cover proportionally more screen real estate. The most effective network displays exploit this perceptual tension where possible. To address these concerns, Eick provides tools for filtering, analyzing, thresholding, opacity, etc., to allow the user not only to see the global topological relationships of global networks, but to begin to analyze them as well. The data set contains the packet counts, by two-hour period, transmitted between each pair of countries. Each country is represented by a box-shaped glyph scaled and colored to encode the total packet count for all links emanating from the country. The glyphs are positioned at the countries' capitals and extend perpendicular to the surface of the globe. The color-coded arcs between the countries show the inter-country traffic, with the higher and redder arcs indicating the larger traffic flows. The globe is illuminated by a light positioned to indicate, via the angle of the sun, the time for the frame of the time- series data displayed. Fig. 2 Global Mbone Topological Traffic Mapped Back to a Global Coordinate System Conclusions One of the most common misconceptions about discovery Network Visualization and retrieval in a distributed environment is that it is a General three-dimensional global network displays are single event instead of a complex series of iterative events. often confusing and difficult to navigate. An open area of The complexity of the Internet makes it extremely difficult research is whether mapping the display to a sphere to determine overall structures and/or relationships captures many of the advantages of a general 3D network between networked resources and thus makes navigation layout while simultaneously helping the user maintain in this environment extremely difficult. Maps help us context. Network Visualization (available at: understand spatial patterns, relationships, and complexities http://www.computer.org/pubs/cg&a/report/g20069.htm) of environments and thus provide powerful mechanisms attempts to address these issues. Additionally, this research for facilitating navigation though this new medium. The begins to address issues of user comprehension and creation of these maps and the accompanying visualization analysis regarding the enormous amounts of visual data. is challenging but increasingly vital to managing and Node and link displays effectively visualize small, sparse understanding these relationships between networked networks. In visualizing with larger networks, Eick [1996] information. identifies three general problems:

57 2. OCLC PROJECT REPORTS

Note Figures 1 and 2 are based on work partially sponsored by the National Science Foundation under NSF Cooperative Agreement NCR-9415666 and the NSF Graduate Research Fellowship Program. Used with permission. References Eick, S. G. 1996. "Aspects of Network Visualization Special Report: Computer Graphics and Visualization in the Global Information Infrastructure" Published online in IEEE Computer Graphics Applications, 16:2. Gould, P. and R. White. 1986. Mental Maps, Second edition. Boston: Allen & Union. Hawking, S. 1988. A Brief History of Time. New York: Bantam Books. Kohonen, T. 1995. Self organizing Maps. Berlin; Heidelberg; New York: Springer. Monmonier, M.. 1991. How to Lie with Maps. Chicago, London: The University of Chicago Press. Peuquet, D. 1994. "It's About Time: A Conceptual Framework for the Representation of Temporal Dynamics in Geographic Information Systems." Annals of the Association of American Geographers, 84(3):441. Robinson, A. H. et al. 1995. Elements of Cartography, Sixth Edition. New York: John & Sons, Inc. Ultsch, A. and H. P. Siemon. 1989. Exploratory Data Analysis: Using Kohonen Networks on Transputers.

58 3 EXTERNAL AND COLLABORATIVE RESEARCH

CLC is committed to the proposition that collaborative ventures, in all areas of its mission, are well worth the Oeffort. The OCLC Office of Research has found this to be true many times over the years. The current collaborative works reported here add to these proofs a large exclamation point. First is the report of the effort to rebuild the collection and catalog of the Bosnian National and University Library. The next report discusses work on the feasibility of generating a subject validation file based on LC headings. Then, the Monticello project involves staff from OCLC and the Southeastern Library Network in designing a virtual library. The final report by Lorcan Dempsey of the United Kingdom and Stuart Weibel focuses on the second in a series of workshops that OCLC is cosponsoring with a number of partners about the issue of metadata in the World Wide Web arena. The global impact these workshops are having on the various concerned communities should not be underestimated. These reports reflect OCLC's reasons for reaching out to individuals and organizations around the world to undertake collaborative research. Positive results, such as these reported here, add to the wealth of knowledge and capabilities of millions of people around the world. An end worth pursuing, without a doubt. 3. EXTERNAL AND COLLABORATIVE RESEARCH

THE BOSNIAN libraries in Bosnia and several American universities are donating materials [Mazmanian 1996, 14]. Harvard NATIONAL LIBRARY University Library and Harvard University Press announced that they will play a leading role in this effort to rebuild BUILDING A VIRTUAL COLLECTION the collection [Kniffel 1996, 21]. Edward T. O'Neill, Consulting Research Scientist, Jeffrey A. Rebuilding the Catalog Young, Consulting Systems Analyst, and Robert Bremer, Efforts to rebuild the collections are hampered by the Database Specialist destruction of the catalog—no complete record of the library's holdings remains. The effort to rebuild the library Abstract generated strong support and offers of assistance from the Artillery attacks during the siege of Sarajevo in August international library community. The first step is to 1992 destroyed the collection and catalog of the Bosnian generate a database of bibliographic information National and University Library. The effort to rebuild the pertaining to Bosnia along with an indication of what library depends, at least in part, on the reconstruction of its institutions hold the material. Representatives from catalog. We examined WorldCat, the OCLC Online Union Catalog, to determine the criteria that would be most Harvard University Library, Yale University Library, and effective in selecting records appropriate for rebuilding a OCLC met in summer 1995 and agreed that OCLC should Bosnian catalog. For the first cut extraction, we applied the assume technical leadership in support of the project following seven criteria to WorldCat; (1) works of Bosnian within the United States. authorship, (2) works in Serbo-Croatian languages, (3) Bosniaca is defined as all documents in any format works about Bosnia or Yugoslavia, (4) works classified by written in any language on or about the territory of Bosnia LC or Dewey as Bosnian, (5) titles with Bosnian keywords, and Herzegovina. It also includes all items published (6) works published in Bosnia, and (7) works with Bosnian- within the territory of Bosnia. Initial efforts will related subjects. Examination of the results suggests that concentrate on producing a subset of the OCLC database. adding selected geographic subject headings and In the future, it may be desirable to add records from personal name subject headings will result in better recall. international sources. The resulting database will eventually be made available to the National and University Library of Bosnia and Herzegovina and other The artillery attack on the National and University Library interested parties. of Bosnia and Herzegovina in August 1992 resulted in the Bosniaca records in WorldCat (the OCLC Online Union massive destruction of the library and the culture it Catalog) will form the basis of the Bosniaca Catalog. represented. When the four days of bombardment ended Bosniaca records were identified by searching WorldCat on August 27, all that remained were burned ruins. The for records meeting at least one of the following criteria: flames "engulfed almost 50,000 feet of wooden 1. Published by a Bosnian author. The record must bookshelves" [Perlez 1996, A4] destroying both the contain either a 100 or 700 field for a Bosnian author collection and the catalog. "A burnt-out skeleton was all identified through a comprehensive list of possible that was left of this once-beautiful and renowned cultural Bosnian authors. This list of 3,404 authors was institution; fire had raced through most of its three million obtained by scanning the personal names file for volumes" [Lorkovic 1992, 736, 8l6]. Zeco provides a those authors who have published more in Serbo- detailed account of the library from an insider's Croatian than in any other languages except English. perspective [Zeo 1996, 294-301]. (As part of the OCLC Control service, a comprehensive file of the personal names and their Recovery Effort attributes including language of publication is The library community was horrified by the level of maintained. The file was used to identify authors who destruction, and librarians throughout the world have publish primarily in Serbo-Croation.) generously offered to assist in the rebuilding effort. The 2. Published in Serbo-Croatian. A fixed field language recovery has several distinct aspects: reconstruction of the code (field 008 / 35-37) of either see or scr. building, recreation of the catalogs, and rebuilding of the 3- Published about Bosnia or Yugoslavia. The collections. There are different views on whether the geographic area code (field 043) contains the code library should be restored to its former splendor or "left in e-bn— or e-yu—. its distressed state as a memorial to the three-and-a-half- 4. Classified as Bosnian (LC Class). Field 050 and 090 year siege of Sarajevo that led to the near-total LC call number ranges DB231-250 (obsolete class for destruction" of the library [Kniffel 1996, 17]. The Sabre Bosnian history), DR1652-1785 (Bosnian history), or Foundation is among those coordinating efforts to restock PG1400-1798 (Serbo-Croatian literature).

60 3. EXTERNAL AND COLLABORATIVE RESEARCH

5. Classified as Bosnian (Dewey Class). Field 082 and The language code produced the greatest number of 092 Dewey Decimal call numbers 914.9742 (Bosnia matches; 78,766 (75.7%) records have the language code description/travel), 949.742 (Bosnian history), or for Serbo-Croatian. The next most productive criterion was 891.82 (Serbo-Croatian literature). authorship; 35,003 records were retrieved with a Bosnian 6. Keywords. The title (fields 245, 246, and 740) author as either the main entry or as an added entry. The containing the words Bosanski, Bosansko- geographic area code generated 28,931 matches. The Library Hercegovacki, Bosne, Bosnia, or Bosnian. of Congress classification was the only other selection 7. Published in Bosnia. The place of publication (field criterion which generated a large number of matches. 260 subfield a) is Banja Luka, Bihac, Bosnia, Mostar, Sarajevo, Travnik, Trebinje, Tuzla, or Zenica. Characteristics of the Records Retrieved 8. Bosnian subject. A subject heading (fields 600-651) contains the words Bosnia or Bosnian. For records with either an LC or Dewey classification, the Many of the criteria overlap; many of the selected class number was converted to one of 29 broad subject records matched on several criteria. The criteria were categories. The resulting subject distribution is shown in selected to provide a high recall—the precision is a fig. 2. The set was dominated by Language and literature secondary consideration. It is expected that many of the and History. Together, these two groups accounted for retrieved records will fall outside the definition of Bosnian almost 55% of all retrieved records. Science, mathematics, materials. If precision becomes a problem, the nonrelevant and technology are under represented. These fields do not materials can later be removed from the catalog. The usually have a geographic dimension and therefore are extraction of Bosnian material will be done twice: the less likely to be retrieved by the geographically oriented selection criteria. initial set of Bosnian records will be used to review and refine the criteria. The revised criteria will then be used to extract the records for the Bosnian catalog. The revised Physics 1 0.3% criteria will also be used by other libraries willing to Astronomy 0.1% contribute records to the catalog. Mathematics and statistics • 1.4% Science and technology 1 0.4% WorldCat Record Retrieval Performing arts • 0.8% WorldCat was initially searched in November 1996 for Language and literature records matching the criteria. At the time of the search, Art and photography warn 4.6% WorldCat contained almost 36 million records; 103,983 Music • 1.8% matched one or more of the criteria. The percent of Education • 1.1% records retrieved by each criterion is shown in fig. 1. The Law H 2.4% percentages sum to more than 100 since many of the Political science •• 3.3% records matched on multiple criteria. Sociology mm 3.2% Business 7.4% Life sciences 1 0.8% Medicine and dentistry 4.8% Subject •15.6% Geology t 0.3% Place of Publication h 5.1% Recreation 1 0.7% Title Keywords 11.4% Anthropology • 1.0% Dewey Class 11.0% Agriculture • 1.0% LC Class HM115.6% Engineering • 1.1% Geographic Area 27.8% Geography • 1.3% History Bosnian Author •^HHBii^HH33.7% Religion •• 2.8% Home economics 1 0.2% Military science 1 0.5% Fig. 1 WorldCat Matches Chemistry 0.1% Psychology 1 0.2% Philosophy I 0.8% General works "3.4%

Fig. 2 Subjects

61 3. EXTERNAL AND COLLABORATIVE RESEARCH

Serbo-Croatian is the dominant language, accounting for approximately 75% of the records. The distribution of Unknown languages other than Serbo-Croatian is shown in fig. 3- In South America • 0.3% this group, English was by far the most common. Almost Middle East and Africa • 0.3% 80 languages are included in the "other" group. Asia and Oceania •i 0.7% Eastern Europe 2.2% Other 3.6% North America M 0.7% Russian Western Europe Italian H 0.8% Macedonian ••i 1.1% Fig. 5 Country of Publication French 1.8% German 2.3% Slovenian ^mmmm 2.9% Enhancing the Search The primary purpose of this initial retrieval is to test and evaluate the effectiveness of the retrieval criteria. It is Fig. 3 Common Languages expected that other libraries will also contribute to this rebuilding effort by using similar retrieval criteria to search Ninety-two percent of the records were for books, 5% their catalogs to identify materials which are not included for serials, and the remaining 3% for all of the other formats. in WorldCat. Searching for Bosnian authors is complex, The publication dates for the books are shown in fig. 4. requiring a Boolean OR of several thousand author names. The figure shows a steady growth except during the decade To greatly simplify the search, it would be desirable to of World War II. The drop in the 90s is due to both the drop this criterion unless there is strong evidence that it war and the lag in libraries acquiring current publications. retrieved a large number of records that would not have been retrieved otherwise. The analysis of the retrieval consists of two phases: Pre-1900 ••• 2.9% (1) the statistical analysis of the records retrieved and (2) 1900-1909 H 1.8% the manual review of individual records to assess the 1910-1919 2.5% precision and recall. The manual relevance assessment has 1920-1929 ^^•1 3.8% not been started and its scope will depend, in part, on the 1930-1939 4.3% availability of subject specialists. The statistical analysis requires far less time and effort but is expected to assist in 1940-1949 MM 3.1% refining the criteria. 1950-1959 8.3% The topical and geographic subject headings were 1960-1969 examined to see if additional terms could be identified 1970—1979 that would enhance the recall. The most frequently used 1980-1989 topical subject headings found in the retrieval set are: 1990- World War, 1939-1945 Serbo-Croatian language Fig. 4 Publication Dates for Books Serbian literature Yugoslav War, 1991 Nearly 75% of the material retrieved was published in Croatian literature Yugoslavia or in what was formerly Yugoslavia. The country World War, 1914-1918 code bn has been used since 1992 and is present in only Communism 628 WorldCat records. Materials published in Bosnia and English language Hercegovina prior to 1992 generally are identified with Serbian poetry country code yu (Yugoslavia). For material published Yugoslav literature outside the former Yugoslavia, the regions of publication Serbs are shown in fig. 5. Almost equal numbers of items were Eastern question (Balkan) published in North America and Western Europe, with Chess most of the remainder coming from Eastern Europe. As might be expected, many of the headings are generic without any connection to Bosnia. Some headings such as Chess have little regional meaning while other

62 3. EXTERNAL AND COLLABORATIVE RESEARCH

headings such as Communism may have strong ties to the Adding additional Library of Congress class numbers region but are still too general to use for retrieval. After would significantly enhance the recall. As with the subject reviewing the common subject headings used, several headings, it will be difficult to differentiate between additional keywords were identified. Adding Yugoslavia, Yugoslav and Bosnian materials. Serbo-Croatian, Balkan, Adriatic, Yugoslavs, Slavs, and As expected, analysis of the retrieved records Slavic to Bosnia and Bosnian, the two keywords used indicated that revising the selection criteria will improve originally, would significantly improve the recall. the recall. Manual review of the retrieved records should Prior to the breakup, Yugoslavia included six result in further refinement. Since the number is relatively republics: Bosnia and Hercegovina, Slovenia, Croatia, small—five times as many records would easily fit on a Macedonia, Serbia, and Montenegro. Since 1992, CD-ROM—the focus will remain on improving the recall. Yugoslavia refers only to a federation of two republics: WorldCat will be researched using the revised criteria to Serbia and Montenegro. As a result, it is difficult to extract the records for the Bosnian catalog. distinguish between Bosnian materials and materials associated with one of the other five republics. The recall, Adding IVIaterial from Otiier Sources which is the primary focus of this project, would be WorldCat contains one of the largest, if not the largest, improved by adding the terms Serbs, Serbian, Croats, single collections of Bosnian records in the world. It Croatian, Slovenes, and Slovenian. However, much of the represents a large proportion of the Bosnian materials held additional material would relate to Yugoslavia in general rather than Bosnia in particular. by American libraries. However, American libraries, even collectively, would not have acquired everything held by Personal name subject headings could also enhance the National and University Library of Bosnia and the recall. The 12 most frequent personal name subject headings in the retrieved set were: Herzegovina. It is expected that European libraries hold a large number of Bosnian materials that are not in Tito, Josip Broz, 1892-1980 WorldCat. The materials from European libraries and other Mary, Blessed Virgin, Saint libraries that are not OCLC members will need to be Karadzic, Vuk Stafanovi, 1787-1864 included in the effort to rebuild the catalog. Petar II, Prince Bishop of Montenegro, 1813-1851 OCLC will accept machine-readable records from other Andric, Ivo, 1892-1975 libraries willing to participate in the rebuilding effort and Krleza, Miroslav, 1893- add them to the collection of Bosnian materials found in Sava, Saint, 1169-1237 WorldCat. OCLC has developed comprehensive record- Mihailovic, Draza, 1893-1946 matching algorithms [O'Neill 1990, 13-14] to merge Marx, Karl, 1818-1883 bibliographic records from external sources with those Markovic, Svetozar, 1846-1875 from WorldCat. The software can determine reliably which Jesus Christ of the new records duplicate existing records and which Tesla, Nikola, 1856-1943 are for new bibliographic items. New bibliographic As can be observed, many of the frequent personal records and holdings symbols will be added to the name subject headings would be useful in retrieving database. Only the holding information from incoming relevant material. Some frequently used headings, such as matching records will be retained. Jesus Christ, the Virgin Mary, and Karl Marx, rank high due to their high general use rather than any unique regional Future Plans interest. Most of the others had strong regional ties. A large number of relevant LC class numbers were not This project is in its early phase. OCLC's contributions are originally included in the criteria. Some lacked any regional but a small part of a global effort to assist in the rebuilding specificity or included regional specificity only as part of of the National and University Library of Bosnia and the subject cuttering. However, many frequently occurring Herzegovina. Many of the resources lost in the fire resulting classes were identified that appeared relevant, including: from the artillery attack were unique and will never be recovered. However, collectively, the collections of OCLC DR1214 History—Balkan Peninsula—General Works members combined with those of other libraries should HC407 Social Sciences—Economic History and contain most of the nonunique material lost in the fire. Conditions—Europe—Balkan States— Building the catalog, a virtual collection of Bosnian materials, Yugoslavia is a small but important step in the rebuilding of the AS346 General Works—^Academies and Learned National and University Library of Bosnia and Herzegovina. Societies—Europe—^Turkey and the Balkan States—^Yugoslavia DR301-396 History—Balkan Peninsula—Yugoslavia (Obsolete)

63 3. EXTERNAL AND COLLABORATIVE RESEARCH

References Nature, Scope, and Hypotheses Kniffel, L. 1996. "National Library of Bosnia: To Restore or Consistency and accuracy of access points contribute to Memorialize?" American Libraries 27:17 (February). effective retrieval. In authority control, a major goal is to . 1996. "Harvard Library and Press Contribute to Bosnian achieve consistency and accuracy in the form of access Recovery Efforts." American Libraries 27:21 (February). points by adhering to standards and guidelines. In the case of subject authority control, the standards and Lorkovic, T. 1992 "National Library in Sarajevo Destroyed; guidelines include Anglo-American Cataloging Rules, Collections, Archives Go Up in Flames." American Libraries second edition, 1988 revision (AACR2R), Library of 23:736, 816. Congress Subject Headings (LCSH), and Subject Cataloging Mazmanian, A. 1996. "U.S. Helps Bosnian Libraries Rebuild." Manual: Subject Headings. Library Journal 12:14 (October 15). Because only about 15% of subject headings assigned to O'Neill, E. T. 1990. "Duplicate Detection." Annual Review of bibliographic records come directly from LCSH, catalogers OCLC Research, July 1989- June 1990, Dublin, OH: OCLC seldom derive complete subject headings from authority Online Computer Library Center, Inc.: 13-14. records or LCSH. Errors do occur. Furthermore, because of Perlez, J. "Sarajevo Journal: A Library Is in Ruins, but with a Will to constant changes in subject headings and heading Live." New York Times (national edition), August 12, 1996:A4. assignment policies, many headings become obsolete. Zeco, M. 1996."Research Notes: The National and University Important operations in subject authority control include Library of Bosnia and Herzegovina during the Current War." detecting and correcting errors and maintaining currency. Library Quarterly 66:294-301. The first phase of this project studied the feasibility of automatically generating a subject validation file with complete strings but few errors and obsolete elements from LC-assigned subject headings appearing in FEASIBILITY OF A bibliographic records. It included an analysis of the COMPUTER-GENERATED distribution and density of subject headings by frequency SUBJECT VALIDATION FILE of use and a calculation of the rate of errors and obsolete elements in subject headings that have been used at least BASED ON FREQUENCY OF twice in the LCMARC database. The results were presented OCCURRENCE OF ASSIGNED in an earlier report. [Chan and Vizine-Goetz 1995] The second phase consists of an analysis of the nature LC SUBJECT HEADINGS and patterns of incorrect and obsolete subject headings. PHASE II, NATURE AND PATTERNS Although errors in subject headings assigned by the Library of Congress occur at a relatively low rate, an OF INVALID HEADINGS awareness of the nature of errors and the rate of obsolete headings will help to avoid or minimize such errors in the Lois Mai Chan, Professor, School of Library and future. An understanding of the recurring patterns of Information Science, University of Kentucky, and Diane errors can also help to improve mechanisms for Vizine-Goetz, Consulting Research Scientist identifying and correcting errors and for updating obsolete headings automatically. Abstract The first phase of this project studied the feasibility of Method automatically generating a subject validation file with complete strings but few errors and obsolete elements study sample from LC-assigned subject headings appearing in The Subject Heading Corrections database, developed to bibliographic records. It analyzed the distribution and correct subject heading errors in WorldCat (the OCLC density of subject headings by frequency of use and Online Union Catalog) was the source of headings for this calculated the rate of errors and obsolete elements in project. The database contains an entry for each unique subject headings that have been used at least twice in the subject heading used in bibliographic records loaded into LCMARC database. The second phase analyzes the nature the OCLC cataloging system through November 1992 and patterns of incorrect and obsolete subject headings. (more than four million headings). A 1% sample, Although errors in subject headings assigned by LC occur consisting of 20,743 headings, among headings assigned at a relatively low rate, an awareness of the nature of by the Library of Congress was extracted from the errors and the rate of obsolete headings will help to avoid database for further processing and examination. or minimize such errors in the future. Understanding the Phase I of the project analyzed the distribution and recurring patterns of errors can also help to improve mechanisms for identifying and correcting errors and for density of headings by frequency of use and identified all updating obsolete headings automatically.

64 3. EXTERNAL AND COLLABORATIVE RESEARCH the incorrect or obsolete headings among headings with a Tools used in validation of headings frequency of use greater than one, a total of 5,970 Each of the 9,442 headings in the study sample was headings. Among these, 76 incorrect headings and 218 checked for correct MARC tagging, terminology, syntax, obsolete headings were found. spelling, punctuation, capitalization, etc., according to the Because of the large number (a total of 14,503) of standards and/or authority files listed here: headings that were used only once each and because of 1. For MARC coding: the time and budgetary constraint, a subset of the sample - USMARC Formats for authority data [USMARC 1987] was used in the second phase: a 23.93% of headings with and for bibliographic data [USMARC 1988] a frequency of use of one, comprising 3,472 headings - Authority records in the NAMES file in LOCIS beginning with the letters A, F, M, T, and Z. These headings, (Library of Congress Information System) together with the 5,970 headings with a frequency of use - Subject authority records in the LCXR (SUBJECTS) of two or greater, resulting in a total of 9,442 headings, file in LOCIS formed the basis of the second phase (tablel). 2. For forms of personal name headings (6OO), corporate Data analysis name headings (6IO), meeting names (6II), uniform Evaluation of headings titles (630), and jurisdictional geographic names (651): - Authority records in the NAMES file Details regarding subject heading evaluation are given in - Anglo-American Cataloging Rules, Second edition, the report for phase I of the project. Briefly recaptured, 1988 revision (AACR2R) [AACR2 1988] the sample headings were sorted and evaluated for 3- For topical headings (650), for nonjurisdictional validity according to the following procedures: geographic names in main headings (651) and in Categorization of headings geographic subdivisions ($z), and for enumerated The headings were categorized according to the following subdivisions: types: - Library of Congress Subject Headings (both the print 600 Personal name headings version and the LCXR file in LOCIS} 610 Corporate name headings - Revised Library of Congress Subject Headings 6l 1 Headings for meetings (1991) [LCSH 1991] 630 Uniform titles 4. For free-floating subdivisions and phrases: 650 Topical and form headings - Free-Floating Subdivisions: An Alphabetical Index 651 Geographic name headings [FFS 1989] - Subject Cataloging Manual: Subject Headings [SCM 1984]

Table 1 Distribution of the Sample Headings by Tag

Frequency Headings Cum Count 600 610 611 630 650 651 Total > 500 3 3 2 1 3 401-500 3 6 3 3 301^00 4 10 1 3 4 201-300 10 20 8 2 10 101-200 28 48 20 8 28 51-100 74 122 1 59 14 74 46-50 17 139 15 2 17 41-45 22 161 2 17 3 22 36-40 28 189 23 5 28 31-35 35 224 1 27 7 35 26-30 47 271 1 1 38 7 47 21-25 82 353 6 1 1 62 12 82 16-20 115 468 1 1 2 89 22 115 11-15 246 714 14 5 184 43 246 6-10 792 1,506 42 11 7 563 169 792 3-5 1,876 3,382 144 59 17 1,311 345 1,876 2 2,588 5,970 250 103 12 1,809 414 2,588 1* 3,472 9,442 375 182 5 23 2,424 463 3,472 Total 9,442 838 363 5 62 6,657 1,517 9,442 "Based on 23.93% of sample

65 3. EXTERNAL AND COLLABORATIVE RESEARCH

Examples of heading verification • Example 5 651 Latin America $x Foreign economic The following examples illustrate the procedures of relations $z Great Britain $x History analyzing headings for errors and obsolete elements: Element in heading Authority tools • Example 1 600 Gandhi, $c Mahatma, $d 1869-1948 651, $x, $z, $x USMARC Formats $x Political and social views Element in heading Authority tools Latin America Foreign economic relations LCSH 600, $c, $d, $x USMARC Formats Great Britain

Gandhi, $c Mahatma, Name authority file History Subject Cataloging Manual $d 1869-1948 In this case, the use of the subdivision $x History is Political and social views Free-Floating Subdivisions incorrect; it may not follow the subdivision $x Foreign Subject Cataloging Manual economic relations. Types of errors and obsolescence In this case, the heading is valid. After verification, headings that were determined to be Example 2 610 Society of Friends $x Biography invalid were analyzed and categorized as either incorrect or obsolete. A heading that was both incorrect and Element in heading Authority tools obsolete was placed in the first category. Each invalid 610, $x USMARC Formats heading was placed in one of the following categories: Society of Friends Name and subject authority files 1. Incorrect headings. Headings containing one or LCSH more incorrect elements are labeled incorrect LCSH headings. The following codes are used to identify Biography Free-Floating Subdivisions various types of errors:' Subject Cataloging Manual Er-c: errors in MARC coding (Error-coding) In this case, the combination is incorrect, because the Er-m: incorrect main heading (Error-Main) subdivision $x Biography should be used under a Er-s: incorrect subdivisions (Error-subdivision) class-of-persons heading. i.e., 650 Quakers. including incorrect subdivisions and improper Example 3 630 Bible. $p O.T. $p Habakkuk $x combinations of either heading—subdivision Criticism, interpretation, etc. or subdivision—subdivision. The code Er-s is followed by a letter indicating the type of the Element In heading Authority tools subdivision: 630, $p, $p, $x USMARC Formats X = form and topical subdivision y = chronological subdivision Bible. $p O.T. $p Habakkuk Name authority file 2r = geographic subdivision Free-Floating Subdivisions A second letter indicating the nature of the error e.g., Criticism, interpretation, etc. Subject Cataloging Manual Er-s(za) (incorrect application of geographic subdivision): In this case, the heading is valid. t = terminology and syntax a = application Example 4 650 Salvage archaeology $z South Dakota Er-p: errors in punctuation, capitalization, spacing, $z Lake Sharpe Region etc. Element in heading Authority tools 2. Obsolete headings. Headings that were not valid at 650, $z, $z USMARC Formats the time the sample was generated but were valid according to previous (15th or earlier) editions of Salvage archaeology LCSH LCSH and previous (4th or earlier) editions of Free- Floating Subdivisions were considered obsolete. South Dakota Name and subject authority files A large number of invalid headings, particularly those Lake Sharpe Region LCSH with lower frequencies of use, contain personal, Subject Cataloging Manual corporate, or geographic names, meeting names, and uniform titles that do not have corresponding name Subject Cataloging Manual authority records in either the NAMES file or LCSH. In this case, the form of the geographic subdivision $x These constitute a special type of obsolete headings. Lake Sharpe Region is incorrect. The correct form is: $z Sharpe, Lake, Region.

66 3. EXTERNAL AND COLLABORATIVE RESEARCH

since current policy requires that each name, except 3. A heading containing more than one type of invalid those formed by free-floating phrases, used in or as a element was placed in the first category according to subject heading be established in the authority files. the order listed previously, based on the seriousness While some of these headings may be correct of invalidity: incorrect, obsolete. Thus, a heading that according to AACR2R, they cannot be used for was both incorrect and obsolete was placed in the validation purposes until established in the authority category of incorrect headings. A heading that files. The following codes are used to identify various contains both an obsolete element and an unverified types of obsolete elements:^ name was placed in the category of Ob-m or Ob-s. Ob-c: the heading was coded Ob-c (Obsolete-code) if it contains an obsolete MARC tag or subfield Results code. Ob-m: the heading was coded Ob-m (Obsolete-main) Analysis of invalid headings by tag if it matched an earlier heading according Table 2 shows the distribution of invalid headings by type to information provided in LCSH (l6th or an of headings. The overall rate of invalid headings, including earlier edition)' or, in the case of names, if it all types of headings in all frequencies of use, is 9.16%. matched a see-reference (in field 4xx) coded The percentage of invalid headings by type, in descending as an earlier heading in the name authority order, is: 6ll (60.00%), 630 (33.87%), 6lO (25.07%), 651 record. (13.12%), 600 (12.65%), and 650 (6.68%). Ob-s: obsolete subdivision, including subdivisions in The overall rate of incorrect headings, including all obsolete forms and obsolete combinations: a types of headings in all frequencies of use, is 3.24%. The heading that contained one or more obsolete percentage of incorrect headings by type, in descending subdivisions or showed previously authorized order, is: 611 (20.00%), 630 (12.90%), 6lO (10.74%), 600 combinations was coded Ob-s (Obsolete- (3.94%), 651 (3.76%), and 650 (2.52%). subdivision). The code Ob-s is followed by a The overall rate of obsolete headings, including all letter of subdivision: types of headings in all frequencies of use, is 5.92%. The X = form and topical subdivision percentage of obsolete headings by type, in descending y = chronological subdivision order, is: 611 (40.00%), 630 (20.97%), 6lO (14.33%), 651 z = geographic subdivision (9.36%), 600 (8.71%), and 650 (4.16%). A second letter indicates the nature of The high percentage of invalid headings among obsolescence, e.g., Ob-s(zt) (obsolete headings for meetings (611) may be due to the small size terminology or form of name in geographic of the sample (a total of five). subdivision): Analysis of invalid headings by frequency of use t = terminology and syntax Table 3 shows the distribution of invalid headings by a = application frequency of use. As discussed in the report for phase I of Ob-p: the heading was coded Ob-p (Obsolete- the project, the rate of invalid headings is in inverse punctuation, etc.) if it contains obsolete relationship to the frequency of use; the rate of invalid practice in punctuation, capitalization, etc., for headings increases as the frequency of use decreases. This example, use of a comma instead of trend is particularly prominent among headings with parentheses for qualifiers and hyphenated frequencies of three-to-one. The rate of invalid headings words instead of words without hyphens. among those with a frequency of two shows an increase of Uv-m; a heading containing an unverified name in 68% (from 4.18% to 7.03%) over headings with a frequency the main heading. of three, and the rate of invalid headings among those with Uv-s: a heading containing an unverified name in a a frequency of one more than doubles (from 7.03% to subdivision. 16.45%) that among headings with a frequency of two.

Table 2 Analysis of Invalid Headings by Taq

Sample Incorrect Incorrect Obsolete Obsolete Total Invalid Total Invalid Tag Headings Headings Headings % Headings Headings % Headings Headings % 600 838 33 3.94 73 8.71 106 12.65 610 363 39 10.74 52 14.33 91 25.07 611 5 1 20.00 2 40.00 3 60.00 630 62 8 12.90 13 20.97 21 33.87 650 6,657 168 2.52 277 4.16 445 6.68 651 1,517 57 3.76 142 9.36 199 13.12 Total 9,442 306 3.24 559 5.92 865 9.16

67 3. EXTERNAL AND COLLABORATIVE RESEARCH

Table 3 Distribution of invalid Headings by Frequency of Use

Sample Sample Total Total Sample Headings Headings Incorrect Incorrect Obsolete Obsolete Invalid Invalid Frequency Headings Cum Ct Cum % Headings Headlngs% Headings Headlngs% Headings Headings % >25 271 271 1.32 1 0.37 1 0.37 21-25 82 353 1.72 1 1 1.22 16-20 115 468 2.29 5 4.35 5 4.35 11-15 246 714 3.49 3 1.22 3 1.22 10 96 810 3.96 1 1.04 2 2.08 3 3.13 9 124 934 4.56 1 0.81 2 1.61 3 2.42 8 138 1,072 5.24 2 1.45 8 5.80 10 7.25 7 175 1,247 6.09 0.00 4 2.29 4 2.29 6 259 1,506 7.36 1 0.39 5 1.93 6 2.32 5 352 1,858 9.08 2 0.57 14 3.98 16 4.55 4 542 2,400 11.72 4 0.74 15 2.77 19 3.51 3 982 3,382 16.52 7 0.71 34 3.46 41 4.18 2 2,588 5,970 29.16 58 2.24 124 4.79 182 7.03 1* 3,472 9,442 46.12 230 6.62 341 9.82 571 16.45 Total 9,442 306 3.24 559 5.92 865 9.16 *Based on 23.93% of the sample

Analysis of patterns of incorrect (leadings There is only one incorrect heading for meetings. The Table 4 shows the distribution of the 306 incorrect error was found in the main heading. headings by type of errors. Overall, the largest number of Among incorrect uniform title headings (630), a total errors occur in the main heading and in subdivisions (with of 8, the most frequently occurring errors are found in 109 each), followed by errors in coding (73), and coding (4), followed by errors in main heading (3) and mechanical errors (15). subdivision (1). Among incorrect personal name headings (600), a Among incorrect topical headings (650), a total of l68, total of 33, the most frequently occurring errors are found the most frequently occurring errors are found in in main headings (23). Errors in punctuation, etc. (5), subdivision (76). Errors in main heading (45) and those in constitute the next largest group, followed by errors in coding (43) constitute the next largest groups, followed by coding (4) and in subdivision (1). mechanical elements (4). Among incorrect corporate name headings (6l0), a Among incorrect geographic headings (651), a total of total of 39, the most frequently occurring errors are found 57, the most frequently occurring errors are found in main in the main heading and in coding, with 14 each. Errors in heading and in subdivision (with 23 each), followed by subdivision (8) constitute the next largest group, followed those in coding (8), and errors in punctuation, etc. (3). by errors in punctuation (3).

Table 4 Analysis of Patterns of Incorrect Headings

Main Form& Form & Total Coding Heading Topical Topical Chron Chron Geo Geo Sub Punctuation Incorrect Tag (Er-c) (Er-m) Sub/App Sub/Term Sub/App Sub/Term Sub/App Sub/App Total Capitalization Headings 600 4 23 1 1 5 33 610 14 14 4 2 2 8 3 39 611 1 1 630 4 3 1 1 8 650 43 45 26 6 5 22 17 76 4 168 651 8 23 13 3 1 4 2 23 3 57 Total 73 109 44 12 6 4 26 17 109 15 306 Key: App = Application; Term = Terminology; Chron = Chronological; Geo = Geographic; Subdiv = Subdivision

68 3. EXTERNAL AND COLLABORATIVE RESEARCH

Table 5 Analysis of Patterns of Obsolete Headings Unverified Main Form & Form & Names Total Coding Heading Topical Topical Chron Chron Geo Geo Sub Capital (Uv-m; Uv-s) Obsolete Tag (ob-c) (Ob-m) Sub/App Sub/App Sub/App Sub/Term Sub/App Sub/Term Total etc. (Ob-p) Total Headings 600 14 12 1 13 2 44 73 610 1 9 10 1 11 1 30 52 611 2 2 630 2 6 6 5 13 650 41 33 12 2 41 127 215 8 13 277 651 49 31 3 1 4 2 3 44 6 43 142 Total 1 117 92 16 1 6 43 131 289 17 135 559 Key: App = Application; Term = Terminology; Chron = Chronological; Geo = Geographic; Sub = Subdivision; Capital = Capitalization

Analysis of patterns of obsolete headings Among obsolete uniform title headings (630), a total Table 5 shows the distribution of the 559 obsolete headings of 13, the most frequently occurring obsolete elements are by category of headings and type of obsolete elements. found in subdivision (6), followed by unverified uniform Overall, the largest number of obsolete elements occur in titles (5), and obsolete elements in main heading (2). subdivision (289), followed by unverified names (135), Among obsolete topical headings (650), a total of 277, obsolete elements in the main heading (117), obsolete the most frequently occurring obsolete elements are found in mechanical elements (17), and obsolete coding (1). subdivision (215). Obsolete elements in main heading (41) Among obsolete personal name headings (600), a total constitute the next largest groups, followed by unverified of 73, the most frequently occurring obsolete elements are names (13), and obsolete mechanical elements (8). unverified names (44). Obsolete elements in main heading Among obsolete geographic headings (651), a total of (14) constitute the next largest group, followed by 142, the most frequently occurring obsolete elements are obsolete elements in subdivision (13), and obsolete found and in main heading (49). Obsolete elements in punctuation, etc. (2). subdivisions (44) constitute the next largest group, Among obsolete corporate name headings (6l0), a followed by unverified names (43), and obsolete total of 52, the most frequently occurring obsolete mechanical obsolete (6). elements are also unverified names (30). Obsolete Analysis of invalid subdivisions elements in subdivision (11) constitute the next largest Table 6 displays data regarding invalid subdivisions only. group, followed by obsolete main heading (9), mechanical Columns B-F present data regarding form or topical elements (1), and coding (1). subdivisions ($x), columns G-K regarding chronological There are only two obsolete heading for meetings subdivisions ($y), and columns L-P regarding geographic (6ll). In both cases, the obsolete elements were found in subdivisions ($z). Within each type of subdivision, invalid the main heading. elements are divided into incorrect or obsolete

Table 6 Analysis of Invalid Subdivisions by Tag Form OR Topical ($x) Chronological ($y) Geographic ($z) Incorrect Obsolete Incorrect Obsolete Incorrect Obsolete A B C D E F G H 1 J K L M N O P 0 Term/ Term/ Term/ Term/ Term/ Term/ App Syntax App Syntax Total App Syntax App Syntax App Syntax App Syntax (Er-s (Er-s (Ob-s (Ob-s Form/ (Er-s (Er-s (Ob-s (Ob-s Total (Er-s (Er-s (Ob-s (Ob-s Total Tag (xa)) (xt)) (xa)) (xt)) Top (ya)) (yt)) (ya)) (yt)) Chron (za)) («)) (za)) (zt)) Geo Total 600 1 12 1 14 14 610 4 2 10 16 2 1 3 19 611 0 630 1 6 7 7 650 26 6 33 12 77 5 2 7 22 17 41 127 207 291 651 13 3 31 3 50 1 4 1 4 10 2 2 3 7 67 Total 44 12 92 16 164 6 4 1 6 17 26 17 43 131 217 398 App 44 92 136 6 1 7 26 43 69 212 Term 12 16 28 4 6 10 17 131 148 186 Key: App. = Application; Term = Terminology; Chron = Chronological; Geo = Geographic; Sub = Subdivision; Top = Topical

69 3. EXTERNAL AND COLLABORATIVE RESEARCH subdivisions, each is further divided into application and Conclusion terminology/syntax. Based on the data collected and analyzed, we may draw For personal name headings (600) and uniform title the following conclusions: headings (630), all invalid subdivisions appear in form or 1. The rate of invalid headings varies among different topical subdivisions. A possible explanation is that these types of headings. Name headings (600, 6l0, 6ll) and types of headings are rarely subdivided chronologically or uniform title headings (630) show the highest rate of geographically. There is no invalid subdivision among invalid headings, both in the category of incorrect headings for meetings (6X1). For corporate name headings headings and obsolete headings. Geographic headings (610), invalid elements occur in form/topical subdivision (651) show a low rate of incorrect headings and a ($x) (l6 of 19, or 84%) and geographic subdivision ($z) moderate rate of obsolete headings. Topical headings (3 of 19, or l6%). For topical headings (650), invalid (650), which account for the largest percentage of subdivisions involve all three types of subdivisions, with assigned headings, show the lowest rate of invalid the majority (207 of 291, or 71%) being geographic headings, in both categories—incorrect and obsolete subdivisions ($z). For geographic headings (651), the headings. majority of invalid subdivisions also involve form/topical 2. The rate of invalid headings also varies among subdivision ($x) (50 of 67, or 75%), followed by headings of different frequencies of use. A detailed chronological subdivisions ($y) (10 of 67, or 15%), and analysis of headings with a frequency of two or higher geographic subdivisions ($z) (7 of 67, or 10%). and this conclusion were included in the report on the Taken together, l64 of the invalid headings involve first phase of this project. An analysis of a subsample form/topical subdivisions ($x). There appear to be few of headings with a frequency of use of one further problems with chronological subdivisions ($y). Only 17 supports the conclusion that the rate of invalid invalid headings involve period subdivisions. More than headings is in inverse relationship to frequency of use. half (217), of the 398 invalid headings pertain to 3. Errors in assigned headings include: coding, main geographic subdivisions. heading, subdivision (topical and form, chronological, The two rows at the bottom of table 6 show a and geographic), and mechanics (punctuation, summary of errors and obsolete elements in terms of capitalization, etc.). Errors in subdivision may involve application and form (terminology/syntax) of subdivisions. terminology and syntax or application. In all, 212 (53%) of the 398 invalid subdivisions involve Different types of headings show different patterns application and 186 (47%) pertain to terminology/syntax. of errors. The most common errors in personal name Among the l64 invalid form/topical subdivisions, 136 headings (600), in descending order of occurrence, are: (83%) involve application and 28 (17%) pertain to main heading, punctuation, coding, and subdivision. terminology/syntax. Forty-four (32%) of the 136 invalid For corporate name headings (6lO) there is an equal applications represent improper combination of main- number of errors in main heading and coding, heading/subdivision or subdivision/subdivision, and 92 followed by subdivision and punctuation. The sample (68%) reflect obsolete practice. Among the 28 invalid of headings for meetings (6ll) and uniform titles (630) terms or syntax, 12 (43%) are errors and l6 (57%) reflect is too small to draw meaningful conclusions. Among obsolete terminology/syntax. topical headings (650), the greatest number of errors Among the 17 invalid chronological subdivisions, 7 involve subdivisions, particularly in application, (41%) involve application and 10 (59%) terminology/syntax. followed by errors in main heading, in coding, and Six of the 7 invalid applications represent improper rarely in mechanics. Among geographic headings combination of main-heading/subdivision or (651), there are an equal number of errors in main subdivision/subdivision, and 1 reflects obsolete practice. heading and in subdivision, particularly in application. Among the 10 invalid terms, 4 are errors and 6 reflect The rate of errors in coding and mechanics is obsolete dates or terms. relatively low. Overall, the rate of errors in all types of Among the 217 invalid geographic subdivisions, 69 heading, in descending order, is: main heading and involve invalid application, and 148 contain invalid subdivision (occurring with equal frequency), coding, geographic names. Twenty-six (38%) of the 69 invalid and mechanics. applications represent improper combination of main- 4. Obsolete elements in assigned headings include: coding, heading/subdivision or subdivision/subdivision, and 43 main heading, subdivision (topical and form, (62%) reflect obsolete geographic subdivision practice. chronological, and geographic), mechanics (punctuation, Among the 148 invalid geographic names, 17 (11%) are capitalization, etc.), and unverifiable names. errors and 131 (89%) reflect obsolete geographic names.

70 3. EXTERNAL AND COLLABORATIVE RESEARCH

Obsolete elements in subdivision may involve Notes terminology and syntax or application. Different type of 1 A heading containing more than one error is placed in the headings show different patterns of obsolete elements. category coming first in the list. For example, a heading The most common obsolete elements in personal name containing errors in both the main heading and a subdivision headings (600), in descending order of occurrence, are: is coded Er-m. unverified names, main heading, subdivision, and 2 Many of the headings in the sample contain an extraneous mechanics. Among corporate name headings (610), period (.) before the subfield $x, $y, or $z and lack a period they are: unverified names, subdivision, main heading, before the subfield $t. Initially, these periods were considered coding, and mechanics. The sample of headings for errors. Upon further investigation, it was discovered that this meetings (6ll) and uniform titles (630) is too small to anomaly was introduced during the processing of the sample. draw meaningful conclusions. The irregular punctuation did not originate in LC MARC records. As a result, the extraneous and missing periods are Among topical headings, by far the greatest number not considered errors in the results, except one instance of obsolete elements involve subdivisions, particularly where the LC MARC record shows the extraneous period in application, followed by obsolete elements in main before subfield $x. heading, in unverifiable names, and in mechanics. 3 A heading containing more than one obsolete element is placed Among geographic headings (651), the greatest in the category coming first in the list. For example, a heading number of obsolete elements involve main heading, containing obsolete elements in both the main heading and a followed by subdivision (also heavily in application), subdivision is coded Ob-m. unverified names, and mechanics. 4 A name heading matching a see-reference in a 4xx field in the Overall, the rate of obsolete elements in all types of name authority record not coded as an earlier heading was headings, in descending order, is: subdivision, unverifiable considered an incorrect heading. names, main heading, mechanics, and coding. The purpose of this research project is to collect References and analyze data regarding assigned subject headings, Anglo-American Cataloging Rules, 2nd ed., 1988 revision. which would help to improve the quality and Prepared under the direction of the Joint Steering Committee efficiency in subject authority control. Another for Revision of AACR, a committee of; the American Library possible use of the data is in the development or Association, the Australian Committee on Cataloging, the enhancement of automatic error correction British Library, the Canadian Committee on Cataloging, the mechanisms. Predictable errors lend themselves to Library Association, the Library of Congress. Michael automatic correction. These include errors in Gorman and Paul W. Winkler, eds. Chicago: American mechanics, such as punctuation and capitalization, and Library Association. in terminology, providing there is a reliable subject Chan, Lois Mai and Diane Vizine-Goetz. 1995. "Feasibility of a authority file. Almost all obsolete elements are Computer-generated Subject Validation File Based on predictable and amenable to automatic updating, with Frequency of Occurrence of Assigned LC Subject Headings." In the exception of names used in subject headings that Annual Review of OCLC Research, 1995. Dublin, OH: OCLC Online Computer Library Center, Inc.: 46-52. Available at: do not have corresponding name authority records. http://www. oclc. org/oclc/research/publications/ review95/part2 These occur pervasively among headings with low /chan.htm frequencies of use, particularly those with a frequency of one or two. Correction of errors in application and Library of Congress, Office for Subject Cataloging Policy. 1991. Revised Library of Congress Subject Headings: Cross-References coding require sophisticated mechanisms. from Former to Current Subject Headings, compiled from the The results of this study may also be useful for those online subject authority file of the Library of Congress, 1st ed. engaged in the training of subject catalogers or subject Washington: Cataloging Distribution Service. authority control. Instructors of subject cataloging may Library of Congress, Subject Cataloging Division. 1989-. Free- also find the data useful. Being aware of the Floating Subdivisions: An Alphabetical Index. Washington: frequency and patterns of errors is a step towards Library of Congress Cataloging Distribution Service. improved quality in subject heading assignment. Library of Congress, Subject Cataloging Division. 1984-. Subject Cataloging Manual: Subject Headings. Washington: Library of Congress Cataloging Distribution Service. USMARC Format for Authority Data, Including Guidelines for Content Designation. Prepared by Network Development and MARC Standards Office. 1987. Washington: Library of Congress Cataloging Distribution Service. USMARC Format for Bibliographic Data, Including Guidelines for Content Designation. Prepared by Network Development and MARC Standards Office. 1988. Washington: Library of Congress Cataloging Distribution Service.

71 3. EXTERNAL AND COLLABORATIVE RESEARCH

THE MONTICELLO PROJECT High-level Design Considerations Navigation in an expanding distributed medium of DESIGN CONSIDERATIONS FOR A information requires at least two components: finding VIRTUAL LIBRARY records of interest (discovery) and evaluating records to determine how useful the object will be (analysis). One possible way to view these distributed environments is as Eric J. Miller, Associate Research Scientist, a virtual library. The choice of this model makes sense Tod Matola, Systems Analyst, Pat Stevens, since physical libraries have spent much effort to OCLC SiteSearch Product Planning, and Jay Hayden, accomplish this with collections of diverse material. Southeastern Library Network (SOLINET) Today's electronic tools take two different approaches to discovery and analysis. The approach used in online Abstract catalogs is well suited to naive and infrequent users as it This paper describes a design for a system tliat facilitates offers a structured navigational framework. The second the discovery and retrieval of resources in a distributed, approach, derived from early command-based systems and heterogeneous information environment. The primary used in many CD-ROM based products, offers the elements of this design are database and interface tools sophisticated searcher more power and control. Neither and a high-level semantic description based on the Dublin approach offers a complete solution for the diverse Core element set. This framework provides a model of navigation that supports the process of discovery with collections and user requirements of a virtual collection predictable access points into differing collections of like Monticello. Our goal was to provide a framework that information. This design provides the basis of SOLINET's could serve the full-spectrum of users, that is, to support Monticello Electronic Library Project, designed to link sophisticated searchers in a way that did not overwhelm distributed regional resources regardless of the source or the naive user. The resulting navigation design uses the type of information. online catalog, immediately familiar to most searchers, as a starting point. Options that allow the sophisticated searcher to work more efficiently are integrated into this Introduction framework. Given this navigational approach to discovery and analysis, some elements we considered in completing The explosive growth of the World Wide Web has created the design were: the opportunity to offer unprecedented access to • Integrating both local and distributed information information. Valuable collections of media (e.g., texts, collections into a logical whole (e.g., a "virtual images, and sounds) from many scholarly communities database") exist in distributed electronic collections. The challenge is • Maintaining the contiguity of a search session to use technology to build systems that provide multiple including refinement, history, and navigation communities of users with divergent requirements practical • Supporting both simple and complex Boolean means to discover these cross-disciplinary heterogeneous resources. searching including proximity, keyword, and/or phrase searching A request from SOLINET, an OCLC-affiliated regional • Configuring display of content dependent on the network, led to OCLC's involvement in SOLINET's information resource and/or search session Monticello Project. SOLINET requested a set of tools that • Coherent and predictable navigation provided access to herterogeneous information resources - across search sessions for both naive and skilled users in a single interface. - for each search state These distributed collections include Government - for disparate information collections Information Locator Service (GILS), the Encoded Archival • Supporting information in multiple formats: text, Description (EAD) for special collection finding aids, and sound, video, and images machine-readable cataloging (MARC) records. For a brief history of the Monticello Project, see appendix 1. Metadata and the Search Process Early in the design process it became evident that a high- level semantic description of metadata was necessary to provide predictable access points into differing collections of information. For this reason the Dublin Core element set was the logical choice. The Dublin Metadata Workshop in March 1995, the Warwick Metadata Workshop in April 1996, and the Image Metadata Workshop in September

72 3. EXTERNAL AND COLLABORATIVE RESEARCH

1996 were convened to promote consensus concerning Navigation IVIodel network resource description and a supporting To facilitate the discovery model in a heterogeneous architecture among diverse groups including geospatial environment, a set of navigational tools that build on researchers, computer scientists, text markup experts, and predictable access points is desired. The framework for librarians. this model should also integrate distributed collections in a The Dublin Core element set is, in part, about logical whole. We refer to this framework as a virtual semantics. Its major significance lies in the consensus that database. The navigational tools which build on a virtual was achieved across the many disciplines regarding this database will allow users to coherently and predictably context of discovery. For the complete reference definition interact with distributed databases (fig. 1). We also of the Dublin Core Metadata element set, December 1996, identified two desirable features for controlling the focus see appendix 2. of the virtual database: expand and reduce. Expand, The role of such a description model that supports available from each virtual database's components, discovery is expressed by Ricky Erway's Digital Tourist broadens the focus to searching against the virtual entire metaphor [Erway 1996], allowing a "digital traveler" to database. Reduce narrows the searching focus to the locate resources and to access such resources in the way a current component. tourist might accomplish simple communication in unfamiliar cultures. Richer, domain-specific description models will be necessary for the domain specialist or c researcher, but a simple resource description set such as Record the Dublin Core might provide wider accessibility to many Server A resources that could otherwise escape the attention of the Results r non specialist searcher. Record For these reasons, the Dublin Core element set was a N. logical model for supplying the interoperable framework necessary for the Monticello Project. A crosswalk between Record the Dublin Core and the richer descriptive models Server B V associated with the Monticello Project (GILS, MARC, and Results r Record EAD) is defined to provide the digital traveler with the ability to navigate domain-specific collections. Intellectual access points are defined between each of c Record the Dublin Core elements and the corresponding Server C collections. The following example represents elements of Results c the crosswalk between the Dublin Core and the GILS, Record MARC, and EAD collections: DublinCore: Title EAD: ... ... <subtitle> GILS: Title USMARC: 245 (Title Statement) $a Title Proper $b Remainder of Title (subtitle) DublinCore: Creator Fig. 1 The Conceptual Overview of the Navigation Tool EAD: <origination> GILS: Originator Additionally, design considerations identified a USMARC: 100 (Main entry - Personal Name); feature missing from online catalogs and many 110 (Main entry - Corporate name); command-based system—a filter. The filter allows the user to specify what kinds of records should be allowed A simple mapping between these descriptive schemes to pass through for viewing. All other records are is found in appendix 3. This mapping is a work-in- trapped behind the filter (fig. 2). For instance, the progress and thus is subject to change based on feedback searcher can ask to see only records in French written from the various user communities as well as phased after 1985. All other records are blocked. implementation of the Monticello Project. An up-to-date map reflecting ongoing modifications is available from http://purl, oclc. org/ metadata/dublin_core. </p><p>73 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>Fig. 2 Stateful Searching with an Optional Filter </p><p>"We propose to use a hybrid stateful HTTP-Z39.50 Definition (DTD) tags for Government Information Locator gateway that would support the iteration and navigation Service (GILS), documents based on the Encoded Archival necessary for retrieving information from large and Description (EAD) DTD for special collection finding aids, complex information resources (fig. 3). A rationale for this and machine-readable cataloging (MARC) records. is to provide Web-based access to a standard protocol for Access to valuable collections of resources from many supporting both distributed searching across databases and communities is increasingly necessary for cross-disciplinary access to legacy collections in a rich and robust fashion. research. Effective discovery and retrieval of distributed, heterogeneous information is a formidable task. High-level semantic description to define predictable access points into differing collections of information and tools based on models of the discovery or navigation process provide a sound foundation for this type of access. Appendix 1 The Monticello Project History The Southeastern Library Network (SOLINET) received a grant for $420,000 from the U.S. Department of Commerce, through the Telecommunications and Information Infrastructure Assistance Program (TIIAP) of the National Telecommunications and Information Administration (NTIA). The grant will fund a two-year demonstration project to link and integrate distributed regional information resources, focusing on library special collections and state government information. This demonstration project builds upon the 1994 NTIA-funded joint SOLINET/SURA (Southeastern Fig. 3 Conceptual View of the HTTP-Z39.50 Server for Universities Research Association) project, "Planning for Simultaneous Searching Access to Government Information and Educational Resources in the Southeast." SURA is a partner in the new grant award, as are the Southern Growth Policies Board The Monticello Project (SGPB) and the OCLC Online Computer Library Center. SOLINET's request led to the design described. The SOLINET, SURA, SGPB, and OCLC will provide $432,500 in implementation of the Monticello Project is the next step matching funds. in the process of linking distributed regional resources More information on this project is available at regardless of the source or type of information. Record http://www.solinet. net/monticello/monticel. htm types include documents produced with Standard Generalized Markup Language (SGML) Document Type </p><p>74 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>Appendix 2 Reference Definition of tlie Dublin Core Appendix 3 Dublin Core Crosswalk Metadata Element Set The following is a crosswalk between the Dublin Core The following comprises the reference definition of the element set and GILS, EAD, and MARC. This crosswalk is Dublin Core Metadata element set as of December 1996: a work-in-progress and thus subject to change based on • Title; The name given to the resource. feedback from various user communities as well as phased • Creator: The person(s) or organization(s) primarily implementation on this project. For an up-to-date responsible for the intellectual content of the resource. crosswalk reflecting ongoing modifications, see the Dublin • Subject: The topic of the resource: keywords or Core Home Page: phrases that describe the subject or content of the http://purl.oclc.org/metadata/dublin_core. resource. DublinCore: Title • Description; A textual description of the content of EAD; <eadheader> ... <title> ... <subtitle> the resource, including abstracts in the case of GILS; Title document-like objects or content descriptions in the USMARC: 245 (Title Statement) $a Title Proper $b case of visual resources. Remainder of Title (subtitle) • Publisher: The entity responsible for making the resource available in its present form, such as a DublinCore: Creator publisher, a university department, or a corporate EAD: <origination> entity. GILS; Originator • Contributors: Person(s) or organization(s) in addition USMARC; 100 (Main entry - Personal Name); to those specified in the Creator element who have 110 (Main entry - Corporate name) made significant intellectual contributions to the DublinCore; Other Agent resource but whose contribution is secondary to those EAD; <author> ... <sponsor> of the individuals or entities specified in the Creator GILS; Not Applicable element. USMARC; 700 (Added Entry - Personal Name); • Date; The date the resource was made available in its 710 (Added Entry - Corporate Name) present form. DublinCore; Publisher • Type; The category of the resource, such as: home EAD: <imprint> ... <publisher> ... <repository> page, novel, poem, working paper, technical report, ... <address> essay, or dictionary. GILS; Distributor; Point of Contact • Format; The data representation of the resource, such USMARC: 260 (Publication, Distribution, etc.) $b as: text/HTML, ASCII, PostScript file, executable Name of Publisher; 270 (Primary Address); application, or JPEG image. The intent of specifying 550 (Issuing Body Note) this element is to provide information necessary to allow people or machines to make decisions about the DublinCore: Date usability of the encoded data. EAD; <unitdate> ... <creation> • Identifier: String or number used to uniquely identify GILS; Available Time Period-Structured; the resource. Available Time Period-Textual • Source: The work, either print or electronic, from USMARC: 260 (Publication, Distribution, etc.) $c which this resource is derived, if applicable. Date of publication, distribution, etc.; • Language: Language(s) of the intellectual content of 362 (Dates of Publication and/or Volume the resource. Designation) • Relation: Relationship to other resources. The intent DublinCore: Subject of specifying this element is to provide a means to EAD: <controlaccess> ... <index><indexterm> express relationships among resources that have ... <subject> formal relationships to others, but exist as discrete GILS; Index Terms - Controlled; Thesaurus; resources themselves. Local Subject Term • Coverage; The spatial locations and temporal USMARC: 6XX (Subject added entries - Personal durations characteristics of the resource. name, corporate name, etc.); • Rights: The intent of specifying this field is to allow 650 (Subject added entry - Topical Term); providers a means to associate terms and conditions 653 (Index Term - Uncontrolled) or copyright statements with a resource or collection DublinCore; Description of resources. EAD; <scopecontent> ... <profiledesc> ... <note> GILS: Abstract </p><p>75 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>USMARC; 520 (Summary, etc. Note) 522 (Geographic Coverage Note); DublinCore: Objectiype 651 (Subject Added Entry - Geographic EAD: <do> ... <genreform> Name); GILS: Not applicable 742 (Added Entry - Hierarchical Place USMARC; 006 (Type of Record); Name) 516 (Type of Computer File or Data Note) DublinCore; Relationship DublinCore: Format EAD: <relatedmaterial> ... <separatedmaterial> EAD: <add> ... <dodesc> ... <extent> ... GILS: Not applicable <fileplan> USMARC: 772 (Parent Record Entry); GILS; Format 773 (Host Item Entry); USMARC: 007 (Physical Description Fixed Field); 776 (Additional Physical Form Entry); 256 (Computer File Characteristics); 780 (Preceding Entry); 300 (Physical Description); 785 (Succeeding Entry); 340 (Physical Medium); 787 (Nonspecific Relationship Entry); 538 (System Details Note); 580 (Linking Entry Complexity Note) 551 (Entity and Attribute Information DublinCore; Source Note); EAD: <acqinfo> ... <admininfo> 753 (System Details Access to Computer GILS; Record Source; Sources of Data Files) USMARC; 786 (Data Source Entry) DublinCore; Identifier EAD: <unitid> References GDLS: Control Identifier; Original Control Identifier CNI/OCLC Metadata Workshop on Metadata for Networked Images. September 24-25, 1996. Available at: USMARC: 010 (LC Control Number); http://purl.oclc. org/ metadata/dublin_core 020 (ISBN); 022 (ISSN); 024 (Other Standard Identifier); Dempsey, Lorcan and Stuart L. Weibel. 1996. The Warwick 035 (System Control Number); Metadata Workshop: A Framework for the Deployment of Resource Description. Available at: 856 (Electronic Location and Access) $b http://www.dlib. org/dlib/july96/07weibel. html Access Number DublinCore: Language The Dublin Core Element Set. Available at: EAD: <archdesc> ... <langmaterial> ... http://purl.oclc.org/metadata/dublin_core_elements <langusage> Erway, Rickey. Personal communication with authors at the GILS; Not applicable CNI/OCLC Metadata Workshop on Metadata for Networked USMARC: 041 (Language Code); Images, September 24-25, 1996. Available at: 546 (Language Note) http://purl.oclc. org/ metadata/image DublinCore; Geographic Coverage OCLC's InterCat Project. Available at: http://www.oclc.org:6990/ EAD: <chronlist> ... <chronitem> ... <event> ... OCLC/NCSA Metadata Workshop. 1995. Available at: <geogname> http://puri.oclc. org/OCLC/RSCH/Metadatal GILS: Spatial Domain; Bounding Coordinates, The UKOLN/OCLC Metadata Workshop II. 1996. Available at: Place; Time Period - Structured; Time http://puri.oclc. org/OCLC/RSCH/Metadatall Period - Textual Weibel, Stuart, Eric J. Miller, C. Jean Godby, and Ralph LeVan. USMARC: 033 (Date/Time and Place of an Event); 1995a. "An Architecture for Scholarly Publishing on the 034 (Coded Cartographic Mathematical World Wide Web." Computer Networks and ISDN Systems Data); 043 (Geographical Area Code); 28:239-5. 045 (Time Period of Content); Weibel, Stuart, C. Jean Godby, Eric J. Miller, and Ron Daniel. 255 (Cartographic Mathematical Data); 1995b. OCLC/NCSA Metadata: Workshop Report: The Essential 306 (Playing Time); Elements of Network Object Description. Available at: 310 (Current Publication Frequency); http://purl.oclc. org/metadata/dublin_core_report 342 (Geospatial Reference Data); 343 (Planar Coordinate Data); 352 (Digital Graphic Representation); 513 (Type of Report and Period Covered Note); 518 (Date/Time and Place of Event Note); </p><p>76 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>THE WARWICK METADATA Planning for the second workshop began with informal discussions between the U.K. Office for Library WORKSHOP and Information Networking (UKOLN) and OCLC's Office of Research in the summer of 1995. The agenda for the A FRAMEWORK FOR THE meeting gradually crystallized around the theme of DEPLOYMENT OF RESOURCE identifying and resolving impediments to deployment of a Dublin Core-style record for resource description. The DESCRIPTION meeting exceeded expectations as conferees worked toward related conclusions about the Dublin Core Lorcan Dempsey, UKOLN, University of Bath, U.K., and metadata element set, about the need for a wider set of Stuart L. Weibel, Consulting Research Scientist metadata types, and about an extensible framework for interchange of metadata of different types. A consensus about these issues emerged from the workshop and a set Abstract of concrete proposals for moving forward was produced. This paper presents information regarding work done The areas of consensus include the following: during the OCLC/UKOLN Warwick iVIetadata Workshop • Dublin Core held in April 1996 at Warwick University, Warwick, U.K. The - A concrete syntax for the Dublin Core expressed as focus of this workshop was on the problem of the a Document Type Definition (DTD) in Standard deployment of metadata for networked information Generalized Markup Language (SGML). resources. The Dublin Core metadata elements, defined at the OCLC/NCSA Metadata Workshop held in March 1995 - A mapping of this syntax to existing HyperText in Dublin, Ohio, formed the basis for the discussions. The Markup Language (HTML) tags to enable a final deliverable was the Warwick Framework, a description consistent means for embedding author-generated of a container architecture for aggregating metadata description metadata in Web documents. Other objects of interchange, as well as descriptions for how to mappings will be performed in the future (to implement this architecture. enable Dublin Core descriptions to be embedded in various image file formats and in PostScript's Structured Comments, for example). Introduction • Warwick Framework The first week of April 1996 found 50 representatives of - The Warwick Framework: a container architecture libraries, Internet standards, text markup, and digital for aggregating metadata objects for interchange. library projects converging at Warwick University for the - Descriptions of how to implement this architecture OCLC/UKOLN Warwick Metadata Workshop. The in MIME, SGML, and CORBA environments. conferees came from 3 continents, 11 countries, and • Guide to creation and maintenance of metadata numerous perspectives to apply their collective experience - Guide to authors for generating resource to the clarification of issues surrounding the effective descriptions. deployment of metadata for networked information - Guide to administrators of collections. resources. This paper provides an overview of the issues The meeting followed last year's OCLC/NCSA discussed at the workshop. "Moving the Dublin Core Metadata Workshop, which convened a similarly diverse Forward" discusses the Dublin Core and the proposals for collection of stakeholders and resulted in consensus on a taking it forward. "The Warwick Framework: An simple resource description record that has come to be Architecture for Metadata" discusses the rationale for the known as the Dublin Core. Indeed, the consensus itself Warwick Framework. may well have been the first workshop's most important deliverable. The 13 elements of a Dublin Core record Moving the Dublin Core Forward contain few surprises, focusing largely on what might be thought of as network resource bibliography and a little Scope and constraints of the Dublin Core bit more [Weibel 1995a]. The Dublin Core metadata element set is a set of 13 The Dublin Core received considerable attention as a metadata elements proposed by the first workshop as a simple resource description record in the year since the core description record to facilitate discovery of first meeting. While the first workshop helped to focus document-like- objects in a networked environment. To discussion of the topic in many communities, the facilitate progress, constraints were imposed on the implementation of such a description record requires a discussions at the workshop of April 1995: formal syntax and deployment strategy that were beyond the scope of that first meeting. </p><p>77 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>• Only descriptive data elements required to support Target uses for the Dublin Core resource discovery were considered. The goal was to The development of the Dublin Core is motivated by develop something very much like a bibliographic several intended uses: description for electronic resources, a simple resource 1. A simple interchange format for descriptive metadata discovery record that could be generated by authors 2. Content self-description for networked objects of the data without extensive training. Data elements 3. Semantic interoperability across domains covering terms and conditions, archival status, and It is clear from early implementation experience that other types of metadata were not included. projects have adopted the semantic flavor of the Dublin • Discussion was restricted to elements required for the Core to develop simple resource description formats. The discovery of document-like-objects (DLOs), largely Dublin Core is intended to fill the niche between the understood by example (for example, an electronic terseness of the unstructured full-text Web indexes and the version of a newspaper article is a DLO, while an structured description of more complex models such as unannotated collection of numerous 2"x2" slides is MARC. It is intended to be sufficiently rich to support not). useful fielded retrieval but simple enough not to require • A widely understandable semantics was the goal; specialist expertise or extensive manual effort to create. syntax was left unspecified to avoid becoming bogged Simplicity is especially important in the context of down in the tar pits of implementation minutiae. author-generated metadata. Conferees at both the 1995 • Extensibility was judged a key characteristic. The and 1996 workshops recognized the importance of Dublin Core was not intended to replace other well- embedded metadata in Web documents to be harvested by known resource description sets, but rather to act as a software robots. The key to success is balancing the need simple record with elements of commonly understood for well-structured metadata with the requirement that the meaning that could help unify other more complex creation of the description is manageable by authors. description schemes. Thus, it was judged essential to Future applications will have to work with different develop means for extending the Dublin Core set of types of metadata from different sources. The Dublin Core elements and for linking it to other, richer description was positioned to provide a common set of tags that models. would have recognizable meaning across description • Elements were defined to be optional, repeatable, and models, and in that way provide a unifying semantics modifiable. Elements can be modified by qualifiers. among many disciplines. The National Document and For example, an element can include a specified Information Service, a joint project of the National Library schema to identify a controlled vocabulary or rule set of Australia and the National Library of New Zealand governing the element. (described later) is one example of such a use. The 1995 Dublin Metadata Workshop is described in Early Dublin Core pilot projects greater detail in [Weibel et al. 1995a] and [Weibel 1995b]. Even absent a clearly defined syntax, the Dublin Core The reference description of the element set can be element set attracted the interest of a number of early found at: http://purl.org/metadata/dublin_core_elements adopters who developed projects that built on the consensus that emerged from the Dublin Metadata Table 1 The Dublin Core Elements Workshop. Some of these include: Subject The topic addressed by the work Nordic Metadata Project Title The name of the object Among Nordic countries, there is a special need for a Author The person(s) primarily responsible for the shared metadata system to facilitate further the already intellectual content of the object Publisher The agent or agency responsible for making the active use of Interlibrary Loan (ILL) and document delivery object available in its current form services within Scandinavia. The Dublin Core is one of Other agent The person(s), such as editors, transcribers, and several resource description models under consideration illustrators who have made other significant for this purpose. intellectual contributions to the work A preliminary plan for the Nordic Metadata Project has Date The date of publication Object type The genre of the object, such as novel, poem, or been written by Juha Hakala from the Helsinki University dictionary Library. The NORDINFO management group accepted the Form The physical manifestation of the object, such as plan in its meeting in spring 1996, and a full project plan PostScript file or Windows executable file will be written to the management group's September Identifier String or number used to uniquely identify the object 1996 meeting [Hakala et al. 1996]. Relation Relationship to other objects Source Objects, either print or electronic, from which this object is derived, if applicable Language Language of the intellectual content Coverage The spatial location and/or temporal duration characteristics of the object </p><p>78 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>TURNIP Project • The NDIS project used the Dublin Core as a tool in The Uniform Resource Name Interoperability Project determining generic metadata for bibliographic data, (TURNIP), initiated by the Distributed Systems Technology with extensions of the core element set or adoption of Centre (DSTC) in Australia, has produced a URN other metadata standards for nonbibliographic data. Resolution service that uses the Dublin Core element set The creation of additional metadata can be viewed as for URC metadata. The Dublin Core elements are used to extensions or separate core elements sets. describe DSTC Technical Reports and are supplemented • The Dublin Core serves as a useful model for the with administrative metadata elements (e.g., URC-Type, generic storage and access requirements in cross- Date-Creation, Owner). Three issues arising from this database searching. Its concept of qualification offers a deployment of the Dublin Core are the need to group model for normalizing disparate data types and search elements, a common syntax for the exchange of URCs, precision at the individual database level via specific and standards for element qualifiers. More information on schema or types. The NDIS implementation uses many the TURNIP project can be found at: principles of the Dublin Core, such as extensibility http://www. dstc. edu. au/RDU/TURNIP/ and modifiability, but differs on optionality, as only OCLC those metadata elements that intersect across data The OCLC Office of Research is investigating several types are core information resource elements. applications of the Dublin Core element set: Metadata elements intersecting a grouping of data or • A preliminary evaluation of the Dublin Core element item types are considered "common" metadata set as a search interface into WorldCat (the Online elements. Union Catalog). Similar experiments in a distributed Mapping between the Dublin Core and MARC records environment are planned between OCLC map After the first OCLC/NCSA Metadata Workshop in March collection MARC records and the University of 1995, the Library of Congress drafted two discussion California at Santa Barbara's Project Alexandria papers for review by the USMARC Advisory Group at its database. June 1995 meeting. DP86 was called "Mapping the Dublin • The Spectrum project is exploring user interface issues Core Metadata Elements to USMARC" [Guenther 1995a]. associated with user-based resource description of The purpose was to publicize the Dublin Core, to electronic information based on the Dublin Core encourage a standard mapping to USMARC, and to point element set. out problems in mapping to the current format. • The Scorpion project is an OCLC research project in One of the biggest problems was that there was no automatic subject assignment based on the Dewey valid place in MARC to put names from the "Author" or Decimal Classification system. A Spectrum-Scorpion "OtherAgent" elements when you might not know the dovetail provides both user-described and type of name or its relation to the object. Discussion Paper automatically generated classification metadata based 88 [Guenther 1995b] and, subsequently. Proposal 96-2 on the Dublin Core element set. An extension of this [Library of Congress 1996] proposed a field for "generic project involves the automatic harvesting of Internet author," which was added to USMARC in January 1996 as resources to explore the feasibility of using the Dublin the Uncontrolled Name field, tag 720. Core element set as a descriptive framework for A full discussion of data interchange between MARC distributed indexing of networked information. and Dublin Core appears in Cataloging and Classification Information regarding these projects is available at: Quarterly, "Metadata for Internet Resources: The Dublin http://purl .oclc. org/metadata/dublin_core Core Metadata Elements Set and Its Mapping to USMARC," The National Document and Information Service by Priscilla Caplan and Rebecca Guenther [Caplan and • The NDIS project (National Document and Information Guenther 1996]. See Mapping the Dublin Core Metadata Service) is a joint development of the National Library Elements to USMARC for further information. of Australia and National Library of New Zealand Deployment of Dublin Core records in the Alexandria Project aimed at providing a sophisticated search service to The Alexandria Digital Library (ADL) project is one of six Australian and overseas databases, collection NSF/NASA/ARPA-funded Digital Library Initiatives. ADL management services, and state-of-the-art document focuses on online access to spatial data. Given that an delivery services. The first phase of the project will estimated 90% of all spatial data is available only in hard­ implement a search and document request service copy form, metadata is of prime importance. At the same across an integrated information resource of MARC- time, ADL recognizes that a full cataloging record is not based bibliographic data and a suite of indexing, needed by most users. ADL translated Dublin Core fields directory, and thesauri databases in a variety of into ADL fields, and added fields required specifically for encoding formats. Further information about the spatial data and for hard-copy items. This combined set of project is available at: http://www.nla.gov.au/2/NDIS </p><p>79 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p> fields is the default display set that general users see when Impediments to wider deployment they perform a search and display resulting metadata. Among the major goals of the Warwick Workshop was to identify impediments to successful deployment of a simple Other simple resource description models Internet resource description format such as the Dublin Among the factors that motivated the Warwick Framework Core. Early workshop discussions identified four areas is the certainty that a variety of resource description requiring substantive progress: models will emerge from different communities. A • Specification of a transfer syntax successful architecture of network resource description • Development of user guides must accommodate such diversity. • Identification of extensibility mechanisms Examples of other simple resource description models • Specification of a framework to accommodate different discussed at the workshop include the following RFC 1807 varieties of metadata and lAFA templates: Specification of a transfer syntax RFC 1807 (Format for Bibliographic Records) Discussions of syntax are often difficult, burdened as they This RFC (available at: http://ds.internic.net/rfc/rfcl807.txt) are with the biases of familiarity and competing defines a format for bibliographic records describing methodologies. The earlier Dublin Workshop made technical reports. This format is used by the Cornell progress partly because such discussions were ruled out of University Dienst protocol and the Stanford University SIFT scope. However, consensus concerning semantics cannot system. be deployed without a concrete syntax (or syntaxes). In RFC 1807 is a bibliographic record tailored to the pilot implementations, the absence of a common model needs of the Networked Computer Science Technical led to different syntax and structuring choices. Clearly, any Report Library (NCSTRL) project (available at: widespread deployment of Dublin Core (or any similar http://www.ncstrl.org) and is targeted specifically to the description scheme) hinges on reaching consensus about a description of computer science technical reports. As such transfer syntax. it has many characteristics appropriate to a resource Since the Web is currently the primary medium of the description record for document-like objects. Internet, it was further recognized that deployment of lAFA templates and ROADS (Resource Organisation and metadata in the Web is the primary strategic application; Discovery in Subject-based services) successful deployment of metadata in HTML is necessary, ROADS (available at: http://ukoln.bath.ac.uk/roads/) is an though almost certainly not sufficient. eLib-funded project to implement software for resource A working group on syntax formed around this issue organization and discovery in subject-based services. The and elaborated a position paper describing a formal syntax aims are to develop a sharable resource discovery system for Dublin Core Metadata. A Syntax for Dublin Core and to fulfill the requirements of the eLib subject-based Metadata [Burnard et al. 1996] includes: services. The intention is to involve information providers 1. A concrete syntax expressed as an SGML DTD in resource description—essential to a sustainable service. 2. A mapping of this DTD into existing HTML tags using There are two subject services currently in production the meta element of HTML 2.0 (OMNI and SOSIG) using a prototype version of ROADS. 3. A proposal for "keeping the metadata at arm's length" The choice of standards for ROADS was based on the by allowing metadata consumers to recognize criteria of simplicity and availability to allow for speedy references to external metadata using the LINK startup of the subject services. To this end, a simple element attribute-value record structure based on the In related developments, a convention for embedding IAFA/whois++ template definition was chosen. A later metadata in HTML was proposed in a breakout group at version of ROADS will be based on implementation of the the W3C Distributed Indexing and Searching Workshop, Common Indexing Protocol (CIP) to allow for a distributed May 28-29, 1996. This breakout group included attendees system of shared indexing. Initial experience of of the Dublin Core/Warwick Framework Metadata deployment of the IAFA/whois++ template has generated meetings, representatives of several major Web search statistical information on the frequency of use of both vendors (Lycos, Microsoft, WebCrawler), various other bibliographic and administrative attributes. It is expected software vendors, and the W3 Consortium. that this will provide useful feedback for further The problem is to identify a simple means of development of the whois++ template structure. embedding metadata within HTML documents without requiring additional tags or changes to browser software, and without compromising current practices for robot collection of data. Although metadata is intended for display in some situations, it is undesirable to display embedded metadata on browser screens as a side effect of </p><p>80 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p> displaying a document. Therefore, any solution requires Of perhaps greater importance is the need to link encoding information in attribute tags rather than as Dublin Core records to other, richer description schemes container element content. (for example, MARC). The ability to link a simple The goal was to agree on a simple convention for description record to a richer description model provides a encoding structured metadata information of a variety of way to promote one record type to a more complete types (which may or may not be registered with a central description as warranted, and also affords a more registry analogous to the MIME Type registry). It was continuous axis of resource description (from simple to judged that a registry may be a necessary feature of the complex) to suit a variety of user or system needs. metadata infrastructure as alternative schema are Additionally, Dublin Core data addresses only one elaborated, but that deployment in the short-term could go type of metadata (resource description for search and forward without such a registry, especially in light of the retrieval). Other types are necessary: terms and conditions proposed use of the LINK tag to link descriptions to a (who must pay what to whom, for example), archival standard schema description as described later. status, administrative metadata, and others. The solution agreed upon is to encode schema Finally, competing models of resource description elements in META tags, one element per META tag, and as overlap the Dublin Core to one degree or another. RFC many META tags as are necessary. Grouping of schema 1807 and IAEA templates discussed previously are elements is achieved by a prefix schema identifier examples of such formats. Workshop discussions on associated with each schema element. extensibility merged with this recognition of the need to A convention for linking resource description tags to accommodate different description models. No single the reference definition of the metadata schema (or format for resource description will fill all the needs, nor schemata) used in a document was also proposed. Doing could such a monolithic model be maintained easily. The so serves as a primitive registration mechanism for consensus of the workshop converged on a need for an metadata schemata, and lays the foundation for a more architecture that would accommodate the diversity of formal, machine-readable linkage mechanism in the future models and levels of description that characterize the [Weibel 1996]. heterogeneous world of electronic resources. The proposed conventions are described more fully in The proposal that emerged from these discussions is http://www.oclc.org:5046/~weibel/html-meta.html known as the Warwick Framework, discussed in detail in a Development of user guides companion article in D-Lib Magazine [Lagoze et al. 1996]. Resource descriptions might be created by several different It is an architecture for the aggregation and interchange of agents in the metadata chain: authors, collection discrete metadata packages. Such an architecture will administrators, and third-party catalogers. Guidelines for afford the opportunity to mix and match metadata sets, creating metadata are needed. A guide for authors would allowing rational deployment of many existing and be especially useful in supporting a move to document- emergent description models. embedded descriptions, and at least one producer of HTML authoring tools (SoftQuad, Ltd.) has committed to An Architecture for IVIetadata; embedding Dublin Core resource description templates in The Warwick Framework their products when the syntax and guidelines are sufficiently stable. Need for the Warwick Framework Extensibility—mixing and matching metadata No single element set will satisfy all metadata The Dublin Core addresses one niche of the metadata requirements. Different communities of users or different ecology. It is a simple resource description format that is application areas will require data of different elements intended to be extensible in at least two ways. As its name and levels of complexity. The Workshop took as its implies, it is intended to provide a commonly starting point the Dublin Core, a simple scheme for what understandable core of elements that will help unify might be thought of as electronic bibliography. However, different models of resource description. Its simplicity is other application areas might require the fullness and among its major virtues, but users may well wish to structure provided by a MARC-type record, for example, or augment description of their resources with additional data. might have domain-specific descriptive requirements not Original concepts of extensibility for the Dublin Core addressed in the Dublin Core. At the same time, other assumed a mechanism for local extensions—additional types of data exist that were outside the scope of the elements added at the discretion of authors or collection Dublin Core: terms and conditions and evaluative data, for maintainers. Such local information may be critical to the example. effective use of a particular collection, though the local Satisfying the need for competing, overlapping, and character of such elements may not be of general interest complementary metadata models requires an architecture or usefulness. that will accommodate a wide variety of separately maintained metadata models. It was concluded that an </p><p>81 3. EXTERNAL AND COLLABORATIVE RESEARCH architecture for the interchange of metadata packages was Requirements for Implementation required. A package is conceived as a metadata object specialized for a particular purpose. A Dublin Core-based Concrete implementations record might be one package, a MARC record another, The architecture needs to be realized in one or more terms and conditions another, and so on. Such discrete concrete implementations. Proposals for MIME- and SGML- packages might be numerous and varied in content and based implementations have been prepared, as well as a even source. Users or software agents would need the discussion of the architecture in a distributed object ability to aggregate these discreet metadata packages in a environment. conceptual container (a metadata basket of sorts), hence Registration the notion of a container-package architecture. A registry agency for metadata object types needs to be This architecture should be modular, to allow for established. Early implementation pilot projects should not differently typed metadata objects; extensible, to allow for be hampered by the lack of such an agency, but as more new metadata types; distributed, to allow external metadata sets are elaborated by various stakeholders, a metadata objects to be referenced; and recursive, to allow formal means for managing changes will be important. metadata objects to be treated as "information content" and have metadata objects associated with them. Packages are typed objects. They may be primitive (a Moving Forward package is one of a number of separately defined, The Warwick Framework was enthusiastically welcomed at primitive metadata formats); indirect (a package may be a the workshop as a practical approach to the effective reference to an external object); or a container (a integration of metadata into a global information container is a collection of metadata objects, which may in infrastructure. The realization of such an architecture will turn be packages or other containers). require great effort on many fronts, in many communities. Several benefits flow from a container-package The great hope is that the consensus achieved at this approach: meeting will have provided the foundation for • It provides a framework in which metadata objects coordination and sufficient freedom in the proposed can be aggregated and exchanged in a consistent way. architecture to allow progress without an undue burden of • It avoids the need to reinvent wheels or do redundant close coordination. design work. The modular approach means that packages can be specialized for their particular Conclusions and Directions function and that existing formats and best practice Conferees left Warwick convinced that progress had been can be accommodated readily. made in important areas. This conviction is corroborated • The particular aggregation of metadata objects can be by the rapid appearance of a number of documents optimized for particular content types. It can also be supporting key decisions and recommendations. optimized for particular user groups: the user as client The consensus concerning embedding metadata in or agent, the user as end-user, the user as customer, HTML reached at the W3C workshop on Distributed and so on. Indexing and Searching provides an encouraging impetus • The architecture is extensible and can accommodate to rapid deployment of richer resource description unanticipated requirements. It allows metadata objects techniques on the Web along the lines developed in the to be treated as information objects with associated Warwick Workshop. The recent appearance of a Dublin metadata, allowing, for example, terms and conditions Core implementation based on these developments to be applied to some or all of the metadata packages. available at: (http://archaeology.ahds.ac.uk/project/ The Warwick Framework is a high-level container metadata/dublin.html) is a promising indicator of the need architecture; it makes no assumptions about the contents and demand for better resource description on the of the packages. Nor can it be assumed that clients (or Internet, and the speed with which such ideas can be agents) will be able to interpret all packages. Conferees promulgated when community consensus emerges. agreed that packages should be strongly typed and that a It is hoped that the Warwick Workshop has galvanized registry for metadata types will probably be required, such a consensus and provided an important signpost for perhaps along the same lines as the lANA registry for the development of more effective networked resource Internet Media Types (also known as MIME types). description. </p><p>82 3. EXTERNAL AND COLLABORATIVE RESEARCH </p><p>References Acknowledgments Burnard, Lou, Eric Miller, Liam Quinn, and C. M. Sperberg- The authors are indebted to many organizations and McQueen. 1996. A Syntax for Dublin Core Metadata. individuals that paved the way for this work and Available at: http://purl.oclc.org/net/eric/DC/syntax/ contributed substantively to the success achieved. metadata.syntax. html • Hazel Gott, whose able organizational skills provided Caplan, Priscilla, and Rebecca Guenther. 1996. "Metadata for a superb working environment conducive to our task, Internet Resources: The Dublin Core Metadata Elements Set and whose amiable hospitality made us all feel more and Its Mapping to USMARC." Cataloging and Classification at home. Quarterly, In Press. • UKOLN and OCLC, for providing staff time for Guenther, Rebecca. 1995a. Discussion Paper No. 86: Mapping the organizational and travel support for many attendees. Dublin Core Metadata Elements to USMARC. Submitted to • JISC (the Joint Information Systems Committee of the the USMARC Advisory Group, June 1995. Available at: Higher Education Funding Councils in the U.K.) for gopher://marvel.loc.gov:70/00/.listarch/usmarc/dp86.doc their support of UKOLN's MODELS project, through Guenther, Rebecca. 1995b. Discussion Paper No. 88: Defining a which U.K. conferee attendance was supported. Generic Author Field in USMARC. Submitted to the USMARC • The Corporation for National Research Initiatives Advisory Group, May 1995. Available at: (CNRI) and European Research Consortium for gopher://marvel.loc.gov:70/00/.listarch/usmarc/dp88.doc Informatics and Mathematics (ERCIM), whose Hakala, Juha, Die Husby, and Traugott Koch. 1996. Warwick contributions of staff time and effort were key factors Framework and Dublin Core Set Provide a Comprehensive in bringing together the ideas and people that made Infrastructure for Network Resource Description. Available at: the workshop a success. http://www.bibsys. no/warwick. html • Finally, and most importantiy, the attendees of this Knight, Jon and Martin Hamilton. 1996. A MIME Implementation workshop, whose good faith and commitment to for the Warwick Framework. Available at: progress during and after the workshop are the http://weeble.lut. ac. uk/MIME-WF. html bedrock on which this effort is founded. Lagoze, Cari, Clifford A. Lynch, and Ron Daniel. 1996. "The Warwick Framework: A Container Architecture for Note: This abbreviated report is used with permission Aggregating Metadata Objects. An Overview of the Warwick of D-Lib Magazine, July/August 1996. Framework Architecture." D-Lib Magazine, July 1996. Available at: http://www.dlib.org/dlib/july96/lagoze/ 071agoze.html Lasher, Rebecca and D. Cohen. June 1995. RFC 1807: A Format for Bibliographic Records. Available at: http://ds.internic.net/ rfc/rfcl807.txt Miller, Paul. 1996. Archaeology Data Service. Graphics & GIS Advisor, University Computing Service, University of Newcastle, Claremont Road, Newcastle upon Tyne NEl 7RU. Available at: http://www.ncl.ac.uk/~napml/ads/metadata Proposal No. 96-2: Define a Generic Author Field in the Bibliographic, Authority, Classification, and Community Information Formats. 1996. Library of Congress, Network Development and MARC Standards Office. Washington, Library of Congress. Weibel, Stuart. 1996. A Proposed Convention for Embedding Metadata in HTML. A working group report from the W3C Distributed Indexing Workshop, May 28-29, 1996. Available at: http://www.oclc.org:5046/~weibel/html-meta.html Weibel, Stuart. 1995a. OCLC/NCSA Metadata: Workshop Report: The Essential Elements of Network Object Description. Available at: http://purl.oclc.org/oclc/rsch/metadata Weibel, Stuart. 1995b. Metadata: The Foundation of Resource Description, D-Lib Magazine. Available at: http://www.dlib.org/dlib/July95/07weibel.html </p><p>83 </p><p>4 LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p> he OCLC Office of Research Library and Information Science Research Grant program has been a fruitful means for OCLC Tto support high-quality research by faculty in Library and Information Science schools for many years. Every year, several project grants are awarded to researchers through a two-tier, competitive evaluation process. In return for the grants, the researchers publish their nonproprietary results in the library literature. Recipients of the 1996 awards are: • Myke Gluck, Assistant Professor, Florida State University A Descriptive Study of the Usability of Geospatial Metadata • Gregory H. Leazer, Assistant Professor, University of California at Los Angeles A Demonstration System for the Explicit Control of Bibliographic Works and Relationships • Charles R. McClure, Professor, Syracuse University Quality Criteria for Evaluating Information Resources and Services Available from Federal Websites Based on User Feedback When announcing these awards Director Terry Noreault said, "While we always speak of recipients as 'winning' an award, in truth it is we who are members of the library community who win, because the results of these projects by front-line researchers serve well to further the ends of our community. For that reason, OCLC is happy to be able to aid in such a win-win process." The following reports prepared by last year's winners certainly support the contention that the library community stands to win much from their efforts. 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>ANALYZING THE VIABILITY Research Questions The principal research questions addressed by the study are: OF USING PEER GROUP 1. Does the emphasis on public libraries providing HOLDINGS AS AN current popular reading appear to lead to relatively rapid changes in the titles held by those libraries? EVALUATION TOOL FOR 2. Is it feasible to consider using a core listing of widely held titles generated from WorldCat to assist public PUBLIC LIBRARY ADULT librarians in making decisions about the management FICTION of adult fiction holdings? Hypotheses tested include: James H. Sweetland, Associate Professor, Hypothesis 1. The listing of adult fiction titles most and Judith J. Senkevitch, Associate Professor, widely held by OCLC member public libraries will show School of Library and Information Science, significant change within one year, with many titles University of Wisconsin—^Milwaukee dropping from the list and new titles taking their places. Hypothesis 2. Titles with more recent publication dates will show a larger increase in number of holding Abstract libraries than titles with older publication dates. This project examined the extent of change over a one- Hypothesis 3. Titles with older publication dates will year period in the adult fiction titles most held by OCLC be more likely to drop from the list than those with recent member public libraries. The purpose of the study was: to contribute to the understanding of the nature of adult dates. fiction collections in public libraries; to provide further Hypothesis 4. Titles added to the list from one year insight into the nature of a "classic" work of fiction; and to to the next will be those with recent publication dates. clarify the potential utility of an OCLC-generated list of Hypothesis 5. Titles with more recent publication adult fiction classics for public library collection evaluation. dates will show a greater increase in the number of different OCLC records (as a result of new editions and different formats) than titles with older publication dates. Background IVIetliodology \^hile providing popular fiction to adult readers remains a primary role for most small and medium-sized public Using WorldCat, the authors worked with the OCLC Office libraries, there has been little research on how to improve of Research staff to generate a list of the approximately this important aspect of library service. Similarly, while 400 adult fiction titles most widely held by OCLC member librarians agree on the need to evaluate and weed fiction public libraries. Following the parameters used in the 1994 collections, they have received little information on how analysis, the study isolated the records of public libraries to do it. One aspect of the problem identified in earlier from those of other libraries; eliminated non-book research [Senkevitch and Sweetland 1994] is the lack of materials, serials, and government publications by use of reliable ways to identify fiction "classics" when selecting MARC document type fields; and identified fiction by use and weeding. The need to recognize classic works is of the fixed field code. Multiple records for various particularly acute for smaller libraries with very limited editions of the same title were then merged under a single resources. title, and duplicate library holding codes were eliminated. Using WorldCat (the OCLC Online Union Catalog) of Experience in the previous project suggested that over 35 million records, this project compared a list of the because of varying editions and cataloging practices, approximately 400 adult fiction titles most widely held by juvenile material cannot effectively be removed during OCLC member public libraries in 1995 with a similar initial list creation. Therefore, juvenile titles were identified listing prepared in 1994 to examine the relative stability of by checking for "juvenile" coding in Books in Print Plus on the listing. The study builds on findings from the authors' CD-ROM. Titles flagged as juvenile were eliminated from previous research funded by OCLC in 1993-1994 further consideration. [Sweetland and Senkevitch 1995; Senkevitch and This list, produced in August 1995, was compared Sweetland 1996]. It is part of a larger research effort to with the previous one produced in August 1994 to increase understanding of the nature and management of examine the volatility of the core list of adult fiction as it adult fiction collections in public libraries and to develop applies to OCLC member libraries and to test the practical models to help librarians evaluate their fiction proposed hypotheses. Appropriate statistical tests were holdings and improve service to adult patrons. made using SPSS to analyze correlations and determine the validity of hypotheses. </p><p>86 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>Results 4. Titles added to the list from one year to the next tended to be those with more recent publication The final 1995 list of adult fiction most held by OCLC publication dates; however, the difference was not public libraries resulted in 409 titles, with 13 titles statistically significant. Only one of the l6 titles added dropping from the earlier 1994 list and l6 additional titles to the list was published before 1980, F. Scott appearing. The vast majority of titles are relatively recent, Fitzgerald's The Great Gatsby (1925). Including that with publication dates since 1980. Further details of the title, the mean date of titles added was ca. 1983, the results include: median ca. 1987. On the other hand, only one of the 1. Hypothesis 1, the listing of adult fiction titles would new titles had a 1991 publication date, and one a significantly change over one year, must be rejected. 1992 date; there were no new titles more recent than Only 13 titles from the 1994 list did not appear on the 1995 list, while only l6 new titles (not on the 1994 1992. 5. The study also examined whether titles with more list) appeared on the 1995 list. This number represents recent publication dates would show a greater a change of approximately 3-2% over one year. At this increase in the number of different OCLC records (as rate of change it would, in theory, take at least 30 a result of new editions and different formats) than years for the entire list to be replaced. titles with older publication dates. This was not the 2. Of the 393 titles that remained on the list from 1994 to case. The 393 titles in both lists showed a mean 1995, those with more recent publication dates did number of OCLC records (manifestations) in 1994 of show a slightly larger increase in number of holding libraries than titles with older publication dates. The 8.7, with a median of 8. In 1995, the typical title had a mean of 8.99 manifestations and a median of 8. mean number of libraries holding a given title However, 304 (77.4%) of the continuing titles showed increased (by 39) from 915 in 1994 to 954 in 1995; the no change in the number of manifestations; 64 titles median increased (by 41) from 904 to 945. Depending showed an increase of only one manifestation. Also, on how "older" and "newer" are defined, analysis among these titles with increased manifestations are using t-tests and correlations suggests that there is a two of the oldest titles (1886 and 1936), as well as one slight tendency for newer titles to show a greater increase in holdings (approximately three more per of the newest (1991). For the 83 titles that showed an increase in manifestations, the mean publication date title) than older titles. was ca. 1984 (median 1986), showing no substantial When titles published before 1968 are eliminated as difference from the tides as a whole. outliers, there is a weak Pearson correlation of .1388 (p < .006) between increase in holdings and date of publication. However, this increase is small when Conclusions and Recommendations for compared with the overall growth in the number of Further Research public libraries using OCLC. The number of potential The authors' previous research tested the predictive holding symbols for public libraries increased from validity of commonly used lists, including both quality approximately 4,000 to 4,700 in the same period lists, such as ALA Notable Books, and quantity lists, such as (based on data from Patrick McClain, OCLC Office of the Publishers Weekly bestseller lists, as predictors of the Research, E-mail message October 5, 1995). titles held by the greatest number of public libraries 3. Titles with older publication dates are no more likely [Sweetland and Senkevitch 1995; Senkevitch and to drop from the list than those with recent dates. Of Sweetland 1996]. Those results show that, with the the 13 titles dropped from the 1994 list, the oldest was exception of the Wilson Fiction Catalog, the lists most published in 1981 and the newest in 1990, for mean commonly used do not represent a consensus core and median dates of ca. 1986. Titles added had a collection for public libraries in the United States. mean publication date of ca. 1983. In comparison, Consequently, their value as quality estimators for the publication dates of the 393 tides that remained on the purpose of collection evaluation is, at best, weak. list ranged from 1886 to 1991, with a mean publication However, the findings of the present study suggest date of ca. 1985 and a median date of 1986. T-tests that the OCLC list of most held adult titles is relatively show no significant difference in dates between those stable over time; that newly published titles do not tides that dropped from and those that did not drop automatically get purchased in large quantities; and that from the list, nor any significant difference between WorldCat could in fact be used as a consensus list of adult those dropped and those added. However, caution fiction suitable for public libraries. However, while the must be exercised because so few titles are involved 1995 WorldCat contained, in theory, the holdings of in this analysis. approximately 4,700 public libraries (as defined by holding codes), it is notable that the most held work was owned by only 1,090 libraries, and that the five tides ranking </p><p>87 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>407th were held by only 901 libraries. When a list of the AN EXPERIMENTAL STUDY "top 400" contains titles owned by only 19% of public libraries in the system, one may question how meaningful ON GRAPHICAL TABLES OF the "consensus" is. At the same time, descriptions of and research about CONTENTS the use of collective lists of library holdings suggest that the analysis of peer group collections to create a baseline Xia Lin, Assistant Professor, School of Library and for evaluating one's own collection may have value. While Information Science, University of Kentucky there is little literature regarding overlap among public library collections, analyses of academic library collections Abstract suggest that there is a small core of widely owned An experiment was conducted to compare a traditional (nonfiction) titles. Thus, titles held by approximately one- table of contents (TOC) with a new graphical table of fifth of all public libraries may well represent sufficient contents (GTOC). A GTOC is constructed as a graphical consensus to serve as a core list of public library adult interface with its contents generated automatically from fiction. underlying documents. It is an associative representation The issue of what constitutes a fiction "classic" of document contents. In this experiment, three versions of requires further examination. The removal of those titles a GTOC for the same document collection were classified as juvenile from the list of most held adult generated, each based on a different type of indexing. A fiction effectively eliminated most older works frequently total of 54 subjects participated in the experiment. The regarded as "classics." The authors intend to examine subjects performed the same retrieval tasks—first with one of the GTOCs, then with a conventional TOC. Analysis of these titles in the near future in hopes of explaining the subjects' performances and interactive patterns on phenomenon that appears to define "classics" as, in effect, different types of tables of contents shows that, while the older books written for adults that have become suitable subjects completed the tasks well with both the TOC and for youthful reading. the GTOC, their browsing behaviors were different. Results In addition, given the surprising stability of the adult also indicate that different types of indexing will affect the fiction list over one year and the lack of sufficiently large functions of a GTOC. numbers of new titles for rigorous analysis, it would be useful to retain the programming used to generate these two lists in order to extend the analysis over a longer Introduction period, such as five or ten years. This would provide additional insights into the nature of public library adult The table of contents is a major component of documents. Its main functions are to (1) provide an fiction collections and into the longer term value of a list overview for contents of a book or a collection of of most held works as a public library collection documents, (2) organize documents into semantic groups evaluation tool. based on their contents, and (3) provide direct access to individual documents [Juhasz et al. 1973; Prabha et al. References 1988]. These functions are essential to digital documents, Fiction Catalog, 12th edition, 1991 (and annual supplements yet little research has been done on how these functions through 1995). Edited by Juliette Yaakov and John should be supported and what formats of the table of Greenfieldt. New York: H.W. Wilson Co. contents are needed in the digital environment. In the Senkevitch, Judith J., and James H. Sweetland. 1996. "Evaluating printed environment, the table of contents is a product of Public Library Adult Fiction: Can We Define a Core linear documentation that organizes documents in one Collection?" RQ 36(1): 103-17. linear order (such as the order of physical pages). This . 1994. "Evaluating Adult Fiction in the Smaller Public linear organization is simple and easy to use. It is a much Library." RQ 34 (2): 78-89. different environment in the digital world, where many Sweetland, James H., and Judith J. Senkevitch, 1995. "Evaluating documents will be nonlinear, dynamic, and lacking Public Library Fiction Collections: Is There a Core List of permanent order. The table of contents in this kind of Classics?" Annual Review of OCLC Research 1994. Dublin environment may need to be different from the table of OH: OCLC Online Computer Library Center, Inc.: 59-61. contents that we know. </p><p>88 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>This research investigates a graphical table of contents (GTOC) designed for the digital environment. The proposed GTOC is to be a graphical interface with its contents automatically generated from underlying documents. It self-organizes the documents into clusters or groups and labels the document groups with words selected from the documents or their titles. It also provides direct access to individual documents from keywords or ITO nxm wordson the I move i^o slider to the titles. A prototype was built for such a GTOC. Issues TorcdocewoidS/ I move the slider lo the left. related to GTOC design were discussed by Lin [1996]. This I Qick on my wotd or llocAtkm to see tifles (or report focuses on the experiment that compared a GTOC I Authors, if holdin* the I O3tione ^ cowrtT with a printed table of contents (TOC). WerenceN^volcs PrclMiMvtic Doctrnwr Irxlmngfrorr RHtvmr* F«*<lb*cfe l> ' ThtSgnifiancecfthtCrenfidcTes:) onhdexUnguajM ^ Co)r^^'«^ce9hEndU»)rScaichr)g(r«C(HKMIn4«jOiieiykro(tfbatiman4Cx(«rMi«nin«LNetv<aiiaAhA<tojUv«^ Research Questions - ui^Duiearde3«)g:A:!>c&aUeMecw)HmiorL)BHi6ijt«iinioi Latent Sein&^lnde>ng it «n OptimBl SpecMCasc of MulU«lr U9hgWordKettoDis«nbi9U8t«Wor< S««e9'orTe:4Retriev Sim pi• Word String* MConpo«nii K*>v*rd*:<Vi Irdn^^g an The experiment was designed to explore potential benefits S«gft«tedD<taJieae>ffomft'eeTexforTfe>tRetilt of a GTOC as an information visualization and access tool. close the window. Four research questions were explored in this study: • How does the proposed GTOC compare with the Fig. 1 Screen Display of GTOC-2 printed table of contents for information access? • Do different indexing procedures significantly change functionality of the proposed GTOC? Experimental Method • How does the user interact with a visual display for Fifty-four students were recruited as subjects for the hour- information access? long experiment. Subjects first spent about 10 minutes • What kind of table of contents may be needed in the learning about the experimental tasks and practicing with electronic environment? GTOC for two sample questions. The actual experiment began when subjects clicked a Start button. They then Experimental Design completed a three-stage experiment. The experimental system contained a "book" of 143 Stage 1. Subjects searched for 12 randomly selected articles that appeared in SIGIR (ASIS Special Interest titles on GTOC, one at a time. For each given title, they Group on Information Retrieval) proceedings from 1990 to needed to browse, explore, and click the map display 1993. Three versions of the GTOC were generated for until they found the title in the pop-up window. these documents, each based on a different type of Stage 2. Subjects searched for the page numbers of indexing. The first was based on title words only (GTOC-1), the same 12 titles in the TOC. The page numbers could using a binary version of the vector space model [Salton only be found by browsing through the given printed 1989]. The second was also based on title words, using the tables of contents. weighted vector space model where weights were Stage 3. Subjects searched for titles in the GTOC to computed by document frequencies and inverse document match two given abstracts. After reading each abstract, frequencies of all the words in a document's title, subjects first clicked on related keywords on the map keywords, and abstract (GTOC-2). The third was based on display to view related titles, and then clicked on the titles full-text indexing including all words in the documents' that they thought might be a good match for the abstracts. titles, keywords, and abstracts, using the weighted vector Each time a title was clicked, a full display of the space model (GTOC-3). document appeared. Only with the full display could the Figure 1 shows a screen of GTOC-2. Subjects can click subjects judge if they had found the original document. on any location or word on the display to pop up a The experiment was conducted in four sections in a window displaying the top ten titles most closely computer lab where all the computers had 17-inch associated with the word or the location. Subjects can also monitors. The success rate and the time each subject spent use the slider to adjust the number of words shown on on the tasks were automatically recorded by the the display. To find a title in the GTOC, subjects need to experimental system. When subjects interacted with browse keywords on the display and find a conceptual GTOC, their exact clicking locations and their sequence of location that associates with the target title. interactive actions were also recorded by the system. Subjects also completed a brief questionnaire before the experiment and two exit questions after the experiment. </p><p>89 4. LI BRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>Results the given titles. They made a judgment on which category was the best match, and what other categories they should Title searching check if they did not find the title in the first category they With regard to title searches, the printed TOC had the selected. They processed a lot of semantic information. highest success rate of 94%, while the GTOCs had success Data shows that subjects' first clicking location consistently rates of 88%, 84%, and 77%, respectively. The difference reflected their decisions, but their subsequent clicking between the success rates for the printed TOC and for locations were much more diverse. While some subjects GTOC-1 or GTOC-2 is not statistically significant (p=0.09 could follow associative patterns of the visual display to and p= 0.07, respectively), but the difference between the continue their browsing, others were lost, without a clear success rates for TOC and GTOC-3 is significant (p=0.02). direction for their next move. This indicates that the subjects could use GTOC-1 and GTOC-2 for the task about as successfijlly as they used Conclusions TOC, while GTOC-3 was a little difficult. On the other The proposed GTOC is not so easy to use as a traditional hand, in their exit comments, most subjects stated that TOC because its use requires processing of semantic GTOCs were easier to use than the TOC. Subjects perceived information. However, because of its visual display, its that they did better with the GTOC than with the TOC. associative organization, and its interactive functions, a Time spent searching GTOC can be used as an alternative to perform tasks that Results on the time subjects spent on each search also people usually do with the table of contents. It might even revealed the same trend. On average, each title search be preferred by users as indicated by the subjects of this took about 63-5 seconds using the TOC. Title searches experiment. Because TOCs and GTOCs are organized required about 83.5, 83.5, and 93.2 seconds on GTOCs, differently, the way that users utilize them will also be respectively. The differences are significant. However, very different. In designing a GTOC, it is important to when successful searches only are compared, the average select indexing and mapping techniques that will make time spent on title searches with the TOC and GTOCs are the relationships transparent among visible keywords as 60.3, 63.9, 65.3, and 64.7 seconds. These differences are well as between the keywords and underlying documents. not significant. Therefore, the difference is in those failed Different types of indexing might also be needed for searches. When a search failed using the TOC, the subjects different types of search tasks as shown in the experiment. could simply go back to the beginning and try again. When a search failed on the GTOCs, the subjects had Future Research much more difficulty deciding what else to try, and thus Future research is needed to study what other functions a spent much more time on the searches. GTOC should provide to help users understand keyword Indexing methods categories and their relationships. GTOCs should also be Comparing results for the three types of GTOCs revealed tested in a large or networked information environment the impact of different indexing procedures. The map with an improved interface. Currently, GTOC has been display mainly based on title indexing shows its advantage implemented as a Java applet applying to web pages. It on title searches, while the map display based on can be accessed and tested at: title/abstract indexing shows its advantage for abstract http://lislin. gws .uky. edu/Sitemap/Sitemap.html. searches. For abstract searches, GTOC-2 had a much higher success rate (6l%) than GTOC-1 and GTOC-3, References which had success rates of 21% and 41%, respectively. Juhasz, Stephen et al. 1973. TOC: Table of Contents Practices of Since GTOC-1 and GTOC-2 use exactly the same Primary Journals—Recommendations for Monolingual, keywords on the displays, this result also indicates that Multilingual, and International Journals. (ERIC ED075042). map displays represent underlying information well. Lin, Xia. 1996. "The Graphical Table of Contents." Proceedings of User instructions the First ACM International Conference on Digital Libraries Data on users' interactions was analyzed to identify their (Digital Library'96, March 20-23, 1996. Bethesda, Maryland), browsing patterns. Significantly different browsing 45-53. behaviors were found. When browsing using the TOC, Prabha, Chandra G., Duane Rice, and David Cameron. 1988. subjects typically looked for the first two or three words of Nonfiction Book Use by Academic Library Users. Dublin, OH: the given titles from those tables of contents, page by OCLC Online Computer Library Center, Inc. page. It made little difference whether these words were Salton, Gerard. 1989- Automatic Text Processing: The familiar to them, or to what categories these words Transformation, Analysis, and Retrieval of Information by belonged. Therefore, they processed only syntactic Computer. Reading, MA: Addison-Wesley. information and conducted linear browsing with the TOC. When browsing on GTOCs, subjects visually scanned though keywords on the map display to find a match for </p><p>90 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>THE IMPACT OF ELECTRONIC peer-reviewed electronic journals in science and technology, and there are many more in the <a href="/tags/Social_science/" rel="tag">social science</a> and JOURNALS ON SCHOLARLY humanities. These constitute the first wave of what are likely to be many more scholarly electronic journals to come. COMMUNICATION Electronic journals cannot play an important role in A REFERENCE AND CITATION STUDY scholarly communication unless they affect scholarship and research. There has been much discussion and speculation in the literature about perceived advantages Stephen P. Harter, Professor, School of Library and and disadvantages of electronic journals, and about Information Science, Indiana University problems and issues related to their development and use. Some observers have argued vigorously that electronic Abstract journals will transform the scholarly communication The goal of this research was to generate hard data about system. While such speculation is necessary and the impact of electronic journals on the scholarly important, the present study was aimed at gathering hard communities they serve. The main question addressed data regarding the actual use and impact of electronic was: To what extent are scholars and researchers aware journals by examining the artifacts of scholarly of, influenced by, using, or building their own work on communication—^the journal article and its references. As a research published in electronic journals? Demographic bibliometric method, citation analysis is a well-known features and access problems associated with electronic technique that has a long history in studies of scholarly journals were also studied. A sample of electronic journals communication [see, for example, Borgman 1990]. As was drawn and examined using techniques of citation artifacts of scholarly communication, citations can reveal analysis. The data shows that thus far the impact of formal communication patterns and scholarly impact. The electronic journals on scholarly communication has been major advantages of citation analytic methods include high minimal. reliability and relative unobtrusiveness. </p><p>Overview Goals and Objectives The scholarly journal has served as the primary medium The primary goal of this project was to gather hard data of scholarly communication among scientists and scholars about the impact of electronic journals on the scholarly for more than three centuries, and has remained communities they are serving. To what extent are scholars essentially unchanged in form and function over its and researchers aware of, influenced by, using, or building lifetime. As we know it, science is scarcely imaginable their own work on research published in electronic without the scholarly journal. However, despite its benefits journals? to science and scholarship, the paper journal system has The specific objectives of the study were: been subject to criticism, including perceived problems • Assess the accuracy and completeness of references to with the <a href="/tags/Peer_review/" rel="tag">peer review</a> process (that it suppresses new electronic journals. ideas, favors authors from prestigious institutions, and • Identify the extent to which scholars publishing in causes undue delays in the publication process), high both print and electronic journals cite electronic costs that are growing much faster than the rate of journals and other electronic publications. inflation, and lack of selectivity. Spiraling costs and long • Identify fields in which researchers actively use publication delays are perhaps the most serious of these scholarly electronic journals. criticisms. • Identify highly cited electronic journals and articles While the costs of producing paper journals have appearing in them. increased sharply, developments in computer and • Record and analyze interesting demographic communications technology have accelerated, characteristics and access problems of electronic accompanied by sharp drops in costs. These have led to, journals. among other developments, the dramatic explosion of the Internet, especially the World Wide Web. Technology Populations and Samples increasingly offers the possibility of using computers and There were three parts to this study. In the first phase of communication networks to create alternative electronic the research, the Access and Demographic Study, a sample forms of the conventional paper journal. of 131 electronic journals was drawn from two print The first peer-reviewed, full-text electronic journal directories of electronic journals—one published by the including graphics was Online Journal of Current Clinical Association of Research Libraries [Okerson 1995] and the Trials (OJCCT) [Keyhani 1993]- In a recently published list, other by MecklerMedia [MecklerMedia 1994]. Hitchcock, Carr, & Hall (1996) identified 115 scholarly. </p><p>91 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>In the second phase of the research, the Reference descriptions of the methods and findings from these Study, the sample was refined by imposing some studies were reported [Harter & Kim 1996a, 1996b]. additional criteria. To be included in the Reference Study, an article must have reported the results of research or Citation Study scholarship and must have cited at least one reference. Citations are intimately connected with the reward system Seventy-seven electronic journals met these criteria. The of science, a subject about which Robert Merton and many sample of articles for the Reference Study was defined as others have written [Merton 1968]. The normative theory the last four available scholarly, peer-reviewed articles of citation holds that through their citations, scientists and from each of the 77 scholarly electronic journals. scholars acknowledge their intellectual debts to those from For the third phase, the Cited Work Study, the sample whom they have borrowed, on whose research they have used in the Reference Study was refined still further. For built, or by whom they have been influenced [Kochen an electronic journal article to be cited, especially in a 1987]. Although there are other reasons for citing [see, for print publication, the lag time in conventional print example. Brooks 1986; Gilbert 1977], clearly a major publishing and in the research process itself must be meaning of citations is that they reflect the use, impact, or considered. One must provide some time for articles to be even perhaps the quality of cited works, as assessed by read and to influence a researcher or scholar in some way, the members of a scholarly community. In this sense, it thus becoming part of a study in progress. Such articles has been said that citations are the currency of may eventually be cited in the published article reporting scholarship. the results of that study. For this reason, the Cited Work The Citation Study assessed the impact of electronic Study was conducted only on scholarly electronic journals journals on scholarly communication by using tools that began publication in 1993 or earlier. There were 39 developed by the Institute for Scientific Information (ISI). journals in this sample. Table 1 summarizes the After selecting the sample of electronic journals for study, characteristics of the three electronic journal samples used the following steps were taken: in the study. A detailed discussion of the procedures for • Search for each of the 39 electronic journals in selecting the three samples was reported [Harter and Kim Journal Citation Reports QCR) for citation data. 1996a]. • Identify all possible forms of the name of the cited journal. Table 1 Electronic Journal Sample Sizes for the Three • Search for journals not found in JCR in the cited work Studies field of Scisearch, Social Scisearch, and Arts & Origin of Sample Humanities Search, using DIALOG'S Onesearch. Purpose of Sample lUlembers Sample Size • Print and verify bibliographic citations for the citing Access and ARL directory and 131 electronic papers. Demographic Study MecklerMedia directory publications • Perform analyses on the citation data. </p><p>Reference Study Peer-reviewed or 77 scholarly A detailed description of this portion of the refereed scholarly electronic methodology is reported [Harter 1996]. journals in the initial journals sample Findings Cited Work Study Electronic journals 39 scholarly A great many interesting findings resulted from this in the Reference Study electronic research, far too many to attempt even a summary here. that began publication journals The findings of the demographic and reference studies are In 1993 or earlier reported [Harter & Kim 1996a]. An article describing the access problems and issues encountered was published [Harter & Kim 1996b]. Finally, a report of the citation study Methods appeared [Harter 1996]. Perhaps the most significant of all the findings of this Access and Demographic Study, Reference Study research was that the great majority of the electronic The methods followed in the first two phases of the journals studied have been cited seldom or not at all by research included defining and drawing appropriate the mainstream print journals that comprise the source samples of electronic journals, obtaining data from printed journals of the ISI databases. A summary of these findings, documentation and downloaded from the electronic represented as a frequency distribution, is given in table 2. journals themselves, conducting descriptive analyses, and reporting the findings in tables and prose. Detailed </p><p>92 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>Finally, this study was not intended to attempt to Table 2 Frequency Distribution for Citations to Electronic predict the future impact of electronic journals, and its Journals in the Sample findings should not be interpreted in that sense. The study Citations to Electronic Number of Electronic is probably best regarded as a benchmark against which Journai^ Journals' the findings of future research can be compared. 0 15 1-5 13 6-10 3 References >11 7 Brooks, Terrence A. 1986. "Evidence of Complex Citer 1 For electronic journals with print counterparts, the number of Motivations." Journal of the American Society for citations includes all citations to articles in the journal, regardless Information Science 37: 34-36. of whether the author read the print or electronic version, a fact that in general cannot be known. Gilbert, G. Nigel. 1977. "Referencing as Persuasion." Social 2 The number of citations to one electronic journal, Psyche, could Studies of Science 7: 113-22. not be determined. Harter, Stephen P. and Hak Joon Kim. 1996. "Electronic Journals and Scholarly Communication: A Citation and Reference Analyses based on the citation data were also Study." Proceedings of the ASIS Midyear Meeting (San Diego, performed. Among the "pure" electronic journals—those CA: May, 1996) 299-315. Available at: with no established print counterpart—those having made http://ezinfo. ucs. indiana. edu/~harter/harter- the most impact were Online Journal of Current Clinical asis96midyear.html Trials, Public-Access Computer Systems Review, and Harter, Stephen P. 1996. "The Impact of Electronic Journals on Psycoloquy. The typical article published in these three Scholarly Communication: A Citation Analysis." Public-Access electronic journals has had a significant impact in terms of Computer Systems Review 7. Available at: the comparative number of times it has been cited (as http://info.lib.uh.edu/pr/v7/n5/hart7n5.html measured by the "<a href="/tags/Impact_factor/" rel="tag">impact factor</a>"). However, even these Hitchcock, Steve, Leslie Carr, and Wendy Hall. 1996. A Survey of journals publish so few articles annually that their impact STM Online Journals 1990-95: The Calm before the Storm. on scholarly communication has been slight. (Updated Febmary 14, 1996.) Available at: http://journaIs .ecs.soton.ac. uk/survey/survey. html Conclusions, Significance, and Limitations Keyhani, A. 1993. "The Online Journal of Current Clinical Trials: An Innovation in Electronic Journal Publishing." Database This research provides hard citation data concerning the 16: 14-23. impact of electronic journals on the conduct of research and scholarship, and concludes that thus far they have Kochen, Manfred. 1987. "How Well Do We Acknowledge Our made very little impact. There are other ways that impact Intellectual Journal of Documentation 43: 54-64. can be measured. Electronic journal publishers can gain MecklerMedia. 1994. Internet Worlds' on Internet 94-. An some information regarding use by recording the number International Guide to Electronic Journals, Newsletters, Texts, of subscriptions to an electronic journal or the number of Discussion Lists, and Other Resources on the Internet. times articles are accessed or downloaded from host Westport: MecklerMedia. servers. However, while these kinds of data provide useful Merton, Robert K. 1968. "The Matthew Effect in Science." Science indicators of one type of use, they do not measure the 159: 56-63. extent to which electronic journal articles are playing a Okerson, Ann, ed. 1995. Directory of Electronic Journals, role in the scholarly and research process, that is, in the Newsletters and Academic Discussion Lists. 4th ed. advancement of knowledge. This research used Washington, D.C.: Association of Research Libraries. bibliometric research techniques based on citation analysis as primary methods of data analysis. The validity of citations as a measure of journal impact has been criticized, and to the extent that this research depends on the meaning of citations, it is subject to the same criticisms. However, it seems clear that while the meaning of citations can certainly be debated, they do reflect an influence or impact of some kind on the author of the citing article, even while the precise nature of this influence may not be known. </p><p>93 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>A RELATIONAL THESAURUS FRAME NAME Slot-a name [Value of slot-a] MODELING SEMANTIC RELATIONSHIPS Slot-b name [Value of slot-b] Slot-c name [Value of slot-c] USING FRAMES etc. Since verbs systematically represent relationships Rebecca Green, Associate Professor, College of Library between their arguments (participants in the relationship and Information Services, University of Maryland at designated by the verb), they constitute the single best College Park linguistic data source for identifying and characterizing relationships. Verbs used in this study come from the Brown corpus of American English, a body of 500 texts, Abstract each of about 2,000 words, representing more than a The study investigated the argument structure of mid- to dozen text types. In their analysis of this corpus, Francis high-frequency English verbs, grouped semantically, to establish a basic inventory of relational structures, and Kugera [1982, 465-532] give a ranked frequency listing represented as frames. Based on this investigation, 28 of English word classes (words sharing stem and part of general relational structures were identified. These speech are merged into a single class), accounting for relationship types are, in turn, related to each other those in about the 85th frequency percentile and above. through specialization (based on the hierarchy The verbs from this list were grouped into semantic fields relationship) and composition (based on the part-whole (where overall frame relationships are shared), using a relationship). Relational structures are also interrelated machine-readable version of Roget's Thesaurus [1911] to through the background knowledge inherent in their use. facilitate categorization. Analyzing the argument structure of verb groups and forming larger verb groupings with Introduction shared relational structure proceeded in tandem until a full array of major relationships was identified. Our conceptual world—^whether modeled through For example, commercial exchange is communicated classification schemes, databases, or natural language— using buy, sell, purchase, pay, and cost. These verbs all comprises two major types of components: entities and presuppose a scenario in which four arguments interact with relationships. To be effective and efficient, knowledge each other in a relatively standard way: possession of the organization must have access to inventories of both Merchandise, originally owned by the Seller, is transferred to component types. Since development of these inventories the Buyer in exchange for Money. A common relationship, can proceed independently, this study, like related research captured by a single frame, operates across the semantic field. that precedes it [see Vickery 1996], addresses only Each of the arguments identified as participants in the semantic relationships. The current study differs from related work by field is included as a slot in the associated frame, as shown: rejecting the conunon assumption that all relationships are, COMMERCIAL EXCHANGE or can be treated as, binary relationships. Consequently, the Merchandise [ ] inventory identified differs significantly from other proposed Money [ ] inventories of relationships. [Green 1995 discusses the basic Seller [ ] characterization of conceptual relationships.] Buyer [ ] The current research goal is to develop a structured inventory of basic relationships as needed for knowledge organization. This entails both identifying a comprehensive Inventory of tfie Basic Relational Thesaurus set of general relationship types and analyzing their Several phenomena combine to reduce the relational interrelationships. Since work on the second phase is thesaurus to manageable proportions. Verbs within a ongoing, the results reported here on the structure of the common semantic field share relational structure. relational inventory are preliminary. Relational structure is also shared across metaphor, where the structure of a more concrete domain is borrowed to IVIethodology structure our understanding of a more abstract domain. The specificity of individual verbs ranges broadly; we To capture the essence of a relationship, we need to have both the extreme generality of go and make and the identify the nature of the relationship, the entities involved greater specificity of meander and manufacture. But the in it, and the roles they play. The type of structure same basic relational structure of an agent moving from adopted here, capable of specifying relationship nature one point to another is present in both go and meander. and identifying participant identities and roles therein, is The same basic relational structure of an agent engaging in known as a frame. The name of the frame identifies the a process to create an object, optionally using material nature of the relationship, while the slots within the frame and/or an instrument, is present in both make and tie the participants involved to their respective roles: </p><p>94 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p> manufacture. The convergence of these phenomena relationships underlying the 1,250+ verbs being sharply reduces the number of general relational structures investigated. This inventory correlates positively with verb to be accounted for. types identified by Chafe [1970] and with basic image Figure 1 gives the inventory of basic relational schemata discussed in Johnson [1987]. structures identified in this study. Although small in number, this set of structures accounts for all general </p><p>LOCATIVE CONTAINER ACTION Patient [ ] Interior [ ] Agent [ ] Landmark [ ] Boundary [ ] Activity [ ] Distance between Patient Exterior [ ] and Landmark [ ] CENTER-PERIPHERY JOURNEY GIFT Directional location of Patient [ ] Traveler [ ] Gift [ ] Patient relative to Landmark Center [ ] PATH [ ] Giver [ ] [ ] Periphery [ ] Vehicle [ ] Receiver [ ] COMMUNICATION COMMERCIAL EXPERIENCE PERCEPTION Message [ ] EXCHANGE Experiencer [ ] Perceiver [ ] Sender[] Merchandise [ ] Experience [ ] Perception [ ] Receiver [ ] Money [ ] VOLITION Sense Modality [ ] Language [ ] Seller [ ] Agent of Choice [ ] COGNITION Discourse [ ] Buyer [ ] Choice [ ] Knower [ ] Illocutionary force [ ] Known [ ] EDUCATION Subject Matter [ ] Teacher[ ] STATE PROCESS Learner [ ] Patient [ ] Patient [ ] Attribute Type [ ] Attribute Type [ ] Attribute value [ ] Initial value [ ] ACTION-PROCESS PRODUCTION Subsequent value [ ] ACTION [ ] ACTION [ ] PROCESS [ ] PROCESS; result [ ] BVSTRUMENT Material [ ] UNK HIERARCHY Instrument [ ] FORCE Entity A [ ] Superordinate [ ] ACTION-PROCESS [ ] ACTION; force [ ] Entity B [ ] Subcategory [ ] PATH [ ] Link [ ] Principle of inclusion [ ] PROCESS; effect [ ] COMPARISION WHOLE-PART Impediment [ ] Entity A [ ] Whole [ ] Entity B [ ] Part [ ] Comparative evaluation, A Configuration [ ] BENEFACTIVE POSSESSION to B [ ] PATH ACTION/PROCESS Possessor [ ] BALANCE Source [ ] Beneficiary [ ] Possession [ ] FORCE A [ ] Destination [ ] Patient [ ] FORCE B [ ] Path [ ] Point/axis/plane [ ] Comparative distribution of forces [ ] </p><p>Fig. 1 Relational Inventory of Basic Conceptual Structures Represented as Frames </p><p>95 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>Structure within the Basic Relational Thesaurus JOUKNEY-a ACTION-a Erame relationships vary in their degree of generality; Traveler [ ] Agent [JOURNEY-a.Traveler some relational structures are specifications and/or PATH [PATH-a] I FORCE-a.ACTION.Agent] combinations of others. This means that an inventory of Vehicle [ ] Activity [JOURNEY-a] frame relationships is itself structured by relationships. PROCESS-a EVSTRUMENT-a (opt) Two particular patterns of relationships between Patient [JOURNEY-a.Patient] Instrument [JOURNEY- relational structures occur regularly in the relational Attribute type [LOCATIVE] a.Vehicle] thesaurus. In the first pattern, specificity is introduced Initial value [LOCATIVE-a] ACTION-PROCESS when the identity of entities is unconstrained in one Subsequent value ACTION [ACTION-a] frame, but constrained to some degree in the other. It is in [LOCATIVE-b] PROCESS [PROCESS-a] this sense that the frame for PATH is more specific than the LOCATIVE-a FORCE-a frame for LINK. PATHs are constrained to involve LINKS Patient [PROCESS-a.Patient] ACTION; force [ACTION-a] that can be traversed. Specificity of this type can also be Landmark [JOURNEY - PATH [PATH-a] introduced when attributes of entity participants or of the a.PATH.Source] PROCESS; effect [PROCESS- overall relationship are constrained. Such relationships Distance [0] a] occur extensively within blocks of related frames (fig. 1 Directional location [AT] Impediment [X] groups the frames into eight blocks; within a block, more LOCATIVE-b general frames are given before more specific frames). Patient [PROCESS-a.Patient] In the second pattern, one or more relational structures Landmark [JOURNEY - are incorporated within a more complex relational a.PATH. Destination] structure. ACTION-PROCESS, a straightforward merging of Distance [0] (Key I =or ACTION and PROCESS, is a clear example of such Directional location [AT] x= not applicable) incorporation. FORCE likewise incorporates the whole of PATH, PROCESS, and ACTION. Although this second Fig. 2 Frame-based Representation of Knowledge Inherent in pattern has some surface similarities with the first pattern, the JOURNEY Frame it does not involve hierarchical inclusion; it is based instead on composition. The existence of such relationships between frame structures is reflected in fig. 1 by the use of Significance the name of a frame as a slot name in another frame. Many enterprises could benefit from a basic inventory of In addition to the patterns, noted frame structures are relationships, including information retrieval, knowledge related through the knowledge inherent in the use of a representation, data modeling, and hypermedia systems. frame. Frames are meant to capture relational scenarios While such an inventory might not be universal across about which we, as human beings, typically have some cultures, it could be expected to generalize across degree of knowledge. For example, implicit in the notion applications within a linguistic culture. of a JOURNEY is a PROCESS of the Traveler of the The significance of such an inventory to knowledge JOURNEY changing location from being at the JOURNEY'S organization is at least four-fold. Source to being at the JOURNEY'S Destination. This • The inclusion of specific nonhierarchical relationships PROCESS is paired with one of two possible types of within the syndetic structure of an index language ACTIONS. In one case, when the Traveler is moving under would permit greater definition of the relationships his/her own energy, the Agent of the ACTION is the between thesaurus descriptors or between subjects in Traveler of the JOURNEY and the Activity is the JOURNEY a classification scheme, etc. itself. In the other case, when the Traveler is being moved • The inclusion of frames as the basic structure of index rather than moving, the Agent of the ACTION is the Agent units would afford the specification of seemingly ad of the ACTION of FORCE while the Activity, if literal, may hoc relationships that cannot be communicated be something like pushing or pulling. If metaphorical, it presently by most index languages. This is especially may be something like encouraging or threatening. important because relevance is often mediated by Further, the JOURNEY'S Vehicle, if present, is an nonhierarchical relationships between the subject of Instrument of the ACTION-PROCESS. This relational the user's need and the subject of documents relevant knowledge is spelled out in fig. 2. (Incidentally, being able to the need [Green and Bean 1995]. Frame-based to express these knowledge relationships using the basic indexing would permit searching for literature that inventory of relational structures is one indication of the contains information bearing on the user need in a inventory's validity.) certain way. Retrieval could then become more precise, </p><p>96 4. LIBRARY AND INFORMATION SCIENCE RESEARCH GRANT PROGRAM </p><p>Frame-based indexing would permit retrieval based on References the presence of specific relationships, no matter which Chafe, Wallace L. 1970. Meaning and the Structure of Language. entities are involved. This type of searching would be Chicago: University of Chicago Press. more useful than the traditional entity-based search in, for example, analogically based search strategies. At Francis, W. Nelson and Henry Kugera. 1982. Frequency Analysis present, relationship-based searches are almost of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin. impossible to perform in any systematic way. Some degree of reasoning could be performed on Green, Rebecca. 1995. "Syntagmatic Relationships in Index frame-based index entries based on both the Languages: A Reassessment." Library Quarterly 65(4) 365-85. hierarchical and the nonhierarchical knowledge . 1996. "Development of a Relational Thesaurus." In inherent in frame structures. By this means, a degree Knowledge Organization and Change: Proceedings of the 4th of expertness or artificial intelligence could be built International ISKO Conference. Rebecca Green, ed. into retrieval systems. This, in turn, should have the Frankfurt/Main: INDEKS Verlag, 72-79- potential of contributing further to recall and Green, Rebecca and Carol A. Bean. 1995. "Topical Relevance precision. Relationships. II. An Exploratory Study and Preliminary Typology." Journal of the American Society for Information Science, 46(9) 654-62. Johnson, Mark. 1987. The Body in the Mind: The Bodily Basis of Meaning, Imagination, and Reason. Chicago: University of Chicago Press. Roget, Peter. 1911. Roget's Thesaurus of English Words and Phrases. New York: Crowell. Vickery, Brian. 1996. "Conceptual Relations in Information Systems [letter to the editor]."/oMrna/ of Documentation 52(2) 198-200. </p><p>97 </p><p>5 DISTINGUISHED SEMINAR SERIES </p><p> he OCLC Office of Research Distinguished Seminar Series was founded with the express purpose of furthering and Tstimulating the exchange of ideas across the barriers of time, space, and disciplines. Each year, OCLC invites distinguished professionals to lead a half-day seminar reporting on recently completed or early-stage research that they have undertaken. The topics sometimes align closely with research directions within OCLC, but more frequently they represent areas of interest to the library and information science community that are not formally being studied by researchers at OCLC. Diversity of topics is essential to meeting the purpose of the Distinguished Seminar Series, and through the years, there has certainly been no shortage of diverse presentations. The reports in this section reflect that principle of diversity. Barbara Tillett of the Library of Congress shared her experience of using conceptual models in determining future directions for cataloging and the MARC format. Two months later Kenneth Crews of the Copyright Management Center at Indiana University-Purdue University made a presentation on the tensions between United States copyright law and the emerging technologies in library information networks. Diverse topics, yes, but equally vital to OCLC's membership, and thus appropriate and timely topics for the Seminar Series. 5. DISTINGUISHED SEMINAR SERIES </p><p>CATALOGING RULES AND given rise to whole new categories of intellectual and artistic effort, which were previously unimaginable. CONCEPTUAL MODELS The easy availability of online materials and the fact that digitized forms can be easily and cheaply created and On January 30, 1996, Barbara altered by individuals have shaken some of our fundamental B. Tillett, Ph.D., presented her concepts of intellectual property rights, authorship, views on the usefulness of publishing, and bibliographic control. Since digital conceptual modeling to renditions can easily be corrected and updated, we need determine future directions for better ways of identifying which version we are seeing, cataloging and the MARC which catalogers will want to describe, and which library format. Nearly 100 persons from selectors will want to obtain and preserve for the future. OCLC local libraries, and universities attended her "We [cannot] afford to sit on the side lines lecture, as part of the OCLC Office of Research waiting for the digital dust to settle." Distinguished Seminar Series. Things are changing so quickly now that we cannot Dr. Tillett discussed a Barbara B. Tillett accurately predict what tomorrow may bring. But that technique called conceptual does not mean that we can afford to sit on the sidelines modeling that can be tremendously useful in analyzing waiting for the digital dust to settle. We need to take a where we need to go with the future of cataloging rules and hard look at the publishing world as it exists today, MARC formats. Dr. Tillett reviewed some of the evolutionary project as best we can what might happen in the future, processes that have taken place in the bibliographic and plan how to fit the products of the new technology universe. She introduced the conceptual model that IFLA is into the considerable corpus of information we already preparing in its study of the functional requirements of the have. In preparing ourselves for coping with this new bibliographic record, refreshed memories about the purpose world, we can profit from re-examining how well we have of cataloging rules and how they have evolved in response done in describing the bibliographic universe of the past. to changes in technologies. She also examined where we are headed and where we might go with fumre rules and alternative communication formats. Conceptual Modeling A technique for such re-examination that has been used for Evolution of Publishing Technology over a decade now is conceptual modeling using entity- relationship models or object-oriented modeling. Dr. Tillett The impulse to record started with human beings wanting mentioned the article by Michael Heaney of the Bodleian to document their ideas and share their intellectual and Library, "Object-Oriented Cataloging".' There is also the artistic creations. It may have begun earlier, but Dr. Tillett modeling that was done in the 1980s by several institutions, suggested that the urge to record ideas started with cave such as in Western Australia, at the Library of Congress, and paintings in the Paleolithic era about 20,000-30,000 years at the National Library of Canada. Yet another modeling ago. Although cave drawings were fixed in location, the exercise was started in the early 1990s under the auspices ideas they conveyed could be replicated by other human of the IFLA Study Group on the Functional Requirements beings in other places using similar symbols. Language for Bibliographic Records.^ All of these activities use an was another medium of expression that evolved for entity-relationship model (E-R model) to identify the communicating ideas. But it was writing, or capturing entities, attributes, and relationships that are most important language in a more permanent form, that has been for the users of bibliographic information. particularly significant, for with written language came the In the world of entity-relationship modeling, entities possibility of saving and disseminating ideas and artistic are those things about which we keep information. For expressions. It is that "fixing" of artistic and intellectual each relationship, there is also information (cardinalities) content that allows us to define that content. That is an that describes whether the relationship is optional or incredibly important step in any culture. Another mandatory, and how many times it can occur for each important step is the reproduction of that content, which instance of the entity. allows it to be shared with others. An interesting and significant aspect of the E-R During this century there has been even more rapid modeling technique is that the entities are defined in proliferation of types of materials that capture intellectual terms of the real world as much as possible. In other and artistic content. New technologies allowed us to words, in formulating an E-R model, you should first step capture previously undreamt-of wonders, preserving back and ask "what are these objects in the real world, movement and sound in motion pictures and sound and how do they behave?" and then adjust the model to recordings. Thus we see how available technologies have </p><p>100 5. DISTINGUISHED SEMINAR SERIES </p><p> reflect your particular business rules. A key point to Self-describing manifestations remember about an E-R model is that it represents a Models can also help us see where to draw lines between particular view of reality. Another important point is that it what we can capture by machine and what we need should be "implementation neutral," in other words, it humans to provide. The source for information we should not reflect any particular implementation of the provided in the past for books and other materials was data in manual or automated form. This is why we talk based on relatively easy to find information on the "chief about this kind of modeling as "conceptual modeling." source of information" on the physical item. The concepts involved in the bibliographic world are much more abstract, and are therefore much more difficult to MARC formats and future alternatives Modeling also will help us re-examine the MARC formats model. It is interesting and encouraging, however, that most and see how best they can evolve towards a future, of the recent bibliographic models, which have been arrived improved structure for communicating bibliographic and at somewhat independently, share the following types of authority information. The model helps us see that those entities for representing information about the materials that record structures we created for catalogs which were we catalog (these are from the IFLA Study Group on the Functional Requirements of the Bibliographic Record): primarily for bibliographic description and access to books in the traditional publishing world are not adequate for • Work, that is, the intellectual and artistic content the bibliographic information we need to convey for other • Expression, that is, the intellectual or artistic content formats of materials. In fact, they were never totally as realized in the form of software or textual, visual, successful for books—^witness our problems with multiple musical, or choreographic notation versions. It also helps us to see where the current MARC • Manifestation, that is, a physical embodiment of the formats need improving in how they express entities and intellectual content, but still an abstraction that applies relationships. to all instances of the work that share the same physical characteristics • Item, that is, an individual copy you want to "Human intelligence is essential for making the inventory, which in turn could be made up of several logical connections that express relationships, for physical pieces the subject analysis that gives structure to our Future collections, and for the catalogs that serve up the We have used the conceptual model to try to step back surrogates for those collections. " from bow we record information today and start thinking What if, instead of a flat-file view, we could look at about what that information really means and why we bibliographic records as groupings of information about provide it. So, where are we headed for the future? different kinds of bibliographic entities, with a variety of Cataloging rules relationships among them? Instead of recording information We are all painfully aware of the effects that rapid redundantly about a work when describing the expression, technological advances are having on the kinds of for example, we should be creating a relationship between materials we are cataloging. Isn't it now time to re­ that expression and the work with which it is associated. examine the basic principles upon which we base our Instead of writing a note to identify that something is an Anglo-American Cataloging Rules? The Joint Steering English translation from the original work in German, we Committee for the Revision of AACR has proposed an should be able to establish the relationship between the invited conference to be held in 1997 to reexamine the translation and the original. Technology can help us to principles behind the rules to reaffirm them or revise implement these relationships through links, and to help them. Dr. Tillett said that conceptual modeling could be a catalog users traverse the relationships that have been useful tool for reexamining the rules. articulated. But the intellectual work of making those links Perhaps the biggest challenge as librarians is in will still need to be done by human beings. determining when you have a different intellectual content. What was sometimes difficult in the past has Conclusions become extremely problematic with the advent of the while modeling can help us see the bibliographic universe Internet. What markers do we have to make the more clearly, we need to remember that the universe is determination that a change has occurred? We hope that constantly changing. Our views of library catalogs and conventions for presenting information online will include cataloging may also change. Catalogers are expert explicit data on what version we are viewing and when it organizers, and have an immensely important role in is considered a new work. This may be tied to future providing bibliographic control over those portions of the copyright requirements. Stay tuned. bibliographic universe that we have allocated to the care of libraries, preserving information for future generations. Our </p><p>101 5. DISTINGUISHED SEMINAR SERIES jobs are facilitated by computer software, which lets THE COPYRIGHT DILEMMA computers do what they do best in processing large quantities of material quickly and reporting back results. LEGAL TENSIONS AND However, human intelligence is essential for making the logical connections that express relationships, for the subject INFORMATION NETWORKS analysis that gives structure to our collections, and for the On March 19, 1996, Professor catalogs that serve up the surrogates for those collections. Kenneth Crews spoke on the In the evolution of cataloging rules, we see less and topic of copyright law in less of the rationale of why we provide the information relation to emerging library we give in bibliographic records. The faster technology information networks. More changes the characteristics of the things we are trying to than 60 persons from OCLC, catalog, the more important it is to examine our local libraries, and universities assumptions to be sure they still apply. We need to keep attended his talk as part of the pace with what information is useful and frequently check OCLC Office of Research whether our assumptions still hold. Modeling is a tool to Distinguished Seminar Series. help us see the universe more clearly, and it will help us Dr. Crews focused on in re-examining our assumptions about cataloging rules possible conflicts between the and MARC formats as we reinvent cataloging. Kenneth Crews law of copyright and the deployment of new Notes technologies for creating library networks. A fundamental 1 Heaney, Michael. 1995. "Object-Oriented Cataloging," objective of the United States copyright law is to protect Information Technology and Libraries. September: 135-53. the rights of authors and other owners of copyrightable works to allow them a fair return in the marketplace, 2 Draft Report of the IFLA Study Group on the Functional Requirements of the Bibliographic Record provides the while giving the American public reasonable access to following (paraphrased) background on their initiative. The those works and the information they contain through 1S>90 Stockholm Seminar on Cataloging, sponsored by the IFLA public institutions such as libraries. New technological UBCIM Program, grew out of an increasing awareness of: advances, such as the World Wide Web, digital scanners, • The mounting costs of cataloging and the concomitant and optical character recognition, are putting some strain desire for cataloging simplification on this balance of rights. For example, these advances • The proliferation of electronic, multimedia, and make it possible for copyrighted works to be copied easily computer-related materials and the challenges these pose and cheaply and disseminated from just one instance of for both description and access the work, which, in turn, threatens the revenue stream of • The increasing drive to economize in cataloging by copyright owners. A new balance must be struck, perhaps reducing duplicate efforts, nationally and internationally, through such institutions as the Conference on Fair Use and the associated need to define a core-level standard to and possibly through changes in the copyright law itself. support the cooperative use of records As his main organizing principle. Dr. Crews • The exploding bibliographic universe and the continual need to effect better universal bibliographic control characterized the emerging knowledge network world by • The continuing pressures to adapt cataloging practices four illustrative examples: and codes to the machine environment • Interlibrary loans Following the 1990 Stockholm Seminar, a project was • Distance learning proposed to study the functional requirements for • Electronic reserves bibliographic records. The Terms of Reference for a Study of • Electronic publications the Functional Requirements for Bibliographic Records states the following purpose: Each represents the clash between opportunity brought The purpose of this study is to delineate in clearly by advancing technology and the demands of copyright defined terms the functions performed by the law. Dr. Crews sought to demonstrate this clash through bibliographic record with respect to various media, discussions of each activity in relation to the four major various applications, and various user needs. The sources of public rights of use of copyrighted materials: study is to cover the full range of functions for the • Fair use (Section 107 of the Copyright Act) bibliographic record in its widest sense (i,e,, a record • Library copying (Section 108) that encompasses not only descriptive elements, but • Distance learning (Section 110) access points [name, title, subject, etc,], other "organizing" elements [classification, etc,], and • Permission of owners annotations). </p><p>Presented January 30, 1996. Barbara B. Tillett, Ph.D., is the Chief of the Cataloging Policy and Support Office at the Library of Congress. </p><p>102 5. DISTINGUISHED SEMINAR SERIES </p><p>Interlibrary Loans Distance Learning Taking interlibrary loans (ILL) first, Dr. Crews said that The next problem area that Dr. Crews addressed was that Section 108 of the current copyright law is "technologically of distance learning. Technology now allows for much neutral," and "is perhaps the most readable law passed by wider dissemination of courses through remote means Congress, ever. It is clear, understandable." In order to than ever before. For instance, it is possible for a student, further society's need for reasonable access to knowledge through the technologies of the World Wide Web and its by all citizens, not just a few. Section 108 provides specific software clients (browsers and their plug-ins), to be sitting rights to certain kinds of libraries (e.g., they must be open at a desk at home or work while listening to and viewing to the public) to copy works for such purposes as an instructor, either live or recorded. However, Section preservation and patron use. In terms of patron use, the 110(2) of the copyright law puts limitations on distance following conditions often apply: learning that would most likely prohibit such an action, if • Allows copies of articles and other materials what the student was viewing or listening to contained • For user's private study, scholarship, research some form of copyrighted material. Section 110(2) has the • Display a warning of copyright following conditions concerning use of copyrighted works • Isolated and unrelated copies only in a distance-learning situation: • No systematic reproduction of copies Eligibility requirements for using copyrighted materials: • Specific allowance of interlibrary loans • Regular part of systematic instruction • Quantity of copies received does not substitute for • Directly related to teaching content subscription or purchase • Reception in classrooms or similar places; or reception Despite the technologically neutral position of Section by persons with disabilities or other "special 108, concerns have arisen in all communities involved in circumstances" ILL (the libraries, the publishers, the authors, the patrons) Types of materials allowed: that can be traced to the burgeoning use of electronic • "Displays" of any work resources to provide materials sought by patrons. • "Performances" of nondramatic literary works and Publishers fear uncontrollable proliferation of electronic nondramatic musical works copies of works, while libraries fear unreasonable costs • "Literary" excludes "audiovisual works" being levied on them by publishers through royalty This statute means that an instructor could not charges on electronic sources. transmit by television or computer network to students at In that regard, in 1979 the National Commission on remote locations even a portion of a copyrighted film or a New Technological Uses of Copyrighted Works (CONTU) "dramatic" musical work, even if all other requirements proposed guidelines for ILL. These guidelines, also known were met. Yet the technology clearly allows for such as the "Rule of 5s," allow a borrowing library to receive transmission to occur. This tension between the law and up to five copies per year from the most recent five years the possibilities afforded by technology in distance of a given periodical. This guideline does not address learning is yet to be addressed. older materials, nor does it make any attempt to distinguish among technologically different renditions Electronic Reserves (e.g., electronic vs paper). Regarding electronic reserves. Dr. Crews explained the intricacies of Section 107 of the copyright law, which "Publishers fear uncontrollable proliferation of addresses the legal concept of "fair use." This doctrine electronic copies of works, while libraries fear currently supports traditional print reserve systems. These unreasonable costs being levied on them by systems keep hard copies, one of each required item, in a particular place, limiting access to one student at a time, publishers through royalty charges." and sometimes only to students from the course for which the materials were requested. Electronic versions of such Under the CONTU guidelines, libraries would not be systems could easily offer better, more useful service, but limited in the number of loans they make. The limit is on doing so might easily reach beyond the limits of fair use. the borrowing library. In practice, this should mean the publisher will be able to sell sufficient numbers of the Fair use, as defined in the copyright law, is based on original work to protect revenues. On the other hand, the a balancing of the following four factors: availability of a nonroyalty-based ILL option to libraries • Purpose and character of the use should keep pressure on the publishers to keep • Nature of the copyrighted work • Amount and substantiality used subscription prices reasonable. Unfortunately, the • Effect of the use on the market for or value of the guideline doesn't address the large collections of legacy original articles and monographs held by libraries. </p><p>103 5. DISTINGUISHED SEMINAR SERIES </p><p>The law offers guidelines for determining if a particular Electronic Publications use of a copyrighted work satisfies each of these As for electronic publications. Dr. Crews outlined the requirements, but provides no hard-and-fast rules. For following issues regarding managing and sharing new example, according to Dr. Crews, there is no absolute works via electronic networks, such as the World Wide cutoff in the law regarding the amount factor. Court rulings Web: on these factors offer vague, split authority, which leads to • Nil bill intended to offer assurance tremendous uncertainty about the law. Dr. Crews says that • Commercial works vs. scholarly works there is little legal precedent to rely on to determine fair- • Maximizing revenue vs. advancing knowledge use situations; we have only guidelines such as the • Copyright ownership and the growth of knowledge proposed CONTU rules, which are not a panacea. • Consolidate ownership rights The emerging set of guidelines regarding electronic - Ease quest for permissions reserves will: - Prepare for new formats • Allow full text of articles and book chapters - Identify public rights of use • Allow simultaneous access by multiple students • Anticipate access from remote locations The Nil Copyright Protection Act of 1995 (HR 2441 • Anticipate downloading and printing and S 1284) was a bill introduced in the 104th Congress in • Allow the library to store digital version for future use September 1995 in both the House and the Senate. Its • Allow one-time use per course/per instructor purpose was "to amend title 17, United States Code, to • Require including the copyright notice adapt the copyright law to the digital, networked • Require display warning notice on screens environment of the national information infrastructure, and • Permit access only to enrolled students for other purposes." The bill did not become law during • Not allow electronic reserves to compete with course the 104th Congress, though hearings were held in packs February, 1996. • Require reserves to be small proportion of total Commercial works are not the same as scholarly reading for a course works, for they have different purposes and different As these are adopted, then the tensions between the audiences. Maximizing revenues, including strong law and the technological possibilities inherent in the new copyright enforcement, is important in the case of electronic reserves systems will be eased. However, from commercial works, while advancing knowledge is most an owner's perspective, fair use is hard to police, crucial in the case of scholarly works. But copyright still guidelines or no. Catching violators is difficult, and when plays a role for all works, by protecting the integrity of potential violations are found, the letters used to inform works and helping to preserve an author's right to credit the accused violator often end up having compromising for original contributions. Ideally, copyright ownership language in them that complicates the case. Selective, would enhance the growth of knowledge, not retard it. ineffective enforcement does not help the author's rightful economic concerns. Presented March 19, 1996. Kenneth Crews is Associate Professor in the Indiana University School of Law-Indianapolis and the IU School of Library and Information Science and Director of the Copyright Management Center, Indiana University-Purdue University at Indianapolis. </p><p>104 PROGRAMS </p><p>VISITING SCHOLAR 1992/93 Dik L. Lee, Associate Professor, Computer and Information Science, The Ohio State University A SABBATICAL RESEARCH POSITION 1991/92 Karen M. Drabenstott, Associate Professor, The OCLC Visiting Scholar post brings experienced School of Library and Information Studies, The University scientists, educators, and administrators with demonstrated of Michigan research capabilities to OCLC to conduct research with 1988/89 Roland Hjerppe, Group Leader, LIBLAB Library OCLC staff, facilities, and data resources. The Visiting and Information Science Research Laboratory, Department Scholars conduct research in areas of mutual interest to of Computer and Information Science, and University themselves and OCLC so that both benefit from a close Library, Linkoping University, Linkoping, Sweden working relationship. 1987/88 Abraham Bookstein, Professor, Center for Qualifications Information Studies, University of Chicago The Visiting Scholar must have a Ph.D. degree or 1987/88 Elaine Svenonius, Professor, Graduate School equivalent training and substantial experience in directing of Library and Information Science, University of and conducting research in one or more fields of interest California, Los Angeles to OCLC, such as library, information, or computer 1987/88 Paul B. Kantor, President, Tantalus, Inc., science; applied mathematics; statistics; psychology, and Cleveland, Ohio human factors. The position is not restricted to citizens of the United States. 1986/87 Francis L. Miksa, Professor, Library and Information Science, University of Texas at Austin Nature of Appointment 1985/86 Martin Dillon, Professor, Library Science, The appointee is expected to conduct research that University of North Carolina, Chapel Hill focuses on problems of significance to the library and 1985 Terry Noreault, Assistant Professor, Information information science community; the research need not be Science, University of Pittsburgh specific to OCLC's development and production activities. OCLC expects that research results will be published in 1984 Allan Pratt, Professor, Graduate Library School, the open literature. All publications stemming from the College of Education, University of Arizona research conducted while at OCLC are attributed to the 1983/84 Mary Jo Lynch, Director, Office of Research, Visiting Scholar, acknowledgment of OCLC's support is American Library Association, Chicago, 111. required, and coauthorship is expected when OCLC staff make significant contributions to the research effort. 1982/83 Nancy B. Olson, Associate Professor, Audiovisual Cataloger, Mankato State University, Mankato, Terms of Appointment Minn. Length of appointment is variable, and compensation is 1981/82 Donald J. Sager, Administrative Librarian, negotiable based on the length of term. Elmhurst Public Library, Elmhurst, 111. 1980/81 Pauline A. Cochrane, Professor, School of Previous Visiting Scholars Information Studies, Syracuse University 1996/1997 John V. Richardson, Jr., Associate Professor, 1979/80 William E. McGrath, Dean of Library Services, the University of California at Los Angeles Graduate University of Lowell, Lowell, Mass. School of Education and Information Studies 1978/79 Edward T. O'Neill, Associate Professor, School 1993/94 Shoichi Taniguchi, Research Associate, of Information and Library Studies, State University of New University of Library and Information Science Ibaraki-ken, York at Buffalo Japan 1992/93 Mark T. Kinnucan, Associate Professor, School of Library and Information Science, University of Western Ontario, Elborn College, London, Ontario, Canada </p><p>105 PROGRAMS </p><p>Past Postdoctoral Fellows Terms of Appointment 1989/90 Michael J. Prasse, Ph.D., Psychology, The Ohio OCLC provides research assistants a monthly stipend and State University, 1987 pays tuition fees for each academic term that the position is held. The stipend and tuition payments are subject to 1986/88 Jeanette Drone, Ph.D., Library and Information applicable federal, state, and local taxes, and fringe Science, University of Illinois at Urbana-Champaign, 1984 benefits are not included. Generally, research assistants 1985/86 Cliandra Prabha, Ph.D., Library and work 20 hours per week, but working hours are flexible. Information Science, University of Illinois at Urbana- Champaign, 1984 Reappointment 1985 Padmini Das-Gupta, Ph.D., Information Transfer, Renewal of the position depends upon both academic and Syracuse University, 1985 OCLC performance. Reappointment to succeeding academic terms is considered prior to the end of a current term. 1983/84 Diane Vizine-Goetz, Ph.D., Library and Information Science, Case Western Reserve University, 1983 ADDITIONAL PROGRAM RESEARCH ASSISTANTS INFORMATION The OCLC research assistant position enables full-time ENVIRONMENT, TRAVEL, AND graduate students to participate in research at OCLC in Dublin, Ohio, while pursuing an academic degree. APPLICATIONS INFORMATION Program participants may pursue independent research using OCLC facilities during nonwork hours to complete academic requirements. Environment Resources available to the Visiting Scholar include access Qualifications to WorldCat (the OCLC Online Union Catalog) and other online information providers as well as other data related Candidates must be enrolled full-time in a graduate degree to the OCLC database and communications network. program at an accredited university. Hardware such as Sun, Gateway 2000, and Apple® Degree programs may include, but are not limited to, the computers and software (including utility routines) apart following general subject areas; from the OCLC operations computer system are available • Computer science for research purposes. OCLC staff are available for • Information science consultation. Secretarial, editorial, and graphics support • Library science are also provided. • Statistics • Psychology • Linguistics Travel • Communications Relevant travel and business expenses connected with the • Industrial and systems engineering Visiting Scholar appointment are reimbursed by OCLC, when approved by the Director of the Research & Special Researcfi Assistant Activities Projects Division. Research assistants play an important and significant role in research activities and work closely with research Applications scientists to: Candidates for the Visiting Scholar appointment should • Assist in the design, implementation, and execution of submit a letter of interest, including a current curriculum research projects vitae, date of availability, and specific research interests to: • Conduct literature searches Director, Research and Special Projects • Collect and analyze data OCLC Online Computer Library Center, Inc. • Analyze and write computer programs 6565 Frantz Road • Assist the preparation of research proposals, reports, Dublin, Ohio 43017-3395 and publications The application deadline is February 1. Candidates are selected by April 1 for appointment in either July or September. Applications are accepted at any time. OCLC is an Equal Opportunity Employer. </p><p>106 OCLC STAFF </p><p>Mark W, Bendig, Consulting Systems Analyst Andrew P. Houghton, Consulting Systems Analyst B.S.E.E. (Electrical Engineering), The Ohio State A.A.S. Data Processing, State University of New York University Project; Cataloging Productivity Tools, Classification Projects: Cataloging Productivity Tools, Classification Research, ExTended Concept Tree, Research, Technical Services Workstations andrew_houghton@ oclc.org mbendig@oclc.org Susanne Krouse, Administrative Coordinator Richard Bennett, Senior Systems Analyst krouse@oclc.org B.S. (Engineering), Pennsylvania State University; M.S., Linda M. Kwiatkowski, Research Associate Ph.D., (Engineering), Georgia Institute of Technology B.A., Economics, Barnard College of Columbia Projects: FirstSearch Next Generation University; M.B.A., Finance, New York University rick_bennett@oclc.org Projects: Interlibrary Loan and Document Delivery, Jenny Colvard, Consulting Systems Analyst Current Cataloging Practices B.S. Computer Information Systems, kwiatkow@oclc.org Western Carolina University Suzanne Lauer, Administrative Secretary Projects: FirstSearch Next Generation lauers@oclc.org colvard@oclc.org Brian F. Lavoie, Research Associate Mark A. Crook, Sr. Consulting Systems Analyst B.S., Economics, M.A., Economics, Ohio University B.A., English, The Ohio State University; MLS, Kent Project: Cuttering State University; MBA, Franklin University lavoie@oclc.org 1953-1996 Elizabeth C. Marsh, Research Associate Jonathan R. Fausey, Senior Systems Analyst B.A., French Literature; M.L.S., University of California, B.S., Computer and Information Science, The Ohio Los Angeles State University; M.S., Computer Science, Wright State Projects: Interlibrary Loan and Document Delivery, University Current Cataloging Practices Projects; SGML Document Grammar Builder (FRED); marsh@oclc.org Persistent URL (PURL); Scorpion; Kilroy fausey@oclc.org Patrick D. McClain, Systems Analyst http://purl.oclc.org/net/fausey B.S., Computer Science, Bowling Green State University Projects; Cuttering, Authority Control C. Jean Godby, Research Scientist mcclain@oclc.org B.A., English, German, University of Delaware; M.A., Linguistics, The Ohio State University Eric J. Miller, Associate Research Scientist Projects: WordSmith, Networked Information B.S., Computer Science Engineering, M.S. Natural godby® oclc.org Resources, The Ohio State University Projects: Metadata, PURL Carol A. Hickey, Research Associate emiller® oclc.org B.S., Education, M.L.S., State http;//purl.oclc.org/net/eric University of New York at Geneseo Projects; Cataloging Productivity Tools, Classification Terry R. Noreault, Director, Research Research & Special Projects hickeyc@oclc.org B.A., Computer Science, State University of New York Oswego; Ph.D., Information Transfer, Syracuse Thomas B. Hickey, Chief Scientist University B.S., Physics, State University of New York at Stony noreault® oclc.org Brook; M.L.S., State University of New York at Geneseo; Ph.D., Library Science, University of Illinois, Edward T. O'Neill, Consulting Research Scientist Urbana-Champaign B.A., Liberal Arts, Albion College, Albion, Michigan; Projects: FirstSearch Next Generation B.S.I.E. (Industrial Engineering), M.S.I.E., Ph.D. hickey@oclc.org Purdue University http://purl.org/hickey/hickey Projects: Cuttering, Authority Control, Record Matching, Interlibrary Loan oneill@oclc.org 107 OCLC STAFF </p><p>Chandra G. Prabha, Senior Research Scientist Vincent M. Tkac, Senior Programmer Analyst B.S., Education, M.S., Library Science, University of B.S., Computer Science and Mathematics, Youngstown Wisconsin—Madison; Ph.D., Library and Information State University Science, University of Illinois, Urbana-Champaign Projects: Scholarly Publishing WWW, Persistent URL Projects: Interlibrary Loan and Document Delivery, (PURL); Scorpion; Kilroy Current Cataloging Practices tkac@oclc.org Chandra® oclc. org http://purl.oclc.org/net/tkac John V. Richardson, Jr., Visiting Distinguished Scholar Diane Vizine-Goetz, Consulting Research Scientist B.A., Sociology, The Ohio State University; MLS, B.A., English, M.L.S., Information and Library Science, Vanderbilt University, Peabody College; Ph.D., Indiana State University of New York at Buffalo; Ph.D., University Information and Library Science, Case Western Reserve Project; Intelligent Question Answering University richards® oclc.org Project: Cataloging Productivity Tools, Classificcation http;//PURL.oclc.org/net/jrichardson Research, ExTended Concept Tree vizine@oclc.org Keith E. Shafer, Senior Research Scientist B.A., Computer Science/Mathematics, Mount Vernon Bradley C. Watson, Consulting Systems Analyst Nazarene College; M.S., Ph.D., Computer Science, The B.A., English, University of Dayton; M.L.S., Library Ohio State University Science, George Peabody College for Teachers; M.A., Projects: SGML Document Grammar Builder (FRED); English, Wright State University; M.C.S., Computer Persistent URL (PURL); Scorpion; Kilroy Science, University of Dayton; Ph.D., English, shafer@oclc.org The Ohio State University http:// purl.oclc.org/keith Project: Natural Language, Newton Lite watsonb@oclc.org. Lisa Stickley, Administrative Secretary sticklel@oclc.org Stuart L. Weibel, Consulting Research Scientist B.S., Biology, Ohio Dominican College, Columbus, Thomas L. Terrall, Senior Systems Analyst Ohio; Ph.D., Pharmacology, The Ohio State University Ph.D., Physics, The Ohio State University Projects: Metadata Workshop Series; International Projects: SGML Document Grammar Builder, STORD, World Wide Web Conference Committee, Task Force TULIP on Archiving of Digital Information; Task Force on terrall@oclc.org Bibliographic Access to Electronic Resources Roger Thompson, Senior Systems Analyst weibel@oclc.org B.A., Computer Science, University of California— http:// purl, org/ net/weibel Berkeley; M.S. Computer Science, New Mexico State Jeffrey A. Young, Consulting Systems Analyst University; Ph.D., Computer Science, University of B.S., Computer and Information Science, Massachusetts-Amherst. The Ohio State University Projects: Natural Language, Newton Lite, Scorpion, Projects: Cuttering, Bosnia Database Reconstruction Classification Research, ExTended Concept Tree jeffrey_young@oclc.org thompson®oclc.org </p><p>108 PATENT, PRESENTATIONS, PUBLICATIONS </p><p>Patent . "Persistent Uniform Resource Locators" at the Shafer, Keith E. American Library Association Annual Meeting held in US Patent 5583762 New York City, July 1996. Generation and reduction of an SGML-defined grammar. . "A Perspective on the Geospatial Metadata Landscape" at the American Library Association Presentations Annual Meeting held in New York City, July 1996. Bendig, M. "Dewey for Windows" at 1996 Workshop for . "Dublin Core Implementation from the Trenches" at Technical Services Workstations held in Washington, the OCLC/UKOLN Metadata Workshop II held at D.C., April 1996. Warwick, UK, April 1996. Colvard, J. "Java at OCLC" at American Library Association Noreault, T. "Use of SGML in Digital Libraries" at the Annual Meeting held in New York City, July 1996. International Seminar on Digital Libraries and Godby, J. "Natural Language Processing at OCLC" at Information Services for the 21st Century held in Columbia University, February 1996. Seoul, Korea, September 10-13, 1996. . "Natural Language Processing at OCLC" at Chemical . "Electronic Publishing: Communication in a Abstracts, Columbus, Ohio, April 1996. Scholarly Environment." (co-authored with Bradley Watson) at the AUUG 96 & Asia Pacific World Wide . "A Metalanguage for Describing Internet Resources" Web Conference & Exhibition, held in Melbourne, (co-authored with Eric Miller) at INET Conference, Australia, September 18-20, 1996. Montreal, Canada, June 1996. . "Technologies for the Access and Delivery of . "Library Classification Schemes and Access to Information in a Networked Age" at the 1996 Joint Electronic Collections: Enhancement of the Dewey Meeting of the Medical Library Group of Southern Decimal Classification with Supplemental Vocabulary" California and Arizona and the Northern California (co-authored with Diane Vizine-Goetz) at 7th and Nevada Medical Library Group, held in Pasadena, Association for Information Science Classification California, February 6-9, 1996. Workshop at Baltimore, Maryland, October 1996. . "Publishing in an Electronic Age" at the Hickey, T. "Publishing Scholarly Journals on the Net" at Professional Engineers Meeting held in Washington, Internet Expo held in San Jose, California, February D.C., June 1996. 1996. . Keynote Address "Delivery of Library Services in a . "Java at OCLC" at American Library Association Networked Age" at the Access '96 Conference held in Annual Meeting held in New York City, July 1996. Vancouver, British Columbia, Canada, September 1996. . "Java and Z39.50" at Windows on the Web: The Shafer, K. "Tools for the Digital Library: Research projects Java Environment, a preconference to the ALA at OCLC" at OHIONET Director's Forum, Akron, Ohio, LITA/LAMA National Conference held in Pittsburgh, October 1996. Pennsylvania, October 1996. . "Scorpion's Use of Java" at Windows on the Web: Miller, E. "Persistent Uniform Resource Locators" at the The Java Environment, preconference workshop of the World Data Center on Terrestrial Ecosystems, Denver, ALA LITA/LAMA National Conference, Pittsburgh, Colorado, January 1996. Pennsylvania, October 1996. . "High Level Semantic Interoperability: The Dublin . "Tools for the Digital Library: Research Projects at Core Element Set" at the World Data Center On OCLC" at CAPCON 1996 Annual Membership Meeting, Terrestrial Ecosystems, Denver, January 1996. Silver Spring, Maryland, June 1996. . "Metadata Encoding Issues and SGML Standard . "Creating DTDs via Fred" at Lexis-Nexis, Generalized Markup Language" at the FGDC Metadata Miamisburg, Ohio, April 1996. Working Group, Denver, Colorado, February 1996. </p><p>109 PATENT, PRESENTATIONS, PUBLICATIONS </p><p>. "Persistent Uniform Resource Locators (PURLs)" at Weibel, S. "PURLs and Metadata,"at Technical Services Lexis-Nexis, Miamisburg, Ohio, April 1996. Directors Meeting, ALA Midwinter, January 12, 1996. . "Persistent Uniform Resource Locators (PURLs)." . "PURLs and Metadata," at the University of Naming Objects in the Digital Library Panel at First Michigan Colloquium, April 16, 1996. ACM International Conference on Digital Libraries, . " Mending Our Net: Gathering, Describing, and Bethesda, Maryland, March 1996. Preserving Information in the Digital World," invited . "Creating DTDs via Fred." TEI Workshop, First ACM presentation at the Fifth International WWW Conference, International Conference on Digital Libraries, May 7, 1996. Available at: http://www5conf.inria.fr Bethesda, Maryland, March 1996. /fich_html/invited/ISl/overview.html . "Creating DTDs via Fred" at Columbia University, . URN Panel Discussion for Fifth International WWW February 1996. Conference, May 9, 1996. Available at: http://www5conf. inria.fr/fich_html/panels/panel5. html . "Persistent Uniform Resource Locators (PURLs)" at Chemical Abstracts Service, Columbus, Ohio, February . PURL and Metadata discussions at the Library of 1996. Congress NDLF meeting, May 21-22, 1996. Vizine-Goetz, D. "Using Library Classification Schemes for . "Report on Metadata and PURLs" at American Internet Resources" at OCLC Internet Cataloging Library Association Annual Meeting held in New York Colloquium, at the American Library Association City, July 8, 1996. Midwinter Meeting, San Antonio, Texas, January 1996. . "Introduction to Persistent Uniform Resource . "Feasibility of a Computer-Generated Subject Locators" at the INET96 Annual Meeting of the Validation File" (with Lois M. Chan) at the Midwinter Internet Society. June 25, 1996. Available at: Meeting of the LITA Authority Control in the Online http://info.isoc.org:80/isoc/whatis/conferences/inet/96/ Environment Interest Group. American Library proceedings/a4/a4_l. html Association Midwinter Conference, San Antonio, . "Image Metadata Workshop" Keynote Presentation Texas, January 1996. on the Dublin Core at the INET96 Annual Meeting of . "Feasibility of a Computer-Generated Subject the Internet Society. September 24, 1996. Validation File Based on Frequency of Occurrence of . "State of the Dublin Core," ERICIM Metadata Assigned Library of Congress Subject Headings" at Workshop, Bonn, Germany, October 7, 1996. Authority Control in the 21st Century Conference held at OCLC, Dublin, Ohio, March 31-April 1, 1996. . "Libraries and the Internet, Metadata, and PURLs," Nordic Metadata Information Day, Lund, Sweden, . "Classification Research at OCLC" at the Dewey 21 October 11, 1996. Workshop for Library Educators, Dublin, Ohio, May 9-10, 1996. . "Libraries and the Internet," VTLS Library Director's Conference Blacksburg, Virginia, October 22, 1996. . "Online Classification: Implications for Classifying and Document [-like Object] Retrieval" at 4th . "Internet Resource Description," Southwest Special International ISKO conference, Washington, D.C., Library Association, Albuquerque, New Mexico, July 15-18, 1996. October 31, 1996. . "Technical Services Workstations" at OHIONET 19th Annual Program Meeting, Columbus, Ohio, September Publications 1996. Hickey, T. 1996. "The Impact of Electronic Publishing on . "Networked, Internetted Workstations: Productivity Academic Libraries." DESIDOC Bulletin of Information and Knowledge Enhancers in the Center Ring" at Technology, l6 (1): 9-16 (January 1996). preconference workshop held at LITA/LAMA National Noreault, T. and B. Watson. 1996. "Electronic Publishing: Conference, Pittsburgh, Pennsylvania, October 12, 1996. Communication in a Scholarly Environment." . "Library Classification Schemes and Access to Proceedings of the AUUG 96 & Asia Pacific World Wide Electronic Collections: Enhancement of the Dewey Web Conference & Exhibition. Melbourne, Australia. Decimal Classification with Supplemental Vocabulary" (September 18-20, 1996). (co-authored with Jean Godby) at 7th Association for . 1996. "Use of SGML in Digital Libraries." Information Science SIG/CR Classification Research Proceedings of the International Seminar on Digital Workshop at Baltimore, Maryland, October 20, 1996. Libraries and Information Services for the 21st Century, Seoul, Korea (September 10-13, 1996). </p><p>110 PATENT, PRESENTATIONS, PUBLICATIONS </p><p>Shafer, K. OCLC PURL Services. 1996. Available at: . 1996. "Feasibility of a Computer-Generated Subject http://purl.oclc.org Validation File Based on Frequency of Occurrence of Assigned Library of Congress Subject Headings." in . Persistent URL demo page. 1996. Available at: Proceedings of Authority Control in the 21st Century: http://purl.oclc.org/OCLC/PURL/demo An Invitation Conference, March 31-April 1, 1996. . The Scorpion Project. 1996. Available at: (Reprinted from Annual Review of OCLC Research http://purl.oclc.org/scorpion 1995). Available at: http://www.oclc.org/oclc/ research/publications/review95/part2/chan.htm . The Kilroy project. 1996. Available at: http;//purl.oclc.org/kilroy . 1996. "Online Classification: Implications for Classifying and Document [-like Object] Retrieval." . Fred: The SGML Grammar Builder. 1996. Available Electronic version of a paper published in Knowledge at: http://www.oclc.org/fred Organization and Change: Proceedings of the 4th . "Scorpion Project Explores Using Dewey to International ISKO Conference, 15-18 July, 1996, Organize the Web." OCLC Newsletter, (222):20, Washington, D.C., Rebecca Green, ed. Frankfurt/Main: July/August 1996. INDEKS Verlag. Available at: www. purl.org/net/vizine/4th_isko . A Brief Introduction to Scorpion. 1996. Available at: http://purl.oclc.org/scorpion/bintro.html . "The Technical Service Workstation and Emerging Technologies: A Researcher's Perspective" in Planning Shafer, K. and R. Thompson. 1996. Scorpion: Dewey and Implementing Technical Services Workstations, Database Design. Available at http://purl.oclc.org/ Michael Kaplan, editor. ALA Editions (in press). scorpion/dewey_db_design.html, 1996. Vizine-Goetz, D. and M. Bendig. "Dewey for Windows: Shafer, K., S. Weibel, J. Fausey, and E. Jul. 1996. Accessing the Dewey Decimal Classification from the Introduction to Persistent Uniform Resource Locators. Technical Services Workstation" in Planning and For members only: Available at: http://www.isc)c.org/isc)c/ Implementing Technical Services Workstations, Michael whatis/conferences/inet/96/proceedings/a4/ a4_l .html Kaplan, editor. ALA Editions (in press). Thompson, R., K. Shafer, and D. Vizine-Goetz. 1996. Weibel, S., E. Jul, and K. Shafer. 1996. PURLs: Persistent "Evaluating Dewey Concepts as a Knowledge Base for Uniform Resource Locators. Available at: Automatic Subject Assignment." 2nd ACM International http://www. dlib. org/dlib/july96/07weibel. html Conference on Digital Libraries. Available at: http://purl. oclc.org/scorpion/e val_dc. html Weibel, S. and L. Dempsey. 1996. "The Warwick Metadata Workshop: A Framework for the Deployment of Resource Vizine-Goetz, D. 1996. "Using Library Classification Description." D-Lib Magazine, July 1996. Available at: Schemes for Internet Resources." Proceedings of the http://www.dlib.org/dlib/july96/07weibel. html OCLC Internet Cataloging Colloquium, San Antonio, Texas Qanuary 19, 1996) Available at: http://www.oclc.org/oclc/man/colloq/v-g.htm </p><p>111 PRINCIPAL INVESTIGATORS </p><p>Rebecca Green, Assistant Professor Judith J. Senkevitch, Associate Professor College of Library and Information Services School of Library and Information Science University of Maryland at College Park University of Wisconsin—Milwaukee 2100 Lee Building Milwaukee, WI 53201 College Park, MD 20742-5141 Telephone (414) 229-4707 Telephone (301) 405-2050 Fax (414) 229-4848 Fax (301) 314-9145 rgreen@umd5.umd.edu James H. Sweetland, Associate Professor School of Library and Information Science Stephen P. Harter, Professor University of Wisconsin—Milwaukee School of Library and Information Science Box 413, Enderis Hall 1167 LIB 023 Milwaukee, WI 53201 Indiana University Telephone (414) 229-6840 Bloomington, IN 47405-1801 Fax (414) 229-4848 Telephone (812) 855-5113 sweetlnd® csd. u wm.edu Fax (812) 855-6166 harter® indiana.edu Xia Lin, Assistant Professor College of Communication and Information Studies School of Library and Information Science 502 King Library South University of Kentucky Lexington, KY 40506-0039 Telephone (606) 257-8876 Fax (606) 257-4205 xlin@ukcc.uky.edu </p><p>Co-editors: Bradley C. Watson, Joanne Murphy, and Lois Yoakam Graphic Designer: Rick Limes Desktop Publishing: Tammy Miller Editorial Assistance: Mary Faure Administrative Assistance: Suzanne Lauer </p><p>112 CLC is a nonprofit membership • computer library service and research Oorganization whose computer network and services link more than 24,000 libraries in 63 countries and territories. The OCLC Office of Research conducts mission-oriented research to provide the library and information science community with theoretical findings and practical applications. </p><p>For More Information We hope you have enjoyed reading the Annual Review of OCLC Research and invite you to direct comments or questions to the project managers or principal investigators, or contact the OCLC Office of Research. Office of Research OCLC Online Computer Library Center, Inc. ' " ; ; u 6565 Frantz Road Dublin, Ohio 43017-3395 - www.purl.org/oclc/research ''-'''.-'f - '•w.u ''•-.-r" > 1. "-I-- v.- J^--T - *- '£ •v'V . ,, sir ' -.4 '•f , '•-iV.fT-'''';-^,^"'.^ '.-v-.f * J ' •-•Si:-^'',•>' >'•?'"%•--•(- •*• ^•'-^• -,' J.> - J >"v »»4 ^ 7: € -'I•S"y >^^*7-' li?•». *• 1 ' i* i «H " ' , s ,«V ^ J V*-> - AiQi' </p><p>-' .' ••» .•<- • ^ ^ * f« --.'j^ '•i ^ .• -' r. >* V -^•»' , f .••i"A ••.r% •*« 1 •> '- !'•- i ^ i l •. '*' J f ^ 4 \ . ^ f- *• ^ > V' " s * ^ I T ^ . V ^ ^ . •?5 " «?->*• i,' -•T V>\j * I J-'*' «= •'s *k i •«*.>> V -ill' -- «•n''> " » >» -:i'. % ' S '^ * s;"" i ' ^ i •%* s - 1 •* •^'^*" vT'" -r''^ J' •.;j. V* , - 1 •'<®-.i.•^^... ^ r .-V v'V V-J </p><p>^ M' » « . , % , "" v'::_'-2v>t I. -' * O >, -'' ' ^ ^ V Sf" : " -; '"t-" •1^ ^ V \ T ^ T •_ -I ' x> ^ t \S-' t^.V ':"" f i i-* I > »<."'*' •/ •» . ..T A.---< .^^.^•, ••.•^'-'' 4-Zii- •ft - " :,• ^ " -* -J A ^ S <t - V. -• f-'^'J. , '• ^•2-^ *• ^ C »- < V ^ ^ -5c'^ -V . V—I i *^jr» v" « . " -«- ^ ^ •*• ""V ' >># •^rV • •V ^ -V- . - ."C- /, V- fT^ • y ^^4 ^ J • " ^K" ,-m; ^ i ^ """ •*j4rj,-,K »f=-, ^ "• T- « 1 V, . -w> ' ^ « 5i I. * ^ ^ " ^ ^ > .'i.,- * 5* ^ ^ V — I .V >5 ", % . .-••• :•, ..••. .•» .i • ,1. .-I •""» . I,/ • >.^* j .-.«<^-J"i,.!. T .-^*"1* <1. ^ : .-- -, >* * 1 ^ ' >' > l-t- • ;r ' ^ _ ,*i» $" / -r ' •'. « "ll ' ''• >^ *. •* •• 4 . ^1 ^ V" )? *• •> «, '• '4 c <• ^ i t * ' \ r^ ff-'-* . < V t <- J^ J" </p><p>V'- '. " --•. — ir; - ^ .',••• -f-'- r- T •":• • i-ij."*-- . T f «— t' ' J t •r- '^" SA > ' ?. - V -a: r - , < v {iT"'-'^'- '' '^•' t. : •. «. "• '•'• '1' X -J ^ ^ ^ -."» VjrV < •* p -» ' p. » ^ ?v ^ \ >•» -.-4' //' V <, ~ »^ f r ^ ^ ^ -5K ^"^ - -.•V v.- » ,x ^ "'•i -*S V * i vi « L* , * *• V " , ^ i J. , ^ T ^ '' * ^ Y ' -^ r -I A ^ \ - . -i • -</p><p> t-- ~ ,'r ; oc *. - s „ r »C t X '' *"-< '' > ,r < ^ .r ^ f ', ' s^" LCHI OCLC Online Computer Library Center, Inc. ^ ^ ^ f i ^ ;• - Vu. ^ X >5, - -t 6565 Frantz Road -,^-cr * Dublin, OH 43017-3395 < USA </p><p>1-614-764-6000 www.oclc.org Product Code MAN8430 ISSN 0894-198X -esC «i "l 9705/9911Q-8M. B&S </p><p>^ . " • •.' '-v 'fV *• ~^ - '<• J f»— </p> </div> </article> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = '587fbc7188786d2a3e444d984cac6bb4'; var endPage = 1; var totalPage = 127; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/587fbc7188786d2a3e444d984cac6bb4-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 5) { adcall('pf' + endPage); } } }, { passive: true }); if (totalPage > 0) adcall('pf1'); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html><script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>