Knowledge Organization for a Global Learning Society

Advances in Knowledge Organization, Vol. 10 (2006)

Knowledge Organization for a Global Learning Society

Proceedings of the Ninth International ISKO Conference 4-7 July 2006 Vienna, Austria

Edited by

Gerhard Budin Christian Swertz Konstantin Mitgutsch

ERGON VERLAG Predocumentation

The volume contains: Introduction – Information Systems and Learning in a Global Society: Concepts, Design and Implementation – Global Society and Learning in Theories of Knowledge and Knowledge Organization – Multilingual problems of information retrieval – Representations of Educational and didactical knowledge – Theoretical basis of knowledge organization: universal vs. local solutions – Users and uses of knowledge organization – Ontologies – KO for non print multimedia – Linguistic and cultural approaches

Bibliografische Information der Deutschen Bibliothek Die Deutsche Bibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.ddb.de abrufbar.

” 2006 ERGON Verlag · Dr. H.-J. Dietrich, D-97080 Würzburg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways and storage in databanks. Duplication of this publication or parts thereof is only permitted under the provisions of the German Copyright Law, a copyright fee must always be paid. Cover Design : Jan von Hugo www.ergon-verlag.de

Printed in Germany ISBN-10 3-89913-523-7 ISBN-13 978-3-89913-523-7 ISSN 0938-5495 Contents

Introduction ...... 9-10

1 Information Systems and Learning in a Global Society: Concepts, Design and Implementation

Claudio Gnoli The meaning of facets in non-disciplinary classifications...... 11-18 Jeff Gabel Improving Information Retrieval of Subjects through Citation- : a Study...... 19-26 Aaron Loehrlein, Richard Martin, and Edward L. Robertson Integration of International Standards in the Domain of Manufacturing Enterprise ...... 27-34 Charles Abiodun Robert and Amos David Annotation and its application to information research in economic intelligence ...... 35-40 Shawne D. Miksa, William E. Moen, Gregory Snyder, Serhiy Polyakov, and Amy Eklund Metadata Assistance of the Functional Requirements for Bibliographic Records’ Four User Tasks: a report on the MARC Content Designation Utilization (MCDU) Project...... 41-50 Dimitris A. Dervos and Anita Coleman A Common Sense Approach to Defining Data, Information, and Metadata...... 51-58 Markus F. Peschl Knowledge-Oriented Educational Processes. From Knowledge Transfer to Collective Knowledge Creation and Innovation ...... 59-70 Ricardo Eíto Brun Retrieval effectiveness in software repositories: from faceted classifications to software visualization techniques...... 71-76 Olha Buchel Uncovering Hidden Clues about Geographic Visualization in LCC...... 77-84 Kerstin Zimmermann, Julika Mimkes, and Hans-Ulrich Kamke An Ontology Framework for e-Learning in the Knowledge Society...... 85-92 Mikel Breitenstein Global Unity: Otto Neurath and the International Encyclopedia of Unified Science...... 93-100 Athena Salaba, Marcia L. Zeng, and Maja Zumer Functional Requirements for Subject Authority Records ...... 101-106 6

2 Global Society and Learning in Theories of Knowledge and Knowledge Organization

Jack Andersen Social change, modernity and bibliography: Bibliography as a document and a genre in the global learning society...... 107-114 Ágnes Hajdu Barát Usability and the user interfaces of classical information retrieval languages...... 115-122 Judith Simon Interdisciplinary Knowledge Creation – Using Wikis in Science...... 123-130 Alon Friedman Concept Mapping a measurable sign ...... 131-140 Chaim Zins Knowledge Map of Information Science: Issues, Principles, Implications...... 141-150 Rebecca Green Semantic Types, Classes, and Instantiation...... 151-158 Clare Beghtol The Global Learning Society and the Iterative Relationship between Theory and Practice in Knowledge Organization Systems...... 159-164

3. Multilingual problems of information retrieval

Elaine Menard Image Retrieval in Multilingual Environments: Research Issues ...... 165-172 Graciela Rosemblat and Laurel Graham Cross-Language Search in a Monolingual Health Information System: Flexible Designs and Lexical Processes...... 173-182 Susanna Keränen Equivalence and focus of translation in multicultural thesaurus construction...... 183-194 Marianne Dabbadie and Jean-Marc Blancherie Alexandria, a multilingual dictionary for Knowledge Management purposes ...... 195-204

4. Representations of Educational and didactical knowledge

Elin K. Jacob, Hanne Albrechtsen, and Nicolas George Empirical analysis and evaluation of a metadata scheme for representing pedagogical resources in a digital library for educators ...... 205-212 Nancy J. Williamson Knowledge Structures and the Internet: Progress and Prospects ...... 213-224 B. H. KwaĞnik, Y.-L. Chun, K. Crowston, J. D’Ignazio, and J. Rubleske Challenges in Creating a Taxonomy of Genres of Digital Documents...... 225-232 7

Hur-Li Lee Navigating Hierarchies vs. Searching by Keyword: Two Cultural Perspectives ...... 233-240 Maria Teresa Biagetti Indexing and scientific research needs...... 241-246 Babajide Afolabi and Odile Thiery Using Users’ Expectations to Adapt Business Intelligence Systems...... 247-254 Aaron Loehrlein, Elin K. Jacob, Seungmin Lee, and Kiduk Yang Development of Heuristics in a Hybrid Approach to Faceted Classification...... 255-262 Michèle Hudon Structure, , and semantics for Web-based collections in education...... 263-270 Catalina Naumis Peña Evaluation of Educational Thesauri...... 271-278

5. Theoretical basis of knowledge organization: universal vs. local solutions

Martin Thellefsen The dynamics of information representation and knowledge mediation ...... 279-286 Jian Qin, Peter Creticos, and Wen-Yuan Hsiao Adaptive Modeling of Workforce Domain Knowledge ...... 287-294 Julianne Beall and Diane Vizine-Goetz Finding Fiction: Facilitating Access to Works of the Imagination Scattered by Form and Format ...... 295-302 Joseph T. Tennis Function, Purpose, Predication, and Context of Information Organization Frameworks...... 303-310 Edmund JY Pajarillo A qualitative research on the use of knowledge organization in nursing information behavior...... 311-322 Ia C. McIlwaine and Joan S. Mitchell The new ecumenism: Exploration of a DDC/UDC view of religion...... 323-330

6. Users and uses of knowledge organization

María J. López-Huertas Thematic map of interdisciplinary domains based on their terminological representation. The Gender Studies...... 331-338 Edmund JY Pajarillo A classification scheme to determine medical necessity: A knowledge organization global learning application...... 339-348 8

Steven J. Miller, Melodie J. Fox, Hur-Li Lee, and Hope A. Olson Great Expectations: Professionals’ Perceptions and Knowledge Organization Curricula ...... 349-358 Kathrin La Barre A multi-faceted view: Use of facet analysis in the practice of website organization and access...... 359-366 Xia Lin, Serge Aluker, Weizhong Zhu, and Foster Zhang Dynamic Concept Representation through a Visual Concept Explorer ...... 367-374 Victoria Frâncu Subjects in FRBR and Poly-Hierarchical Thesauri as Possible Knowledge Organizing Tools...... 375-382

7. Ontologies

Richard P. Smiraglia Empiricism as the Basis for Metadata Categorisation: Expanding the Case for Instantiation with Archival Documents...... 383-388 Carol A. Bean Hierarchical Relationships Used in Mapping between Knowledge Structures ...... 389-394

8. KO for non print multimedia

Francisco Javier García Marco Understanding the categories and dynamics of multimedia information: a model for analysing multimedia information...... 395-404 Rob Hidderley and Pauline Rafferty Flickr and Democratic Indexing: Disciplining Desire Lines...... 405-412

9. Linguistic and cultural approaches

Blanca Rodríguez Bravo The Visibility of Women in Indexing Languages ...... 413-422 Florian Kohlbacher Knowledge Organization(s) in Japan – Empirical Evidence from Japanese and Western Corporations...... 423-434 Ann Doyle Naming and Reclaiming Indigenous Knowledges in Public Institutions: Intersections of Landscapes and Experience...... 435-442 Proceedings

“Knowledge Organization for a Global Learning Society”

9th International Conference International Society for Knowledge Organization (ISKO-09)

Introduction

The 9th International Conference of the International Society for Knowledge Organization (ISKO-09) was held at the Center for Translation Studies, University of Vienna, July 4th-7th 2006. The general conference topic focuses on the role of knowledge organization methods and processes in educational situations, including E-Learning, didactical presentation and organization of knowledge. The intention of the conference was to actively contribute to the vision of a “Global Learning Society”. The refereed papers selected for this book provide interesting new thoughts, evidence, and insights.

This volume contains selected papers contributing to current research in the following topics:

ƒ Information Systems and Learning in a Global Society: Concepts, Design and Implementation ƒ Global Society and Learning in Theories of Knowledge and Knowledge Organization ƒ Multilingual Problems of Information Retrieval ƒ Representations of Educational and Didactical Knowledge ƒ Theoretical Basis of Knowledge Organization: Universal vs. Local Solutions ƒ Users and Uses of Knowledge Organization ƒ Ontologies and Fundamentals for Knowledge Ordering ƒ Knowledge Organization for Non-print and Multimedia Information ƒ Linguistic and Cultural Approaches to Knowledge Organization in a Global Learning Society

We would like to thank a number of persons who helped us in preparing not only the conference but also this proceedings volume, in particular Professor Ia C. McIlwaine, who gave us a lot of important advice and who played a crucial role in the selection and review process. We are also grateful to the other members of the review committee, Prof. Marcia L. Zheng, Dr. Hanne Albrechtsen, and Dr. Peter Ohly for their important work. As far as organizational matters are concerned we would like to thank Ms Barbara Wallner and Dr. Margit Sandner for their tremendous help and excellent work.

Gerhard Budin Christian Swertz Konstantin Mitgutsch

Vienna, June 9th 2006

Claudio Gnoli

The meaning of facets in non-disciplinary classifications

Abstract: Disciplines are felt by many to be a constraint in classification, though they are a structuring principle of most bibliographic classification schemes. A non-disciplinary approach has been explored by the Classification Research Group, and research in this direction has been resumed recently by the Integrative Level Classification project. This paper focuses on the role and the definition of facets in non-disciplinary schemes. A generalized definition of facets is suggested with reference to predicate logic, allowing for having facets of phenomena as well as facets of disciplines. The general categories under which facets are often subsumed can be related ontologically to the evolutionary sequence of integrative levels. As a facet can be semantically connected with phenomena from any other part of a general scheme, its values can belong to three types, here called extra-defined foci (either special or general), and context-defined foci. Non-disciplinary freely faceted classification is being tested by applying it to little bibliographic samples stored in a MySQL database, and developing Web search interfaces to demonstrate possible uses of the described techniques.

In memory of Douglas Foskett, whose writings are far greater than the sum of their words.

1: Introduction Several authors in classification research remarked at various times that disciplines are an arbitrary constraint on classification schemes, and produce obstacles to cross-disciplinary indexing and searching. Among them are Brown (1906, 8-11), Kyle (1959), Foskett (1961, 138-139; 1970), Austin (1969), Beghtol (1998), Williamson (1998), Szostak (2004, 221), and López-Huertas (pers. comm.). Defining classes as disciplines and their subdivisions is a top-down approach: first the universe of knowledge is cut into a certain number of fields, then each of them is further subdivided, and so on. An alternative bottom-up approach has been advocated by members of the Classification Research Group (CRG). In particular, Farradane (1961) claimed that classification should follow the inductive method of the sciences, starting from simple concepts (isolates) and combining them by relational operators. Foskett (1970), Austin (1969), and others explored the possibility of a new general classification scheme based on phenomena of the real world instead of disciplines: phenomena can be arranged in a sound order by the integrative level to which they belong, and class marks can be obtained by combining the constant notations of each compounding phenomenon (Gnoli & Poli, 2004). On the other hand, the main advancement in 20th century classification theory, facet analysis (Vickery, 1960), has usually been applied to disciplinary schemes. In them, each facet is defined as belonging to one general category, such as Objects, Parts, Properties, Materials, Actions, Operations, Agents, Space, and Time. Within each discipline, a given category, e.g. Operations, gets a more specific meaning to give a facet, e.g. Therapies in medicine or Working in technology. Therefore, the meaning of a facet depends on its disciplinary context. What is, then, the meaning of facets when they refer no more to discipline classes, but to phenomenon classes? The question is even more relevant, as faceted classification is now becoming popular in information architecture as an effective model to structure search interfaces and website menus. Several websites and software applications, including file management systems and email management systems, claim to adopt faceted classification, 12 though often their interpretation of facet theory is partial or use inappropriate terminology (La Barre, 2004). In many cases, what is indexed in these sites and applications is not disciplinary knowledge but concrete objects, like wines or cars to be sold, or factual information, like that on working processes in an enterprise. In these situations, the meaning of facets cannot depend on a limited list of disciplines, but has to be understood in a more flexible sense.

2: A predicate logic model of facets As an example, let’s suppose that a disciplinary classification uses botany as a main class, where a phenomenon classification uses plants as a corresponding main class. It is not only a matter of formal definition. Indeed, for the class botany some facets can be defined, such as Methods or History, which have no sense for the class plants: clearly dissection,a possible focus in the facet Methods, is a facet of botany, not of plants themselves, as plants alone do not dissect anything. The class plants will rather have facets like Organs, Growth stages, Diseases, Habitat, etc. The Web catalogue of a gardening shop could use facets of plants, but a Web directory of botanical resources could use facets of botany: some of the facets in the two sets can coincide (like Organs), while others can be different (like Methods). To describe this dependence of facets on their basic class, a useful model is predicate logic. Predicate logic describes the elements of a language in terms of predicates and their arguments. Each predicate can provide for a few arguments:

moving [P] of object [A1] towards destination [A2] from provenance [A3] by means [A4]

so that a sentence like “I go to Vienna from Pavia by train” is translated as “moving me Vienna Pavia train”. Arguments A1, A2 etc. can be identified by their position (so that Vienna will be understood as the destination because it is the second argument) or, in case arguments are not all expressed mandatorily, by a special marker. As “argument markers”, natural languages use prepositions, like towards, or cases, like’s. Predicate logic is used to structure various kinds of languages, like programming languages or logical artificial languages such as Loglan and Lojban (Cowan, 1997), and I suggest that it can be usefully applied to indexing languages as well. Indeed, in a faceted classification scheme, any class can be considered as a predicate, and provided with a set of potential arguments which are its facets, giving a facet formula:

plants [P] with organ [A1] at growth stage [A2] affected by disease [A3] living in habitat [A4]

In this way, virtually any phenomenon, not only disciplinary main classes, can provide for facets. Disciplines can be seen just as a special kind of phenomena (Gnoli, 2005), providing for facets like Methods or History. The function of argument markers can be carried out by facet indicators, like punctuation marks in Colon Classification or letters in FATKS. Predicate logic also allows for nested predicates, which correspond to Ranganathan’s “rounds” and “levels”. Allowing any class to be a predicate with potential arguments agrees with the emphasis on relations and dynamics, rather than on “substance”, recently claimed in philosophical trends such as process ontology (Seibt, 2004). The word phenomenon is neutral enough to include both substancial and relational aspects. A phenomenon can be thought as defined by a set of relationships, both internal and with other phenomena, of which the most relevant and typical are expressed by its facets. This idea is close to that of system, which was also considered by the CRG as a tool for framing classification schemes: indeed, according to Bunge (1979) a system consists of a composition, an environment, and a structure. 13

Though arguments of a predicate can be defined freely, experience with logical languages suggests that it is useful to state some conventional sequences for the “place structure” of predicates conveying analogous meanings, like “moving”, “going”, “travelling” etc.: “The places tend to appear in decreasing order of psychological saliency or importance. There in as implication within the place structure of klama, for example, that lo klama (the one going) will be talked about more often, and is thus more important, than lo se klama (the destination), which is in turn more important than lo xe klama (the means of transport)” (Cowan, 1997, sect. 12.16, 295). Clearly, this is equivalent to what is called citation order in bibliographic classification. This brings us to the identification of semantic categories of the kind used in traditional faceted classification, like Parts, Materials, Operations, etc.: they can be introduced by a constant facet indicator, working as an approximation of their syntactical role. For example, all dynamical aspects, like “at growth stage...” or “by locomotion mechanism...”, can be represented by Process facets, and be introduced by a colon. The following section deals with the status of categories in a phenomenon scheme based on integrative levels: readers more interested in technical applications may want to skip it.

3: The ontological levels of general categories The general scheme of integrative levels, as was drafted by the CRG with reference to several philosophers, ideally reflects an evolutionary sequence from more primitive and simple phenomena to more and more organized and complex ones: particles, atoms, molecules, crystals, cells, organisms, societies, institutions, cultures, etc. (Foskett, 1970). We have seen how each facet can refer to a standard general category, like Properties or Operations. These categories are usually taken as a-priori. However, in the perspective of integrative levels, many of them can refer to the specific level where they first appear, rather than being postulated as prior to any knowledge and valid throughout the whole scheme. This idea is also present in the ontology of Nicolai Hartmann (1952). Categories like Material, Beneficiary, Product and By-product (used e.g. in BC2) all imply some technological activity carried out with some purpose, at least in a broad sense. This can only apply from the level of complex organisms, like a bird building a nest, onwards. For phenomena at lower ontological levels, they have no sense: molecules or rocks are not products and have no beneficiary, unless one wants to introduce some metaphysical statement in classification. Even some human activities, like mental processes or language, have no material. Similarly, the categories Operation and Agent imply some voluntary activity by a living being, which can apply to human and animal behaviour but not to plants or simpler phenomena (Gnoli & Poli, 2004). The Process (or Energy in Ranganathan’s terms) category has a more general status, as it can be connected with any dynamical phenomenon, i.e. anything showing some change in time. Electrons exchange between atoms can be described as a Process. Energy is a very universal concept; even matter is connected with energy by relativity theory. At present, it looks as one of the most general concepts in human knowledge; maybe only information, taken in its physico-mathematical sense, can be said to be prior to it. Space and Time may look as the most universal categories, as they are kept unchanged in most classification schemes: they play as universal coordinates onto which any subject can be mapped, like in Kant’s a-priori categories. While in practice this is often effective, it can be remembered that in physics space and time are not absolute concepts; in quantum mechanics, space-time is an aspect of gravitation. Therefore, even these categories appear at a given moment – though early it is – in the cosmic evolution, and some phenomena exist, like certain forms of energy, for which they don’t apply. As noticed by Paul Davies, questions such as 14

“what was there before the Big Bang?” are nonsense, as time itself began with the Big Bang. Therefore, using Space and Time as absolute categories can be problematic in classifying high energy physics and cosmology. All the categories examined until now refer to more or less general aspects of the physical world. We are left with Entity, Kind, Part and Property (or Personality and Matter/property in Ranganathan’s terms). Here we really seem to enter a universal realm, that of the logic by which any knowledge is organized. Property (also called Attribute) is a very basic category which is difficult to do without in any knowledge organization language. In mathematical terms, properties are operations having as a result “true” or “false”, so they refer to the basic concepts of relation and reality. Entities and their Kinds are dealt by set theory, which is also fundamental in mathematics. Finally, entities and relations look as two basic notions which cannot be avoided in any language.

4: Free classification by integrative levels Despite of being invoked and studied, the phenomenon approach has not yet given birth to any complete bibliographic classification scheme. This also seems to be due to reasons internal to the CRG, where the disciplinary approach eventually prevailed and produced BC2. The ISKO Italian chapter has been applying non-disciplinary classification within the Integrative Level Classification (ILC) research project. The project aims at experimenting with phenomenon classification in several small bibliographical samples in various domains, each consisting of some hundreds records, and to exploit its structure and notation by search interfaces (ISKO Italia, 2004; Hong, 2005). ILC classmarks are built following the method already drafted by the CRG: notations for the phenomena occurring in a subject are combined, by simply listing them in inverted order of integrative levels. Hence “Italian vineyards” can be analyzed into cultivation : grapes : Italy and expressed by the corresponding notation S Mpfdg Kl^ei. Such technique, already used by Brisch (Kyle, 1956), was referred to as free classification (Gardin, 1965; Gnoli & Hong, submitted). The first experiments have shown that free classification can work quite well for broad domains, like the whole culture and environment of a rural area (Gnoli & Merli, 2005). However, when documents are more specialized and their subjects become more complex, simple juxtaposition of all the component phenomena can yield combinations both cumbersome and ambiguous. To take a case really encountered in the tests, the expression villages :cultivation : words : grapes : Oltrepò pavese, where phenomena are listed freely in inverted order of integrative levels, is not a satisfying synthesis of a book entitled “the dialect of the village of Portalbera, and wine terminology in Oltrepò pavese”. Indeed, when several component phenomena occur, the relations holding between them (cultivation of grapes, words referring to cultivation) and their order should be expressed more specifically, in order to yield the sense of the combination. These functions are usually carried out by facets and their place structure: therefore, for detailed indexing in a specific domain, facets are needed.

5: The place of definition of facets On the basis of the predicate logic approach, we can mean facets as the relations typical of a phenomenon with other phenomena: plants typically have organs (roots, stem, leaves, etc.), are in some growth stage (seed, bud, adult, etc.), can be affected by some disease (smut, peronospora, oidium, etc.), live in some habitat (desert, meadow, forest, etc.); cars typically have some bodywork (saloon, station wagon, coupé, etc.), use some fuel (petrol, diesel oil, 15 methane, etc.), are made by some firm, etc. These facets can be expressed by categories (organs are Parts of plants, firms are Agents of cars). In Austin’s draft of an integrative level classification, relations between phenomena could be combined by using operators, which expressed more precisely the syntactical relations holding between them. Austin speaks of a freely faceted classification scheme (though it can be discussed whether his operators resemble more to classical facets or to phase relationships); later, his draft evolved into the PRECIS subject heading system. He also argued that this kind of technique would be suited for machine processing of subjects but not for shelving, as it would produce very complex and long notations (Austin, 1976; 1979). However, I suggest that notation can be shortened by using extra-defined foci, in the way described below. Let’s take the integrative level of cultivation (i.e. the phenomenon consisting of all the agricultural activities performed by humans and their products), S. One obvious facet of it, S6, is the cultivated species (the ILC test schemes use digits as facet indicators). Now, what is cultivated is always a plant – it cannot be a lake or a skyscraper: therefore, notation for the foci of S6 can be borrowed from the subclasses of plants Mp, like in the “divide like” instructions of traditional classification manuals, instead of being enumerated again in the schedule. As the broad part Mp of notation is implied in the formal definition of facet S6 (written in square brackets in the test schedules, and stored in a special field in the database), it is omitted, and only notation for its subclass “grapes” is expressed:

Mp plants ... Mpfdg grapes ... S cultivation S6 [Mp] cultivation of species... S6fdg cultivation of grapes

The use of extra-defined foci throughout a classification scheme should produce a notation reasonably short, and at the same time expressive, therefore exploitable by a machine information retrieval system in order to produce relevant results and to display them in helpful sequences. Extra-defined foci will usually be special, that is they will be borrowed from a special class, like plants in the example above. Sometimes, however, they can be general, that is, they will be borrowed from any other class of the general scheme; in the latter case, of course, their notation must be copied entirely. Examples of general extra-defined foci are perceived objects as a facet of perception, meanings as a facet of language, subjects as a facet of documents; indeed, any phenomenon in the known universe can in principle be perceived, or spoken about, or dealt with:

T artefacts

Ne perception Ne7 []perception of object... Ne7mp perception of plants Ne7t perception of artefacts

Finally, some other foci need to be defined only within the facet itself, instead of being borrowed from other classes. This happens e.g. in music with facet “played by instrument”: 16 indeed, its foci are instruments, like guitar, fiddle, bagpipe, etc., which only exist as such in the context of music, and not as independent phenomena at different integrative levels; in Farradane’s terms, their “place of unique definition” is the facet itself. Another example is organs of animals: there is no arm or liver other than as a facet of some animal. These can be called context-defined foci:

Xi music Xi6 [:]music played by instrument... Xi6k music played by stringed instruments Xi6n music played by wind instruments

To summarize it, a facet can have

[A-Z] special extra-defined foci [] general extra-defined foci [:] context-defined foci

In disciplinary classifications, facets apply to all the subclasses included in the hierarchical tree of a discipline. In phenomenon classification, in a similar way, facets can apply to any subclass of the phenomenon for which they are defined: if Mq6 means “organs of animals”, and Mqvo are “birds”, then Mqvo6 will automatically mean “organs of birds”. However, in some cases conflicts may occur:

X arts X5 arts produced by technique...

Xi music Xi5 music with rhythm...

Xiv compositions Xiv5 ??

Does Xiv5 mean “compositions produced by technique...”, or “compositions with rhythm...”? A reasonable solution seems to be that the latter be valid, on the basis of a principle of cascading facets: that is, the definition of a facet at a more specific class prevails over that of a more generic class, like it happens in cascading style sheets for websites.

6: Ongoing work The test bibliographical samples are currently under development in the framework of the ILC research project. The bibliography on local culture and another on faceted classification literature itself allow us to perform test searches through Web interfaces (ISKO Italia, 2004). The former bibliography is indexed mainly by free classification, with some experiments of facets, while the latter is indexed mainly by freely faceted classification. Bibliographic references are stored in a MySQL database, and the Web search interfaces are developed in PHP language. More PHP code is currently being written in order to manage facets and foci of the three kinds described above, and to exploit them consistently in processing user queries (Gnoli & Hong, submitted). Another planned tool is a Web assistant for easy construction of class marks through the selection and combination of terms from the schedule database: this could be useful to show users in a practical way how the whole system work, without forcing them to dive into theoretical explanations. 17

Each special domain to be indexed can be connected to the rough general scheme of integrative levels through the definition of a preferred class. Facing practical classification problems in fields of knowledge at different integrative levels will help us to check and to tune the solutions suggested by the general model. As can be seen, the ILC project involves several aspects: application of the theory of integrative levels to the arrangement of classes; syntactical issues in the definition of facets; building a notation suitable to be exploited in a digital environment; developing usable interfaces; etc. Apart from specific problems and details, the whole experimentation is expected to show the possibility of non-disciplinary classification, at one time making full use of facet analysis, and freeing classifiers and users from the limitations of the disciplinary approach.

Acknowledgements Many thanks go to Rick Szostak and Lorena Zuccolo for their helpful suggestions to improve previous drafts. This paper builds on experience made with Hong Mei, Gabriele Merli and the other people in the ILC project, and on discussions with And Rosta about the syntax of logical languages.

References Austin, D. (1969). Prospects for a new general classification. Journal of librarianship, 1, 149-169. Austin, D. (1976). The CRG research into a freely faceted scheme. In A. Maltby (Ed.), Classification in the 1970s: A second look (pp. 158-194). London: Bingley. Austin, D. (1979). Differences between library classifications and machine-based subject retrieval systems: Some inferences drawn from research in Britain, 1963-1973. In A. Neelameghan (Ed.), Ordering information systems for global information networks: Proceedings 3rd international study conference on Classification research, Bombay, 1975 (pp. 326-340). Bangalore: FID CR, SRELS. Beghtol, C. (1998). Knowledge domains: Multi-disciplinarity and bibliographic classification systems. Knowledge organization, 25, n. 1-2, 1-12. Brown, J.D. (1906). Subject classification. London: Library Supply. Bunge, M. (1979). Ontology 2: A world of systems. Dordrecht, Boston, London: Reidel. Cowan, J.W. (1997). The complete Lojban language. Fairfax (Virginia): The Logical Language Group. Draft version retrieved February 11, 2006, from http://www.lojban.org/tiki/tiki-index.php?page=The+Lojban+Reference+Grammar&bl . Farradane, J.E. (1961). Fundamental fallacies and new needs in classification. In D.J. Foskett & B.I. Palmer (Eds.), The Sayers memorial volume (pp. 120-135). London: Library Association. Foskett, D.J. (1961). Classification and integrative levels. In D.J. Foskett & B.I. Palmer (Eds.), The Sayers memorial volume (pp. 136-150). London: Library Association. Republished in Chan, L.M., Richmond, P.A. & Svenonius, E. (Eds.), Theory of subject analysis: A sourcebook (pp. 210-220). Littleton (Col.): Libraries Unlimited, 1985. Foskett, D.J. (1970). Classification for a general index language: A review of recent research by the Classification Research Group. London: Library Association. Gardin, J.C. (1965). Free classifications and faceted classifications: their exploitation with computers. In P. Atherton (Ed.), Classification research: Proceedings of the international conference, Elsinore, 1964 (pp. 161-176). Copenhagen: Munksgaard. 18

Gnoli, C. (2005). BC2 classes for phenomena: An application of the theory of integrative levels. The Bliss classification bulletin, 47, 17-21. Also in DLIST, retrieved February 11, 2006, from http://dlist.sir.arizona.edu/920/ . Gnoli, C., & Hong M. (submitted). Freely faceted classification for Web-based information retrieval. Gnoli, C. & Merli, G. (2005). Notazione e interfaccia di ricerca per una classificazione a livelli. AIDA informazioni, 23, n. 1-2, 57-72. English abstract retrieved February 11, 2006, from http://www.aidainformazioni.it/2005/122005.html#articoli . Gnoli, C. & Poli, R. (2004). Levels of reality and levels of representation. Knowledge organization, 31, n. 3, 151-160. Hartmann, N. (1952). New ways of ontology. Westport: Greenwood Press. Hong M. (2005). A phenomenon approach to faceted classification [in Japanese]. In 53th Conference of the Japan Society of Library and Information Science, Keio University, October 22-23, 2005. English abstract in ISKO Italia, retrieved February 11, 2006, from http://www.iskoi.org/ilc/phenomenon.htm . ISKO Italia (2004). Integrative level classification: Research project. Retrieved February 15, 2006, from http://www.iskoi.org/ilc/. Kyle, B. (1956). E.G. Brisch: something new in classification. Special libraries, 47, n. 3, 100-105. Kyle, B. (1959). An examination of some of the problems involved in drafting general classifications and some proposals for their solution. Review of documentation, 26, n. 1, 17-21. La Barre, K. (2004). Adventures in faceted classification: A brave new world or a world of confusion? In I. McIlwaine (Ed.), Knowledge organization and the global society: Proceedings 8th international ISKO conference, London, July 13-16, 2004 (p. 79-84). Würzburg: Ergon. Seibt, J. (Ed.) (2004). Process theories: Cross-disciplinary studies in dynamic categories. Dordrecht: Kluwer. Szostak, R. (2004). Classifying science: Phenomena, data, theory, method, practice. Berlin: Springer, 2004. Vickery, B. (1960). Faceted classification: A guide to construction and use of special schemes. London: Aslib. Williamson, N.J. (1998). An interdisciplinary world and discipline based classification. In: W. Mustafa el Hadi, J. Maniez & S. Pollitt (Eds.), Structures and relations in knowledge organization: Proceedings 5th ISKO conference, Lille, August 25-29, 1998 (pp. 115-124). Würzburg: Ergon. Jeff Gabel Long Island University, Brooklyn Campus 1 University Plaza Brooklyn, NY 11201

Improving Information Retrieval of Subjects through Citation-Analysis : a Study

Abstract: Citation-chasing is proposed as a method of discovering additional terms to enhance subject- search retrieval. Subjects attached to OCLC records for cited works are compared to those attached to original citing sources. Citing sources were produced via a subject-list search in a library catalog using the LCSH “Language and languages—Origin.” A subject-search was employed to avoid subjectivity in choosing sources. References from the sources were searched in OCLC where applicable, and the subject headings were retrieved. The subjects were ranked by citation-frequency and tiered into 3 groups in a Bradford-like distribution. Highly cited subjects were produced that were not revealed through the original search. A difference in relative importance among the subjects was also revealed. Broad extra-linguistic topics like evolution are more prominent than specific linguistic topics like phonology. There are exceptions, which appear somewhat predictable by the amount of imbalance in citation-representation among the 2 sources. Citation leaders were also produced for authors and secondary-source titles.

1. Introduction In an extensive review of research on subject-searching, Bates (2003) claims that subject-searching is simultaneously the most popular and the most problematic aspect of OPAC searching. She demonstrates that it has consistently produced low recall, both with pre-coordinate indexing and keyword searching. Graham (2004) notes the disappointment “that the subject-searching capabilities of Web catalogs appear to be much the same as those of pre-Web, second-generation systems,” despite the “significant expansions to the accessibility and content of library catalogs” with the emergence of Web interfaces in the mid-1990s. Bates makes a case for end-user entry vocabulary with expanded terms, whose basic designs could consist of human-made clusters of terms with computer support. This proposal is largely driven by the notion that people can recognize information much easier than they can recall it (Bates, 2003). Though computers are better at memory than humans, humans are generally better decision-makers than computers (Pinker, 2000). Bates incidentally notes that information systems should also support users in other information searching behaviors, noting, for example, that citation-chasing is extremely popular in the social sciences and humanities. Citation-chasing and other forms of citation- analysis can contribute to global learning. Such analytical methods have the quality of being blind to many of the restrictions or limitations posed by national, institutional, professional or language-specific standards. Citations are products of the act of research, created by researchers, whereas databases, thesauri, and other sources of keywords are retroactively applied to the research, created by needs external to that of the researchers. Bates discusses citation-chasing as a tool for locating works directly. In the context of finding additional sources of subject terms, an obvious extension of this process would be the use of subject terms found on records produced through citation-chasing. A set of works are considered to be related by the fact that they are cited by a common later work or group of works. The identification of these sets through citation-chasing therefore creates the 20 potential for discovering previously unknown relationships among works, or in this case, subjects. Larsen (2002) has shown that following citations from a subject-search can improve recall. Following links from documents presupposes a relevance judgment. Using a subject heading rather than a title as a starting point allows for a set of citing sources produced by method rather than subjective decisions. This helps eliminate the heavy dependence on the user’s ability of supply “good seed documents” (Larsen, 2002). If LCSH is not a completely objective and reliable assignment system, its subjectivity is at least independent of the searcher. Bates also mentioned the necessity of letting the user have a choice between legitimate and expanded keyword searching. LCSH might be considered a parallel example to the sought-after balance, in the sense that the terms are ‘legitimate,’ while the consistency with which headings are assigned to works is much more arbitrary, and therefore potentially useful for discovering additional terms. A Bradford-like model, where rank-frequency and frequency-size are compared, can determine a set of subjects with the highest citation-frequencies, a larger set with lower frequencies, and yet a larger set with still lower frequencies, etc. This can show varying levels of relatedness of terms to the source term. For comparison, this process can be applied to the subjects found on records retrieved through the primary search, and then to the subjects attached to records of works retrieved through citation-chasing.

2. Methodology A subject-list search was performing using the subject “Language and languages— Origin” in the online catalog at Long Island University. The titles attached to this subject heading were retrieved, as well as those attached to all instances of this subject heading followed by subheadings. Thirty-seven monograph titles resulted. For a manageable study size and reasonable access, and for currency, the results were restricted to works located at the local campus, Brooklyn Campus, and those which were published in the last 10 years (1995 or later). Thirteen titles were produced. This study will be completed in 2 phases. This is a report on the first phase of this research, a case study of 2 titles from the 13 titles retrieved. The titles chosen for the case study were the two works published after 1999: “From Hand to Mouth: the Origins of Language” by Michael C. Corballis, 2002, and “The Evolutionary Emergence of Language: Social Function and the Origins of Linguistic Form,” 2000, edited by Chris Knight, Michael Studdert-Kennedy, and James R. Hurford. The latter is a monographic volume containing 23 works, but was treated as a single work, i.e., duplicate citations among the different works of the volume were deleted. The references in these works were searched in OCLC. Varying LCSH assignments were used from different records for the same works. For example, different editions are often cataloged with different choices of subject headings by different catalogers. Care was given to avoid misleading variations (for example, when a record shows the work bound with another work in one record). The subject headings were not checked for accuracy or obsolescence. To keep the retrieval set manageable, non-LCSH headings attached to records were not used.

3. Results The work by Corballis contains 426 citations. Of these, 311 citations were to works contained in secondary-sources, and therefore not candidates for citation-chasing. The remaining citations were to 115 monographs found in OCLC. The records contained a total of 284 subject headings, cited by a number of sources ranging from 1 to 17. The work by 21

Knight contains 609 citations. Of these, 404 citations were to works contained in other works and therefore not candidates for citation-chasing. The remaining citations were to 205 monographs found in OCLC. The records contained a total of 412 subject headings, cited by a number of sources ranging from 1 to 21. Of the 284 subject-headings retrieved by chasing the citations from Corballis, 76.4% appeared only once, and a further 12.3% only twice. Only 2.5% appeared more than five times. Of the 412 retrieved from the citations in Knight, 74.5% appeared once, 11.9% appeared twice, and less than 5% appeared more than 5 times. Only 112 (19.2%) of the subject headings were cited by both Corballis and Knight. The following discussion on subjects is restricted to this group. There were a few subjects that received a moderate citation-frequency, but didn’t make the cut because they were only cited by one source: “Grammar, Comparative and General—Phonology;” “Phonetics;” “Universals (Linguistics);” and “Prehistoric peoples,” cited 9, 6, 4, and 4 times. The first three are from Knight, the last is from Corballis. The higher number of citations in Knight makes it likely that this source will produce more citations to any given subject, and therefore more likely that some of its highly cited subjects will not appear in the Corballis citation chase. What is the difference in the keywords retrieved through citation-chasing versus those retrieved from traditional subject searching? The records representing Corballis’ work contain the subject heading “Language and languages—Origin” (in the library catalog, as well as in OCLC). The same subject is attached to the records for Knight’s work. In addition, “Anthropological linguistics” and “Human evolution” are attached to Knight’s records. In the rank-frequency list of total citations (see Table 1), “Language and languages—Origin” was 5th place (21 citations), “Human evolution” was 1st place (31 citations). “Anthropological linguistics” did not make the multiple-source list. It was cited only by the Knight source, receiving only 1 citation. In the LIU online catalog, “Human evolution” points to the subjects “Evolution (Biology)”, “Physical anthropology”, and “Human beings—Origin” via see-references. Similarly, “Anthropological linguistics” points to “Anthropology”, “Language and culture”, and “Linguistics.” These relationships would not have been picked up in a subject-list search or a subject-keyword search of “Language and languages--Origin.” However, the productivity of the references is limited compared to the results of citation-chasing. Furthermore, the see-reference subjects are not highly ranked in the citation-frequency tables (except for “Evolution (Biology)).” Conversely, the subject-list search that was performed on “Language and languages—Origin”, though high in retrieval, is not precise. The subject-list collocates subjects that begin with “Language and languages”, but a look at Tables 1 through 3 shows a small number of these subjects. Table 1 has only two (besides the original search subject). Table 2 has none, and Table 3 has 2 (out of 83 subjects). Of the 112 headings, 11 were cited more than 10 times (Table 1). These 11 headings made up about a third (210 citations, 32.8%) of the citations. This can be called the 1st tier.

SUBJECT TOTAL Corballis Knight Human evolution (tied for 2nd place in LIU catalog search) 31 17 14 Psycholinguistics 29 8 21 Language acquisition 25 5 20 Language and languages 22 10 12 Language and languages--Origin (1st place in LIU catalog search ; 21 9 12 subject used for original search) 22

SUBJECT TOTAL Corballis Knight Evolution (Biology) 16 5 11 Natural selection 16 5 11 Language and languages—Philosophy 15 7 8 Evolution 13 4 9 Grammar, Comparative and general—Syntax 11 1 10 Social evolution 11 5 6

Table 1. Subject Citation-frequency Leaders (more than 10 total citations), with numbers of citations by each of the 2 source titles.

These 11 subjects were represented fairly evenly among the 2 sources except for one, which was cited by Knight 10 times and Corballis only once. For the rest, no source accounted for over 80% of citations to a subject, and in most cases no source accounted for more than 58%. Knight accounted for more than Corballis in all cases except for one. Considering that Knight had more citations than Corballis, this could be seen as a leveling factor that makes the distribution of subjects among the two sources closer to even. In this context, the cases that are least even are the 10-1 case above (“Grammar, Comparative and general—Syntax), and a case were Corballis had more citations than Knight (“Human evolution”), the subject-frequency leader among the citation-chased subjects as well as among the 2 source titles. A second tier (152 citations, 23.8%) of the citations that appeared from both sources accounted for 18 subject headings that were cited between 7 and 10 times. A third tier (278 citations, 43.4%) accounted for 83 subject headings that were cited between 2 and 6 times. The number of subject headings receiving 6 citations was disproportionately high compared to the citation-frequency curve, making it difficult to divide the subjects into equal thirds. “Language and languages—Origin,” the target subject heading, places behind 4 other subject headings. Granted, one is “Language and languages,” which as a broader heading could include the former. Note that “Human evolution” and “Psycholinguists,” the top 2, reflect certain broad topical directions in by theories of language origin, evolution and psychology. Five of these top 11 topics contain the word ‘evolution,’ or in one case represent a major component of evolution: “Natural selection.” It seems that subjects relating to grammar, phonology, semantics, i.e., specific formal components of human language, are not central to the degree that multi-disciplinary or extra-linguistic topics are, like evolution, psychology, or sociology. Note the exception “Language acquisition” which places 3rd in the 1st tier. Conversely, note that “Anthropology” does not appear in any of the tiers, because it is cited by only one source (nor does the term or any of its morphological derivatives appear as part of a subject heading). Similarly, in the second tier, terms like behavior, primates, animal intelligence, and communication are more prevalent than strictly linguistic topics. An exceptions is “Sign language,” but this is unbalanced (all but one of the citations are from one source ; see the discussion below on unbalanced citations). “Linguistics” is in the 3rd tier with 6 citations, and is also unbalanced. The linguistic topics that are prevalent appear as combinations with complementary fields (ex: “Biolinguistics” and “Psycholinguistics”). Note that phonology-related topics do not even appear in the 3 tiers (as mentioned above, 2 phonology-related subjects had 9 and 6 citations each, but were only cited by one source). There are some strictly linguistic topics in the 3rd tier, but they are not in the majority. 23

A few of the subjects received unbalanced attention among the two sources. “Grammar, comparative and general—Syntax” is cited by Knight 10 times and by Corballis once. This ‘citation imbalance’ might be attributed to a heavy use by one source. A larger body of source titles might help to confirm that this subject is an anomaly by showing that only one among many sources cites it. Conversely, a larger body might show that many other sources cite it as well. Similarly, “Generative grammar” is tied for first place in the second tier among many unlike subjects (Knight 9, Corballis 1). Another 10-citation subject here is “Children—Language.” This subject is not a broad extra-linguistic topic, but it is broader and more general than “Generative grammar.” Its citation-imbalance is accordingly high (Knight 8, Corballis 2), but not as high as that of “Generative grammar.” The other 10-cite subjects in tier 2, like most subjects in tier 1, are much more balanced, and have much broader topics that at least partially fall outside of linguistics proper. Some examples of imbalance further down the tier are “Sign language” (Knight 1, Corballis 8), and “Linguistic change” (Knight 6, Corballis 1). As noted earlier, “Linguistics” in the third tier is unbalanced. Among the few other 6 to 1 subjects in the third tier, some of these (but not all) follow this pattern. At 6 total citations or less, the notion of balance vs. imbalance becomes tenuous. Measuring citation-imbalance follows the same line of logic concerning the deletion of single-source subjects, but with the potential for measuring progressive steps of irregularity rather than relying on the binary choice of inclusion/exclusion. A study using a much larger body of source titles would be required before making conclusions about balance in this context.

4. Authors There were 642 authors cited 1035 times. 465 of the 642 authors (72.4%) were cited once, and a further 90 were cited twice (14%). Only 5 were cited 10 or more times (0.8%). The citation leader (18 citations) accounted for 1.7% of all citations. Together with the 2nd place author (13 citations), they received 3% of all citations. Vihman was only cited by one source, Knight. Besides Vihman, there were many authors with moderate citation- frequencies who were also cited by only source. As noted earlier, a study using more source titles is necessary to make conclusions about unbalanced treatment among the citing sources.

18 Chomsky, N 18 Chomsky, N 13 Tomasello, M 13 Tomasello, M 11 MacNeilage, P F 11 MacNeilage, P F 11 Dunbar, R I M 11 Dunbar, R I M 10 Vihman, M M -- --

Table 2. Author Citation-frequency Leaders (10 or more citations). The right columns show the results after the removal of authors not cited by multiple sources.

5. Secondary-Sources There were 713 citations to 336 secondary-source titles (310 from Corballis, 403 from Knight). 234 (69.6%) of the titles were cited once, and a further 46 (13.7%) were cited twice. 317 (94.3%) were cited 5 times or less. Nine (2.7%) were cited 10 or more times. The titles cited 10 or more times accounted for 25.1% of the 713 citations. The citation leader (38 citations) accounted for 5.3% of the citations, and the top 2 (the 2nd place title had 36 citations) accounted for 10.4%. Secondary-source leaders carried heavier impact that author 24 leaders. As with the authors, there were many secondary-source titles with lower citation- frequencies that were cited by only one source. The removal of these titles does not affect Table 3 ; however, two titles with 9 citations, one with 7 citations, and two with 6 citations, as well as many less-frequently cited sources, would be removed from the full list of secondary-source titles.

38 Science 36 24 Behavioral and brain sciences 21 Approaches to the evolution of language: social and cognitive bases 15 Journal of human evolution 13 Language 11 Journal of theoretical biology 11 Psychological review 10 Current anthropology

Table 3. Secondary-source Citation-frequency Leaders. Secondary-sources cited 10 or more times (the removal of titles not cited by multiple sources did not affect this table).

6. Discussion This study proposes using citation-chasing as a methodology for generating, ranking, and clustering alternative subjects for the potential enhancement of subject searching. The source documents were produced methodologically to avoid subjectivity in choosing citing sources. Some of the most prominent headings retrieved through citation-chasing were not uncovered by viewing the records for the source-documents, or by chasing LCSH see- references. Citation-chasing demonstrates the ability to cluster subjects as well as topic- types by relative importance. The most prominent subjects are broad topics that often form complimentary research topics to the science of language, most notably, evolution. The division of the subject headings into Bradford-like tiers involved somewhat arbitrary cutoff points. The general pattern was fairly continuous from the first to the second tier with regards to topic-types, as well as to the balance of citations these types received from the two original sources. This balance showed some ability to predict anomalies. Citation- chasing produced similar results for authors and secondary-source titles, showing productivity leaders for each of these categories. The secondary-source leaders carried more relative weight than the author leaders. The same process could be performed using database descriptor terms, title- or general- keywords from articles, or free-text keywords from web resources. This paper used LCSH terms to keep well-defined limits for a study. A further study on subject headings found in OCLC might include foreign, genre, or other types of headings for greater diversity. A full set of source titles should be employed. This might produce different results, and greatly affect the status of subjects currently considered anomalies due to single-source citations. Regarding the topic “Language and languages—Origin”, a more involved study might analyze the fields of Linguistics and complimentary or related subjects, like Evolution. The results might benefit from a more qualified investigation into the high placement of “Psycholinguists” and “Language acquisition” in the first tier, where much of the remainder of the tier consists of subjects that fall squarely in the domain of evolution. There is a fair amount of literature which discusses the scientific status of the field of linguistics, which is 25 often placed in the humanities, the social sciences, or the sciences, depending on the source (Georgas & Cullars, 2005). There is also considerable discussion about the questionable adequacy of methods and results in linguistics, where explanatory adequacy often does not match the scientific mode of investigation (Yngve, 1985).

The author wishes to acknowledge the advice of Dr. Richard P. Smiraglia in the construction of this paper.

7. References Bates, Marcia J. 2003. Task Force Recommendation 2.3 Research And Design Review: Improving User Access To Library Catalog And Portal Information: Final Report (version 3), report resulting from the Bicentennial Conference On Bibliographic Control For The New Millennium, Library Of Congress, Washington, D.C. http://www.loc.gov/ catdir/bibcontrol/2.3batesreport6-03.doc.pdf Ding, Ying, Chowdhury, Gobinda G., Foo, Schubert. 2001. Bibliometric Cartography Of Information Retrieval Research By Using Co-Word Analysis, Information Processing And Management 37: 817-842. Georgas, Helen, Cullars, John. 2005. A Citation Study Of The Characteristics Of Linguistics Literature. College & Research Libraries 66: 496-515. Graham, Rumi Y. 2004. Subject No-Hits In An Academic Library Online Catalog: An Exploration Of Two Potential Ameliorations. College & Research Libraries 65: 36-54. Larsen, Birger. 2002. Exploiting Citation Overlap For Information Retrieval: Generating A Boomerang Effect From The Network Of Scientific Papers. Scientometrics 54: 155-178. Pinker, Steven. 2000. The Language Instinct. New York: Perennial. Stegmann, Johannes, Grohmann, Guenter. 2003. Hypothesis Generation Guided By Co- Word Clustering. Scientometrics 56: 111-135. Yngve, Victor H. 1986. Linguistics as a Science. Bloomington: Indiana University Press.

Aaron Loehrlein School of Library and Information Science, Indiana University Bloomington, USA

Richard Martin Tinwisle Corp., Bloomington, IN, USA

Edward L. Robertson School of Informatics, Indiana University Bloomington, USA

Integration of International Standards in the Domain of Manufacturing Enterprise

Abstract: This paper examines the use of formal terminologies in international standards, such as those generated by the International Organization of Standardization (ISO), to organize knowledge in the domain of manufacturing enterprise. It analyzes the terminological networks that are formed when standards implicitly or explicitly base their use of terms on the definitions provided by earlier standards. It describes the practical ramifications of inconsistencies in meanings between standards and the steps being taken by ISO working groups and other interested parties to bring those standards into alignment. In so doing, this paper explores some of the social aspects of knowledge organization that take place in global communities.

1. Introduction Manufacturing enterprise and other domains of experience can be organized according to a variety of characteristics, depending on the viewpoint of the person or group involved. In terms of an enterprise, viewpoints can be distinguished according the framework proposed by Zachman (1987), such as the point of view of the Designer or the Builder. Standards that explicitly acknowledge different viewpoints facilitate the aligning of concepts between each viewpoint so that their "points of contact and points of difference are both apparent" (Martin and Robertson, 2005). This helps the actors involved understand the viewpoints of others, and how their own viewpoint fits into the overall enterprise. There are currently many international standards that deal with various aspects of manufacturing enterprise. Many (though by no means all) of these standards are developed by working groups associated with the ISO. Historically, many of these working groups often had limited awareness of the efforts made by other groups to develop standards within similar enterprise domains. In addition, manufacturing enterprise standards generally employ their own terminologies to represent the relevant concepts. Although different standards' terminologies may overlap considerably, a given term is likely to be interpreted in different ways across multiple standards. Frequently, these distinctions can simply be attributed to different accepted definitions of the given term. However, in some cases, two standards might provide interpretations for a given term that are similar enough to occur within the same general definition, but have enough subtle distinctions as to make it difficult for their respective communities to cooperate. For example, Integrated Definition Methods (IDEF) interprets a Function Model as an abstract representation and a Process Model as concrete realization. In contrast, ISO 19439 does not separate Process and Function, but instead represents Process as occurring within the life-cycle of an enterprise functionality. 28

Communities utilizing IDEF might make distinctions between Process and Function that communities using ISO 19439 could consider to be irrelevant or even harmful (Martin, 2004). There is a concerted effort on the part of many working groups within the ISO to bring their standards into alignment with each other. This paper traces the evolution of the terminologies of twenty-one standards used in manufacturing enterprise in order to analyze the social and logical processes by which standards are integrated. Sixteen of these standards deal with the integration of industrial automation systems and/or management data in the domain of manufacturing. The remaining standards deal with integration and architecture in manufacturing or general enterprise. Twenty of the standards were published by the ISO. The remaining standard, International Electrotechnical Commission (IEC) Standard 62264-1, deals with enterprise-control system integration. Within these general domains, the standards deal with such issues as definitions, concepts, terminologies, frameworks, models, and basic principles. ISO 10746-2 10746-2 ISO 15531-1 ISO 15704 ISO 15745-1 ISO 16100-2 ISO 18629-1 ISO 18629-11 ISO 19439 ISO 19440 ISO IEC 62264-1 Activity B C Enterprise A A engineering

Entity A1 B A1 A1 A2 A2 Genericity A A

Life cycle B A3 A2

Model C A2 B A1 A1 Modelling language A3 A2 construct Object A B A Process A A B

Resource B C1 D B A A C2

Table 1. An sample portion of the inverted index of terms that occur in multiple standards

2. Analysis of the terminologies of standards Kosanke (2005) performed a manual meta-analysis of the terminologies defined in these standards. He collocated terms and definitions that appeared in many standards, including all the standards described in this paper. In many cases, term definitions were also included from Collins Dictionary (1987 edition) and WordNet 1.7.1. From this meta-analysis, we created an inverted index of terms and the standards in which they are used (see sample entries in Table 1). For each term, the index records which standards define the term in exactly the same way, which standards provide definitions that are distinct, but have some overlap of wording, and which standards define the term in a manner that is fundamentally different than the definitions provided by other standards. 29

In Table 1, the letters indicate distinct interpretations of the term. For example, standards ISO 10746-2 and ISO 16100-2 each define Object as a "model of an entity". Because the definitions are exactly the same, they are represented with the same letter ("A"). In contrast, ISO 15745-1 defines Object as an "entity with a well-defined boundary and identity that encapsulates state and behaviour". The wording of this definition is fundamentally different from wording of the first two. It is therefore represented with the letter "B". In some cases, different standards define a term using some of the same wording, but also use wording that is distinct. For example, ISO 15745-1 defines Entity as "any concrete or abstract thing of interest", while ISO 19439 defines Entity as "any concrete or abstract thing in the domain under consideration". In these cases, the definitions are represented by the same letter, but are given different numeric qualifiers (e.g., "A1" and "A2").

No. of No. of Breakdown Terms Standards Definitions by Definition

Attribute 5 3 A1, A1, A1 | A2 | B Capability 5 3 A, A, A | B | C Component 5 3 B, B | C, C | D

Enterprise 6 2 A1, A1, A1, A1 | A2, A2 Entity 6 3 A1, A1, A1 | A2, A2 | B Model 6 5 A1, A1 | A2 | B | C | D Resource 8 5 A, A, A | B, B | C1 | C2 | D

Table 2. The seven most commonly-occurring terms, including the number of standards in which they appear (out of twenty), the total number of definitions used, and the breakdown of definitions

Definitions that appear in a plurality of standards are labeled "A". If there is no definition in the plurality, then the "A" label is not used. For example, the two definitions of Activity are labeled "B" and "C". The alphabetic (and alphanumeric) representations are not meaningful across terms. For example, an "A" that describes Model has no relation to an "A" that describes Process. Sixty-eight terms and short phrases appeared in more than one of the standards in this analysis. These terms appeared in an average of 2.86 standards. Only seven terms (10.29%) occurred in more than four standards (see Table 2) and no term occurred in more than eight standards. Seventeen of the terms and phrases (25%) were used in exactly the same way across multiple standards. However, only five terms or phrases (7.35%) were used in exactly the same way by more than two standards. No terms or phrases were used the exactly the same way by more than four standards. Twenty-seven terms or phrases (39.71%) were described in fundamentally different ways by at least two standards. Overall, these results suggest that, while standards seem to frequently use other standards as sources for their terminologies, large networks of citations between definitions have yet to emerge.

3. Interpreting modifications to definitions We are currently developing an automated process to extract terminologies from the standards in which they appear and compile them into a master table. This will facilitate analysis of the similarities in terms' meanings across standards. However, interpreting the significance of changes in wording across standards can be problematic. Different people attach different amounts of significance to a modification in wording. To explore this effect, we conducted a pilot study with three subjects (the authors of this paper). The first subject was a domain expert who was regularly involved in the 30 development of many of the standards discussed in this paper (referred to as "Expert One"). The second subject was a domain expert who was only occasionally involved in the development of these standards (referred to as "Expert Two"). The third was a relative novice to the domain who had no direct experience with the development of standards (referred to as the "Novice"). Each subject was presented with a list of terms that were used across multiple standards. The subject also had access to each of the definitions involved. If two or more standards used the term in exactly the same way, those standards were explicitly grouped together (a portion of the list is shown in Table 3). The subjects were asked to characterize how the meaning of the term changed from standard to standard. They were instructed to select from the six options on Table 4.

Earlier Later Code (Samples) Term Standard(s) Standard(s) 6 Element 14258 (1999) 15531-1 (2002) 15704 (1999), 62264-1 2 Enterprise 14258 (1999) (2002), 19439 (2003) 3 Mission 15704 (1999) 19439 (2003) 1 Model 15704 (1999) 19439 (2003), 19440 (2004) 6 Object 15531-31 (2002) 15745-1 (2000)

Table 3. For each term, the subjects were asked to provide codes representing the modifications that took place to the definition of the term between the earlier standard(s) and the later standard(s).

Code Wording Modification Degree 1 Revised general rephrasing Small 2 Revised avoids using terms that are defined in the new standard Small 3 Different (virtually) equivalent concepts Small 4 Different (virtually) equivalent concepts, but with different aspects Moderate emphasized 5 Different changes to context or specificity Large 6 Different dissimilar concepts Large

Table 4. The codes the subjects used Revised: except for a few terms or phrases, the definitions consist of the same words Different: the wordings of the definitions are fundamentally different Small, Moderate, or Large: a general description of the degree of difference conceptually between each standard's interpretation of the term

The highest rate of agreement occurred between Expert Two and the Novice, who agreed on 47.5% of the forty terms. Expert One and the Novice agreed on only 22.5% of the terms, while Expert One and Expert Two agreed on only 12.5% of the terms (see Table 5). One explanation for this discrepancy is that Expert One significantly favored response 5 (indicating representations at different levels of specificity, or with different contexts) over other responses (p<0.001 on a chi-square test). In contrast, Expert Two and the Novice did not show a significant preference for any one response (see Table 6). However, Expert Two identified "small" changes 55% of the time, while Expert One identified "small" changes only 12.5% of the time. This suggests that a person who is frequently involved in the development of standards tends to attach a higher degree of significance to seemingly minor changes in wording than do people who have not actively developed the standards. 31

Small Large Total 1 2 3 4 5 6 No agreement 13 Expert One and Expert Two 1 1 2 Expert One and Novice 1 5 6 Expert Two and Novice 5 4 3 1 1 2 16 All Subjects 1 1 1 3

Table 5. The frequency with which the subjects agreed on a given term, by type of modification.

Overall, the results of this pilot study suggest that there is little consensus regarding the effect of changes in wording in a term's definition on the meaning of the term. Although automated processes can compare similarities between the strings that comprise definitions, an interpretation of the semantic similarities between definitions appears to be a process that requires a great deal of human interpretation.

Small Large Skipped Total 1 2 3 4 5 6 Expert One 2 1 2 7 19 7 0 40 Expert Two 10 10 6 3 5 6 2 40 Novice 5 9 8 4 9 5 0 40

Table 6. Each subject's preference for types of modification

4. Changes across revisions of a single standard We also compared changes to terms' meanings across standards to changes that occur between multiple revisions of a single standard. We analyzed the changes made to ISO Standard 19439 over fifteen revisions from 25 April 2000 to 5 May 2005. We also characterized the nature of those changes according to the same six types from Table 4. Examples of changes are shown in Table 7.

Original Passage Revised Passage Date Type means of execution means of operation execution 11/2000 2 as described at the as specified at the Design 11/2000 2 Requirements Definition Level Specification Modelling Level the means of domain operation the means of operation execution 12/2000 5 execution are developed by the particular are developed as new components enterprise if they are not by the particular enterprise if they 2/2001 5 available on the market are not available on the market The implementation description The implementation description phase shall describe the phase shall capture all the information needed for all of the information describing the final 4/2003 3 tasks that are to be carried out by implementation of the domain the enterprise domain operational operational system system. Table 7. Sample modifications to ISO 19439, the date the changes took place, and the type of changes that occurred. Modifications are represented in bold face. 32

5. Conclusions and future work In this paper, we have described methods in development for analysis of changes to conceptual representations across standards and between revisions of a single standard. We found that standards in manufacturing enterprise frequently base their definitions on earlier standards, but that these references have not yet developed into a large network. Conceptual representations are more prone to significant changes as standards are revised than as new standards are created. However, the apparent lack of consensus in interpreting the significance of modifications to wording is a limitation to the validity of these conclusions. Future work will involve continued development of automatic methods for tracing the evolution of definitions across standards. Fully automated methods will be able to compare definitions of all terms across standard, not just the definitions of a single term. Also, we will use the results of the pilot study to collect data from a larger number of experts as to how they interpret changes to the wordings of definitions. We hope to find trends in the data that will further inform the development of automatic methods for analysis. We will also take into account the effects of context. For example, ISO 14258 defines Behaviour as "how an element acts and reacts", while ISO 15704 defines Behaviour as "how the whole or part of the system acts and reacts". However, ISO 14258 defines Element as "a basic system part that has the characteristics of state, behaviour, and identification", while ISO 15704 does not define Element. With the additional context of Element in ISO 14258, the definition of Behaviour in that standard could be reinterpreted as "how an element (i.e., a basic system part that has the characteristics of state, behaviour, and identification) acts and reacts". Therefore, the relationship between the meanings of Behaviour as it appears in ISO 14258 and in ISO 15704 depends partially on whether the term Element is also taken into account.

References IEC/FDIS 62264-1; Enterprise-control system integration — Part 1: Models and terminology, voting start date: 2002-10-25, Ref. IEC/FDIS 62264-1:2002(E) ISO FDIS 9506-1; Industrial automation systems — Manufacturing Message Specification - Part 1: Service definition, Ref: ISO/FDIS 9506-1:2002(E) ISO IS 14258; Industrial automation systems - Concepts and rules for enterprise models, dated: 1999-04-14, Ref. TC184 SC5 WG1 Home Page ISO IS 15704; Industrial automation systems - Requirements for enterprise reference architectures and methodologies; date: 1999-08-20, Ref. Web version WG1 N431, ISO/TC 184/SC5. ISO/DIS 15745-1; Industrial automation systems and integration — Open systems application integration frameworks — Part 1: Generic reference description; Ref. ISO/FDIS 15745-2000(E) ISO/CD 15531-32; Industrial automation system and integration — Manufacturing management data exchange: Resources usage management: Part 32: Conceptual model for resources usage management data, 2001-02-09, N228 ISO 15531-1; Industrial automation system and integration -ҏIndustrial manufacturing management data: Part 1: General overview, 2002-06-28, N286 ISO 15531-31; Industrial automation system and integrationҠҏ Manufacturing management data exchange: Resources usage management data: Resources information model: basic concepts, 2002-07-01, N290 ISO/CD 15531-42; Industrial manufacturing management data: Time Model, 2004-01-08, N386 33

ISO/CD 15531-43; Industrial automation system and integration – Industrial manufacturing management data: Manufacturing flow management data: Conceptual model for manufacturing flow management, 2004-01-22, N390 ISO IS 16100-1; Industrial automation systems and integration — Manufacturing software capability profiling for interoperability – Part 1: Framework, date: 2002-11-01 ISO IS 16100-2; Industrial automation systems and integration — Manufacturing software capability profiling for interoperability – Part 2: Profiling methodology, date: 2003-01- 22, Ref. N759 v1.0 ISO IS 16100-3; Industrial automation systems and integration — Manufacturing software capability profiling for interoperability – Part 3: Interface Protocols and Templates, date: 2003-09-19, ICS 25.040.01 ISO IS 16100-4; Industrial automation systems and integration — Manufacturing software capability profiling for interoperability – Part 3: Interface Protocols and Templates, date: 2004-03-20, N233 ISO/WD 18629-11 Industrial automation system and integration -- Process specification language Part 11: PSL-Core, Ref. N 236, 2001-02-19 ISO/WD 18629-13 Industrial automation system and integration - Process specification language Part 13: Duration and Ordering Theories, Ref. N 417, 2004-07-06 ISO/WD 18629-43 Industrial automation system and integration - Process specification language Part 43: Definitional extension: Activity Ordering and Duration extensions, Ref. N 419, 2004-07-07 ISO/CEN DIS 19439; CIM Systems Architecture — Framework for enterprise modelling; voting start date: 203-04-16, Ref: ICS 25.040.40, Ref. prEN ISO 19439:2002 E (2003- 04-16) ISO/CEN DIS 19440 Enterprise Integration - Constructs for Enterprise Modelling, date: 2004-06-10 Martin, R. (2004). Personal correspondence. Bloomington, IN, USA Martin, R., Robertson, E. (2005). Views in the enterprise domain. Unpublished manuscript. Bloomington, IN, USA Kosanke, K. (2005). Terms defined in ISO SC5 standardization work that relate to Enterprise integration. CIMOSA Association e.V. 5-16-2005. Zachman, J.A. (1987). A framework for information systems architecture. IBM Systems Journal, 26(3), 1987.

Charles Abiodun Robert and Amos David LORIA, Vandoeuvre-Lès-Nancy, France Annotation and its application to information research in economic intelligence

Abstract: Annotation tools are becoming increasingly important in information research, information management and collaborative works. Annotation can be conceptualized to assist in the “collection, processing and distributing of useful information for the economic actors” (Economic intelligence) with the aim of facilitating the integration of two fields of information systems and decision making. This paper described the theory and concept of applying annotation in the process of information research for decision making. The specificities of this concept were compared to other concepts behind other annotation tools. Our study considered annotation in the light of three parameters of document, user and time. We observed that (a) different document requires different annotation; (b) two or more users may not make the same type of annotation on the same document (c) a specific user may not annotate the same document the same way at different time. Information research for decision making integrating an annotation database can be founded on these three parameters.

Key words: Annotation, economic intelligence, information research, document

1. Introduction Annotation tools are becoming increasingly important in information management and collaborative works. Internet technology, which is one of the most outstanding platform for today’s communication is favouring the use of annotation tools to resolve informational problems. Several free annotation tools are available for co-operative work and for personal employment. Some of these tools were developed based on specific local needs other were developed to cater for a generalized application. We believe that annotation tool can be conceived based on a concept that will permit its exploitation. One of the objectives of annotation is to evaluate the content of a document. This content is nothing but the information contained in a document. Information is valuable possession. The value attributed to and/or perceived from document containing information is never zero. The value may depend on economic considerations, political factors, access to other related information and other socio-ecological factors.

2.Annotation parameters An annotation is essentially consisting of three main components; the annotator (person making the annotation), the document being annotated and the resulting annotation itself. We will not give attention to the annotator in this study because our concern here is not on user modeling or profiling. We will consider a document as “a trace of human activities” (Prie, 1999, 23). In another word, a document is an attestation of human activities. A document may attest to cultural, historic, social, political, scientific, religious or biological events at a place or several places at a point in time. Document is not just paper or electronic media with information content. If a piece of paper contains some writings on it, we consider that piece of paper as a document. If two pieces of clothes, one from India, the other one from Mali are presented before us, we may be able to say, one cloth is from India, while the other is from Mali. In other word, we can infer from the pieces of clothes, some cultural heritages of Mali and India. This means that the pieces of clothes are medium of information. 36

Be it an architectural master piece, a sculpture, food items, drawings or musical instruments, a document contains information. Documents essentially contain information meant for interpretation (read, viewed, heard and perceived) by a certain group of people. The audience may or may not be pre-determined. It is therefore imperative that a document be made available to its potential audience. A document may be in oral, graphics or text form. It may be tangible or intangible. Henceforth, we will refer to documents as objects and vise-versa. Annotation is an action and an object. From the perspective of an action, annotation can be defined as an act of interpreting a document. The interpretation is in a specific context expressed on a host document. The interpretation can be made by the producer of the document or another person. Considering it as an object, annotation is a written, oral or graphic document usually attached to the host document. The host document is the interpreted document. Annotation can not take place until after the host document has been completed. It is therefore not a property of the original document. For instance, a plate number of a vehicle is not an annotation though it is attached to the document (vehicle). This is because, we consider a plate number as a property of a vehicle. A vehicle can not be said to be completed without a plate number. “A document in the making” is generally not considered for annotation. Every annotation on incomplete document is considered as part of the initial document. Annotations will normally take a different appearance with respect to the original document. The difference in appearance may be noticeable in form of character used, font, style, colour or additional signs and images that is not characteristic of the original document. In a textual document, a document for annotation can include various entities like punctuations, words, images, artefacts terminologies, phrases, sentences, passages, collection of homogeneous documents or a collection of heterogeneous documents. These entities can reflect the granularity of annotation on the document. A collection of articles can be considered as a collection of homogeneous documents. This is because we can consider each of the articles in that collection separately each related to other articles uniformly in properties. A multimedia document can be seen as a collection of heterogeneous documents in the sense that individual member that form this collection may differ in their properties and features.

3. Annotation in economic intelligence The word economic intelligence has to do with the use of information for strategic decision making. It is a process covering two fields of information systems and decision making (David et al 2001). Economic intelligence is defined as “all the coordinated actions of collection, processing and distributing of useful information for the economic actors with the aim of its exploitation. These actions are taken legally with all the guarantees of protection necessary for the conversation of the company’s patrimony, in the best conditions of quality, of delay and of cost” (Martre, 1994). From the perspective of economic intelligence (EI) study, it is the view of considering information system as a tool for a decision maker to make the best decision as regards to a particular problem of interest. Attention is given to human actors in EI. Two actors of importance are the decision maker and the watcher (Robert, 2003). These actors perform complementary activities to resolve decisional problems. It is of our interest to see how some of these activities can be performed with the help of annotation. Most annotation tools available today were conceived without due consideration of neither user nor time. The attention given to document in most annotation tools were either 37 based on the medium or the content of the document (Ovsiannikov , 1999), (Denoue 2000). Some tools were concerned with the convenience of web technology and its applicability to annotation (Yee, 2002), (Sudhir, 2005). Others were concerned with information sharing and collaboration (Heck, 1999). The attention of DEEP Annotation system was on ontology instantiation (Handschuh, 2003).

Fixed parameters Annotation context Representation User Doc Time

1 All annotations on all documents by all users ³³³dUdDdT All annotations by all users on all documents at 2 X a specific time T³³dUdD 3 All annotations by all users on a document X D³³dUdT All annotations by all users on a document at a 4 X X time specific DT³dU All annotations by a user on all documents all 5 X time U³³dDdT All annotations by a user on all documents at a 6 X X specific time UT³dD All annotations by a user on a document all the 7 X X time UD³dT All annotations by a user on a document at a 8 X X X UDT specific time

Table 1: Table of annotation context

Our approach is to present annotation as a function of the host document, the user and the time involved in the annotation. We observed that (a) different document requires different annotation. (b) two individuals will not necessarily make the same kind of annotation on the same document (c) under normal condition, the same user will not annotate the same document the same way at different time. A document may be annotated several times by a particular individual. Several documents can be annotated by a user at a point in time or in a period of time. One user will annotate a document different from another user. A particular user may annotate the same document differently given a time frame. In applying annotation tool into economic intelligence, we are considering annotation in terms of time, users and documents. Series of annotations over time on one or more documents, by one or more users can be used to evaluate the orientation and interest of individuals as they attempt to resolve a problem of interest. These problems are generally in the form of using information to resolve a problem related to decision making in the course of economic intelligence. An annotation or a set of annotation can be represented as dUdTdD x Where dU is the change in user parameter, dT and dD are the changes in time and documents parameters respectively. Specifically, we are signifying that annotation can be seen as a function of user (U), time (T) and document (D). One or more of these parameters can be kept constant while the others are varied as in table 1. The three parameters when kept constant refer to a single case of an annotation. In 38 the case where all these vary, it imply every possible annotation on a set of documents of interest. We can be interested in the annotations made by a particular user on a particular document over time. The objective of this may be to see his reaction or the user’s disposition to an event. We can represent this as UD dT

We can represent this in a three dimensional graph with each of the parameters in X, Y and Z axis respectively or with a table as in Table 1.

Application in economic intelligence A production manager in a bottling company may be interested in the marketing of Lemonade. We can collect comment (annotations) made by the sales manager on “sales report”. Here, we consider “sales report” as a document. The annotations of the sales manager can reveal several factors involved in sales of Lemonade. His comments will depict factors that affect sales at each point in time. If there is a sharp drop in sales, it may be as a result a particular event that was remarked (annotated) by the sales manager in a particular annual general meeting (UDT). The document may be fixed (sales report) and the user fixed (the sales manager), but the time not fixed (UD³dT). We can expand the scope of annotations to other parameters of managers, documents and over a period of time. The drop in sales can be as a result of a particular event that was remarked (annotated) by all the managers in a particular meeting DT³dU. The drop can be a cumulative effect of each of the several reported comments (annotations) of the sales manager on a report he consistently but diversely presented in several meetings (UD³dT). We may look beyond the comments of the sales manager and consider the comments of every stake-holder on sales report in meetings (D³³dUdT). Lastly it could the comments (annotation) of everyone on all reports (documents) all the time (³³³dUdDdT). Analysis of these annotations can be used to resolve problems relating to drop in sales of Lemonade.

4. Perspective We have considered annotation as a function of user, time and document. It may be interesting to look at these parameters in their subdivisions. For example, do we consider time in its totality or from what point to what point in time will be needed to resolve a problem? Will it be necessary to consider all annotations by a user knowing fully well that some annotations will fall out of our time frame of consideration? How do we fragment or group documents and time? In some cases, we may need the annotation of more than one user, what criteria do we use to group users? This study presented a review with some illustrative examples of the use of annotation. We hope that these illustrations of annotation in economic intelligence can be adapted to a wide range of potential applications in both personal and collaborative contexts.

5. Conclusions The concept of annotation was defined as object and action in a manner to isolate other confusing terminologies. This study proposed a set of direction and application of annotation that can forms part of a useful definition of ‘annotation’ in the context of economic intelligence. 39

The importance of annotation to economic intelligence was illustrated by providing detail examples. A practical application of this consideration is on course with the use of bibliographic database of a research group SITE1 in LORIA2.

Notes 1 SITE is a research group in LORIA that is concerned with economic intelligence modelling. http://site.loria.fr SITE bibliographic database can be located at http://metiore.loria.fr 2 LORIA (Laboratoire Lorrain de Recherche en Informatique et ses Applications) http://www.loria.fr

Bibliographies 1. BOUAKA N., & DAVID A., (2003) Modèle pour l'Explicitation d'un Problème Décisionnel: Un outil d’aide à la décision dans un contexte d'intelligence économique. in Conférence "Intelligence Economique : Recherches et Applications", Nancy : 14-15 avril 2003. http://www.inist.fr/iera/fichiers/iera11.pdf , 30/06/2004 2. DAVID A., BUENO D., & KISLIN P., (2001) Case-Based Reasoning, User model and IRS. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics - SCI'2001. International Institute of Informatics and Systemics (IIIS). (Orlando, USA). 2001. http://isdm.univ-tln.fr/PDF/isdm11/isdm11a98_amos.pdf, 31/01/2005 3. DENOUE, L. & VIGNOLLET. L., (2000) An annotation tool for Web browsers and its applications to information retrieval. In Proceedings of RIAO200, Apr. 2000. http://www.fxpal.com/people/denoue/publications/riao2000.pdf, 31/03/2005 4. DESMONTILS E., JACQUIN C., & SIMON L., (2003), Vers un système d’annotation distribué, http://www.sciences.univ-nantes.fr/irin/Vie/RR/RR-IRIN2003-01.pdf, 17/07/2004 5. HANDSCHUH S., STAAB S., & VOLZ R., (2003) On deep annotation. In Proceedings of International World Wide Web Conference, 2003, pages 431-438,. 6. HECK R. M., LUEBKE S. M., & OBERMARK C. H., (1999), A Survey of Web Annotation Systems http://www.math.grin.edu/~rebelsky/Blazers/Annotations/Summer1999/Papers/ survey_paper.html 7. MARTRE, H., (1994) « Intelligence économique et stratégie des entreprises », Rapport du commissariat Général au Plan, Paris, La documentation Française, 1994, pp 17,18 8. OVSIANNIKOV I., ARBIB M.A. & McNEILL T.H., (1999) Annotation Technology. Int. J. Human-Computer Studies, , pp 329 - 362 9. PRIE Y., (1999), « Modélisation de documents audiovisuels en Strates Interconnectées par les Annotations pour l'exploitation contextuelle » Thèse de Doctorat à l’Université Claude Bernard Lyon1, France, 1999, pp 27 10. ROBERT, A.B.C., (2003), Représentation des activités du veilleur en contexte de l’intelligence économique, DEA en Sciences de l’information et de la Communication, Université Nancy 2, Université de Metz, Octobre 2003, page 15 11. SUDHIR A., SIEGFRIED H., and STAAB S. , (2005), Annotation, Composition and Invocation of Semantic Web Services, Journal of Web Semantics, http:// www.websemanticsjournal.org/ps/pub/2005-5, 21/03/05 12. Yee K., (2002), CritLink: Advanced Hyperlinks Enable Public Annotation on the Web, Demo to the CSCW 2002 conference, New Orleans, Dec 2002, http://zesty.ca/ pubs/yee-critcscw2002-demo.pdf 30/03/2005

Shawne D. Miksa, William E. Moen, Gregory Snyder, Serhiy Polyakov, and Amy Eklund Texas Center for Digital Knowledge, University of North Texas Denton, Texas, U.S.A. Metadata Assistance of the Functional Requirements for Bibliographic Records’ Four User Tasks: a report on the MARC Content Designation Utilization (MCDU) Project

Abstract: This paper describes the work of the MARC Content Designation Utilization (MCDU) Project, funded by a National Leadership Grant from the U.S. federal Institute of Museum and Library Services (IMLS). The MCDU Project is analyzing approximately 56 million MARC 21 Format for Bibliographic Data records from OCLC’s WorldCat database to identify actual use of the content designation available in the MARC bibliographic record. We consider bibliographic records as artifacts resulting from the overall cataloging enterprise, of which the encoding of the bibliographic data into MARC is only one part. Concepts from the Functional Requirements for Bibliographic Records (FRBR) can be used to examine and critically assess the availability of bibliographic data in these records, data meant to assist end users in finding, identifying, selecting, and obtaining relevant information resources. Overall, the MCDU Project will provide empirical data reflecting the actual use of MARC content designation structures in this set of records. Specifically, the data can be used to demonstrate how catalogers’ coding of bibliographic data may or may not assist end users in these four tasks. The project is using the mapping by Delsey of MARC data elements to FRBR user tasks in this analysis. These data are crucial for making decisions about the future of MARC and may inform current work on bibliographic rules reflected in the development of the next version of cataloging rules (i.e., Resource Description and Access) by the Joint Steering Committee for the Revision of the Anglo-American Cataloguing Rules.

1. Introduction The successful use of any knowledge organization system by end users is profoundly influenced by the information professionals’ effective use of the metadata schema underlying such a system. Information professionals have long held the belief that proper organization of humankind’s recorded knowledge is key to the access of that knowledge. The library catalog in particular plays a unique role in libraries as it allows users to explore the holdings and the relationships those items have to one another within a particular collection and, in many cases, in collections across the globe. There is very little empirical evidence demonstrating the extent of utilization of a major metadata encoding scheme by information professionals, especially catalogers, when creating bibliographic records. The MARC Content Designation Utilization (MCDU) Project (http://www.mcdu.unt.edu) seeks to address that lack of evidence. This analysis of the actual use of MARC content designation structures (CDS) can reflect on policies and practices of the whole enterprise, especially as it relates to the rethinking of the requirements for bibliographic data such as the new conceptual approaches suggested by the Functional Requirements for Bibliographic Records, or FRBR (IFLA, 1998). FRBR concepts can be used to examine and critically assess bibliographic data to assist end users’ tasks of finding, identifying, selecting, and obtaining relevant information resources. MCDU will show how catalogers’ coding of bibliographic data may or may not assist end users in these four tasks by expanding on Delsey’s mapping of FRBR user tasks to MARC data elements (Delsey, 2003) and by providing empirical data on the actual use of the elements in the entire OCLC WorldCat database. 42

2. MARC Content Designation Utilization Project The MCDU Project, funded by a National Leadership Grant from the U.S. federal Institute of Museum and Library Services, is a systematic examination of MARC content designation use through a quantitative analysis of over 56 million bibliographic records from OCLC’s WorldCat database. The overarching research question for this project is: What is the extent of catalogers’ use of content designation available in MARC 21? In addition, a set of research questions more specifically guide the project:

x What does the empirical evidence of MARC 21 content designation use suggest about a set of common or frequently occurring elements in bibliographic records per format or type of material? x What is the relationship between the availability of new MARC content designation and its subsequent adoption and use? x What methodology is appropriate to identify and understand factors contributing to cataloger’s utilization of available content designation and the interplay between MARC and the entire cataloging enterprise?

OCLC provided a dump of the entire WorldCat database in spring 2005, at which time there were 56,177,383 records. This comprises the MCDU Project dataset. These records have been decomposed to facilitate analysis. The content designation structures were parsed and the resulting data were stored in a large relational database. These content designation structures are the basic units of analysis. Data preparation and management, software tools, and systematic methods and procedures developed for the project will ensure reliable and valid analyses of MARC 21 content designation use. The central component of this process was the development of parsing scripts for decomposing each MARC record’s content designation structures (CDS)—fields, subfields, indicators, etc.—for storage in a database that allows the structures to be retrieved and analyzed. The entire MCDU dataset was then divided into separate subsets based on ten material types. Moreover, in the interest of analyzing the practices of Library of Congress catalogers separately from those of other institutions and cataloging enterprises, the records were further segregated according to the agency responsible for creating them—Library of Congress or other OCLC member libraries (i.e., nonLC records). These efforts resulted in the creation of twenty separate databases, which in turn were populated according to each record’s creating agency (i.e., LC or non-LC) and the type of material described (books, cartographic materials, electronic resources, etc.). This data preparation was guided by the anticipated types of analyses and frequency counts the study team carried out.

3. Relationship of MCDU Project Frequency Counts to the FRBR Model One of the planned deliverables of the MCDU project is a list of frequently used MARC elements for bibliographic records representing different material formats, as indicated by the empirical evidence resulting from our analyses. The set of frequently used elements will be based in part on Delsey’s functional analysis of the MARC 21 bibliographic format (Delsey, 2003). Ultimately, identifying frequently used elements can have practical applications for catalogers, managers, and others involved in the cataloging enterprise by informing their decisions regarding potential changes in local cataloging policies and practices. In many ways this corresponds to the FRBR study group’s recommendation that the entity-relationship analysis resulting from their study “…might also serve as a useful conceptual framework for a re-examination of the structures used to store, display, and communicate bibliographic data” (IFLA,1998, 6). Moreover, the identification of frequently 43 used elements in MARC 21 bibliographic records may be used to inform the practices of and research within other metadata communities, as well as those involved in automatic metadata generation, metadata harvesting, or metadata transformation. The FRBR study group began their investigation by making no “a priori assumptions about the bibliographic record itself, either in terms of content or structure” (IFLA, 1998, 3). The MCDU Project, on the other hand, directly addresses the content designation structures (CDS) of MARC 21, and with the data gathered from OCLC’s WorldCat database it can test this model of basic functionality and make some conclusions as to whether or to what extent catalogers are utilizing MARC to support the model. To identify which CDS support the four user tasks—find, identify, select, and obtain— we rely on the results of Delsey’s functional analysis of the relationship between MARC 21 format data and FRBR’s user tasks, commissioned by the Library of Congress, Network Development and MARC Standards Office (Delsey, 2003). This research provides a detailed mapping of data elements (fields, subfields, character positions in fixed fields, and indicator positions) specified in the MARC bibliographic and holdings formats to FRBR user tasks. While Delsey’s analysis examines three categories of user tasks (resource discovery tasks, resource use tasks, and data management tasks), our interests focus on the category of resource discovery, which corresponds to FRBR’s four user tasks. Detailed definitions of the four resource discovery tasks are provided in Delsey’s functional analysis of MARC 21 shown here in Figure 1.

Resource Discovery Search: Search for a resource corresponding to stated criteria (i.e., to search either a single entity or a set of entities using an attribute or relationship of the entity as the search criteria). Identify: Identify a resource (i.e., to confirm that the entity described or located corresponds to the entity sought, or to distinguish between two or more entities with similar characteristics). Select: Select a resource that is appropriate to the user’s needs (i.e., to choose an entity that meets the user’s requirements with respect to content, physical format, etc., or to reject an entity as being inappropriate to the user’s needs). Obtain: Access a resource either physically or electronically through an online connection to a remote computer, and/or acquire a resource through purchase, licence, loan, etc.

Figure 1: The Four User Tasks in Resource Discovery (Delsey, 2003, 10)

4. Methodology, Threshold Identification and Matching to FRBR User Tasks It should be noted that the task correlating to FRBR’s find task is referred to in Delsey’s analysis as “search”; as the data elements and resource discovery task correlations we use are drawn from Delsey’s data, we retain this convention. In addition to the identification of these user tasks, Delsey maps MARC data elements to their corresponding FRBR entities, along with the associated attributes and relationships. In our analyses, we provide frequency count data for these same elements, grouping them according to entity and user task to show how catalogers have used the CDS related to them. By providing the actual utilization of content designation for the MARC elements associated with the user tasks and entities of the FRBR model, this project’s analyses reveal 44 the extent to which data in the bibliographic records are available to support end users’ activities when using library catalogs for resource discovery. In the process of developing a methodology for examining the correlation between catalogers’ use of MARC content designation and the FRBR model, there are some fundamental issues regarding the nature and structure of the project’s analyses and the data provided in Delsey’s functional analysis of MARC 21 that need to be addressed. The intent of the Delsey study was “to link MARC 21 format data with models identified in major studies that have recently been developed in the area of bibliographic control,” including the FRBR model (Delsey, 2003, 3). As that research encompasses both the MARC 21 Format for Bibliographic Data and the MARC 21 Format for Holdings Data, the data resulting from the analysis understandably includes elements from both formats. However, the MCDU project’s analyses are concerned only with catalogers’ creation of bibliographic records, and therefore do not include any of the elements defined only in the MARC holdings format. However, where there is a redundancy between the two (e.g. 022 $a -- International Standard Serial Number, or 852 $a Location, which are defined in both formats), those elements are included in our data. Another issue involves the nature of the sets and subsets on which our analyses focus. The large number of bibliographic records in our dataset—more than 56 million—required the creation of twenty smaller subsets, based on the creating agency and type of material, or format, described in the records, in order to optimize processing and querying functions. To minimize discontinuities between MARC categories and the parameters of our own format- specific sets, we have chosen to focus the analyses of the elements that support the four FRBR user tasks by providing frequency counts only for the variable data fields and related subfields, as the structure of these elements is common both to all MARC material types as well as all of our format sets. Finally, Delsey’s extensive analysis covers all of the content designators specified in the MARC formats, including indicators, the two character positions in the variable data fields whose values interpret or supplement the data found in the field (Library of Congress, 2001). In considering the efficiency of obtaining frequency counts for indicator positions from all of the 56 million records in our dataset, a comparison was made between the total numbers of elements designated by Delsey as supporting each of the four user tasks and the quantity of indicator positions that support the tasks. Judging from the small number of instances in which an indicator position supports a given user task (as shown in Table 1), the project team concluded that frequency counts for these specific content designators would not contribute significantly to the general understanding of catalogers’ utilization of MARC. Frequency counts for indicator positions are, therefore, not included in our data.

FRBR Resource Discovery Tasks MARC 21 Bibliographic Format Find (Search) Identify Select Obtain Total no. of elements (fields, subfields, fixed field positions, 454 972 375 468 and indicator positions) that support a given task Total no. of indicator positions that support 0 3 7 6 a given task

Table 1. Number of Elements Supporting User Tasks, as of February 2006 45

An especially valuable product of Delsey’s functional analysis of MARC 21 is the detailed mapping of data elements specified in the MARC bibliographic format to the four user tasks of the FRBR model, including correspondences to the FRBR entities. All of the tabular data from the efforts of this analysis are provided in an Access 2000 database that has been updated by the Network Development and MARC Standards Office to reflect recent updates or additions to the MARC format. Providing the data in this format allows us to reorder and filter the data for our needs, as well as transfer relevant data elements into electronic tables to intersect with the results of our own analyses. The variable field data elements from the MARC bibliographic format (fields 010 through 999) and their related subfields that were mapped by Delsey to the four user tasks were extracted from the Access 2000 database and combined with frequency count data showing the utilization of each CDS and separated by each of the formats specified by the project (e.g., Books, Pamphlets, and Printed Sheets; Cartographic Materials; Electronic Resources; etc.). In order to highlight the most frequently occurring CDS, and because there are approximately 2,000 fields and subfields defined in MARC 21 Format for Bibliographic Data, we found it necessary to determine the threshold at which elements occur at least once in a record. The threshold is based on a statistical calculation explained in a separate document available on the project website. Briefly, the threshold calculations are based on the frequencies of use of each CDS expressed as number of records where a particular CDS is used. CDS were presented as a list ordered by the descending frequencies.

5. Partial Results of the Analysis Focusing on an analysis of only the dataset of non-LC Books records (all records where Leader 06 value is “a” and where Leader 07 value is “a”, “c”, “d”, or “m”, and where 008/23 is not value “s”), some results of the mapping of the variable field content designation structures (CDS) within the MCDU project thresholds that support the four user tasks (find (search), identify, select, and obtain) in the non-LC dataset for Books are given in the tables below. This series of cumulative tables will show data for all four user tasks as well as the associated FRBR entities. With this data we can test the FRBR model of basic level of functionality and make some conclusions as to whether or to what extent catalogers are utilizing MARC to support the model. Tables 2-5 show both the total number of variable field CDS, and the percentage of threshold variable field CDS within those totals, that support the user tasks as designated by Delsey's analysis. The entities included in these tables include both the primary entities as described in the FRBR model and the additional, or secondary, entities that Delsey (2003) defined as relating to work and item. For example, a work can result from the performance of a task which is part of a project which in turn is part of a program. The project and program can be funded through a contract or grant which is funded by a corporate body (Delsey, 2003, 58). The abbreviation “C/O/E/P” represents concept, object, event, and person which can be subjects of a work as defined in FRBR (IFLA, 1998, 15). 46

Find (Search) No. of Variable Field Total No. of % of Variable Field Entity CDS that are Threshold Variable Field CDS CDS That Are Elements Used in Set Threshold Elements Action 0 3 0.0% Any 2 3 66.7% C/O/E/P 3 8 37.5% Concept 4 12 33.3% Corp. Body 9 36 25.0% Curriculum 0 3 0.0% Event 2 13 15.4% Expression 1 28 3.6% Item 5 17 29.4% Manifestation 11 64 17.2% Person 10 22 45.5% Place 3 22 13.6% Work 11 151 7.3% Total: 61 382 16.0%

Table 2: Threshold Percentages That Support the Find (Search) User Task and Associated Entities in CDS Found Within Book Records Created by OCLC Member Libraries (nonLC).

The Find (Search) task (Table 2) is supported by 61 field/subfields above the calculated threshold. Stated another way, only 16% of the 382 variable fields utilized by catalogers support this user task within Book records created by OCLC member libraries (non-LC). Furthermore, taking all the entities that are specifically associated with the Find (Search) task and the secondary items associated with those (as designated in Delsey’s 2003 analysis) we can say that 12% of the 382 variable fields utilized by catalogers support this user task within Book records created by OCLC member libraries (non-LC). This set of Book records accounts for 34.5 million of the 56 million records in OCLC’s WorldCat database.

Identify No. of Variable Field Total No. of % of Variable Field CDS that are Threshold Variable Field CDS CDS That Are Entity Elements Used in Set Threshold Elements Action 0 3 0.0% Any 2 3 66.7% C/O/E/P 2 6 33.3% Concept 4 12 33.3% Contract 0 1 0.0% Corp. Body 10 50 20.0% 47

Identify No. of Variable Field Total No. of % of Variable Field CDS that are Threshold Variable Field CDS CDS That Are Entity Elements Used in Set Threshold Elements Curriculum 0 3 0.0% Event 2 15 13.3% Expression 3 47 6.4% Grant 0 1 0.0% Item 5 23 21.7% Manifestation 29 281 10.3% Person 10 22 45.5% Person/Corp. 0 2 0.0% Place 2 16 12.5% Program 0 1 0.0% Project 0 1 0.0% Study program 0 1 0.0% Task 0 1 0.0% Work 13 194 6.7% Total: 82 683 12.0%

Table 3: Threshold Percentages That Support the Identify User Task and Associated Entities in CDS Found Within Book Records Created by OCLC Member Libraries (nonLC).

The Identify user task is supported by 683 variable field CDS, of which only 82 (12%) CDS are above the calculated threshold. If we compare this with the FRBR model’s basic level of functionality we can see that work, expression, and manifestation are supported by only 6.6%, a little more than half of the 12% of CDS above the threshold. The secondary entities associated with work—contract, curriculum, grant, program, project, study program, task do not have significant numbers of CDS to include in this percentage.

Select No. of Variable Field Total No. of % of Variable Field Entity CDS that are Threshold Variable Field CDS CDS That Are Elements Used in Set Threshold Elements C/O/E/P 1 2 50.0% Corp. Body 1 1 100.0% Event 0 5 0.0% Expression 6 28 21.4% Item 0 1 0.0% Manifestation 16 104 15.4% Place 1 7 14.3% Study program 0 1 0.0% 48

Select No. of Variable Field Total No. of % of Variable Field Entity CDS that are Threshold Variable Field CDS CDS That Are Elements Used in Set Threshold Elements Work 3 23 13.0% Total: 28 172 16.3%

Table 4: Threshold Percentages That Support the Select User Task and Associated Entities in CDS found Within Book Records Created by OCLC Member Libraries (nonLC).

The Select user task is supported by a total of 172 variable field CDS, of which only 28 (16.3%) CDS are above the calculated threshold. This user task is also associated with work, expression, and manifestation as outlined in the FRBR model’s basic level of functionality, and taken together these account for the majority of the 16.3% of above- threshold CDS.

Obtain No. of Variable Field Total No. of % of Variable Field Entity CDS that are Threshold Variable Field CDS CDS That Are Elements for the Set Used in Set Threshold Elements Action 0 1 0 Any 2 3 66.7% Corp. Body 1 9 11.1% Expression 2 7 28.6% Item 5 18 27.8% Manifestation 24 250 9.6% Work 1 7 14.3% Total: 37 295 11.9%

Table 5: Threshold Percentages That Support the Obtain User Task and Associated Entities in CDS Found Within Book Records Created by OCLC Member Libraries (nonLC).

Finally, in Table 5 we can see that in this set of Book records the Obtain user task is supported by a total of 295 variable field CDS, of which only 11.9% are above threshold. The concentration of this percentage centers on fields that are associated with the manifestation entity, which corresponds to the relationship between the Obtain task and manifestation as designated in the FRBR model’s basic level of functionality.

6. Conclusions and Summary By pairing the MARC content designation structures associated with the four user tasks with the frequency count data from the MCDU Project’s analysis, we are able to add another layer to Delsey’s functional analysis of MARC 21, showing the correspondence of its actual utilization to the FRBR model. However, this raises the question of what these levels actually mean in the overall picture of cataloger utilization of MARC 21. For instance, do we know how many content designation structures are needed to support a user 49 task? Does a higher percentage of CDS used in a record necessarily mean there is stronger support for a task? Further study is needed to explore these types of questions. This paper has endeavored to show how catalogers’ coding of bibliographic data may or may not assist end users’ tasks for finding, identifying, selecting, and obtaining relevant information resources. The results from the current research are important contributions to discussions about the future of MARC and bibliographic rules such as the current work on Resource Description and Access (i.e., AACR3) by the Joint Steering Committee for the Revision of the Anglo-American Cataloguing Rules. The MCDU Project team has developed a methodology to identify factors that influence utilization of MARC content designation and query software that allows for deeper levels of analysis (e.g., the correlation between descriptive cataloging form, encoding levels and format; the distinction between Library of Congress created records and non-LC created records within formats, etc.). An understanding of the factors can point decision makers to focal areas of the cataloging enterprise for assessment, especially as it relates the FRBR. This understanding can, in turn, inform cataloging education and future catalogers, both nationally and internationally.

References Delsey, Tom. (2003). Functional analysis of the MARC 21 bibliographic and holdings formats, Second revision. Prepared for the Network Development and MARC Standards Office, Library of Congress. Retrieved February 15, 2006, from http://www.loc.gov/ marc/marc-functional-analysis/functional-analysis.html International Federation of Library Associations, IFLA Study Group on the Functional Requirements for Bibliographic Records. (1998). Functional requirements for bibliographic records: final report. Retrieved February 15, 2006, from http:// www.ifla.org/VII/s13/frbr/frbr.pdf Library of Congress, Network Development and MARC Standards Office. (2001). MARC 21 Format for Bibliographic Data, Update no. 2. Washington D.C.: Library of Congress Cataloging Distribution Service. Library of Congress, Network Development and MARC Standards Office. (2006). Access 2000 database filename: FRBR_Web_Copy.mdb, updated 07 February 2006 [Data file]. Retrieved February 27, 2006, from http://www.loc.gov/marc/marc-functional-analysis/ functional-analysis.html

Dimitris A. Dervos(1), and Anita Coleman(2) (1)Information Technology Dept., Alexander Technology Educational Institute, Thessaloniki, Greece (2)School of Information Resources & Library Science, University of Arizona, Tucson, USA

A Common Sense Approach to Defining Data, Information, and Metadata

Abstract: Many competing definitions for the terms data, information, metadata, and knowledge can be traced in the library and information science literature. The lack of a clear consensus in the way reference is made to the corresponding fundamental concepts is intensified if one considers additional disciplinary perspectives, e.g. database technology, data mining, etc. In the present paper, we use a common sense approach, to selectively survey the literature, and define these terms in a way that can advance the interdisciplinary development of information systems.

1. Introduction Many competing definitions for the terms data, information, and knowledge can be traced in the library and information science literature (Farradane, 1979; Buckland, 1991, McCrank, 2002). The ALA Glossary of Library Terms (1943) does not mention these three terms although the Online Dictionary of Library and Information Science (2005) modeled on the ALA Glossary of Library and Information Science (1983) does. A closer look at the definitions reveals that they appear to differ from the way they are agreed upon and used in the traditional computer science, database technology, and data mining communities. Similarly, in the library and information science literature the term metadata is defined differently from the way it is defined in the knowledge management literature. In this paper, we assume that competing and divergent definitions get in the way of true interdisciplinary collaboration between computer scientists, library/information scientists, and all those who need to work together to solve problems. They not only prevent the accurate accumulation of data, information, or knowledge, but also hinder the development of systems that can truly move us into the information society era. Since many disciplines study information today, information, data, metadata, and knowledge also have current and emergent meanings. Definitions for these fundamental concepts in the many academic disciplines that study information and its variants (data, knowledge, wisdom, beliefs, etc.) however, continue to be widely divergent (Debons et al. 2005). The common sense approach established in this paper is outlined in Section 2. In Sections 3, 4, and 5 we define the concepts of data, information, and metadata, respectively. Next, we consider a number of interesting implications in Section 6, and conclude in Section 7.

2. The Common Sense Approach Debons (2005a) proposed that the basic notions of data, information, and knowledge can be defined by observing the following two preconditions (Figure 1):

PRECONDITION-1: A “living species” approach is adopted, i.e. social/organizational systems are not addressed at this stage, the focus being on the individual living organism (the human being comprising only one case of the many), and technology is ignored. 52

PRECONDITION-2: Reasoning builds upon a finite number of simple assumptions made initially.

Figure 1 A living species interaction with/understanding of the environment (Debons 2005a)

In compliance with preconditions -1, and -2, the following two assumptions can be made:

ASSUMPTION-1: Information is at a higher level than data

ASSUMPTION-2: Knowledge is at a higher level than information

3. Data McCrank (2002) defines data as follows:

Data are what is given in the smallest units, from digits to arrays and points to lines, and bits of information which are encountered, collected, or inferred, and manufactured, that are neither facts nor constitute evidence by themselves. These are the raw material for building information. (p. 627) 53

McCrank (2002) further defines facts and this definition provides a clue as to how disciplinary differences can sometimes be overcome:

Facts are things done, that is deeds or acts made into something known (from facer, to make, so: something made), which have had or do have actual existence, and are true and pertain to objective reality…facts are supposedly stable, actual, and real and can therefore be made evident (they are represented but must be presented).

Similarly, for those in the data mining field, data are distinguished from facts. “A fact is a simple statement of truth” (Roiger and Geatz, 2003, p. 5). Observing the world, the most primitive type of autonomous intellectual activity one can think of is the recording of facts that relate to the object/event upon which observation is focused. Facts can only be recorded (subsequently: processed) once they are represented properly in the appropriate model space, i.e. when data come into play.

DEFINITION: Data represent real world facts.

For example, data may comprise the outcome of measurements conducted in relation with real world phenomena (e.g. rainfall values for a given set of geographical locations over a period of time). Also, data may relate to values of attributes that characterize entities and/or relationships between entities in real world application model situations (Ramakrishnan 2004).

Information Farradane (1979) defined information as any physical form of representation (or surrogate) of knowledge, or of a particular thought, used for communication. A little less than twelve years later, Buckland (1991) refined the definition further and distinguished between information-as-thing, information-as-process, and information-as-knowledge. He convincingly argued that information-as-thing is what is dealt with in modern day information systems. Machlup (1980), one of the earliest researchers who tried to measure the information economy, defined knowledge as content and information as process. However, his initial attempts to provide justification for these definitions were limited to the measurement of activities of scientists and researchers, and ignored the processing work done by librarians and records managers. Later, Machlup and Mansfield noted: “Information is not just one thing. It means different things to those who expound its characteristics, properties, elements, techniques, functions, dimensions, and connections.” (1983, p. 4). Thus, information is hard to define directly. However, it is felt to relate to the communication and to the interpretation of data; the process of “becoming informed”, and “informing”. Interestingly enough, despite lacking a direct formal definition, the concept is better understood by the influence carriers that are said to be information rich have on the environment. Thus, information may be defined as follows:

DEFINITION: Information is revealed each time data are interpreted successfully in the direction of increasing benefit, profit, or pleasure, as the latter are realized by some intellectual activity. 54

The human mind appears to favor savings on the overhead associated to processing data for the purpose of extracting useful information from them. In this respect, shortcuts that interpret data in a most concise and comprehensible way are taken to comprise clever things. For example, a plot that occupies one third of an A4 page may comprise a shortcut to two A4 pages of tabular data in revealing the same information on how, say, a dependent variable responds to the way values are assigned to an independent variable. A one-page entity-relationship (ER) diagram comprises a shortcut revealing information in relation to a model application the textual description of which may utilize, say, five A4 pages (Chen 1976). A rule, like heavy smokers have a high probability to develop lung cancer comprises a shortcut to interpreting data of thousands patient records. Shortcuts of the type described in the previous paragraph are felt to be utilized in the intellectual activity that compiles/accumulates knowledge (Miller, 1956). As it has been stated already, the latter is assumed to be one level up in the intellectual hierarchy chain when compared to information. Most people will agree that information is certainly something that helps produce knowledge. Additionally, the truth-value of information can be an important criterion in both the determination of information itself as well as in measuring it. Truth-value from mathematical logic may be a better criterion than accuracy1, because truth-values can be calculated based on two values only: “true” and “false”. No further attempt is made in the direction of realizing and defining the concept of knowledge at this stage.

Metadata Metadata are often glibly and ambiguously defined as “data about data”. A somewhat more involved explanation is useful before we define metadata. To facilitate the processing stage, data need be organized in structures that group values in accordance with the semantics of the facts they relate to.

DEFINITION: Metadata are tags/labels assigned to data instances and structures that make them comprehensible and/or facilitate the processing that extracts information from data corpora.

For example, when observation targets the performance of students in an academic establishment, the corresponding set of metadata could include labels like: Student ID, Department, Year of Entry, Course ID, Grade, etc. In the case of an information resource or information package, metadata labels could include Format, Form, Creator, Title, etc. When the target model involves conceptual ideas expressed through language only, metadata that attempt to categorize and capture both data as well as information could include Process, Object, Phenomenon, etc.

Implications/Discussion Debons’ approach to defining data, information, and knowledge may be extended in the direction of adopting a human-centric approach in order to identify the current stage of the forefront in human civilization. Everybody agrees that the latter is post-agricultural, and post-industrial. Interesting issues to argue about relate to questions like: x The information age: Is it nearing? Has it begun? Is it here? x The knowledge age: Is it nearing? Has it begun? Is it here? x What is it that comes after the knowledge age? The age of wisdom? 55

Confusion begins to develop when software product vendors use exotic terms to name their products, for marketing purposes. How can one compete in the information management market today when IBM have been marketing their premier transactional and hierarchical database management product under the name of IMS (Information Management System) since the early 60’s (IBM 2006)? Once again, one needs to rely on common sense in order to find a way out of such a maze:

ASSUMPTION: Major transitions in human civilization relate to qualitative changes in the way people perceive the world and function in relation to their (external) environment in everyday life, not to quantitative ones.

Applying this to the case of data and data processing, one notes that:

1. The concept of data has been defined and it is fully understood 2. The concept of data has been quantified: the storage space a data corpus occupies remains invariant upon transportation from system to system, provided that the technology that materializes the representation remains invariant 3. When it comes to storing data on digital media, the unit that measures data ‘quantity’ is well defined: the bit. 4. In the developed part of the world today, everyday human activity is shaped, to a great extent, by the data storing and data processing operations of digital devices: one considers a digital organizer as an extension to his/her memory – for as long as the corresponding data are registered with the device, s/he no longer cares about remembering them; an alarm sound will go on when time comes for that telephone call to be made (the name and the phone number of the other party flashing on the screen of the organizer).

Considering the above (1-4), in order to identify whether information has come of age, one notes that:

1. Today, the concept of information is realized only indirectly, not directly. This is analogous in a way to the case of elementary particles in Physics: one cannot measure their properties directly (i.e. mass/energy equivalent, wavelength, electric charge, etc.), but only indirectly via the influence they have to their environment. 2. The concept of information has not yet been quantified. The information inherent to a given dataset can usually be extracted in a most concise way (i.e. a shortcut) via, say, a graph, a rule, or (even) a pattern. However, one is not sure of having achieved the most concise way of extracting a given instance of information until (probably) a better shortcut is invented. 3. There exists no comprehensive scheme for measuring information (or a way to model it) in everyday life. For example, the present situation is far from having one, for example, claim that “John called me last night and in five minutes of talk he revealed five infotrons worth of information to me, whereas Mary called a bit later and in just three minutes she revealed ten infotrons worth of information on the same topic”. 4. Human activity is still far from utilizing technology that incorporates information processing in everyday life, today. Many visions such as Bush’s Memex (Bush, 1945) are indicative of how things will be when information processing is to become a 56

routine in everyday life, namely when it will become possible for technology to: (a) model the user profile/interests/preferences, (b) sense the current context of the human, (c) processes information relevant to the previous a and b, and (d) interacts with the human by presenting the information in a subtle, and non-intrusive way. Obviously, we have some way to go before we reach the point where our everyday activities are shaped to co-exist and co-function harmonically with technology in such a mode.

Considering the above:

COROLLARY-1: Information remains to be quantified, modeled, and to be fully understood as a concept

COROLLARY-2: The forefront of our civilization, with regard to the technology advances made and the way those advances shape the everyday life of humans, is still in the data processing age. More research and technology advances are required for the information society to come of age.

Humans make associations and relationships between concepts and ideas all the time and thesauri are well-known knowledge organization schemes that capitalize on this; mono- lingual and multi-lingual thesauri choose one term as the preferred form of heading for a concept and describe the three major types of relationships - hierarchical, associative, and equivalence - between concepts. The thesauri are used for 1) document indexing and representation in information systems and 2) effective database searching by information intermediaries and end-users. Research in the area of semantic relations (which may be hierarchical, associative, and equivalence) shows that the precise application and uses of fine-grained semantic relations are as yet unknown. Semantic relations are important in information processing applications and for future information science research (Khoo and Na, 2005). Collocating the definitions for the fundamental terms of data, information, and metadata, from the various disciplines, such that the various groups who work with them are aware of differences, is a first step towards transformational information systems (Neelameghan, 1972). By collocation is meant the process of bringing together concepts and characteristic relationships amongst them, so as to form junctions across subject and discipline areas. In this respect, semantic relations could be utilized, as they are defined in Khoo and Na (2005). More specifically, explicit, detailed semantic relations of the case- and lexical- type could be further identified in broad contexts for the concepts of interest (i.e. data, information, metadata). Once identified, the semantic relations would then be placed into multiple logical orders; multiple classifications and orders aid in-depth cumulative understanding of the corresponding abstract concept/idea.

Conclusion Machlup and Mansfield suggested, as early as 1983, that “most of the confusion caused by the use of the term information science in its broadest sense could be avoided by the addition of the plurals”. That is, many disciplines comprise the information sciences, like the social sciences and natural sciences. While writings that have examined information and related phenomena are not exactly unique (Cana, 2003), the imperative of the activity- theoretical approach to information science (Hjorland, 1997), the interdisciplinary, multi- disciplinary nature of information systems work, and the “viewpoint warrant” (Beghtol, 57

1988) suggest that we first identify the terms and their competing definitions from the many branches of knowledge, and then work consensually towards acceptance of the fundamental ones such that they become sharable and applicable across interdisciplinary domains. In the present paper, we have used the common sense approach to show how some definitions can be developed in a way that promotes the usage of a common vocabulary in many disciplines. We hope that our proposal can be fruitful in interdisciplinary work, thereby, leading to the growth of the information sciences and the development of information systems that truly make possible the realization of the information society.

Acknowledgement The authors are thankful to Richard J. Hartley for his prompt comments on various drafts of the present paper.

Notes 1 Accuracy is often only one criterion in multi-dimensional and complex models of information and data quality (Olaisen, 1990; Fox et al, 2004; Beall, 2006).

References Beall, J. (2006). Metadata and Data Quality: Problems in the DigitaL Library. Journal of Digital Information 6 (3). Retrieved 27 February 2006. http://jodi.tamu.edu/Articles/ v06/i03/Beall/ Beghtol, C. (1988). General Classification Systems: Structural Principles for Multidisciplinary Specification. Structures and relations in knowledge organization: Proceedings of the 5th International ISKO Conference, Lille, 25-29 August 1998; eds. W. Mustafa el Hadi, J. Maniez, S. Pollitt. (Advances in Knowledge Organization - 6). Würzburg: Ergon, 1998, pp. 89-96. Buckland, M. (1991). Information as Thing. Journal of the American Society for Information Science 42 (5): 351-360. Bush, V. (1945). As We May Think. Atlantic Monthly. Retrieved 24 February, 2006. http://www.theatlantic.com/doc/194507/bush Cana, M. (2003). The Understanding of Information and Information Science. Retrieved 4 November, 2005. http://www.kmentor.com/socio-tech-info/archives/000050.html Chen, P.P. (1976). The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems, 1(1):9-36. Debons, A., Zins C., Beghtol C., Harmon G., Hawkins D., Froehlich T.J. et al. (2005). The Knowledge Map of Information Science, 2005 ASIS&T Annual Meeting, October 28 – November 2, Charlotte N.C., U.S.A. Farradane, J. (1979) The Nature of Information. Journal of Information Science 1: 13-17. Fox, C. Levitin, A, and Redman, T.C. 1996. Data and data quality. (pp. 100-122). In Encyclopedia of Library and Information Science. N.Y.: Marcel Dekker. Hjorland, B. (1997). Information Seeking and Subject Representation: An Activity Theoretical Approach to Information Science. Westport, Conn.: Greenwood. IBM (2006). The IMS Family. Retrieved 1 February 2006. http://www-306.ibm.com/ software/data/ims/ Khoo, C. S.G. and Na, J. (2005). Semantic Relations in Information Science. Annual Review of Information Science and Technology 40. Medford, N.J.: Information Today. Machlup, F. (1980) Knowledge: Its Creation, Distribution, and Economic Significance. Vol 1, Knowledge and Knowledge Production. Princeton, NJ: Princeton University Press. 58

Machlup, F., and Mansfield, U. (1983). Editors. The Study of Information: Interdisciplinary Messages. New York: Wiley. McCrank, L. (2002). Historical Information Science. Medford, N.J.: Information Today. Miller, G. (1956). The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. The Psychological Review 63: 81-97. Retrieved 27 February 2006. http://www.well.com/user/smalin/miller.html Neelameghan, A. (1972). Systems Approach in the Study of the Attributes of the Universe of Subjects. Library Science with a Slant to Documentation 9 (4): 445-472. Olaisen, J. "Information Quality Factors and the Cognitive Authority of Electronic Information." In Information Quality: Definitions and Dimensions. I. Wormell (ed.). London: Taylor Graham, 1990, pp. 91-121. Online Dictionary of Library and Information Science. Retrieved 4 November, 2005. http://lu.com/odlis/odlis_s.cfm Ramakrishnan R., and Gehrke J. (2004). Database Management Systems, 3rd Edition, McGraw-Hill Science/Engineering/Math Roiger R., and Geatz M. (2003). Data Mining: A Tutorial-Based Primer. Boston, Mass.: Addison-Wesley Publishing. Markus F. Peschl Dept. of Philosophy of Science University of Vienna, Austria

Knowledge-Oriented Educational Processes From Knowledge Transfer to Collective Knowledge Creation and Innovation

Abstract: Over the last years a general understanding has been established that knowledge is one of the main resources, not only in economics and science, but also in our everyday life. The whole movement of knowledge management (e.g., Nonaka et al. 1995; Holsapple 2003, 2003a, and many others) is just one expression of this new understanding. Nevertheless, the notion, focus, and importance of knowledge is acknowledged only very slowly in the area of (university) teaching. The goal of this paper is to develop an understanding of teaching and learning processes which are “knowledge-based/driven”; i.e., the process of teaching and learning will be re-interpreted in the light of individual and collective knowledge (co-)construction and knowledge creation. It will turn out that both concepts form knowledge management, organizational learning as well as from constructivism play a central role in this approach.

Taking Seriously Knowledge and Cognition in Educational Peocesses A vast number of studies, evaluations, reviews, etc. (e.g., PISA) has brought into light that our students seem to have deficiencies in a number of skills, such as reading, mathematics, problem solving, etc. Apart from such rather superficial statistical results, both personal and reported experiences reveal that these intellectual and cognitive problems are almost omnipresent not only in the knowledge intensive industrial context, but also in our universities. One has to ask what are the reasons and causes behind such deficiencies.

Shifting our Attention Towards Cognitive Processes and a Knowledge Perspective From an epistemological and cognitive science perspective one has to shift the focus of attention from particular skills or competencies to the underlying cognitive operations which are responsible for knowledge processes on diverse levels (see 0). One can identify a lack of intellectual capacities in at least the following domains: (i) the domain of observing, (ii) of making abstractions and induction/classification, (iii) the capacity of profound understanding, (iv) of developing creative knowledge and solutions, and (v) the ability of reflecting. These capacities can not be seen as separated from each other, as they are mutually dependent on each other: in order to understand a phenomenon profoundly it is necessary to perform a precise and careful observation of the phenomenon and to make an abstraction on these observations; in order to arrive at an abstraction it is necessary to have some understanding. Philosophically and cognitively speaking both understanding and abstraction are core cognitive capacities which are located in the higher cognitive areas/mind (“intelligence”) (e.g., Wilson et al. 1999; Aristoteles 1995; Clark 2001; Haegel 1999). Reflection is a meta-competence necessary for shifting the framework of reference and for questioning one’s own knowledge, premises, etc. Hence, in order to increase the performance on the (superficial) level of skills and competencies it will be necessary to find ways of strengthening the intellectual capacities having been mentioned above on a more basic level. However, most approaches of teaching/learning aim at the domain of skills and competencies (even at the university level). From an epistemological perspective it can be shown that these approaches do not really aim at what is the peak of human cognitive 60 capacities: generally speaking, skills concern rather superficial knowledge on the level of functionalities, algorithms, “know-how”, techniques, “systems”, “recipes”, guidelines, methods, etc. Yet, human mind is designed to penetrate much deeper into reality, into the phenomenon of our interest. Our intellect is not satisfied with being able to grasp the functional aspects of a phenomenon (e.g., the dynamics of a particular system) or to control certain aspects of reality. Rather, both our cognition and most complex tasks in almost every field (of science, economics, technology, etc.) call for a profound understanding of the object under investigation first; only then one can start making any decisions or taking action (in a responsible manner).

Generating Qualitative Knowledge? From a knowledge perspective, what are the main deficits that can be observed? Apart from classical complaints about the decay of the level of education and of competencies in almost every domain it seems that we have to focus on problems and their causes on a more general and, at the same time, more profound level of knowledge and cognition. Several classes of intellectual problems/deficiencies and their causes can be identified both in the field of (university) education and of more economic settings:

ƒ A lack of capacities to generate qualitatively new knowledge and approaches. ƒ As one implication, the approach and attitude of problem solving is given a higher priority than creating and constructing new knowledge or developing a new vision. ƒ A lack of capacities in discovering, constructing, as well as understanding complex and global relationships between a large number of seemingly unrelated events or phenomena. ƒ A lack of creative solutions to complex problems and questions. This is due to our conservative way of solving problems and answering questions by applying well established and successful “recipes” rather than (taking the risk of) inventing new solutions and trying innovative approaches; ƒ Tangible realities are found to be more “real” than what is behind these realities;

Performing a qualitative induction over these deficiencies it turns out that the majority of them has a common cause lying at their root: a lack in the capacity to deeply understand and to intellectually penetrate in the meaning of a phenomenon or reality. From a learning perspective this means that we have to take a closer look at the processes leading to this understanding, namely the processes of knowledge construction and knowledge creation.

Knowledge Sharing as Foundation for Educational Processes Almost every teaching/learning situation has to be seen as a situation of knowledge sharing in one sense or the other. The goal of this process of knowledge sharing is to individually and collectively construct and create new knowledge as well as to develop a deeper understanding of the phenomenon under investigation.

Collaborative Knowledge Construction and Knowledge Creation Figure 1 shows the elements and the processes which are involved in this knowledge-sharing cycle. Knowledge sharing always takes place between the poles of (i) individual knowledge (including all the cognitive processes leading to this knowledge), (ii) actually shared knowledge (in the particular group working in a concrete moment together), and (iii) organizational knowledge (comprising of artifacts, etc.). In the domain of university teaching organizational knowledge is realized as scientific knowledge in most 61 cases. Contrary to shared/collective knowledge scientific/organizational knowledge remains “alive” after the process of knowledge sharing and negotiation between the members of the group has come to an end. This is realized via artifacts holding the potentiality to be interpreted as meaningful chunks of (explicit externalized) knowledge.

Figure 1: Elements and processes involved in the process of collaborative knowledge sharing and knowledge construction/creation (compare also Huysman and de Wit 2003).

The whole process is organized as a knowledge-sharing cycle aiming at the construction of knowledge. As Huysman et al. (2003) point out this process is similar to Nonaka and Takeuchi’s (1995) knowledge creation cycle of socialization, externalization, combination, and internalization. It has also similarities to Berger and Luckmann`s (1966) social constructivist concepts and phases which can be discerned during the institutionalization of knowledge. From the knowledge-sharing cycle of Figure 1 three types of knowledge sharing can be derived (see also Huysman and de Wit 2003; Scharmer 2000, 2001). They are crucial for any educational process:

(i) Knowledge retrieval/downloading: knowledge retrieval basically is “downloading” already existing knowledge in the form of artifacts (e.g., from books, e-learning lectures, etc.). It is a process of knowledge retrieval from external explicit knowledge (artifacts) to individual knowledge. Most of our teaching processes are located in that domain. (ii) Knowledge exchange: knowledge exchange takes place in the negotiation and externalization processes between the individual cognitive systems leading to a space of shared/collective knowledge. Individual knowledge is shared in a learning process being realized in a communication setting. (iii) Knowledge creation: the ultimate goal of knowledge sharing-processes (i) and (ii) is to generate and to create new knowledge. The construction of new knowledge is based on the combination of different sources of individual, collective, and organizational knowledge. 62

The whole cycle is organized as a feedback process where the result of one cycle is the basis for the next round of knowledge creation. As will become evident in the sections to come these types of knowledge sharing determine the way of how teaching and learning processes are organized.

Technologies for Knowledge Construction and Knowledge Sharing Each of these types of knowledge-sharing can be associated with knowledge-sharing and educational technologies (in a broad sense). These technologies are giving educational processes a special flavor and favor certain types of learning/teaching strategies and knowledge processes. Sharples (2005) introduces a differentiation relating strategies of knowledge sharing and learning/teaching with certain media (see Table 1).

Strategy of knowledge sharing Medium & context and learning/teaching

A Learning as knowledge transfer Print, textbook Learning as downloading and Download style (e)learning (“first generation eLearing”) repeating of (well established) mental models

B Learning as (individual and eLearning + collaborative aspect (virtual cooperation, collective) knowledge communication, etc.) construction, modeling

C Learning as dialogue/conversation Being embedded in a (concrete, physical, etc.) context in context + Creating/changing context Embodiment in reality + in social context + in technological context Immediate/direct experience with reality and changing reality Mobile learning (technologies) „Socrates like learning“, „peripatetic learning“ Concept of “Ba”

Table 1: the relationship between strategies of knowledge sharing (teaching/learning) and the medium being applied in that process (see also Sharples 2005).

In the era of mass print literacy (Table 1; A) the textbook (and its related form in the domain of eLearning) was the medium of instruction. This implies an understanding of learning as knowledge transfer and downloading of mental models. Whereas in the knowledge transfer perspective the goal of the learner is to repeat these models, knowledge, etc., “advanced” learning strategies aim at individual and collective knowledge construction and modeling (see (B)). Learning is a process of “coming to know”; learners in cooperation with peers and teachers construct knowledge and models which can be interpreted as transiently stable interpretations of their world (e.g., Foerster 1972; Glasersfeld 1984, 1989, 1995; Sharples 2005). Most eLerarning technologies, which are presently in use, support these processes by providing platforms for presenting knowledge, for enabling communication and virtual collaboration and cooperation. Going one step further, (Table 1; C) learning is extended “back to the roots”: this mode of learning and knowledge sharing/construction takes into account that each learner is not only embedded in an intellectual framework and in a virtual and artifactual environment, but also 63 in his/her physical and social context. Furthermore, this approach respects the fact that he/she is not only a more or less passive recipient of knowledge constructing his/her mental models, but that the learner is also actively interacting with his/her environment (compare also approaches to situated cognition; e.g., Suchman 1987; Clark 1999, 2001, etc.). I.e., the learner is capable of actively changing environmental dynamics and structures (e.g., by conducting experiments, by creating artifacts, etc.). In that sense, learning becomes a kind of “conversation/dialogue in context”—dialogue and conversation on multiple levels: with other learners, with external reality, with external knowledge, etc. In such a context mobile learning technologies become highly interesting and effective tools supporting this kind of situated learning processes. Furthermore, dialogue is understood in this context in the specific meaning of D.Bohm’s and others concept of dialogue (cf. Bohm 1996; Schein 1993).

The Concept of “ba”: a Space for Knowledge Sharing and Knowledge Creation If learning and knowledge-sharing processes are understood in such a way, we have reached quite a sophisticated level of knowledge work: namely the domain of knowledge creation. It can be compared to a concept well known in knowledge management, which is referred to as “ba”; Nonaka et al. (2003) describe it as follows:

Ba is a continuously created generative mechanism that explains the potentialities and tendencies that either hinder or stimulate knowledge creative activities… The knowledge-creating process is necessarily context-specific in terms of time, space, and relationship with others. Knowledge cannot be created in vacuum, and needs a place where information is given meaning through interpretation to become knowledge… We define ba as a shared context in motion, in which knowledge is shared, created, and utilized… Ba is a phenomenological time and space where knowledge, as ´a stream of meaning´ emerges. New knowledge is created out of existing knowledge through the change of meanings and contexts… Ba is an existential place where participants share their contexts and create new meanings through interactions. Participants of ba bring in their own contexts, and through interactions with others and the environment, the contexts of ba, participants, and the environment change. Ba is a way of organizing that is based on the meaning it creates, rather than a from of organization such as hierarchy or network. (p 6f)

In that sense, the concept of “ba” goes far beyond purely technological or educational issues—it concerns the general question of knowledge construction and knowledge creation; more specifically, the conditions enabling and facilitating these processes. If we start to understand university teaching and educational processes in general in such a way the character of both the knowledge taught and the learning strategies and pedagogical means will change dramatically. In the sections to come it will be developed, what these changes look like in more concrete detail. Finally, it has to be said that these three modes of learning and knowledge sharing do not exclude each other. Rather, mode C (“learning as conversation in context”)—to some extent—is based on knowledge downloading and construction processes.

Modes of knowing and Knowledge We have come to the point where knowledge and knowledge construction are considered to be the heart of educational and knowledge sharing processes. In order to understand and improve learning/teaching processes according to the concepts having been developed above, we have to take a closer look at the modes of knowing and knowledge that are involved in these processes first. Table 2 gives an overview over these modes. This table identifies three domains describing (i) the level of knowledge (in the sense of which realm of reality this level refers to), (ii) the cognitive activities which are necessary to construct and explore this realm, 64 and (iii) the characterization of the knowledge which is the result of these construction processes.

Level Process/Activity Resulting knowledge 1 Behavioral level x Observing x Description of the observed object, x Detecting & registering its behavior(-al dynamics), its x Describing external and superficial properties (e.g., material, etc.) x List of observed properties (“data”, “facts”, etc.) 2 Level of (emerging) x Searching for, constructing, and x „Explanation“ of the observed patterns of behaviors “discovering” regularities and patterns behavior by making use of internal and relationships x Projecting patterns mechanisms (which are postulated x Quantitative induction and “projected“ into the observed x Constructing patterns behaviors) x Single-loop learning x These mechanisms are said to be responsible for generating the constructed (behavioral) patterns (i.e., these mechanisms are the „pattern generators“) x “Recipe knowledge” & algorithmic knowledge 3 Level of causes and the x Searching for, constructing, and x Understanding the observed “source” discovering causes, meaning, finality, etc. phenomenon x activity of “radical questioning” x Understanding its meaning and the x Discovering/constructing the intangible “source”/causes which are behind dimensions of reality the mechanisms x Discovering & constructing the “deeper x Knowledge about the intangible source”, the “substance” dimension of the observed reality x “Deep knowing/knowledge”, knowledge about the core of the reality x knowing “from within” x “Metaphysical knowledge” (in the sense of knowing the “substance”) 4 Level of potentiality, x Exploring, discovering, and developing x Artifacts, technology change, and design the potentials of a reality x Social, scientific, and cultural x Making use of and bringing “deep realities knowledge” and the mechanisms to the x Organizations domain of application x Visions + their realizations x “Facere” and design x Changing reality according to knowledge 5 Level of reflection (of x Reflecting Knowledge about the following questions: the causes, source, x Reframing x What are the assumptions/premises patterns, processes of x Radical Questioning (your mental behind these causes/source? knowledge models, premises, etc.) x What are the mental models behind construction, etc.) x Reflecting the learning and construction the observed behaviors, patterns, process itself and the source? x Reflecting the design-process itself x How can these premises and mental x Double-loop learning models be changed and which effects would these changes have on the understanding of the whole phenomenon/reality?

Table 2: Levels of knowledge, modes of knowing, and the (cognitive) activities necessary for developing these modes. 65

From Observations to Causes Level 1 concerns the “superficial” properties of reality: our primary observation, perception, and cognitive processes bring about a rather superficial and singular (in the sense of referring to a single concrete object or phenomenon) kind of knowledge in a first step. This knowledge is realized as a list of observations, description of behaviors or behavioral dynamics, a list of data, facts, etc. It is not about more general and universal properties of the observed phenomenon, but describes this phenomenon on its behavioral level. Taking this descriptive knowledge as a point of departure and progress in the processes of construction, we are reaching the level of (emerging) patterns, trends, and relationships: they are not “directly perceivable” with our senses. In order to arrive at that level more complex and active construction processes are necessary. Normally, this is the domain of the (natural) sciences, where first relationships are constructed between facts and descriptions, and behavioral patterns begin to emerge. I.e., these patterns are the result of more or less complex inductive and constructive processes (in most cases being realized as statistical procedures). Most so-called (scientific) explanations are situated on that level: they offer cognitive, mental, or even physical mechanisms explicating the relationship between hidden (theoretical) structures and observed phenomena. These mechanisms are assumed to be “responsible” for generating the observed phenomena (compare, for instance, Maturana’s concept of scientific methodology; Maturana 1980, 1991)—by offering such a mechanism one can also offer an explanation for the constructed patterns and regularities by providing this pattern-generating mechanism. Hence the resulting knowledge mainly is concerned with the “how” and the dynamics of the observed phenomena. In many cases it has the form of “recipe-knowledge”. The cognitive activities leading to this kind of knowledge has strong structural similarities with the processes of theory/hypothesis construction well known from the natural sciences (e.g., Peschl 2001). From a learning perspective, these construction processes can be considered as epistemological optimization aiming at finding the best possible level of functional fitness (in the constructivist sense; e.g., Glasersfeld 1984, 1995); they are realized in a single-loop learning cycle (e.g., Argyris et al. 1996; Senge 1990). On level 3 we are going one step further: on that level more qualitative issues are at stake. While level 2 was mainly concerned with rather quantitative and measurable matters construction processes on that level aim at the realm of a phenomenon going beyond its material, measurable, and tangible properties, such as its meaning, finality, etc. Philosophically speaking, this level concerns the exploration and the construction of causes (for instance, in an Aristotelian sense [1989]). It can be reached by applying intellectual tools, such as radically questioning, exploring the meaning, or trying to reach “deep understanding” of a phenomenon. The resulting knowledge, in a way, is the source for a deeper understanding of a phenomenon—i.e., the construction of a kind of “deep knowing/ knowledge” (e.g., Jaworski et al. 2000; Scharmer 2000, 2001; Senge et al. 2004), knowing reality “from within”. From a constructivist perspective this may sound quite metaphysically, and, in fact, is very close to metaphysics in the original sense (Aristotle 1989, Philippe 1991). However, it is not a contradiction to a constructivist approach. Rather, it makes a statement about how classical natural science inspired (and limited) construction processes can be overcome and be led into a more qualitative understanding of a phenomenon, e.g., by exploring its finality.

Creating Realities, Innovation, and Reflection It is that level of deep knowing which also reveals another dimension of a phenomenon or reality: its potential(-ity) with regard to change. I.e., each reality is in a certain state at every point in time and that state can change over time. Hence, there exists a space of potential change(s) at every moment; a space of possible changes which can happen to that reality. As 66 a simple example, think of a stone which is given a new form by an artist: a process of “transformation” into a sculpture according to the artist’s plan or knowledge. This sculpture is one possible instantiation in the space of potentiality of that stone. Only if one has a profound knowledge (level 3) about an object, a phenomenon, etc., it is possible to explore, construct, and develop the full potential of that reality. I.e., on level 4 we are changing the perspective from the mode of “contemplation” to the mode of “facere”/doing. The interesting point is that this level of knowledge does not only explore the space of potentiality, but also realizes (some of) these possibilities. I.e., by applying knowledge from the levels 1-3 new realities are constructed, are physically instantiated, existing realities are changed, etc. In a way it is a “materialized constructivism” where artifacts, design, technology, etc. are in the same way a product of this level 4 knowledge-process as creating cultural, scientific, social, etc. realities. This mode of knowledge is the key for most processes of knowledge creation, of innovation, and of finding and instantiating a vision. It is not well established in educational processes in many cases. Finally, level 5 knowledge brings in a completely new quality in the process of knowledge construction: the dimension of reflection. This step has the potential to fundamentally question the knowledge having been constructed so far by reflecting on the knowledge, its premises, as well as on the construction- and learning processes having led to that knowledge. The cognitive activities, methods, and “epistemological technologies” being applied in that process include processes of deep reflection and questioning, systematic reframing, questioning the premises, ideologies, the construction processes, uncovering mental models and hidden assumptions, etc. This level of knowing introduces a completely new dynamics into the whole process of knowledge construction and knowledge creation, because it is situated on a meta-level and it can bring up completely unexpected results and new perspectives which have not been considered so far. This mode of knowing and knowledge acquisition is realized in the double-loop learning strategy (e.g., Argyris et al. 1996; Senge 1990)—it is especially powerful when it is performed in a collective setting.

Modes of Knowing and Educational Processes It is clear that these levels of knowledge (from Table 2 above) do not exclude each other—rather, they depend on each other and there is strong interaction between them. Knowledge-oriented educational processes do not mean that one only abstractly knows these modes of knowledge, but that these modes explicitly find their way into the design of the particular course. Normally, educational processes at university level do not go beyond level 2 (especially in the natural and technical sciences) and level 3 of Table 2. From what has been discussed above, it is essential to focus more on the processes of understanding and reflection; especially in our so-called “knowledge-society”, which is rather a society whose intellectual pride is based on the ability to surf in a sea of unreflected and unrelated chunks of information, it is crucial to be trained in making an effort to understand things in their deeper dimensions, their relations, their meaning, etc. When a student has not only become familiar with these basic intellectual operations of deep understanding and reflection, but also has achieved some sovereignty in this domain, it will be very easy for him/her to quickly learn particular practical skills or competencies. How can we achieve a high level of these intellectual capacities of understanding and reflection?

1. Taking a radically knowledge oriented perspective: i.e., the teaching/learning process has to take as its point of departure the whole spectrum of different forms of knowledge (see 0; Peschl 2003). Only if one is aware of this spectrum it will be possible to go 67

beyond single minded learning scenarios and relatively naïve learning outcomes and teaching goals. The focus is on reaching an integration of the theoretical understanding of a phenomenon, of knowledge concerning its functioning, as well as of knowledge and skills how to deal with it practically. Recent approaches in knowledge didactics take these aspects into consideration (e.g., Swertz 2004). Above that, the whole didactical process has to be based on an alternative understanding of knowledge: namely, knowledge is understood as a process rather than static thing. 2. Taking a closer look at the structure of reality and at the construction processes being responsible for our knowledge: reality or, more precisely, a specific phenomenon in reality is not just a homogeneous unity; ontologically speaking there are various levels and domains, which can be differentiated (see 0). These domains give rise to different types of knowledge having been mentioned above. If one is aware of these levels of reality it is possible to penetrate much deeper into the phenomenon of interest and, by that, achieve a more profound understanding which is not limited to a specific aspect. This implies that it is necessary to bring the operation, capabilities, and techniques of knowledge construction, observation, and reflection more into the focus of teaching processes (e.g., Peschl 2005; Senge et al. 2004; Scharmer 2001, etc. Argyris et al. 1996). 3. Reframe and redefine the role of teachers as well as students—both are responsible for co-constructing and co-creating knowledge. 4. Educational processes are no longer “knowledge transfer processes”, but socio- epistemological processes of negotiating meaning and creating knowledge in a social as well as technological environment.

If students are supposed to reach a profound understanding and a high level of sovereignty and autonomy in a certain domain (of reality/knowledge), it is necessary to consider all of these levels of knowing and to concretely implement them in a particular course or curriculum. Reducing knowledge to only one or two of these levels perhaps leads to highly specialized and efficient “optimizers” and well adapted “recipe applicators”, but surely will not bring forth persons with a highly open attitude, with exceptional potential for innovation and for developing radically new perspectives, and with a high level of reflection.

References Argyris, C. and D.A. Schön (1996). Organizational learning II. Theory, method, and practice. Redwood City, CA: Addison-Wesley. Aristoteles (1995). Über die Seele (De anima). Hamburg: Felix Meiner Verlag. Aristoteles, (1989). Metaphysik (third ed.). Hamburg: Felix Meiner Verlag. Berger, P. and T. Luckmann (1966). The Social Construction of Knowledge, London: Penguin. Bohm, D. (1996). On dialogue. London; New York: Routledge. Clark, A. (1999). An embodied cognitive science? Trends in Cognitive Sciences 3(9), 345–351. Clark, A. (2001). Mindware. An introduction to the philosophy of cognitive science. New York: Oxford University Press. Foerster, H.v. (1972). Perception of the future and the future of perception. Instructional Science 1, 31–43. Glasersfeld, E.v. (1984). An introduction to radical constructivism. In P. Watzlawick (Ed.), The invented reality, pp. 17–40. New York: Norton. 68

Glasersfeld, E.v. (1989). Cognition, construction of knowledge, and teaching. Synthese 80(1), 121–141. Glasersfeld, E.v. (1995). Radical constructivism: a way of knowing and learning. London: Falmer Press. Haegel, P. (1999). Le corps, quel defi pour la personne? Essai de philosophie de la matiere. Paris: Fayard. Holsapple, C.W. (Ed.) (2003). Handbook of knowledge management 1: Knowledge matters. Berlin, New York: Springer. Holsapple, C.W. (Ed.) (2003a). Handbook of knowledge management 2: Knowledge directions. Berlin, New York: Springer. Huysman, M. and D. de Wit (2003). A critical evaluation of knowledge management practices. In M.S. Ackerman, V. Pipek, and V. Wulf (Eds.), Sharing expertise. Beyond knowledge management, pp. 27–55. Cambridge, MA: MIT Press. Jaworski, J. and C.O. Scharmer (2000). Leadership in the new economy: sensing and actualizing emerging futures. Cambridge, MA: Generon Consulting; Society for Organizational Learning (SoL). Maturana, H.R. and F.J. Varela (Eds.) (1980). Autopoiesis and cognition: the realization of the living. Dordrecht, Boston: Reidel Pub. Maturana, H.R. (1991). Science and daily life: the ontology of scientific explanations. In F. Steier (Ed.), Research and reflexivity, pp. 30–52. London; Newbury Parg, CA: SAGE Publishers. Nonaka, I. and H. Takeuchi (1995). The knowledge creating company. How Japanese companies manage the dynamics of innovation. Oxford: Oxford University Press. Nonaka, I. and R. Toyama (2003). The knowledge-creating theory revisited: knowledge creation as a synthesizing process. Knowledge Management Research and Practice 1, 2–10. Peschl, M.F. (2003). Structures and diversity in everyday knowledge. From reality to cognition and back. In J. Gadner, R. Buber, and L. Richards (Eds.), Organising Knowledge. Methods and case studies, pp. 3–27. Hampshire: Palgrave Macmillan. Peschl, M.F. (2005). Acquiring basic cognitive and intellectual skills for informatics. Facilitating understanding and abstraction in a virtual cooperative learning environment. In P. Micheuz, P. Antonitsch, and R. Mittermeir (Eds.), Innovative concepts for teaching informatics, pp. 86–101. Wien: Ueberreuter. Philippe, M.D. (1991). Introduction a la philosophie d´Aristote. Paris: Editions Universitaires. Scharmer, C.O. (2000). Presencing: Learning from the future as it emerges. On the tacit dimension of leading revolutionary change. Helsinki School of Economics, Finnland and the MIT Sloan School of Management: Conference On Knowledge and Innovation, May 25-26, 2000. http://www.dialogonleadership.org/PresencingTOC.html [02.02.2005]. Scharmer, C.O. (2001). Self-transcending knowledge. Sensing and organizing around emerging opportunities. Journal of Knowledge Management 5(2), 137–150. Schein, E.H. (1993). On dialogue, culture and organizational learning. Organization Dynamics 22(2), 44–51. Senge, P.M. (1990). The fifth discipline. The art and practice of the learning organization. New York: Doubleday. Senge, P., C.O. Scharmer, J. Jaworski, and B.S. Flowers (2004). Presence. Human purpose and the field of the future. Cambridge, MA: Society for Organizational Learning. 69

Sharples, M. (2005). Learning as conversation: transforming education in the mobile age. In K. Nyiri (Ed.), Seeing, understanding, learning in the mobile age, pp. 147–152. Hungarian Academy of Sciences. Budapest. Suchman, L.A. (1987). Plans and situated actions: The problem of human-machine communication. New York: Cambridge University Press. Swertz, C. (2004). Didaktisches Design. Ein Leitfaden für den Aufbau hypermedialer Lernsysteme mit der Web-Didaktik. Bielefeld: Wilhelm Bertelsmann Verlag. Wilson, R.A. and F.C. Keil (Eds.) (1999). The MIT Encyclopedia of the cognitive sciences. Cambridge, MA: MIT Press.

Ricardo Eíto Brun Universidad Carlos III de Madrid (Spain)

Retrieval effectiveness in software repositories: from faceted classifications to software visualization techniques

Abstract: The internal organization of large software projects requires an extraordinary effort in the development and maintenance of repositories made up of software artifacts (business components, data models, functional and technical documentation, etc.). During the software development process, different artifacts are created to help users in the transfer of knowledge and enable communication between workers and teams. The storage, maintenance and publication of these artifacts in knowledge bases – usually referred to as “software repositories” are a useful tool for future software development projects, as they contain the collective, learned experience of the teams and provide the basis to estimate and reuse the work completed in the past. Different techniques similar to those used by the library community have been used in the past to organize these software repositories and help users in the difficult task or identifying and retrieving artifacts (software and documentation). These techniques include software classification – with a special emphasis on faceted classifications, keyword-based retrieval and formal method techniques. The paper discusses the different knowledge organization techniques applied in these repositories to identify and retrieve software artifacts and ensure the reusability of software components and documentation at the different phases of the development process across different projects. An enumeration of the main approaches documented in specialized bibliography is provided.

1. Software repositories and knowledge organization techniques Software reuse is one of the key practices in the design and implementation of software systems and applications. This practice gives us the key for reusing the knowledge and the artifacts built in previous projects when designing new solutions. To move from the textual representation of the software requirements to the final code, different artifacts must be created. These artifacts represent the knowledge embedded in the functional specifications, and are intermediate steps to reach the final representation in programming code. Standard artifacts not only include the programming code, but also the functional and technical documentation, detailed specifications, test cases, etc. Software reuse techniques can be applied to the different artifacts created during the development process: from functional specifications to test cases, and not only to the source code itself. Regardless the scope of the software reuse initiative, it requires the set up and maintenance of a software repository or library where the different artifacts are stored and managed; this repository must provide an efficient mechanism to enable the identification and retrieval of the stored artifacts. In this way, a software repository can be seen as an information system where users can access the software components and their related information using a retrieval subsystem. This makes these repositories similar to the information retrieval components traditionally used in document and bibliographic management systems. In the following sections we provide an overview of the main retrieval mechanism proposed during the last years for the design and implementation of the retrieval subsystem in software repositories. 72

2. Keywords and controlled vocabularies One of the first approaches to solve the problem of information retrieval in software repositories was based on the use of keywords manually assigned by trained users to the different items stored in the repository. The use of controlled vocabularies to restrict the set of terms to be used as accepted descriptors was part of these early initiatives. Once the items were indexed, the users of the software repository should use the same keywords or indexing terms to formulate their queries. The GURU system is one of the best known examples of this keyword-based approach. Other proposals – like LASSIE (Large Software System Information Environment) and NLH/E – implemented more complex solutions with the adoption of list of synonyms and thesauri containing term relationships.

3. Use of natural language The use of full-text indexing techniques was the second step in the development of retrieval subsystems. People behind this approach remarked that it was difficult for end- users to use terms extracted from a controlled vocabulary; they also noted the subjectivity behind the indexing process based on manually assigned descriptors. Today, automatically indexing the full text of documents and information related to software components is one of the standard approaches to face the challenge of information retrieval in repositories. Systems based on this approach usually offer the capability of processing the file with the source code and extract the comments entered by the programmers; these comments will be indexed to obtain relevant keywords. To identify these comments and metadata, they must be delimited or marked with special characters. In the case of documents, the indexing process can be executed against the full text. The main problem of this approach is that the comments and documentation related to a software component usually does not have the appropriate size to get all the benefits from the full-text indexing techniques, what makes difficult to identify relevant terms based on their frequency. To face with this issue, Singleton (Singleton, 1993) proposed a solution based on the retrieval of keywords obtained from the names of the software components. In his proposal, the keywords gathered from the components’ name built the main inverted index. To solve the problems of synonym and homonym terms, the system also incorporated a dictionary with equivalences, term specializations, abbreviations, etc. These dictionaries provided users with a tool to extend the search by adding to the query terms related to those initially entered by the user. People who think that full text indexing is a better choice than controlled vocabularies have indicate the need of keeping repository maintenance costs as low as possible, as well as the possibility of obtaining relationships between terms automatically by means of the co-occurrence measures (Henninger, 1994). The indexing system would be in charge of building this knowledge structure to supports a flexible searching process (adding an additional value to those retrieval systems based on the use of Boolean operators and ranking algorithms) Today, full text searching and indexing is a technology used in most of the companies and organizations, and this technology has become a commodity adopted by most of the software reuse projects. One of the most interesting projects where these techniques were used is the Agora Project from SEI (Software Engineering Institute, Carnegie Mellon University); in this project, full-text indexing based on Altavista technology was combined with the dynamic access to the Java-based components’ interfaces (Seacord, 1998) 73

4. Faceted classifications The best example of the impact of classical information retrieval techniques in the creation and maintenance of software repositories is the use of faceted classifications to organize the items in the repository. The author who led this approach was Prieto-Diaz (Prieto-Díaz, 1987) at the end on the eighties. In this approach, each software component is classified by means of a “notation” that contains information about the class to which the artifact belongs. This notation is built by assigning terms or keywords to the different facets or “aspects” used to describe the software component. The model proposed by Prieto-Díaz was initially based on the use of six facets. Each faced was linked to a set of terms or predefined values (controlled vocabulary). In addition, for each accepted term there was a list of synonyms that helped users in charge of classifying the artifact choose the appropriate terms for each facet. These synonyms were also available for the end-users of the retrieval subsystem when exploring the contents of the repository. This model also included information about the similarity (based on the co-occurrence of words) between the terms accepted for each facet. The purpose of managing this measure of similarity measure was to automatically expand queries in those cases in which the items in the repository did not include the terms used in the query. The query could be expanded automatically by adding those terms closer to those entered by the user. The system proposed by Prieto-Díaz also offered a method to sort (rank) the results; retrieved items were sorted based on how easy it was to use them. This ranking was calculated based on a set of variables assigned to the source code, among them: size in LOCs (lines of code), number of conditional sentences or the experience of the user running the query. Authors who think that faceted classifications are a good approach to solve information retrieval issues in software repositories indicate that full-text indexing cannot be considered a definitive solution, as the textual descriptions provided with the source code are usually to short to be significant for retrieval. The benefit behind faceted classifications is that they provide the precision that is needed in software repositories – this precision requirement is greater than in standard document and bibliographic management systems -. Past bibliography offers detailed information about projects where the faceted classification approach proposed by Prieto-Díaz was applied with success: GTE Data Services, Contel and IBM RSL. More recent contributions also make use of retrieval subsystem based on faceted analysis. For example, Zang (Zang, 2000) describes a retrieval subsystem integrated with the MetaEdit+ CASE tool. In this proposal, the author proposes a faceted classification to describe the different objects in the repository. In addition, these items are organized in a hierarchy with three levels (component-unit level, diagram level and project level). Each item is described by means of a record made up of different facets – available facets depend on its level in the hierarchy -.

5. The need of classification and domain analysis techniques Literature on software repositories and reuse usually put together faceted classification and domain analysis techniques. Domain analysis can be defined as a process that is applied to identify, capture and organize the information used in the development of a software system. The purpose of domain analysis is to make all this information reusable in the design of new systems. 74

Domains can be defined as areas for which a software system or application is designed; the domain must have clearly established limits. Domain analysis techniques were introduced by Neighbors in 1981 (Prieto-Díaz 1990); this author defined this technique as “an activity that consists in the identification of the objects and operations of one class of similar systems, in a similar problem domain”. Domain analysis had as an objective the reuse of the artifacts created during the analysis and design phases, and not only the reuse of the programming code. Prieto-Díaz gave a wider scope to this definition by adding to its objectives the “development of an information infrastructure that allows the reuse”; the result of domain analysis would include “domain models, development standards and repositories of reusable components”. In another document, Prieto-Díaz described this technique as “the selection, abstraction and classification of functions, objects and relationships in a similar way to that used by librarians to design specialized classification systems” (Prieto-Díaz 1994). Today domain analysis refers to one activity completed during the analysis phase – at the beginning of the software development process – to identify the main classes or entities that the software system must handle. This concept is not necessarily related to the reuse of software artifacts or to the need of organizing the result of these analysis tasks to enable reuse. Domain analysis must solve the problems usually found in the set up of a software repository. One of these issues is the representation of the knowledge embedded in the different artifacts that are created during the project. When domain analysis was proposed, authors recommended the use of the methodologies applied at that moment: SADT, structured analysis, etc. Regarding the classification and retrieval capabilities of the system, the main approaches were based on faceted classifications. One of the best example of models of repositories based on domain analysis was the DARE system (Frakes, 1998); DARE was not only a tool but also a methodology to complete domain analysis. It included a subsystem to extract terms from the programming code and related documentation, and the automatic classification and aggregation of terms by means of clustering techniques.

6. Formal Methods In the middle of the nineties, the use of faceted classifications became the “standard” method to organize software repositories and enable the efficient retrieval of artifacts. The main alternative to this method was based on the use of formal methods. Formal methods are based on mathematical representations of the systems (Wing, 1990). This technique gives us the choice of detailing the initial specifications, the design and the test cases (for later verification and validation) of any system. Formal method representations should be used when indexing components and also to run the queries against the repositories. The use of these techniques in the organization of software repositories has numerous supporters and documented examples. The advantage of formal methods is greater precision in the representation of artifacts (compared to specifications based on natural language). Its main drawback is that building artifact descriptions and queries with formal methods is difficult for both cataloguers and end-users of the retrieval subsystem. Retrieval subsystems based on formal languages are divided into two groups (Hemer, 2001):

a) Systems where descriptions of components are based on their signatures – that is to say, on the input and output parameters or arguments used by the component – 75

and

b) Systems where the description of components are based on the behavior of the components. This type of subsystems offers more detailed descriptions as they include pre- and post-conditions that specify the initial and final status of the components.

In both cases, the retrieval process starts with the formulation of a query also expressed in a formal language; the query would include a description of the components that the user wants to retrieve. The system will retrieve from the repository those components whose descriptions match the query.

7. Evaluation and comparison of approaches The different techniques applied in the organization and retrieval of information from software repositories were tested by William B. Frakes and Thomas P Pool in 1994. The evaluation process offered some similarities with the Cranfield project – and other similar initiatives – developed in the document retrieval area. The authors compared the behavior of retrieval subsystems based on faceted classifications, hierarchical classifications, metadata (property-value pairs) and keywords. The traditional criteria of recall and relevance were used to measure the effectiveness of the different approaches. The conclusions of this study did not identify significant differences in retrieval effectiveness between the different techniques / subsystems used; this conclusion was similar to the obtained in tests completed against document and textual databases. One interesting conclusion of this classical study was the recommendation that systems should provide different, complementary retrieval methods and techniques. This conclusion is similar to the principles expressed by Henninger (Henninger 1994 1996). He proposed a system in which the retrieval process is based on subsequent iterations of the user with the retrieval system using the traditional relevance feedback concept. In Henninger’s proposal the system shows a set of candidate terms to the user; then, the user can select these terms to redefine the query. The prototype designed by Henninger – CodeFinder – calculated the candidate terms through an activation process based on the co- occurrence of terms in the free-text description of the components.

8. Areas of improvement: Visualization and comprehension One of the objectives that must be reached to obtain an effective retrieval subsystem is keeping the link between the different artefacts managed in the development process. Quality of the documentation and the maintenance of the relationship between textual descriptions and programming code and components is one of the activities that can help improve the effectiveness of the retrieval subsystem. The lack of links between documents and components results in the need of investing a lot of time reading overwhelming documents that in most of the cases do not provide a clear understanding on how a specific functionality has been implemented, as these documents are not related to each other or to the real implementation. The use of diagrams based on standard modelling language offer the capability to understand the code and artefacts retrieved from the repository. These models contain different aggregation levels that can also be used to expand the search based in free-text comments (the item in the repository not only is described by the free-text comment directly attached to it, but also by the free-text comments and documents attached to artifacts in upper levels. 76

9. Conclusions Building software repositories is one of the areas in which different retrieval, classification and knowledge organization techniques have been applied. The conclusions can be summarized in the need of having flexible information retrieval systems that give the users the possibility of reformulating queries interactively, where different retrieval techniques must be used together and combined to obtain the best results. Another interesting point is the need of the integration between the CASE tools and the retrieval subsystems used in the repositories. This integration is interested not only from the usability point of view, but also to exploit in the retrieval process the network of related artifacts that CASE tool manage.

10. Bibliography FRAKES, W.P.; POLE, T.P. An Empirical Study of Representation Methods for Reusable Software Components. IEEE Transactions on Software Engineering, vol. 20, no. 8 (ag. 1994), p. 617-630 ——; PRIETO-DÍAZ, R.. Domain Analysis and reuse environment. Annals of Software Engineering, no 5 (1998), p. 125-141 GIRARDI, M.R. ; IBRAHIM, B. A Software reuse system based on natural language specifications. In: Proceedings ICCI’93: Fifth International Conference on Computing and Information, 1991, p. 507-511 HENNINGER, Scott. Using Iterative Refinement to Find Reusable Software, IEEE Software, vol. 2, no. 5 (Sep. 1994), p. 48-59 HEMER, David ; LINDSAY, P. Specification-based Retrieval Strategies for Module Reuse. In: Australian Proceedings Software Engineering Conference, 2001, p. 235-243 PRIETO-DIAZ, R. Domain Analysis: an introduction. In: ACM SIGSoft Software Engineering Notes, vol. 15, nº 2 (1990), p. 47-54 ——. A Domain Analysis Methodology. In: Proceedings of the Workshop of Domain modeling, vol. 34, no. 5 (1991), p. 89-97 ——; FREEMAN, P. Classifying Software for Reusability. IEEE Software, vol. 4, no. 1 (1987), p. 6-16 SEACORD, Robert C. ; HISSAM, Scott A and WALLNAW, Kurt C. Agora: a Search Engine for Software Components. CMU SEI Technical Report – CMU-SEI-98-TR-011. 1998 SINGLETON, Paul ; BRERETON, P. Software Reuse: Some Positive Experiences (and Sweeping Conclusions) In: Proceedings of Software Engineering Environments Conference, 7-9 Jul 1993. IEEE, p. 166-173 WING, Jeannette M. A Specifier’s introduction to formal methods. Computer, vol. 23, no. 9 (Sep. 1990), p. 8-24 ZHANG, Zheying. Enhancing Component Reuse Using Search Techniques. In: Proceedings of ISIS 2000, vol. 23 Olha Buchel Faculty of Information and Media Studies, University of Western Ontario, London, ON, Canada Uncovering Hidden Clues about Geographic Visualization in LCC

Abstract: Geospatial information technologies revolutionize the way we have traditionally approached navigation and browsing in information systems. Colorful graphics, statistical summaries, geospatial relationships of underlying collections make them attractive for text retrieval systems. This paper examines the nature of georeferenced information in academic library catalogs organized according to the Library of Congress Classification (LCC) with the goal of understanding their implications for geovisualization of library collections.

Introduction: Recent advancements in geovisualization and geographic information retrieval have a potential to transform information systems into highly interactive tools for learners and information seekers. Google Local1 and MSN Virtual Earth2 are good examples of such transformations. The visual learning power of cartographic displays makes them highly desirable not only for information systems designed as geographic information systems, but also for systems containing geographic references in the form of text. Academic library catalogs are among information systems rich on georeferences that can be represented cartographically. What a visualization of academic library collections should look like largely depends on the categories of georeferences contained in library metadata records, classifications and on their connections with other subjects. In this study we propose to analyze georeferences in LCC with the goal of improving our understanding of a cartographic visualization of library materials. We assume that the majority of records that have geographic subject headings have georeferences in the LCC-based call numbers. Commonly, geographic references recorded in subject headings are recorded in call numbers as well, unless the geographic aspect is not important. According to the report from the University of California MELEVYL catalog (Petras, 2004) a large share of library records (53.87%) contain geographic subject headings. 70.56% of 832,108,482 OCLC records in 2002 had LCC call number in 050 MARC field and 16.43% of records had this number in 090 MARC field (OCLC, 2002). These numbers suggest that the number of georeferences in call numbers is statistically significant to facilitate geographic retrieval and visualization. Furthermore, the number of geographic references in call numbers is even higher, if we count georeferences like languages, literatures, religions, and ethnic groups.

Literature review: This study builds upon findings in information visualization, geovisualization, cognitive psychology, and geographic information retrieval. We begin our discussion with an introduction to the notion of representation, crucial for understanding visualization. A graphical representation is a building block of any visualization. Graphical representations are used both in cartography and information visualization. They are defined as “an interpretable graphic summary of spatial information” (MacEachren, 1995) and “the way of representing abstract things” (Spence, 2001). A well-established form of representation in geography is a map. There is a great variety of map types: thematic, 78 analogous, choroplethic, scattered dot, proportional circle maps, and so on. Other representations used in presentations of geospatial visualizations are timelines, map legends, and various representations of underlying collections: 3D spatial histograms of dataset counts, footprints of maps and images (Ancona, 2002), iconic stacks, differently shaped colored blocks (Ahonen-Rainio, 2005), multidimensional icons (Spence, 2001). Together these representations facilitate data exploration and knowledge discovery. The difference between graphical representations and representations used in library and information science is important to clarify as well. Library representations are associated with surrogate metadata records that have to be exact copies of the original documents, so that users could unequivocally recognize a collection item described in a record. This is not always true in case of graphical and cartographic representations. Some representations with high degree of abstraction (like Beck’s London Underground map) can be even better than representations that have very high degree of fidelity (i.e., those that are accurate replicas of originals). The reason is simple – because “abstraction (that is schematization) and omission of information … reduces the otherwise unmanageable glut of information to an amount that can be processed by mental computing equipment” (Card, 1999, 11). Abstractions highlight the salient features of information and make them easy to comprehend. Cartographic representations are data dependent (Fairbairn, 2001). Geographic data has certain properties. First, it can be very precise or may lack precision. For some map related tasks accuracy can be safety-critical: for instance, the task of aeronautical navigation (Peterson, 1996). For other tasks, categories may lack well-defined boundaries: for instance, when we talk about folklore in Carpathian mountains, it does not always matter to what specific location in Carpathian mountains we refer to. Second, geospatial data are inherently structured in two (longitude and latitude), three (position above or below the Earth’s surface) or four (time) dimensions, while they are often unstructured in others. Third, geospatial data are typically collected at multiple scales, with fundamental differences in entities and their semantic structure across scales. For example, things defined as objects at one scale may be conceptualized as fields at another, or not represented at all (MacEachren, 1995, 4). And lastly, we should remember about geographic classifications. Geospatial classifications are not library classifications. They are sometimes designed specifically for visualization purposes to designate areas with specific attributes. An example of a classification suitable for visualization can be a classification of languages, a classification of religions, a classification of countries. Each category in such classification fits into a nested hierarchy based on a container relationship and does not overlap with other categories in terms of space. Library classifications, however, often have overlapping (in terms of space) categories. Until the 1970s geographic classifications included only mutually exclusive non- overlapping categories (MacEachren, 1995; Peuquet, 2002). Nowadays cartographic animations, cartographic movies and interactive maps not only allow the display of non- overlapping classifications but also overlapping and competing classifications by overlaying multiple representations containing different classifications. For example, an interactive map (as well as, an animation or a movie) may include a series of cartographic representations showing changes in administrative political boundaries, changes in linguistic territories, or climatic zones. Another important property of cartographic representations in geographic information systems that we should be aware of is that cartographic representations may have composite structure and may include representations of space and representations of underlying collections. Cartographic representations of space show how people divide the space, for example into countries, counties, provinces, biomes, soil zones, linguistic zones, 79 physiographic features and so on. They are typically present in the base maps. A base map is “the framework layer upon which other layers of data are displayed” (Hill, in press). Graphical representations of collections provide summaries of collections. It can be any other pictorial representation mentioned above (a timeline, a legend, a multidimensional icon and so forth). Despite their virtues, representations are not without limitations. The application of each form of representation is limited. “Just as there is no perfect screwdriver which optimally satisfies all purposes, users and circumstances, so there is no perfect form of representation” (Peterson, 1996, 13). One representation may be computationally efficient for dealing with one part of a problem-solving, reasoning, or concept acquisition task, while another representation may be more advantageous for another. Thus, users in anthropology may find a map showing the location of ethnic groups more useful than a political map. A historian searching for materials on some specific battle in World War II may find a historical map more intuitive than a contemporary political map. A student studying a foreign language, may prefer a linguistic map to an administrative map, because it conveys more information. A topographic map may assist better in a road finding task than a city map (Ahonen-Rainio, 2005). Therefore, representations should receive a significant attention in the design of visualizations for academic library collections.

Methodology: The choice of LCC for this analysis was not accidental. Unlike LCSH where georeferences are grouped in arrays in one facet, LCC presents georeferences in context. LCC contains various snapshots of geospatial knowledge in various domains. Each snapshot contains a linguistic representation of geographic space. These representations convey information about divisions of space. It appears that various disciplines do not have the same worldview. Moreover, views change with tasks and types of data. Geographers and psychologists (Peuquet, 2002) assert that linguistic and graphic representations are interlinked and can be translated from one form of representation into another. In this study we intend to explore the possibility of translating linguistic representations into pictorial representations. This approach is different from statistical content analysis that we usually carry out in information retrieval. We argue that for cartographic visualization, many other aspects are important besides statistical analysis. Cartographers, for example, look at the areal coverage, map scale, density of observations, why the data was compiled, and user tasks (Dodge, 2001). Linguistic representations of space can be found in georeferences in LCC. A georeference is a reference to a geographic location. Two major types of georeferences are differentiated in geographic information retrieval: explicit and implicit. In library catalogs explicit references are recorded in metadata and gazetteers as coordinates. Implicit or indirect georeferences can be found in placenames in subject headings, titles and bibliographic notes, geocodes, ISBN numbers, place of publication, language codes, call numbers, classifications. Indirect references require additional computational steps for them to become explicit (e.g., the system should be able to extract geographic names and assign coordinates to placenames) (Hill in press). Whenever it comes to geographic retrieval in library catalogs, we usually think of geographic subject headings as a place to start. Geographic subject headings include names of countries, cities, continents, physiographic features and so forth. However, both LCSH and LCC are imbued with a number of other concepts that bear geographic connotation and can be represented in terms of coordinates. These are languages (like Greek, Ukrainian, Russian, Slavic and other), religions (Russian 80

Orthodox, Ukrainian Orthodox, and so forth), ethnic people (Russians, Basques, so on). In this study we looked at various geographic indicators. To analyze linguistic representations we consider time, scale, types of georeferenced data, associations with certain geographic classifications, and overlapping categories in classifications. Three of these categories (time, scale and types of georeferenced data) require more detailed explanation. Time. Time can be defined as “an interval, especially a span of years, marked by similar events, conditions, or phenomena; an era” (Lexico Publishing Group, 2006). Time in LCC is measured in periods. Periods are not merely convenient collections of years. They are thematic categories of time that require substantiation by the historian, literary critic, or some other specialist (Frommeyer, 2004, 200). In this study we look at temporal aspects of geographic arrangements of LCC and their implication for cartographic visualization. Scale. Scale is an important aspect of geographic categories. Library classifications allow indexing with the concepts of varying specificity. For example, items can be indexed at the level of an hemisphere, a continent, a country or a region, a county or a province, a city, or a more specific place (like a museum, or any other place of interest). The reason why we think it is important to consider scale is because all these categories may have associations with different representations and therefore should be linked to different representations at different scales. Types of georeferenced data. Some georeferences may be associated with aboutness of specific entities (e.g., museums, laboratories, scientific institutions, periodicals and so forth). We are trying to get an insight onto what types of data is referenced geographically in library catalogs.

Discussion of observations: In this paper we present only a few examples of geospatial snapshots and offer possible visualization solutions. The most interesting sections for visualization can be found in schedules D3, E and F4. They are interesting because they combine temporal, geospatial, and topical aspects. These schedules include general History (D1-2009) and history of individual parts of the world: like Great Britain, Germany, Eastern Europe, United States, and so on. In General History we can distinctly identify snapshots of historical periods and events. An example of an event is a section on World War II (D741-809). In this part of the schedule you will find a number of subjects with geographical arrangements by countries. For example, military, naval, submarine, aerial, engineering operations are organized by period and country or region. Furthermore, georeferences can be found not only in the names of geographic locations, but also in the names of the battles that took place in specific locations (e.g., Leningrad, Siege of, 1941-1944, Stalingrad Battle of, 1942-1943). This snapshot presents a generalized linguistic representation of the world during the World War II and provides references to all places that participated in operations. Progressions of war through time and changes in borders are not well represented in this example. The temporal aspect is often missing too. While some battles have temporal dates and ranges, the time when countries entered the war is not clearly stated. LCC records geographic knowledge in a static format. Whenever a new category appears, the new concept is added to the existing structure but does not reflect the changes in the shape of the regions (which is the main defining criteria for geographic classifications). For this reason, some geographic categories overlap in the section on World War II (for instance, Yugoslavia and Slovenia, Ruthenia and Soviet Union) making it 81 difficult to represent them cartographically. Such overlaps are not numerous in this part of the schedule, however, and could probably be resolved with adequate representations. Classification schedules allow the linking of resources at different geographic scales: countries and regions, continents (e.g. D766.5 Africa: general works), as well as individual cities and places. It is different from a collection on Google Local, where all items are georeferenced with the highest specificity and accuracy and therefore can be linked to the most detailed representation. To represent the world in World War II graphically, we should carefully examine the territories of the countries in World War II, changes in territories and decide which political map will be able to represent this part of LCC classification and whether one representation can offer a viable solution at all. In history schedules that are more focused on the history of individual countries, linguistic representation of space and time looks different. These snapshots suggest that visualization will be more interactive, because classifications are arranged not only geographically but also chronologically and therefore cartographic representation should also include a timeline. Each country will have its own representation of time. Moreover, moving the slider along the timeline should invoke changes in cartographic representations. Consider, for example, “History of Russia. Soviet Union. Former Soviet Republics” (DK1- 949.5). The changes in names of Russia Kievan Rus’, Muscovy, and time periods linked to the times of individual tsars remind us of geospatial transformations of Russian territories. This part of the LCC classification (D70.A2 - D293) can be linked to a series of historical cartographic representations, showing transformations. A presentation of library collection with historical maps may provide users with better understanding of library collections. Besides individual histories for various countries, history schedules also include georeferences denoting local history, where one can find information about a specific place (country, county, province, city, historical place). This information is not time sensitive. DK500, DK508 include such references to places in Russia and Ukraine. Georeferences within these classes typically point to one location. If the name of the place changes, the new names are added to a class, but the call number stays the same. It is possible that a contemporary political map with various scales may well represent these parts of LCC. Language and literature schedule P5 has interesting sections suitable for geovisualization as well. These are sections on individual literatures and languages: e.g., Russian language and literature (PG2001-2826, PG2900-3698), Ukrainian language and literature (PG3801- 3987). They are organized by topics and literary time periods. Time periods are different for each literature. Periodization in literature serves to demarcate recognizable contours of style, trends, shifts in literary creations, or criticism. Within each time period, LCC subjects are organized by individual authors and their works. Authors are listed alphabetically, but also can be rearranged chronologically, because each author’s name is followed by the dates of birth and death. Since literatures and languages are listed by language or literature, perhaps, the most suitable cartographic representation for all languages will be a linguistic map. LCC classification of languages resembles the classification of languages described in (Ruhlen, 1987). Such classifications serve as foundations for the design of linguistic maps as shown in Figure 1. 82

Figure 1. Linguistic Map of Northern European Russia. Map from Gordon, Raymond G., Jr. (ed.), 2005. Ethnologue: Languages of the World, Fifteenth edition. Dallas, Tex.: SIL International. Used with permission.

It is also possible that we should think of multiple linguistic maps. Contemporary linguistic maps include only existing languages. LCC includes topics related to extinct languages, like Old Church Slavonic. Extinct languages could be represented on historical linguistic maps. Caution should be observed when selecting a proper linguistic map because not all languages are always present on linguistic maps. Each cartographic representation is derived statistically and some categories may be omitted due to the low statistics, while linguistically they are present in library classifications. For example, Csángó language (Hungarian or Romanian dialect) is spoken in villages in Romania and Moldova. Library collections have resources and subjects about this language and folklore, but it is often hard to find this language on linguistic maps, because its usage is restricted to a small territory. Another snapshot from LCC may provide clues to the visualization of archeological collections (GN778.52 -885). They are organized by periods: Stone age, Paleolithic, Mesolithic, Neolithic, Copper and Bronze ages, Iron age. Within these periods the schedule has divisions by special cultures and peoples, as well as geographical division by region or country. There exist many historical maps that show locations of ancient cultures and peoples and that could possibly provide solutions to the visualization of these schedules. These representations can be used alongside with historical maps or political maps showing administrative units. Each presentation should include a time scale. Many other snapshots of linguistic representations of worldviews can be found in other LCC schedules. Besides snapshots, one may notice that geographic arrangements are used for geographic distributions of invertebrates, insects, vertebrates, and local plants in schedule Q, for arrangement of books about individual diseases, hospitals, clinics, hospices and nursing homes in schedule R6, corporations, land use, vital events, dictionaries, societies, congresses in schedule H7; handbooks, sports, weapons, arms, military life, 83 customs, ceremonies in schedule U8. Throughout all the schedules, georeferences refer to directories, services, boards, history, laboratories, institutes, research and experimentation, study and teaching, schools, statistic and surveys, practices (e.g., medical practice), catalogs and collections, publications (encyclopedias, periodicals, bibliographies, and so forth) and especially often to museums/exhibitions. Knowing about the types of georeferenced entities may allow us to think of pictorial representations for each individual type and how to filter the resources in the visualizations.

Conclusion: We explored the nature of georeferenced information in LCC and may conclude that each domain of knowledge in LCC might have different geographic representations, since divisions of space and references to space vary from one discipline to another. The advantages of having multiple representations in geovisualization of library collections are numerous. Multiple representations can be used for filtering the collections according to the topic, time, scale and document type, design of effective and meaningful summaries of collections, and knowledge discovery.

Notes: 1 http://maps.google.com/ 2 http://virtualearth.msn.com/ 3 Library of Congress. (1995). Super LCCS. Class D. Subclasses D-DJ, History (general), history of Europe, part 1 : Gale's Library of Congress classification schedules combined with additions and changes through ... Detroit: Gale Research Inc. 4 Library of Congress. (2000). Library of Congress classification. E-F, History, America. Publisher Washington, D.C. : Library of Congress, Cataloging Distribution Service. 5 Library of Congress,. (1999). Library of Congress classification. PB-PH, Modern European languages. Washington, D,C., Library of Congress, Cataloging Distribution Service. 6 Library of Congress. (1996). Super LCCS. Class R. Medicine : Gale's Library of Congress classification schedules combined with additions and changes through ... Detroit: Gale Research Inc. 7 Library of Congress. (1997). Class H: Social sciences. Publisher Washington, D.C. : Library of Congress, Cataloging Distribution Service. 8 Library of Congress. (1996). Super LCCS. Class U. Military science : Gale's Library of Congress classification schedules combined with additions and changes through ... Detroit: Gale Research Inc.

References: Lexico Publishing Group, LLC. (2006). Dictionary.com. Retrieved February 10, 2006, from http://dictionary.com. Ahonen-Rainio, P., and Kraak, M.-J. (2005). “Deciding on fitness for use: evaluating the utility of sample maps as an element of geospatial metadata.” Cartography and geographic information science. 32(2): 101-12. Ancona, D., Freeston, M., Smith, T., and Fabrikant, S. (2002). Visual explorations for the Alexandria Digital Earth Prototype. In K. Boner, and Chen, C. Visual interfaces to digital libraries. Lecture notes in computer science. 2539: 199-213. Card, S. K., Mackinlay, J. D. & Shneiderman, B. (1999). Readings in Information Visualization: Using vision to think. San Francisco, CA, Morgan Kaufman Publishers. Dodge, M., & Kitchin, R. (2001). Mapping cyberspace. New York, NY, Routledge. 84

Fairbairn, D., Andrienko, G., Andrienko, N., Buziek, G., and Dykes, J. (2001). “Representation and its relationship with cartographic visualization.” Cartography and Geographic Information Science 28(1): 1-29. Frommeyer, J. (2004). “Chronological terms and period subdivisions in LCSH, RAMEAU, and RSWK: Development of an integrative model for time retrieval across various online catalogs.” Library resources and technical services. 48(3). Hill, L. L. (in press). Georeferencing: the geographic association of information, To be published by MIT Press. MacEachren, A. M. (1995). How maps work: representation, visualization, and design. New York, London, The Guildford Press. OCLC (2002). Field and subfield statistics (Weighted by OCLC holdings). 2005. Peterson, D., Ed. (1996). Forms of representation: and interdisciplinary theme for cognitive science. Exeter, Intellect, Ltd. Petras, V. (2004). Statistical analysis of geographic and language clues in the MARC record. (Technical report for the "Going Places in the Catalog: Improved Geographical Access" project, supported by the IMLS National Leadership Grant for Libraries, Award LG-02-02-0035-02), University of California, Berkeley. Peuquet, D. J. (2002). Representations of space and time. New York, The Guildford Press. Ruhlen, M. (1987). A Guide to the world's languages. Stanford, Stanford University Press. Spence, R. (2001). Information visualization. New York, Addison-Wesley. Kerstin Zimmermann, physik.org Vienna Julika Mimkes, ISN Oldenburg Hans-Ulrich Kamke, Humboldt University of Berlin An Ontology Framework for e-Learning in the Knowledge Society

Abstract: Efficient knowledge management is essential within the information society. Life long learning as well as the use of new media have lead to e-Learning of different kinds. In order to combine existing resources, a general description of this topic is needed. The semantic web aims at making these meta data machine understandable. In this paper we present our Ontology Framework for e-Learning. After the introduction we review existing approaches and describe our general view of the concepts. In chapter 4 and 5 we present different views of our framework aimed at the intended application areas as material or user centred approaches and end up with the conclusions.

Introduction E-Learning has become an important topic during the last few years. E-Learning can also be seen as knowledge generation, dissemination and (re)use. The knowledge (based) society is a common term now. More and more information is being published, in particular in the scientific world. Bricks-and-mortar university libraries have been providing literature for studying and learning for a long time, digital libraries and other content providers now offer additional material for e-Learning. Software companies want to sell their platforms and publicly founded institutions have the mission to educate. The private individual has to ensure life long learning. At first, legacy learning material has been digitized, e.g. scripts were scanned and an electronic book shelf was offered to a course. But the new media also require new didactics dealing with blended learning, different channels of moderation and presentation [2,7]. This is a time consuming and cost intensive task. Experts have to take the decision: what is relevant for a subject and how to classify it. Libraries use their classification schemes for book indexing, but scientists may use other keywords in their field when they search for information. A student not familiar with the terminology easily gets lost in the wealth of information. In order to overcome the limitation of one perspective and to relate the content with outside resources, the semantic web offers the functionality of machine understandable meaning. Ontologies are explicit visualizations of concepts in a given context and can be expanded. But at the moment only a few ontologies [8] exist based on the formal logic criteria of computer science. User aspects are also neglected within this development.

Meta Data and the Semantic Web As the amount of information in the Internet is rapidly increasing, meta data are more and more necessary to classify and describe it. In this paper we lay out different ordering schemes which are widely used. We start chronologically with one meta data set used by libraries, go on with another used by scientific authors, following by two integrated approaches and try to name the component suitable for e-Leaning material. At the end we will discuss ontologies. 86

Meta data schemes Digital libraries provide their users mainly with online customer services, like a web interface of OPAC for retrieval, account settings like the extension of the loan period, reservation of books etc. Getting the material itself mostly requires one’s physical presence. MARC21 (MAchine-Readable Cataloguing) was introduced in the 60ies for libraries and became standard later on. The main purpose was the allocation of a book in the shelves of the library and the conversion of the catalogue cards in an electronic form. The main bibliographic data are given in a specific order and describe 8 types of material: Book, Continuing resource, Computer files, Maps, Music, Sound recording (non music), Visual materials, Mixed materials. For the classification of e-Learning 5 types of material: continuing resource, computer files, visual/mixed material, maps and 3 types of record, namely manuscript, computer file, cartographic material can be used out of these sets. BibTeX was designed by Patashnik and Lamport in 1985 as the LaTeX bibliographic format. LaTex is an open source document preparation system widely used in the academic community. The authors provide their references/citation of other publications entirely character based, so that it can be shared by the community on the Internet. The type of publication can be classified according to 12 different categories: String, Book, InBook, InCollection, InProceeding, Proceeding, Article, MasterThesis, PhDThesis, TechReport, Manual, Misc. As reference to e-Learning material, Misc and InCollection would be used here. But in the era of the WWW everybody can become a provider. The Dublin Core (DC) Initiative has been started 10 years ago by librarians in order to provide a meta data standard that support a broad range of purposes and business models. Nowadays research communities, corporate knowledge management, e-government and public sector information use 15 core elements for their objectives. The core elements are: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage and Rights. Resources should be classified by the normal user and should be made accessible over the Internet for online retrieval. For ‘type’ the following 12 subclasses are provided: Collection, Dataset, Event, Image, MovingImage, StillImage, Sound, Text, Interactive Resource, Physical Object, Service and Software, out of which maybe collection, interactive resource and software are useful for e-Learning as an extension of traditional material like books. As it can be seen from the examples above, identifying especially e-Learning material in the web is not an easy task. For a more detailed description of these sources DINI the German Initiative for Network information came up with a recommendation by different experts. Table 1 lists the 21 elements separated for course and content. First the mandatory fields are assigned to course and put in the second column, the equivalent or identical field for content is put in the lat column in the same row. Note that duration of the content in line 14 is only optional, whereas (Technical) Requirements are mandatory in contrast to course. In line 15 you find type vs. format which are somehow close together, ECTS points and Copyright (17), Classification and Memory size (19) of course do not refer to each and only stand in line for schematic reasons. 87

Element e-Learning course e-Learning content (material) 1 Title m Title m 2 Author m Author m 3 Keyword m Keyword m 4 ELAN classification m ELAN classification m 5 Description m Description m 6 Publisher m Publisher m 7 Date of publishing m Date of publishing m 8 Date of change m Date of change m 9 Identification m Identification m 10 Language m Language m 11 Terms of use m Terms of use m 12 Audience m Audience m 13 Version m Version m 14 Duration of course m Duration o 15 ELAN document type m format m 16 Relation (e-Learning Content) m Relation IsPartOf (e-Learning Course) m 17 ECTS points m Copyright m 18 Title alternative o Title alternative o 19 Classification o Memory size o 20 Persons involved o Persons involved o 21 Requirements o Technical requirements m Table 1: The DINI e-Learning meta data elements [9] m: mandatory, o: optional

The Learning Object Meta data (LOM) by the IEEE working group 12 provide a more detailed scheme especially for e-Learning material consisting of the following 19 elements: Source, Structure, Aggregation Level, Status, Role (Meta) Role, Type, Name, Interactivity Type, Learning Resource Type, Interactivity Level, Semantic Density, Intended End User Role, Context, Difficulty, Cost, Copyright and Other Restrictions, Kind, Purpose.

Ontologies and the semantic web Ontologies are the ‘new’ semantic description and being formalized in computer science. Coding and reasoning are the main focus at the moment, the real applications are still missing, though. This needs the input of the communities and should include different tasks and aspects in various areas. Ontology tools, e.g. editors are missing to aid coding the intellectual input of the fields in the correct way. One definition in computer science: a shared conceptualisation can be given more formally by: An ontology O is a 4-tuple , where C is a set of concepts, R is a set of relations, I is a set of instances and A is a set of axioms. Ontologies have to be implemented as networks of meaning, where from the very beginning heterogeneity is an essential requirement. Expert ontologies form a specific part and should be consistent within the community. So far some initiatives and projects in the academic field come up with the first stable versions in OWL which are documented as well. These are x FOAF1 (Friend Of A Friend) for communities Describing homepages of people, the links between them and the things they create and do Coverage: Person as 1 concept with 10 properties x SWP2 (Semantic Web Portal) Ontology for Scientific Portals Coverage: Person (Agent and Organisation), Publication, Conference as the 3 main concepts 68 classes, 21 data properties, 57 object properties 88 x MarcOnt3 for Digital libraries Ongoing and still under construction, do mapping of the three formats Marc21, DC, BibTeX

Semantic digital libraries try to integrate user scenarios like shared bookmarks, comments on reading and new forms like web logs and so on. This stresses the community aspects and also provides some feedback loops. Big commercial e-Learning Platforms like WebCT Vista and Blackboard offer the technical perspective with specific features like forums, assessments and workflow. They are course centred 0and not interoperable. The closed area allows no collaboration between universities using different software. Documentation of interfaces is not officially available, they also lack an import facility of existing meta data and ontologies. The data description is only done at an internal basis. To close the gap we invite all players to contribute to our interdisciplinary project. We start with the integration of existing schemes, elements and ontologies. Service description and definition will follow. Hereby user and provider can bridge their demand and needs.

The Ontology Framework In this chapter we present our Ontology Framework for e-Learning. We start with the general consideration of the network and its components. We then present the course- centred view which is the common perspective in literature at the moment. Platforms are currently grouped around a course or topic but they are not user centred [1]. They are using different software and provide a variety of functionalities. Material is hopefully described by some meta data or classified by a whole scheme like DC or LOM [3, 9]. The didactics depends on the material and platform chosen. After that we switch to the user-centred view as the key feature of the semantic web and take it from there. In the last part we mention the application areas we have in mind and how they fit into our considerations so far.

General View Within our framework we focus on nine main concepts. The top level ones are: Topic = Subject, Person, Didactics, Platform and Material. Fig.1. shows the semantic network of our first version. The main concepts are in ellipse, subclasses are marked in boxes and HasAttributes are the more detailed relations. The lines do not represent all attributes completely. Person is also a subclass of Agent in reference to the SWP ontology in chapter 2 and can take several roles. S/he can be a user or provider, student or lecturer, pupil or teacher but also the administrator or a technician. Author is also a key player and an important term in reference to DC:creator. Student will be the main figure in the user-centred view. Material is part of a Course in correlation with the DINI set and the course-centred view. Learning Objects include scripts, test, book chapters but also multimedia content, instructional content, learning objectives, instructional software and software tools. 89

eLearning Agent Person HasTopic HasPerson HasPlatform Tutor HasMaterial Student Platform HasDidactics Lecturer Author Reviewer Admin Provider Didactics

Course Material

Audience HasTopic script Level HasPlatform test HasDidactics book chapter

Figure 1: Ontology concepts, subclasses and relations for e-Learning main concepts are in ellipses, subclasses in boxes, HasAttributes are the relations

Course-centred view The course-centred view shows the perspective of providers and creators of e-Learning platforms. Providers care for the smoothly technical usage of e-Learning platforms. Since providers of (open - source) platforms often enhance the software and infrastructure, e- Learning is presented by the courses and materials forming the centre. Often, no dedicated didactical concept is associated with the platforms. Human interaction and intellectual input is needed to keep the technology alive. Users are authors and teachers. They do not need to be the same person, a teacher can easily (re)use other authors’ materials. Authors create material, which may be used in different courses with different didactical concepts. Teachers order own and external material and form it into a course, using a didactical concept. The didactics depends on the level and the subject of the individual course. Courses are then taken by students, chosen by level and subject. Depending on the concept and the didactics, students are able to interact with the course and the material. This may be very passive, like only downloading files from the platform or simply from a homepage without any other action. But they might also be active in a forum, give feedback to the teachers and/or the authors who then might change or adjust their material. Students may also contribute with their own material and documents to the platform. It would be a real enhancement for courses and materials, or for overall teaching, if independent referees would evaluate them. Unfortunately, this is only commonly accepted for scientific publications, today. In the last years though, students started to evaluate their courses, supported by their university. To be able to use e-Learning offering, students and teachers need support from their institution, e.g. their university. They need support for the platform, but also the technical 90 infrastructure like computer pools and projectors, (W)LAN and more. Both groups depend on further education as well, like media training or the usage of authoring tools. A modern university should be able to provide all of that. Often, the platform is offered by the university as well. In some cases though, private companies are responsible for the provisioning and the support of e-Learning platforms. Furthermore, libraries have started to collect, provide and archive not only print media, but also digital media. As a consequence libraries catalogue and archive e-Learning courses and materials from e-Learning platforms. For this, special sets of meta data are needed. Mostly, these meta data are based on LOM, Dublin Core or PICA.

User-centred view Starting with the user the following view is provided. A person with the main role student has already knowledge and experience. So a certain level and subject can be defined. Attending a new course will focus on improving specific skills for further education. To use material a technical infrastructure is needed. Detailed descriptions will help to fulfil the requirements and to choose the best fitting. For the user, accessibility is the important part. S/he will not care about didactics so far as her/his needs are satisfied and the necessary information is delivered out of an information pool. Using a personal profile can enhance semantic based retrieval and gain best individual results. In the future software agents can also provide specific offers by regularly checking all available background information.

Application aeras: LiLi and HU We plan to test our framework on two examples: First LiLi [10], a free collection of online material for physics [4] and second the library science department of the Humboldt University of Berlin [11] which offer distance postgraduate studies. At the moment both platforms are material/course-centred. LiLi is a collection consists of links to e-Learning material for physics which is described by a set of meta data, comments, and ratings. It can be used by students for self- study. Trying to understand a certain topic in physics, students may search for scripts and visualisations in LiLi. Lecturers may search in LiLi in order to complete their lectures with visualisations or examples. Both groups use the same meta data and comments for the search and the choice of appropriate material. While using LiLi, both groups are also asked to give comments and ratings on the individual entries. By this, community aspects are taken into account which provides useful input to further applications and the reuse of the material. Not only comments but also whole entries, consisting of links and meta data to online material can be inserted into LiLi by teachers and students and also by authors themselves. Only a registration to the system is needed for this. LiLi is strongly connected to “physik multimedial”, a platform for physics’ courses. A function to add links and comments from LiLi directly into the courses which are offered and organized by this platform, is given. On the other hand, lecturers are requested automatically to inscribe the material they add to their course also to LiLi, completed by meta data. This means, that from both sides a connection between the courses in the platform and material that is described in LiLi, is given. We showed that connections between persons, material, topics, meta data, courses and a platform are already given in LiLi and “physik multimedial”. These connections can be coded in formal RDF descriptions. Furthermore, LiLi is an OAI data provider. By this, the information provided by LiLi could easily be retrieved by libraries, archives or portals. 91

Secondly the library science department of the Humboldt University of Berlin [11] will be the test case for a distance postgraduate study. Here a subscribed platform is offered to the students. The course contains slots in Berlin and online self study units with a final exam and a university degree. In a next step the evaluation of the framework will help to improve its further development and will be the followed by the formal description and coding. The implementation will hopefully lead to a prototype application in this area.

Figure 2: Screenshot of the e-Learnig platform LiLi

Conclusion In this paper we presented the theoretical concepts of an ontology framework for eLearning and gave background information concerning the knowledge society. We have described the way to coding and implementation of this framework.

Notes 1 http://www.foaf-project.org/ 2 http://sw-portal.deri.org/ontologies/swportal.html 3 http://www.marcont.org/ 92

References [1] Witold Abramowicz, Marek Kowalkiewicz, Piotr Zawadzki Toward User Centric e- Learning Systems in The Semantic Web: Research and Applications Proc. First European Semantic Web Symposium, ESWS 2004, Heraklion, Crete, Greece, May 10-12, 2004, Lecture Notes in Computer Science, Vol. 3053, Springer 2004. pp. 109-121 [2] Regina Kainz, Hans-Ulrich Kamke, Kerstin Zimmermann Perspectives on Online Learning and Teaching Material Proc. VIEWDET 2002, Wien, Feb. 2003, S.47-56 [3] Hans-Ulrich Kamke, Kerstin Zimmermann Metadaten und Online-Learning Information Wissenschaft & Praxis, 54. Jahrgang, Nr. 6, September 2003, S.345-348 [4] Julika Mimkes, Kerstin Zimmermann physik multimedial Proc. VIEWDET 2003, November 2003, Vienna [5] Sebastian Ryszard Kruk, Marcin Synak, Kerstin Zimmermann MarcOnt - Integration Ontology for Bibliographic Description Formats; DC-2005, Madrid, September 2004 < http://www.marcont.org/marcont/pdf/DC2005skmskz.pdf > [6] Sebastian Ryszard Kruk, Marcin Synak, Kerstin Zimmermann MarcOnt Initiative - Mediation Services for Digital Libraries Poster at ECDL 2005, Vienna, September 2005 [7] Kerstin Zimmermann Klassifikationsbeispiele von Lernmaterialien in spezifischen Portalen Proc. GfKl Bibliothekarisches Programm, Dortmund, März 2004 < http://archiv.tu-chemnitz.de/pub/2004/0164/index.html > [8] Kerstin Zimmermann Ontology Comparison D19 Semantic Web Portal Project, DERI Innsbruck, September 2004 [9] DINI ELAN Application Profile: Metadaten für elektronische Lehr- und Lernmaterialien Berlin, Oktober 2005 < http://www.dini.de/documents/DINI_Schriften_6-de.pdf > [10] LiLi < http://www.physik-multimedial.de/lili/golili/lili.php > [11] Department of Library and Information Science, Department of Distance Study at the Humboldt University to Berlin < http://www.fbiw.hu-berlin.de/startseite/willkommen_e >

Meta data schemes DCMI Dublin Core Metadata Initiative < http://dublincore.org/ > LOM, IEEE WG12: Learning Object Metadata < http://ltsc.ieee.org/wg12/ > MARC Machine-Readable Cataloging record < http://www.loc.gov/marc/ >

Ontologies Semantic Web Portal Ontology < http://sw-portal.deri.org/ontologies/swportal > MarcOnt September 2005 < http://www.marcont.org/ > Mikel Breitenstein University of Wisconsin – Milwaukee, Milwaukee, Wisconsin, USA

Global Unity: Otto Neurath and the International Encyclopedia of Unified Science

Abstract: Otto Neurath (1888-1945) was a pioneer in modern global information organization, an educator, sociologist, scientist of the early 20th century, and cofounder of the . The international cooperative goals of shared information among scientists that he envisioned led to the Unity of Science Institute and the International Encyclopedia of Unified Science. The progress towards modern scientific inquiry and theory that he encouraged, with recognition of the historical context, laid the groundwork for others who followed him as critics and refiners of empirical methods. Neurath was in every way a man of his times, but not an ordinary one. He was a social scientist and an intellectual activist, and his radical and original visions of a unified scientific world made him a prescient voice in modern philosophy of science, information, and society.

1. Introduction The International Encyclopedia of Unified Science was a grand project to bring together scientific philosophies from a variety of disciplines, embodied in essays, in a series of printed volumes. It was not intended to contain all knowledge, of course, but would embody the premises and perspectives of leading thinkers from many realms of modern science in the middle of the 20th century. The goal of creating such an encyclopedia had at least three aspects: 1) to define, by classification and explication, the fields of modern science, 2) to establish a documented body of unified topics, and 3) to affirm an interdisciplinary network that would allow scientists from many specialties to come together to solve problems. It was the product of the convergence of several interlocking forces: encyclopedism, in general, the Vienna Circle in particular, and the genius of Otto Neurath. During the18th century, encyclopedists created visions of unified knowledge, and by the end of the 19th century, encyclopedism had great momentum and was promoted in intellectual circles. The sciences, with their international languages of symbolic notations, were honored as a new organizing force for the world. The world-view philosophical perspective, incorporating metaphysics, was well established in Germany and in other European countries. Unified systems thinking followed logically. Systems theory helped thinkers in many disciplines to give form to the rapid changes being wrought by technological advance. Utopian social reform became a motivating force. The causes and effects of World War I promoted internationalism in cultural and economic matters. Logical positivism, also called scientism, was a reaction to earlier philosophies that admitted metaphysical considerations. A way of thinking born originally with the Industrial Revolution, it had evolved to a program of thinking that merged scientific inquiry and observations with the philosophical search for meaning and knowledge. The process of “scientization” (Verwissenschaftlichung) took a number of forms (Bambach, 1995, 24). In general, scientism was a term that covered the movements in philosophy, and the sciences, that identified reliable and provable knowledge with the idea of science itself (Bambach, 1995, 27). In economics, medicine, and architecture, an emphasis on the material, the concrete, the functional, was what counted. For positivists, matter was what existed, and only matter was capable of affecting the mind. All philosophical studies, e.g. ontology (what exists?), epistemology (what do we know?) and ethics (what should we do?) should, from the 94 positivist perspective, be predicated on clear knowledge of a mostly scientific nature (Everdell, 1997, 15). Philosophy, under the name “epistemology,” could function as the methodological foundation of all scientific fact (Bambach, 1995, 29). Pre-World War I Vienna was an environment of opposing forces. There were rising trends of social conservatism, nationalism, and biological ideals. The opposing forces included advocates of "Jewish" neo-positivism, of the liberal theory of marginal utility (Grenznutz), of opportunity cost, of psychoanalysis, and of Marxism (Stadler, 1991, 56). Often, their grounding in the materialism of science was joined to their grounding in political concerns for the fair distribution of opportunity and benefits to all of society. As early as 1907, Otto Neurath, a sociologist, Hans Hahn, a mathematician, and Philipp Frank, a physicist, began to meet for intellectual discussions that led eventually to a first “Vienna circle” of thinkers and activists. This circle and the later more famous one were comprised of a dynamic group of well-educated philosophers with strong mathematical and scientific training. Most members embraced positivism to some degree. The final Vienna Circle (Wiener Kreis) was a gathering not of quiet, reclusive thinkers, but of vocal scientists, philosophers, and social activists. Not all members who attended the regular Thursday meetings were equal in recognized status, nor did all agree. They were a loosely organized political and intellectual group. Their new ideas prevailed, and laid a foundation for scientific thinking in the early-mid 20th century. Between 1928 and 1937, the Vienna Circle published ten books in a series named “Papers on Scientific World-View” (Schriften zur wissenschaftlichen), edited by and Philipp Frank. In 1930, and Hans Reichenbach became editors of the Journal of Unified Science (Erkenntnis), that was published between 1930 and 1940, and edited toward the end by Neurath, Carnap, and Charles Morris (Murzi, 1998, 2). By the late 1920s, many Circle members experienced increasing difficulty in employment because of their liberal and socialist politics and, often, because of their Jewish identity. Emigration and exile dissipated the ranks steadily through the 1930s, as dangers grew. Circle leader Moritz Schlick, who stayed in Vienna, was murdered in July 1936 by a pro-Nazi student. This event ended the public existence of the Vienna Circle, and in 1938 it was abolished altogether by the National Socialist government. Otto Neurath was an archetypal social activist and idealist. Neurath was born in Vienna on December 10, 1882, into a middle-class scholarly Jewish family. He studied at the University of Vienna and the University of Berlin, and received a Ph.D. from the latter school in 1905. A teaching career at the New Vienna Trade Academy (Neue Wiener Handelsakademie) (1907-1914) was interspersed with travels in Eastern Europe and the Balkans under contract with the Carnegie Endowment for International Peace (1911-1913). Both ended with his service in World War I (1914-18). After the war, Neurath participated actively in several community planning and service organizations. He worked as a civil servant for the Central Planning Office of the Bavarian Social Republic (later known as the Social Democratic Republic of Bavaria) in Munich during 1919; when it fell he was briefly jailed, although he was not officially a member of the Communist party. Upon release, he went to Vienna and was active in the housing reform movement during the period known as Red Vienna. Infusing all of Neurath’s activities during these years was a commitment to utopian social reform, inspired by his strong belief in Marxist socialism. He was among the first in Vienna to call for a centrally-planned economy based on Marxist concepts, with policy determined by social welfare considerations and by empirical statistical analyses of goods, services, and standards of living (Wilson, 1987, 569). He also was the founder of the Museum for Housing and Town Planning, over which he presided from 1919-24. He was founder and director of the Social and Economic Museum, from 1924 to 1933, when he left Austria. 95

2. Neurath and Scientific Unity Neurath brought to his view of science a historical and dynamic perspective, a recognition of the uncertainty of physical description and the probabilistic nature of scientific prediction (Wilson, 1987, 569). He felt that an essential precondition of all reflection and all theory building was to begin with a vocabulary and set of concepts that were pre-given (Haller, 1991, 124). From that point, new knowledge could be created by the shifting of our concepts as we shift our thoughts. Then a continuous transformation process from old knowledge to new would take off. Neurath had been conspicuous for his demand for a unified language. According to Rudolf Carnap, it was Neurath who suggested the designations ‘Physicalism’ and ‘Unity of Science’ (Haller, 1991, 118). Neurath was fully convinced of and inspired by his commitment to insure that his evolving scientific world-conception should infuse all forms of personal and public life, of teaching, of architecture. Social life was to be purely rational. All his problems and concerns led him toward the goal of the unity of scientific effort (Haller, 1991, 119), for the good of society.

3. Neurath and Internationalism Neurath left Vienna in 1933, due to fear of arrest, and went to The Hague, Netherlands. There he founded and led the International Foundation for Visual Education, from 1933-40 (Wilson, 1987, 568). During that time, he promoted visual education and used a system he had designed, The Vienna Method of Picture Statistics, to develop further an international language, the International System of Typographic Picture Education (ISOTYPE) of over 2000 simplified pictures that he had invented in 1923 for representing statistical data.

Figure 1. Neurath ISOTYPE Figure 2. Neurath ISOTYPE

Through his pictorial system, complex facts were to be transformed into pictures that should yield a coherent and relevant story. According to Neurath

Reading a picture language is like making observations with the eye in everyday experience: what we may say about a language picture is very like what we may say about other things seen by the eye….Our experience is that the effect of pictures is frequently greater than the effect of words, specially at the first stage of getting new knowledge….pictures whose details are clear to everybody are free from the limits of language; they are international. Words make division, pictures make connection (Müller, 1991, 228) 96

There were direct and systematic connections between Neurath’s symbolic designs and his concept of encyclopedism. Pictures projected his elements of encyclopedism into non-discursive representation (Müller, 1991, 229). The Vienna Method of Picture Statistics could therefore be considered a continuation of the elimination of metaphysics, said Rudolf Carnap, by replacing metaphysical elements with more precise aspects of symbols ((Müller, 1991, 230). In The Hague, Neurath’s vision of unified science was given even further implementation. He set up the Unity of Science Institute in 1936 as a department of the Mundaneum Institute. The next year it was renamed the International Institute for the Unity of Science, with Neurath, Charles Morris, and Philipp Frank as the executive committee (Morris, 1969, ix). Two other committees were formed there. One was the Organization Committee of the International Encyclopedia of Unified Science, composed of Neurath, Rudolf Carnap, Philipp Frank, Joergen Joergensen, Charles Morris, and Louis Rougier. Another was the Organization Committee of the International Congresses for the Unity of Science, with the same membership as the Encyclopedia committee, with the inclusion of L. Susan Stebbing. The First International Congress for the Unity of Science was held in Paris in 1935. Five more were held before the war interrupted them. At the first congress the idea for an encyclopedia, Neurath’s long envisioned project, was discussed and approved by a vote. The object of both the congresses and the encyclopedia was to keep scientists informed of each others work and thereby to promote integration, understanding, and unity of action applied to a problem (Bogner, 1995, 616). Contributors to the encyclopedia (many of whom also were members of the Vienna Circle), had been scattering since the early 1930s, mostly to England and the U.S. Neurath himself had to leave the Netherlands, and he went to England in 1940 (crossing the Channel in a small boat) (Sigmund, 1995, 29). He became a professor of sociology at Oxford University. This second disruptive move, and the complications of the war, impeded Neurath’s work on the encyclopedia. Only two volumes of a planned twenty-six were produced. Hope for more ended when, on a Saturday night, December 22, 1945, while working at his desk at Oxford, Neurath died of a stroke. He was 63 (New York Times, Dec. 27, 1945, F25). Neurath’s death, and the continued post-war turmoil, caused changes and delays in the original plans to publish the first two volumes in the early 1940s. Some authors had to be replaced, and some were seriously delayed in finishing their monographs. Not until 1969 were the nineteen monographs (which were published separately earlier) and the bibliography and index brought together. When this part of the encyclopedia was completed, without Neurath, the vision for more was gone. According to Rudolf Carnap and Charles Morris at the time of publication, no plans were made to proceed further with the International Encyclopedia of Unified Science (Carnap & Morris, 1969, vii). Although Neurath declared himself an enemy of philosophy, his radical and original visions of a unified scientific world, an enlightened and empowered populace, and the weaknesses of absolutist belief in empiricism (Rutte, 1991, 81), made him a prescient modern philosopher of science and society. 97

Figure 3. Otto Neurath

4. The Encyclopedia Neurath began working on the idea of the encyclopedia as early as 1920. His first discussions were with Albert Einstein and Hans Hahn, and then with Rudolf Carnap and Philipp Frank. According to Rudolf Carnap and Charles Morris, the encyclopedia was meant as a manifestation of the unity of science movement, along with the six International Congresses for the Unity of Science, the Journal of Unified Science (formerly called Erkenntnis), and the Library of Unified Science (Carnap & Morris, 1969, vii). Much of what is known now about Neurath’s establishment of the encyclopedia is available from the documentation of that time recorded by Charles W. Morris (Morris, 1969, ix). The two introductory volumes, of ten monographs each, would comprise Section 1, called Foundations of the Unity of Science. Neurath had ideas for two or three other, larger sections. Section 2 (perhaps six volumes, 60 monographs) was to deal with methodological problems involved in the special sciences and in the systematization of science. Emphasis was to be on the confrontation and discussion of divergent points. Section 3 (eight volumes, 80 monographs) would address the actual state of systematization of the special sciences and the connections between them. Neurath also contemplated a Visual Thesaurus that might be an adjunct to the Encyclopedia. Further, he conceived of a Section 4 that would “exemplify and apply methods and results from preceding three sections to such fields as education, engineering, law, and medicine.” (Morris, 1969, xi) The original plan was for a total of twenty-six volumes containing 260 monographs. (Carnap & Morris, 1969, vii). Neurath imagined editions in many languages, and contributors from Western and Asian countries. Only the first two volumes of the Encyclopedia were published. They comprised Section 1 of the larger plan. Neurath was editor-in-chief of the encyclopedia, with Rudolf Carnap and Charles Morris as associate editors. Its Advisory Committee was composed of leading scientists of the day. Some names we still recognize, but others are not so familiar: Niels Bohr, Egon Brunswik, J. Clay, John Dewey, Federigo Enriques, Herbert Feigl, Clark L. Hull, Waldemar Kaempffert, Victor F. Lenzen, Jan Lukasiewicz, William M. Malisoff, Richard von Mises, G. Mannoury, Ernest Nagel, Arne Naess, Hans Reichenbach, Abel Rey, Bertrand Russell, L. Susan Stebbing, Alfred Tarski, Edward C. Tolman, Joseph H. Woodger. 98

References Bambach, C. R. (1995). Heidegger, Dilthey, and the crisis of historicism. Ithaca: Cornell University Press. Bogner, J. (1995). The Vienna Circle. T. Honderich (Ed.), The Oxford companion to philosophy. Oxford: Oxford University Press. Carnap, R. & Morris, C. (1969). Preface. Foundations of the Unity of Science, vol. 1, Nos. 1-10. Chicago: University of Chicago Press. Cartwright, N. & Uebel, T. E. (1995). Otto Neurath. T. Honderich (Ed.), The Oxford companion to philosophy. Oxford: Oxford University Press. Everdell, W. R. (1997). The First Moderns: Profiles in the origins of twentieth-century thought.Chicago: University of Chicago Press. Friedman, M. (n.d.). Philosophy of logical positivism. http://www.indiana.edu/~koertge/ SurFried.html. Taken 11/13/98. Haller, R. (1991). The Neurath principle: Its grounds and consequences. T. E. Uebel (Ed.), Rediscovering the forgotten Vienna Circle. Dordrecht, The Netherlands: Kluwer Academic Publishers. Heylighen, F. (1998). Towards a global brain: Integrating individuals into the world-wide electronic network. http://pespmc1.vub.ac.be/papers/Gbrain-Bonn.htm. Taken 10/9/98 Institute Vienna Circle. (1998). The Vienna Circle—Historic outline. http://hhobel.phl.univie.ac.at/wk/. Taken 11/13/98. Janik, A. & Toulmin, S. (1973). Wittgenstein’s Vienna. New York: Simon & Schuster. Kern, S. (1983). The culture of time and space, 1880-1918. Cambridge: Harvard University Press. Köhnke, K. C. (1991). The rise of neo-Kantism: German academic philosophy between idealism and positivism.Cambridge: Cambridge University Press. Morris, C. (1969). On the history of the International Encyclopedia of Unified Science. In Foundations of the Unity of Science, vol. 1, Nos. 1-10. Chicago: University of Chicago Press. (Excerpts from an article originally published as “On the History of the International Encyclopedia of Unified Science” in Synthese, 12, 1960, 517-21.) Müller, K. H. (1991). Neurath’s theory of pictorial-statistical representation. T. E. Uebel (Ed.), Rediscovering the forgotten Vienna Circle. Dordrecht, The Netherlands: Kluwer Academic Publishers. Murzi, M. (1998). Vienna Circle. The Internet Encyclopedia of Philosophy. http://www.utm.edu/research/iep/v/viennaci.htm. Taken 11/13/98. Neurath, O., Carnap, R., &. Morris, C., Eds. (1955). International encyclopedia of unified science. Vol. 1, Nos. 1-5. Chicago: University of Chicago Press. Neurath, O., Carnap, R., & Morris, Eds. (1955). International encyclopedia of unified science. Vol. 1, Nos. 6-10.Chicago: University of Chicago Press. Neurath, O., Carnap, R., & Morris, C., Eds. (1970). Foundations of the unity of science: Toward an international encyclopedia of unified science. Vol. 1, Nos. 1-10. Chicago: University of Chicago Press. Neurath, O., Carnap, R, & Morris, C., Eds. (1970). Foundations of the unity of science: Toward an international encyclopedia of unified science. Vol.2., Nos. 1-9. Chicago: University of Chicago Press. Neurath ISOTYPE Figure 1. www.math.yorku.ca/SCS/Gallery/icons/neurath.jpg Taken 2/28/06. Neurath ISOTYPE Figure 2. www.philart.de/articles/images/isotype5.jpg Taken 2/28/06. Neurath Photograph. Figure 3. wikipedia. Taken 2/28/06. 99

New York Times. F25, December 27, 1945. Rothe, A., ed. (1947). Current biography: Who’s news and why, 1946. New York: The H. W. Wilson Co. Rutte, H. 1991. The philosopher Otto Neurath. T. E. Uebel (Ed.), Rediscovering the forgotten Vienna Circle. Dordrecht, The Netherlands: Kluwer Academic Publishers. Sigmund, K. (1995). A philosopher’s mathematician: Hans Hahn and the Vienna Circle. The Mathematical Intelligencer17, no.4: 16-29. Stadler, F. (1991). Aspects of the social background and position of the Vienna Circle at the University of Vienna. T. E. Uebel (Ed.), Rediscovering the forgotten Vienna Circle. Dordrecht, The Netherlands: Kluwer Academic Publishers. Uebel, T. E. (1992). Overcoming logical positivism from within: The emergence of Neurath’s naturalism in the Vienna Circle’s protocol sentence debate. Amsterdam: Rodopi. Wells, H. G. (1938). World encyclopaedia. M. Kochen, Ed.. (1967) The growth of knowledge. New York: John Wiley & Sons, Inc. Wilson, F. (1987). R. Turner, Ed., Thinkers of the twentieth century,. 2nd ed. Chicago: St. James Press.

Appendix -- The Publication History and Contents of the Encyclopedia All editions of the monographs and combined volumes of the International Encyclopedia of Unified Science were published by the University of Chicago Press. From 1938 to 1962, the essays that make up the first two volumes of the Encyclopedia were published as separate monographs, paperbound. In 1955, a two-volume cloth-bound edition of the contents of the International Encyclopedia of Unified Science, Volume 1 was published. In this edition, Volume 1, Part 1 contained Numbers 1-5, and Volume 1, Part 2 contained Numbers 6-10. In 1970, Volume 2, Numbers 1-9 and Bibliography were published in one cloth-bound volume. At this time, the title became Foundations of the Unity of Science: Toward an International Encyclopedia of Unified Science. In 1971, Volume 1, Numbers 1-10 of this set was published. Volume 2 of this set is still in print (as of August 1998). Volume 1 is out of print.

Volume I: (Dates indicate publication data of monograph edition, where known.) Number 1. Otto Neurath, ed. 1938. Encyclopedia and unified science. Part 1: Otto Neurath. Unified science as encyclopedia integration. Part 2: Niels Bohr. Analysis and synthesis in science. Part 3: John Dewey. Unity of science as a social problem. Part 4: Betrand Russell. On the importance of logical form. Part 5: Rudolf Carnap. Logical foundations of the unity of science. Part 6: Charles W. Morris. Scientific empiricism. Number 2. Charles W. Morris. 1938. Foundations of the theory of signs. Number 3. Rudolf Carnap. 1939. Foundations of logic and mathematics. Number 4. Leonard Bloomfield. 1939. Linguistic aspects of science. Number 5. Victor Fritz Lenzen. 1938. Procedures of empirical science. Number 6. Ernest Nagel. 1939. Principles of the theory of probability. Number 7. Philipp Frank. 1946. Foundations of physics. Number 8. Erwin Finlay-Freundlich. Cosmology. Number 9. Felix Mainx. 1955. Foundations of biology. Number 10. Egon Brunswik. 1952. The conceptual framework of psychology. 100

Volume II: Number 1. Otto Neurath. 1944. Foundations of the social sciences. Number 2. Thomas S. Kuhn. 1962. The structure of scientific revolutions. Number 3. Abraham Edel. 1961. Science and the structure of ethics. Number 4. John Dewey. 1939. Theory of valuation. Number 5. Joseph Woodger. 1939. The technique of theory construction. Number 6. Gerhard Tintner. 1968. Methodology of mathematical economics and econometrics. Number 7. Carl G. Hempel. 1952. Fundamentals of concept formation in empirical science. Number 8. Giorgio De Santillana and Edgar Zilsel. 1941. The development of rationalism and empiricism. Edgar Zilsel. Problems of empiricism. Number 9. Jørgen Jørgensen. 1951. The development of logical empiricism. Bibliography and Index. Herbert Feigl and Charles W. Morris. Athena Salaba & Marcia L. Zeng Kent State University, USA

Maja Zumer University of Ljubljana, Slovenia

Functional Requirements for Subject Authority Records

Abstract: Continuing the tradition set by the FRBR model, a new IFLA working group was formed to examine the functional requirements for subject authority records (FRSAR). The focus of the FRSAR Working Group is on the user tasks and functional requirements of authority records for the Group 3 entities as defined by FRBR. This paper presents the Working Group’s terms of reference and reports on initial activities and subject authority issues discussed.

1. Introduction Subject access to information has been the predominant approach that users take to satisfy their information needs. Research has demonstrated that the integration of controlled vocabulary information with the information retrieval system helps users perform more effective subject searches. This integration becomes possible when subject authority data (information about each subject term) from authority files are linked to bibliographic files and are made available to users. The purpose of authority control is to ensure consistency in representing a value – a person’s name, a geographic location’s name, or a subject term—in the elements that will be used as access points. For example, “World War, 1939-1945” has been established as an authority subject heading by the Library of Congress Subject Headings (LCSH). In the cataloging and indexing process, any publications about World War II will be assigned with the above established heading regardless whether a publication mentions the war as the “European War, 1939-1945”, “Second World War”, “World War 2”, “World War II”, “WWII”, “World War Two”, or “2nd World War.” This ensures that all publications about the World War II will be referred by, displayed under, and retrieved by the same subject heading, either in an institution’s library catalog or in a union catalog that contains bibliographic data contributed by a large number of individual databases. In almost all large bibliographic databases, authority control is exercised by creating an authority file manually or semi-automatically. The file contains records of all headings or access points – names, titles, or subjects – that have been used previously in bibliographic records or should be used accordingly. IFLA’s working groups have addressed the functional requirements of bibliographic records in the work of Functional Requirements for Bibliographic Records (FRBR) and the functional requirements for authority records in Functional Requirements for Authority Records (FRAR).

2. FRBR and Authority Records in FRAR IFLA’s Working Group on Functional Requirements for Bibliographic Records (FRBR)1 focused on the development of an entity-relationship conceptual model of the bibliographic universe. FRBR’s final report presented the model, identified entities and their attributes, and defined relationships among and across entities. The basis of the model was to identify the functional requirements of the information in bibliographic records in order to facilitate the defined user tasks (IFLA, 1998). 102

In most structured retrieval systems, information regarding our bibliographic universe is not recorded exclusively in bibliographic records. Authority records are used to record information about all controlled access points that are currently included in bibliographic records or have the potential to be assigned as access points in bibliographic records. Controlled access points include names of entities identified by FRBR, such as members of Group 2 (persons, corporate bodies), titles of Group 1 entities (works, expressions, manifestations and items), and terms/labels for Group 3 entities (concepts, objects, places, events). IFLA’s Working Group on the Functional Requirements and Numbering of Authority Records (FRANAR)2 is charged with the task to continue the work of FRBR by developing a conceptual model for authority records. FRANAR defines authority records as aggregates of information regarding entities that are assigned as controlled access points in bibliographic records in the 2005 draft of the Functional Requirements for Authority Records (FRAR) model (IFLA, 2005). Figure 1 shows the current practice of creating and using bibliographic records. It is based on IFLA’s Functional Requirements for Authority Records (FRANAR) Final Report's Figure “Cataloguing Process as Performed Today”3 and modified to include the uses of bibliographic and authority data.

Figure 1. Creation and Uses of Bibliographic and Authority Files

Looking at the above workflow (Figure 1), when a bibliographic record is created, its access points are checked against the authority file. If no authority record exists for any of the access points, then an authority entry is registered and/or an authority record is created. The functions of the authority file, according to the final report of FRANAR, are: 103

1) Serving as a vehicle for documenting decisions made by the cataloguer in formulating the access points. 2) Serving as a reference tool for the cataloguer when choosing the appropriate access point(s) to be used with a new bibliographic description. 3) Being used to control the form of access points used in a bibliographic file. 4) Supporting access to a bibliographic file by providing the information notes and references that the user requires when searching under variant access points or under access points for related entities. This is ultimately most important to the end-users. 5) Supporting user-specific customization of the link between the bibliographic file and the authority file in an automated environment (IFLA, 2005). Even though all three entity groups from FRBR are covered in FRAR’s conceptual model, major focus has been given to Group 1 and 2 entities. In other words, its defined scope includes mainly NAME and TITLE entities for consideration in the study. It does not include the third group of entities which is defined as concept, object, event, and place by the FRBR. In addition, data on how users use authority data may help support to the ‘functional requirements’ of subject authority records from a real users’ perspective. This is of special interest to the FRASR group due to the charge to look into the direct and indirect use of subject authority data by a wide range of users as defined in the terms of reference in the section below. As a result, the Functional Requirements for Subject Authority Records (FRSAR) working group was formed to address subject data issues.

3. Functional Requirements for Subject Authority Records A third IFLA Working Group was formed in April 2005, charged with the task to develop a conceptual model for the Functional Requirements for Subject Authority Records (FRSAR)4. All controlled access points related to all three entity groups as defined by the FRBR conceptual model have the potential to be the topic of a work. In other words, Group 1, 2 and 3 entities can have a “is-the-subject-of” relationship with Group 1 entities in a bibliographic file. Due to the fact that functional requirements of authority records for entities from Group 1 and Group 2 (with the addition of the “Family” entity) are the major focus of the FRAR study, which defines the entities, their attributes and different levels of relationships, FRSAR will follow the FRAR approach for the purpose of using Group 1 and 2 entities as controlled subject access points in bibliographic records and will include FRAR's findings. FRSAR’s terms of reference focus on Group 3 FRBR entities which currently include concepts, objects, events, and places. The Group will continue the work initiated by FRBR and complemented by FRAR. The draft of FRAR report, distributed for world-wide review, defines Find, Identify, Contextualize, and Justify as the user tasks of all authority records data by all users (IFLA, 2005). During the process of developing an entity-relationship conceptual model of subject authority records, the FRSAR Working Group plans to initially define who the users of subject authority data are, identify contexts of the use, and identify some of the use scenarios. Possible subject authority record data user groups include a) information professionals, such as catalogers and metadata providers, reference and public services librarians and other information professionals that are searching for information as intermediaries, b) controlled vocabulary creators, such as catalogers and thesaurus creators, and c) end-users searching information retrieval systems to fulfill information needs. To define the functional requirements of subject authority records, we strongly believe that a user study must be conducted. If we describe the use of subject authority data from a three-point perspective (Figure 2), we can see that bibliographic files and authority files are created by authority-creation agents (VP2) and are used by catalogers, indexers, web 104 architects, and other information professionals (VP1) who are providing information to users, as well as by end-users (VP3) who are using this information in searching and browsing library catalogs and other bibliographic databases.

Figure 2. Three-point User Perspective

Once the user groups are defined, use scenarios of authority record information by each group will be identified. This will lead to the development and definition of user tasks supported by authority record information either directly or indirectly. Another major initial effort of FRSAR is to revisit Group 3 entities currently defined by FRBR. An examination of other models covering subject data and a comparison of the current Group 3 entities serves as a starting point. FRSAR plans to examine different types of knowledge organization systems (KOS), such as authority records in existing authority files, but also other lists of subject headings, thesauri and classification systems, since most share similar processes and functions. In addition, responding presentations and papers have already raised several issues regarding Group 3 entities. Delsey (2005) questions whether the entities defined in FRBR and FRAR are broad enough to cover all subject entities. In a comparison of the FRBR/FRAR entities to the entities defined for the project, he identified several entities that need to be further examined. He proposed their definitions and their possible addition to the current FRBR/FRAR models. Entities that might need to be redefined include object and event. Entities that might need to be considered for inclusion are situation, percept, beings, things and time. Once the entities have been finalized, the FRSAR Working Group will consider the attributes of each entity that are essential for the functions of authority records and the facilitation of user tasks. Delsey (2005) also suggested that additional attributes for the existing FRBR/FRAR entities need to be considered from a subject perspective and the attributes of any new entities that might be identified for further study need to be considered, with special attention given to the attributes of Group 3 entities. Relationships, another major component of the conceptual model, exist between all three groups of FRBR entities and the work described in a bibliographic record. This type of relationship is defined as “subject relationship” according to FRBR. In addition, 105 relationships exist between subject entities and among certain attributes of each term. These can be of two major categories, syntactic and semantic. Traditionally, it is the semantic relationships that are explicitly indicated in a subject authority record. Semantic relationships can be broadly categorized into three basic categories: equivalent, hierarchical and associative. Each of these three broad categories includes a large number of specific semantic relationships. It is suggested by Delsey (2005) that both semantic and syntactic relationships need to be considered. He identified several challenges with semantic relationships, especially the associative relationships which have been considered as the most challenging by several experts in the field of controlled vocabularies and semantic relationships. In addition, syntactic relationships of both pre-coordinated and post- coordinated subject strings might be even more challenging due to the difficulty in reflecting the “context-dependent nature of the relationships” (Delsey, 2005, 58). One of the considerations of the FRSAR Group is to survey the types of relationships that exist, to examine which are essential for the functional requirements for subject authority data, and to what level of specificity these relationships should be explicitly identified. The final term of reference for the FRSAR Working Group is to assist in an assessment of the potential for international sharing and use of subject authority data, both within the library sector and beyond. A preliminary study of the state of authority data has indicated great challenges for true global sharing and use of subject authority data. The challenges come from many technological aspects, such as heterogeneous structures, various languages and scripts, diverse construction rules and best practice guides, and dynamically developed and advanced encoding schemas. The Working Group plans to analyze existing practices and explore the integration of existing subject authority data and practices into the FRSAR model. Future plans include the exploration of different approaches that can be applied to better use and sharing of subject authority data for multi-structured systems and multilingual environments. FRSAR's work, although independent from any existing structures, codes, rules, and schemas, is expected to eventually help in the sharing and use of subject authority data that carries the diverse and unique characteristics discussed above.

(Information about FRSAR Working Group is available at: http://www.ifla.org/ VII/s29/wgfrsar.htm. Please check under Working Groups for the updates of FRSAR activities.)

Notes 1 IFLA Functional Requirements for Bibliographic Records (FRBR) Review Group, http://www.ifla.org/VII/s13/wgfrbr/wgfrbr.htm 2 IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR), http://www.ifla.org/VII/d4/wg-franar.htm 3 Functional Requirements for Authority Records: A Conceptual Model. Prepared by IFLA UBCIM Working Group on Functional Requirements and Numbering of Authority Records (FRANAR), 2005. Available: http://www.ifla.org/VII/d4/FRANAR-Conceptual-M-Draft-e.pdf 4 IFLA Working Group on Functional Requirements for Subject Authority Records (FRSAR), http://www.ifla.org/VII/s29/wgfrsar.htm

Reference List Delsey, T. (2005). Modeling subject access: Extending the FRBR and FRANAR conceptual models. Cataloging & Classification Quarterly, 39(3/4), 49-61. 106

IFLA Study Group on the Functional Requirements and Numbering of Authority Records. (2005). Functional Requirements for Authority Records: A Conceptual Model. Draft, 2005-06-15. Retrieved February 17, 2006 from http://www.ifla.org/VII/d4/FRANAR- Conceptual-M-Draft-e.pdf IFLA Study Group on the Functional Requirements for Bibliographic Records. (1998). Functional Requirements for Bibliographic Records. Munich: K.G. Saur. Retrieved February 17, 2006, from http://www.ifla.org/VII/s13/frbr/frbr.pdf. Jack Andersen, Assistant Professor, PhD Department of Information Studies Royal School of Library and Information Science Copenhagen, Denmark Social change, modernity and bibliography: Bibliography as a document and a genre in the global learning society

Abstract – In this paper, the role of bibliography in the global learning society is examined. Through an analysis of issues characterizing modernity and globalization, an understanding of bibliography located in light of these issues. I argue that by considering bibliography as a document performing a particular communicative activity with a particular purpose and as a genre that both creates particular expectations as to how to use a bibliography and how to recognize a given bibliographical activity, bibliography as a form of knowledge organization may be able to deal with the effects of modernity on social and cultural communication. I conclude that these ways of understanding bibliographical activity may provide means as to how understand and situate the role of bibliography in the global learning society.

Introduction Bibliographical activities and knowledge organization activities in general abound in today's modern global learning society. The Internet and its search engines and indexes, digital libraries and archives, and corporate information systems are clearly evidence of this. However, the notion of a 'global learning society' is not without impartiality. There are people and organizations that are against globalization or feel insecure about its consequences. Accordingly, what does it mean to analyze the role of knowledge organization in this global learning society? Is knowledge organization supporting and contributing to legitimate the notion of a global learning society? If not, what is the activity of knowledge organization then? The outcome of such analyses must point to the sociopolitical activities knowledge organization is involved in. But how are we to understand and conceptualize the involvement of knowledge organization in these activities? What does it imply for the theory and practice of knowledge organization? In this paper I intend to put focus on the old activity of bibliography with reference to its role, if any, in the global learning society. I seek to analyze what theoretical underpinning(s) can be established in order to arrive at an appropriate understanding of the relationship between bibliography and human activities based on the production and use of documents (or information) in a variety of spheres in late modern society. The thesis that guides and informs this study is that the shift from print to electronic has contributed to detach bibliography from a larger history of documents and their role in human activity and in society. This has caused a lack of awareness of the role of systems for knowledge organization in the global learning society. For instance, electronic databases seem to rest on an ideology that has supplanted the fact that bibliographies are documents with a particular history embedded in sociopolitical activities. If these activities are not recognized and understood it becomes difficult to argue for and conceptualize the role of knowledge organization in the global learning society. Thus, the connection to a history of documents is still important as this underscores the situatedness of bibliography and not its detachment from the forms of social organization that foster the objects of bibliographical activity; that is, documents. 108

I argue that by considering bibliography both as a document and a genre produced to support, mediate, and maintain forms of social organization implies that knowledge organization theory needs to integrate into its theory an understanding of what documents do in society in order to be able to better understand and legitimize the practice of knowledge organization in today's global learning society. Such an understanding may contribute to scrutinize further the sociopolitical activities knowledge organization is involved in and, through this, to better account for the many bibliographical activities still going on nowadays and what we can learn from them as global citizens participating in a globalized world characterized by various means and modes of electronic communication. Understanding the role of bibliography as suggested above requires that its theory and practice needs to be informed by broader views of the role of documents in human activity. In so far this is achieved, knowledge organization theory can make an important contribution as to how to understand the many bibliographical activities going on the global learning society and how these in the long run serves social, cultural and democratic purposes. This is, however, conditioned by how these activities are made visible to citizens whose social actions depend on access to knowledge materialized in documents. The discussion and argument develops through four stages. To begin with, I am discussing the relationship between social changes, modernity and bibliography in order to position and rethink the role of bibliography in the global learning society. Having outlined some selected historical bibliographical issues, I move on to analyze what it means to understand the bibliography as a document and a genre in human communication. The final section examines the theoretical and practical perspectives brought about the view of bibliography put forward here in contemporary global learning society.

Social change, modernity, and bibliography As a form of knowledge organization activity, the role of bibliography in the global society must be addressed in relation to how such a society is understood and described. How we characterize and conceptualize the global society sets the agenda for bibliographical activities in society and culture. Danish historian of bibliography Svend Bruhns puts it this way: “When society changes, so too does the need for literature and bibliographical information. This entails that the condition of society can be read off in bibliographical documents” (Bruhns, 2004, p. 15; my translation). Thus, one may argue, every understanding and theoretical account of bibliography must be located with reference to social changes. The term used to label the period in the history of Western Europe from the Renaissance until now is modernity. It refers to the establishment of particular forms of social organizations and institutions such as capitalism, science, media, industry, and the nation- state. As a result of this, the organization, surveillance, and control of knowledge or information represent one of the institutional dimensions of modernity (Black, 1998; Black & Brunt, 1999; Giddens, 1990). That is to say, modernity has contributed profoundly to underscore the role of knowledge organization activities in forms of social organization. A critical aspect in discussions of modernity and the transformations of modern societies is globalization. Globalization is one of the consequences, or an inherent feature, of modernity (Giddens, 1990, p. 63). Globalization is by Giddens characterized as 'action at a distance' (Giddens, 1994, p. 96). That is, what is intended to be of relevance only to (or directed at) a local audience or is locally embedded have, or may have (cf. the cartoons-case in Denmark), global consequences on a political, social, cultural, or economic scale. This is also why the relationship of time and space and the way it is altered is a critical component in and consequence of modernity. 109

Another consequence of modernity is de-traditionalization. Whereas in traditional societies (e.g. oral societies) knowledge could be acquired with a relatively amount of certainty and no need existed to question that knowledge because it was usually tied to particular forms of social action (e.g. fishing or hunting) and when these actions performed successfully, no need existed to question them. Late modern societies are, however, epistemically characterized by the notions of trust and risk. Our 'market of knowledge' in late modernity is, like various commodities competing, filled with a variety of knowledge claims not all of them true or relevant. Due to a growing specialization in knowledge production, modernity has taught us that a particular case can be analyzed from a variety of perspectives, each of them emphasizing a particular point. But since we are not capable of always to examine claims to knowledge critically, we are left in a position where we have to trust, choose and prioritize some cognitive authorities or experts systems in Giddens's term. We have to make a choice and rely on some expert systems at the expense of others. However, this also implies a risk, a risk we cannot escape but a risk we as humans have to cope with as it forms part our life conditions in late modernity. The mere presence of search engines and other similar knowledge organization activities made possible by the Internet make up one of the conditions of cultural transmission in late modernity. Search engines underscore the notions of trust and risk. The way they perform is crucial to how we conceive of and use them in everyday life and professional life. In using them we must to a certain extent also trust the search engines and their way of performing. But by trusting them we are at the same time also running a risk because search engines and their labour is shaped by their politics; that is, by how and why they index as they do (Introna & Nissenbaum, 2000). Concerning the change in cultural and social communication brought about by the Internet, and in particular the World Wide Web, one of the consequences of 'action at a distance' is that humans have the possibility of being connected to and acting in a variety of spheres of interest differently located in time and space. In terms of communicating and organizing knowledge and culture, it means that humans have a diversity of means and modes of knowledge organization available to them (e.g. search engines, virtual museum websites, online library catalogs, online book stores). This, moreover, entails that humans have to encounter these with a variety of expectations as to their way of performing and possess a kind of cultural capital as to how to recognize these diverse knowledge organization activities. 'Action at a distance' implies further that nation-states have a hard time controlling knowledge. Knowledge and its organization are de-centered in the globalized world. Knowledge organization can bring about or contribute to forms of action in distant locales. For instance, search engines may provide search results, which give critical information about a certain political actions (e.g. the treatment of prisoners in the Abu Ghraib prison in Iraq) but these results may be out of control of the involved (or affected) political agents who do not have an interest in such kind of information becoming publicly known. In relation to this, Slevin (2000, p. 198) argues for the importance of understanding globalization not in terms of economic issues only but also in terms of culture because globalization affects '...the way in which we produce, store and circulate information and other symbolic materials.' Looked upon this way, we may say there is a close connection between social change, modernity and the means and modes of how knowledge is communicated, stored and organized. Our understanding, I argue, of the role of bibliography in the global society, and hence in late modernity, must be situated precisely in light of the effects of globalization has on the production, storing and circulation of information (or knowledge) and other symbolic materials in order to see how and in what 110 ways the need for knowledge and bibliographic information changes. For instance, a search in Google may provide results from library catalogs or other similar knowledge organization genres (e.g. subject portals). Registering a critical document in one local/national bibliography/catalog/search engine may affect a situation in another locality in the sense that it can give rise to public debate that may affect a policy decision in some distant locality. These two examples suggest how changes in transmission of culture affect the reception of information and other symbolic material because cultural boundaries are not identical to national boundaries. That is, the examples point to how bibliographical activities can be understood as ‘action at a distance’ and, for this reason, as socio-political activities. At the same time, they also point to how we need to understand bibliography in terms of what kind of activities it is performing as a document and what to expect of bibliographical activities and how to recognize them as particular knowledge organization genres.

Bibliography: 'Just' a list of documents or a tool in social and cultural communication? When using the concept 'bibliography' in the following, I take point of departure in enumerative bibliography, that is, bibliography understood as the activity of compiling and listing books, documents, literature, or information (see Stokes, 1969, pp. 25-69). Not that I regard material, historical or textual bibliography as irrelevant for my purpose but rather because enumerative bibliography has more or less been the dominant way of thinking about bibliography in knowledge organization and in library and information studies in general. The field of bibliography has historically been considered a part of a larger history of books, documents or literature, i.e. historia litteraria (see e.g. Schneider, 1934; Blum, 1980; Woledge, 1983), pointing to its involvement in the communication of knowledge materialized and organized in various sorts of texts. With regard to this aspect, Balsamo (1990, p. 1) argues that bibliography historically needs to be looked at in terms of its institutional function carried out within the context of cultural transmission; i.e. not only, if at all, with reference to its technical compilation. The documentation movement, in particular personified by Paul Otlet, was in the beginning of the 20th Century also interested in bibliography and its role in social communication. Frohmann (2000, p. 15) writes about Paul Otlet that '...his belief that world peace and a just, global society depend upon the exercise of rational thought in both the natural and social sciences, led him to the inescapable conclusion that an ideal social order can be realized only by building international institutions dedicated to the organization and communication of knowledge.' Otlet’s thinking about bibliography and its function in social organization reveals the modernistic view of knowledge as a means to social progress. As a consequence, Otlet was concerned with how to establish a bibliography and an organization of knowledge that could serve mankind in its struggle for an ideal social order. The link to and emphasis on social order as being dependent on communication and organization of knowledge suggests that bibliography to Otlet was not merely a list of documents but that it could be used to maintain and strengthen forms of social organization. In 1952 Margaret Egan and Jesse Shera proposed a theory of bibliography. They emphasized the relationship of bibliography to social organization, social action, and communication. They were advocating what they labeled 'a macrocosmic' approach to bibliography. With this they saw bibliography '...as one of the instrumentalities of communication and communication itself as an instrumentality of social organization and action.' (Egan & Shera, 1952, p. 125) and further 'Bibliography must be looked upon as 111 being, in effect, the roadbed over which the units of graphic communication move among the various parts of society as they make their contribution to the shaping of societal structure, policy, and action.' (Egan & Shera, 1952, p. 125). In this manner Egan & Shera demonstrated their fundamental belief in a socio-communicative conception of knowledge organization and bibliographical activities. What can be traced above is a conception of bibliography as part of forms of social organization. Such a conception does not seem to be apparent today when bibliographies are, for instance, published in the form of electronic databases. Conceiving of databases, as being historically and socially detached is not very productive, as this tends to remove bibliographical practice from human activity. Our understanding of bibliography must obviously be extended to be more than a mere list of documents or information resources. In its broadest sense one may consider a webpage of a company or organization as bibliographical activity as it compiles, lists, and organizes links on that page and, thus, the webpage also mediates particular forms of knowledge or information. Also the digital scholarly edition may be considered a form of bibliography (Dahlström, 2004). Therefore, in order to better understand and conceptualize the role of bibliography in human activity and in the global learning society, the following will try to point out how this may be done by regarding bibliography both as a document and a genre in social and cultural communication.

Bibliography as document The view of documents to be presented comes from rhetoric. In rhetoric communication is viewed as purposeful action. The means (e.g. documents) for achieving a communicative purpose are regarded from the point of view that documents on behalf of both producers and users want to do something in human communication; they want to act, to perform. Looking at documents from a tools perspective means that we can do something with them. We can achieve some kind of goal in some specified communicate situation. We can, for instance, make them talk on behalf of us (Levy, 2001, p. 23). Bibliography as a document implies, then, that it seeks to accomplish something in the world on the part of both producers and users. This activity of bibliography needs to be understood in order to conceptualize its role in society. Otherwise it will tend to operate in disguise and, hence, leave the impression that it is a value-free instrument, a 'mere' list of documents, in social communication. Historically, the purpose of bibliographies was made clear by their producers, whether the purpose was cultural, social, political or religious (Balsamo, 1990). Such an explication of purpose seems to have, if not disappeared altogether, become rather ignored, in particular when it comes to electronic databases. Users may have a hard time examining their purpose. Knowing a particular purpose of a particular document is crucial because it determines what to expect of and consequently also how to use a particular document. If bibliography is documentation of society and culture, we need to know how it performs this documentation activity in the global learning society.

Social Action: Bibliography as a genre Considering the bibliography a genre may contribute to underline the activity of bibliography in society. Rhetorician Carolyn Miller has put forward a concept of genre that positions it in social action (Miller, 1984). Miller suggested understanding the concept of genre as 'typified rhetorical actions based in recurrent situations' (Miller, 1984, p. 159). This way of alluding to genre implies that it is not only understood with reference to a particular type of text, but also in connection with a particular kind of situation (or activity) that gives rise to the text. Miller (1984, p. 151) argued that "how a [social] understanding of genre can 112 help account for the ways we encounter, interpret, react to, and create particular texts", and, we may add, how we organize and search for particular texts. In this way, genre theory is concerned with how to recognize and understand particular text types in particular human activities. Emphasis is not only (if at all) on mere text types and their formal textual features/structures. Bazerman (2000, p. 16) puts it this way:

"Genres help us navigate the complex worlds of written communication and symbolic activity, because in recognizing a text type we recognize many things about the institutional and social setting, the activities being proposed, the roles available to writer and reader, the motives, ideas, ideology, and expected content of the document, and where this all might fit in our life"

As for bibliography considered a genre, this entails that in order to understand its role in the modern global learning society, we need to take into account the forms of social organization that give rise to bibliographical activities. Bibliography is a genre that documents recorded human activity. As a genre, bibliography must be recognized for performing such kind of labor and thereby it creates an expectation as to what and how it accomplishes the work it is doing. The global learning society produces various sorts of social action. The problem facing bibliographical activities in the global society is, then, that there are many agents, institutions, and individuals out there all performing some sort of bibliographical work. Genre theory provides a way of dealing communicatively with the consequences of modernity as it recognizes that there are many forms of knowledge and many means of articulating and structuring knowledge in a variety of communicative forms and that different forms of texts organizes human activity (e.g. Bakhtin, 1986; Bazerman, 1988, 1994, 1997, 2003; Berkenkotter & Huckin, 1995; Miller, 1984; Winsor, 1999; 2000). If we take such a view on genre and puts bibliography in light of this, it follows that we must be aware of the different kinds of work various bibliographical activities are performing. We must learn how to recognize and what to expect of such different bibliographical activities such as, for instance, Amazon and a digital public library. What kind of social action are they performing on behalf of producers and users? What institutional and social settings develop bibliographical activities? These are the questions we must deal with when considering bibliography a genre in late modernity. It enforces us to look at bibliography as more than a mere list of documents. It enforces us to understand bibliography as yet another consequence of modernity.

Concluding remarks Understanding the role of bibliography as suggested above requires that its theory and practice needs to be informed by broader views of the role of documents in human activity. Considering bibliography as a document performing a particular communicative activity with a particular purpose and as a genre that both creates particular expectations as to how to use a bibliography and how to recognize a given bibliographical activity, bibliography as a form of knowledge organization may be able to deal with the effects of modernity on social and cultural communication In so far this achieved, knowledge organization theory can make an important contribution as to how to understand the many bibliographical activities going on the global learning society and how these in the long run serve democratic purposes. This is, however, conditioned by how these activities are made visible to citizens whose social actions depend on access to knowledge materialized in documents. 113

Bibliography Bakhtin, M. (1986). The Problem of Speech Genres. In: Speech Genres and Other Late Essays, pp. 60-102. Tr. Vern W. McGee, ed. Caryl Emerson and Michael Holquist. Balsamo, L. (1990). Bibliography: History of a Tradition. Translated from the Italian by William A. Pettas. Berkeley: Bernard M. Rosenthal Bazerman, C. (1988). Shaping Written Knowledge. The Genre and Activity of the Experimental Article in Science. Wisconsin: The University of Wisconsin Press. Bazerman, C. (1994). Systems of Genres and the Enactment of Social Intentions. In A. Freedman & P. Medway (Eds.), Genre and the New Rhetoric, pp. 79-101. London: Taylor & Francis Bazerman, C. (1997). Discursively Structured Activities. Mind, Culture, and Activity, 4(4), pp. 296-308 Bazerman, C. (2000). Letters and the Social Grounding of Differentiated Genres. In: Letter Writing as a Social Practice. pp. 15-29. David Barton & Nigel Hall (Eds.) John Benjamins Publishing Company (Studies in Written Language and Literacy) Bazerman, C. (2003). Speech acts, genres and activity systems: How texts organize activity and people. In: What Writing Does and How It Does It: An Introduction to Analyzing Texts and Textual Practices, pp. 309-339 Ed. by Charles Bazerman & Paul Prior, Lawrence Erlbaum Associates Berkenkotter, C. & Huckin, T. (1995). Genre Knowledge in Disciplinary Communication: Cognition/culture/power. Hillsdale, NJ: L. Erlbaum Associates Black, A. (1998). Information and modernity: the history of information and the eclipse of library history. Library History, 14, pp. 37-43. Black, A. & Brunt, R. (1999). Information management in business, libraries and British military intelligence: towards a history of information management. Journal of Documentation 55:4, pp. 361-374. Blum, R. (1980). Bibliographia: An Inquiry into its Definition and Designations. Trans. Mathilde V Rovelstad. Chicago: American Library Association Dahlström, M. (2004). How Reproductive is a Scholarly Edition? Literary and Linguistic Computing, 19, no. 1 pp. 17-33 Egan, M. & Shera, J. H. (1952). Foundations of a Theory of Bibliography. Library Quarterly, vol. 22(2), 125-137 Frohmann, B. (2000). Discourse and Documentation: Some Implications for Pedagogy and Research. The Journal of Education for Library and Information Science, 42, pp. 13- 28 Giddens, A. (1990). The Consequences of Modernity. Polity Press Giddens, A. (1994). Living in a Post-traditional Society. In: Reflexive Modernization: Politics, Tradition, and Aesthetics in the Modern Social Order. U. Beck, A. Giddens, and S. Lash (Eds.), pp. 56-109. Cambridge: Polity Press Introna, L. & Nissenbaum, H. (2000). Shaping the Web: Why the Politics of Search Engines Matters. The Information Society, 16(3), pp. 1-17 Levy, D. M. (2001). Scrolling Forward: Making Sense of Documents in the Digital Age. Arcade Publishing Miller, C. R. (1954). Genre as Social Action. Quarterly Journal of Speech, 70, pp. 151-167 Schneider, G. (1934). Theory and History of Bibliography. Trans. Ralph Robert Shaw. The Scarecrow Press Slevin, J. (2000). The Internet and Society. Polity Press Stokes, R. (1969). The Function of Bibliography. The Trinity Press. 114

Woledge, G. (1983). `Bibliography' and `documentation': words and ideas. Journal of Documentation, 39, pp. 266-279. Winsor, D. (1999). Genre and Activity Systems: The Role of Documentation in Maintaining and Changing Engineering Activity Systems. Written Communication, 16(2), pp. 200-24 Winsor, D. (2000). Ordering Work: Blue-Collar Literacy and the Political Nature of Genre. Written Communication, 17(2), pp. 155-84 Ágnes Hajdu Barát Head of Library and Information Science Department University of Szeged, HUNGARY Usability and the user interfaces of classical information retrieval languages

Abstract: This paper examines some traditional information searching methods and their role in Hungarian OPACs. What challenges are there in the digital and online environment? How do users work with them and do they give users satisfactory results? What kinds of techniques are users employing? In this paper I examine the user interfaces of UDC, thesauri, subject headings etc. in the Hungarian library. The key question of the paper is whether a universal system or local solutions is the best approach for searching in the digital environment.

1. Introduction The possibilities of integrated systems mean not only automated processes within libraries, but also shared catalogues linking different library systems and extending resources considerably. For users, shared catalogues are the realization of the distributed library.

1.1 When integrated library systems appeared, these questions arose: x How do earlier methods like classification and retrieval systems apply in new information environments? x How can users avoid the confusion arising from more user interfaces in the OPAC environment? x How can maximum satisfaction be obtained from the management of knowledge in organizations? x Do we keep any or all parts of earlier information retrieval systems or abandon them? x Different search techniques (conceptual, object-based, browser-based) apply to various levels in the same record. Is it worth separating these different techniques into different fields of the record? Or would it be better to compare them and establish “new” information retrieval language dictionaries from the separate segments? x On the list above is one apparently unreal question: Do we keep any or all parts of earlier information retrieval systems or abandon them?

It would seem we must keep them, because as things stand we cannot manage information effectively. There are three possibilities:

x to compare the outcomes of different search techniques and establish “new” information retrieval language dictionaries from the separate segments of concepts. x to transform and reconstruct existing systems according to current needs. x to change information seeking technology and devise new types of search engines. For instance one could combine the UDC codes (or any hierarchically structured, universal system) with some new technological solution. The Totalzoom technology is only one possibility, because it would be able to map and spatially display hierarchy, tables, codes, and common and special auxiliaries. This method not only would manage the UDC codes in OPACs, but it also could integrate other structured databases, especially hierarchically structured ones. Naturally we can use other methods that are able to visualize information systems. All possibilities should be given a trial. 116

2. Why the UDC? I have studied only the UDC in different OPACs and Internet databases, although any similar structured, hierarchical and universal system would be suitable. The use of classical classification methods is a strong tradition in Hungary. One of the most widely used systems has been (and still is) the UDC. What advantages has the UDC? x Universal system x Meaningful notation x Clarity and transparency x Rich network of relationships x Well-defined categories x Ability to describe special and general concepts with free movement between the different levels x Efficient retrieval, relevant hits x A long-time tradition, found widely in many libraries - In consequence of Szabó Ervin´s activity, librarians knew and used the UDC very early, and most libraries use it today. x The concept system is our common cultural heritage and value x Significant potential x Standardization x Well-developed hierarchies, able to visualize information and conceptualize it independently within its structure [Hajdu Barát, 2004 A]

= 51 Urál-altaji (turáni) nyelvek

= 511 Uráli nyelvek = 511.1 Finn-ugor nyelvek = 511.11 Finn nyelvek Utolsó revízió: 1997. 10. hó = 511.111 Finn = 511.112 Karél = 511.113 Észt = 511.114 Livón Új jelzet! Bevezetés: 1997. 10. hó = 511.115 Vepsze = 511.116 Vót Új jelzet! Bevezetés: 1997. 10. hó = 511.117 Ingri (nyelv) Új jelzet! Bevezetés: 1997. 10. hó = 511.12 Lapp = 511.13 Permi nyelvek = 511.131 Votják. Udmurt = 511.132 Zürjén. Komi = 511.14 Ugor nyelvek = 511.141 Magyar = 511.142 Osztják. Hanti = 511.143 Vogul. Mansi = 511.15 Volgai nyelvek = 511.151 Cseremisz. Mari = 511.152 Mordvin = 511.152.1 Erzä = 511.152.2 Moksa

Table 1: structure of Ural-Altaic language group in the UDC [Egyetemes 2002, p.20] 117

There is marked interest in the UDC’s potential to assist growing numbers of Internet users. The UDC can play a role of integration in knowledge organization. Thus the answer to the earlier “unreal question” is “Definitely, yes!” We should keep the UDC. However, this answer brings with it some other questions:

x Are there any methods that can search hierarchies while changing levels easily? x Will the UDC codes become more user-friendly? x Can we utilize the powerful structure of UDC in OPACs or other electronic environments? x Should UDC codes become non-terminal scores, and can that structure show the way to retrieval? [Hajdu Barát, 2004 B, 175]

Minimum expectations are:

x Users should navigate easily and unequivocally in permanently variable circumstances. x Not only librarians but users should be enabled to work and search with UDC codes. x A user-friendly and user-oriented system is needed. x The expertise, craft, knowledge, and practice of librarians, professionals and scientists should remain important in the UDC system and UDC MRF. x User satisfaction is a general expectation and top-most priority. [Hajdu Barát, 2004 B, 175]

3. Visual Imagery + visualization = usability User interfaces are “communicational surfaces” or channels between human information researchers and information retrieval systems. Relevant and irrelevant, clear and nonclear, sufficient and adequate …– these are the worlds to which information-seekers are accustomed, particularly on the Internet. When users approach a known or unknown information system they often feel a muddy and fuzzy understanding of the system’s basic operation. They may be satisfied without knowing advanced functionality like archiving their results. They cannot see the whole cake, but rather only one small piece. Visual imagery plays an important mental or intellectual role, quite like information (or data) processing, memory, learning, abstract thinking, and linguistic comprehension. Visual perception is a complex process. Visualization begins with the sensation of physical stimuli, but after that it becomes quite individualized. Perception depends, for example, on experience, knowledge, cognition, or one’s system of symbols. The process is an explicit, multilevel and symbolic work of the mind. People are easily attracted to images and visual information. Pictures, graphics, menus, icons, buttons, graphs… etc. can help users understand, navigate, and query information systems. Information-retrieval systems equipped with computer graphics provide more accessible interfaces. Hereinafter I focus on the Hungarian user interface and the role of UDC in the various integrated library systems (ILS) in use in Hungary. The role for visualization with UDC takes many forms. 118

3.1 Simple Levels 3.1.1 Searching UDC codes is impossible Users can query title, author, place of publication, publisher, subjects…etc. UDC code is not searchable, but hits show UDC codes in the record. This is the case with the TINLIB system in use at Pázmány Péter Catholic University.

Table 2: OPAC of the Pázmány Péter Catholic University

3.1.2 Searching UDC codes is possible In the system of the National Széchényi Library, AMICUS software is in use, and users can search for concrete UDC codes, usually by exact, complex and high standards. The elements and fields searched appear highlighted in red (Table 2). Software vendors have not incorporated all levels of the classification hierarchy into the integrated library systems. For instance, if we are looking for 51 Mathematics, the hits show only the records with the 51 codes accurately. Hit lists exclude 510 Fundamental and general considerations of mathematics, 511 Number theory, 512 Algebra, 514 Geometry …519.1 Graph theory, 519.878 Search theory. This is unsatisfactory and in opposition to the philosophy of UDC, because hierarchy disappears from this solution.

Table 3: OPAC of National Széchényi Library 119

There are some OPACs that use UDC codes with the complete hierarchy to a maximum length (3-4 numbers, 1-2 auxiliaries).

3.2 Translation for subjects or index forms 3.2.1 Simple translation In some databases and catalogues, UDC codes are translated into searchable subject terms. Although this is very convenient for users, the terms lose every advantage of UDC classification.

Table 4: one part of MEK (Hungarian Electronic Library)

3.2.2 Structured Translation Corvina software is used in the National and University Library of Debrecen (DEENK). In its “Subject Category System” there are many graphically structured sites and sub-sites with UDC concepts in natural language form. Users can search several databases and several types of documents together. They can see their hits before clicking the links to full descriptions of records. This solution helps users to refine their searches by narrowing or broadening them. This is a clearly useful capability given subjects that are not highly esoteric.

Table 5: OPAC of DEENK Table 6: Hit in the OPAC of DEENK 120

3.3 Lead the users to continue their searching with links In this system, users navigate subject relationships via links and interactive forms. Searchers can combine different subject terms, classification codes, descriptors and bibliographic elements. They have the flexibility to refine strategy or revise their search completely as they go.

3.3.1 National Document Supply System (ODR) This system uses relatively simple, searchable UDC codes. The search history that is displayed following the actual hits helps guide searchers toward more relevant results. Unfortunately not every field contains searchable links, including the UDC codes. Searching UDC codes results in translations to subject terms, which can then be fed back into the search to refine it. One can combine keywords and subjects in the same search, for example Classification - books and Classification, Universal Decimal. Locations for hits are displayed as well.

Table /: Database of ODR Table 8: Hit in database of ODR

3.3.2 University Library of University of Szeged (SZTE) With Corvina software, UDC codes are searchable here, too. The codes are very relevant, complex and well-done, as this institute has a great tradition in the field of classification and has specialized in using the UDC system in catalogues. One drawback is that refining the search using UDC codes is not available. There are two parallel screens, the first column with hits and the second with the full record in different formats. UDC codes participate in only the MARC and XML forms and are not in every type of record. There are links for refining the search but the UDC codes are not searchable at this step or after, although subjects, authors, titles etc. are. 121

Table 9: Hit in the OPAC of SZTE

3.3.3 Library of the Hungarian Academy of Science (MTA) One of the best applications of UDC codes in an electronic environment in Hungary uses the Aleph system. High-standard, searchable UDC codes are used. One can refine the search with the help of hits and the flexibility to combine different elements of descriptions. One can browse and search with UDC codes as well as with classification terms. The searched parts of records are visible and lead the searcher to further steps. Naturally, locations for hits are displayed.

Table 10: Hit in the OPAC of MTA Table 11: Browse with help of hit

3.4 One solution outside Hungary - Catalogus Openbare Bibliotheken Antwerpen http://bibliotheek.antwerpen.be/MIDA/ This system helpfully guides users to narrow or broaden their concepts and topics. Subject hierarchy is apparent and formatted attractively. The structure (classes, subclasses and subdivisions) is clear and classical. This solution is expressive, keeps the UDC tradition, but depends upon the users’ patience and/or knowledge. One feature is simultaneous searching in separate databases. 122

Table 12: Browse in the OPAC of Bib

4. Conclusion The UDC is an artificial information-retrieval language. Users who do not know the semantics of classification codes encounter difficulties, and, earlier, reference librarians were the chief aides to users and visitors in information seeking. Today searchers also use catalogues and databases unmediated via the Internet; therefore, the databases themselves should help users in information retrieval. In the extreme, databases might take over the tasks of reference work by exploiting the strengths of visualization (structure, methods, different search engines etc.). Most solutions involve OPACS. There is a wide variety of ways of using UDC codes in combination with subject terms and other elements of descriptions. We should study these varieties and discover which methods, especially visualization-related, will make OPACS more user-friendly and effective. This paper raises only the classical possibilities. However, the relationship between usability and visualisation is fundamental. From among all the varieties of visualisation methods, librarians and information specialists should intensively seek the best and most effective.

References: Egyetemes Tizedes Osztályozás. Rövidített kiadás. UDC Publ. No. P057 /Szerk. Barátné Hajdu Ágnes. 1-3. köt. Budapest: OSZK - Könyvtári Intézet, 2005 Hajdu Barát, Ágnes: Knowledge organization of the Universal Decimal Classification – new solutions = Extension and Correction to the UDC. 26. The Hague : UDC Consortium, 2004, p.7-12. Hajdu Barát, Ágnes: Knowledge organization of the Universal Decimal Classification – new solutions, user-friendly methods from Hungary = Knowledge Organization and the Global Information Society/Ed. Ia C.McIlwaine. Würzburg, Ergon Verlag, 2004. p.173-178. Pálvölgyi Mihály: KeresĘnyelvek és fogalomtárak általános, ismeretreprezentációs és technológiai tendenciái. 2001. http://www.mek.iif.hu/porta/szint/tarsad/konyvtar/ katalog/medinfo/html/palvolgyi.htm Judith Simon Department for Philosophy of Science University of Vienna Austria

Interdisciplinary Knowledge Creation – Using Wikis in Science

Abstract: This article focuses on two aspects of knowledge generation. First, I want to explore how new knowledge is created in interdisciplinary discourses and, second, how this process might be mediated and promoted by the use of wikis. I suggest that it is the noise coming to life in (ex)changes of perspectives that enables the creation of new knowledge. In section 1-4, I am going to examine how the concepts of noise from the mathematical theory of communication (Shannon 1948) on the one hand and theories of organizational knowledge creation (cf. Nonaka 1994) on the other might help to understand the process of interdisciplinary knowledge creation. In section 5 I am going to explore the role wiki technologies can play in supporting interdisciplinary collaborations. This section is influenced by own experiences in a wiki-based interdisciplinary collaboration. It seems that even though certain features of wiki technology make it an excellent tool to externalize and combine individual knowledge leaving room for noise and at the same time documenting this process, the full benefit of wikis can only be obtained if they are embedded into a broader communication context.

1. Knowledge and Communication The first point to consider is how knowledge is created, that is, how new things and thoughts come into being. One field that is particularly involved in understanding and promoting the fundaments of knowledge creation is knowledge management. Recently, there seems to be strong evidence for a reemphasis on the role of communication in knowledge creation processes (Fuchs 2004, Calabrese 2004). Kuhlen (2004) even proposes “… a paradigm shift in the understanding of knowledge management,” which “puts knowledge management into the broader context of communication” (Kuhlen 2004, p. 21) and therefore announces the communicative paradigm of knowledge management. In his view, information and knowledge, being increasingly produced, distributed and used collaboratively, are primarily the results of communication processes. Even though Kuhlen (2004) explicitly demarcates himself from earlier knowledge management approaches such as those of Nonaka (1995) and Probst (2000), I would argue that the constitutive function of communication for knowledge creation processes can already be found in Nonaka’s influencial book The knowledge creating company (Nonaka 1995). Nonaka’s central thesis states that “organisational knowledge is created through a continuous dialogue between implicit and tacit knowledge” (Nonaka 1994, p.14). More specifically, he differentiates four modes of knowledge conversion between tacit and explicit knowledge: socialisation, externalisation, internalisation and combination and considers their helical interplay to be the fundament of organisational knowledge creation. Pursuant to a general shift in interest or nomenclature from information to knowledge management, he stresses a fundamental difference between organisational knowledge creation and mere information processing or problem solving. While Shannon’s mathematical theory of communication explicitly eliminates semantics in order to quantify information, Nonaka states: “[i]n terms of creating knowledge, the semantic aspect of information is more relevant as it focuses on conveyed meaning” (Nonaka 1994, p. 16) and suggests that “any preoccupation with the formal definition will tend to lead to a 124 disproportionate emphasis on the role of information processing, which is insensitive to the creation of organizational knowledge out of the chaotic, equivocal state of information” (Nonaka 1994, p. 16). However, it becomes obvious that he draws much more inspiration from Shannon than one would expect having read the quotes above. First of all, the four processes of socialisation, externalisation, internalisation and combination when used to explain collaborative processes of knowledge creation can hardly be conceptualised without the premise of information transmitted through channels within an organization. How can for instance, socialisation be understood without considering communication and interaction between expert and novice, colleagues or peers? Moreover, Nonaka explicitly employs concepts taken from Shannon’s information theory to explain the creation of new knowledge when he writes: “Individuals recreate their own systems of knowledge to take account of ambiguity, redundancy, noise, or randomness generated from the organization and its environment” (Nonaka 1994, p. 18). To conclude, both approaches of knowledge management are communication-driven: one explicitly, the other implicitly. If we consider communication to be the basic process of collaboration and accept this decisive role of communication in collaborative knowledge creation, we should take a closer look at signal transfers as the most fundamental processes of communication.

2. Communication, Information and Noise Shannon’s mathematical theory of communication describes the process of signal transfer from a source to a destination. Between source and transmitter as well as between receiver and destination a perfect connection is assumed, while transmitter and receiver are connected by a channel. The significant innovation of Shannon was the formal integration of a noise source into the communication model (Shannon 1998). Since every communication channel comes into existence by distinguishing the message from noise, noise is as much a prerequisite of communication as a threat to it. Noise corrupts the message by endangering its transport through the channel between the transmitter and the receiver or by blurring its unambiguousness, thus challenging the success of communication on its most fundamental level, the information transfer.

Fig. 1: Schematic diagram of a general communication system (adopted from Shannon 1948, p. 380)

It is important to note that Shannon explicitly excludes any semantics from his theory by stating: “Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages” (Shannon 1948, p. 379). 125

Shannon's theory is based on probability theory and was developed to resolve problems in the transmission of signals, enabling the quantification of information as a measure of uncertainty. It might be due to this possibility of quantification that after matter and energy, information soon became the prime concept in science. Scientists from a wide variety of disciplines considered information to be a fertile concept for their own disciplines and thus contributed to its rapid spread in science and humanities. In the handbook Key concepts in Communication and Cultural Studies, noise is defined as “any interference added to the signal between encoder and decoder that makes accurate decoding more difficult", and which “is combated by redundancy and feedback, and is one of the factors that limits the capacity of a channel to convey information” (Sullivan et al. 1994, p. 203f). Furthermore, Sullivan et al. differentiate between mechanical noise, regarded as noise in the channel, and semantic noise, as an “interference with the message brought about by dissonance of meaning; this is usually caused by social or cultural differences between encoder and decoder” (Sullivan et al. 1994, p. 203f). Semantic noise does not necessarily take place in the channel, i.e., between decoder and encoder, but also within encoder and/or decoder due to multiple semantic interpretations of syntactically identical messages. Shannon surely did not intend this widespread use of the term “noise”, but in Sullivan et al. (1994) we encounter how the transfer of concepts from one discipline to another can be highly productive. So before going into deeper analysis of the various functions of noise in communication processes, I want to discuss a prime example for communication processes in the presence of noise: interdisciplinary discourse. If you think about the concept of information and ask a mathematician, a journalist and a philosopher what “information” is, the extent to which academic education influences and shapes our perspective will soon become obvious. Are the differences in perspectives just caused by semantic noise or does mechanical noise also play a crucial role in this diversity? However it may be I would suggest that the same noise, which makes communication and understanding between proponents of different disciplines often so difficult, can also be the basis for the creation of new knowledge in interdisciplinary collaboration.

3. Interdisciplinary1 Knowledge Creation An abundance of academic literature has dealt with interdisciplinarity and its role in scientific knowledge from the 1960s onwards (cf. Weingart & Stehr 2000, Kocka 1987, Klein 2000). Similar questions have been posed in industry, leading to a vast amount of literature on potential value-adding effects of heterogeneous, cross-functional teams for creativity and innovation (Leonard Barton 1997, McDermott 1999, von Pierer & von Oettinger 1999). Scientific studies on interdisciplinarity have primarily focussed on the natural sciences rather than the humanities (Weingart 2000) and especially on fields of applied research and technology development. One line of argument has often been that “nature” does not care about disciplinary boundaries, i.e. many real-life problems demand crossing of disciplinary boundaries to be solved. Klein (2000) argues that some of the most productive and innovative fields of research have emerged from problem-based, interdisciplinary collaborations in the sciences, such as molecular biology, cognitive science or biomedicine. Therefore, interdisciplinarity has become a “programmatic, value-laden term that stood for reform, innovation, progress” (Weingart & Stehr 2000, p. xiii). Maybe the most prominent feature of interdisciplinarity is that everybody praises it, but nobody practises it. Due to the fact that almost every call for national and international research projects explicitly highlights the necessity for interdisciplinary collaboration as a prerequisite for the applications, most research co-operations are labelled interdisciplinary. However popular name dropping of the term “interdisciplinary” may be in research 126 proposals, scepticism that interdisciplinarity really works remains prevalent. The reason for this paradoxical, although obviously exaggerated description is that on the one hand there is an increasing demand for interdisciplinary collaboration in science, technology and industry, but on the other, experience has shown that successful interdisciplinary collaboration is hard to achieve. But how can we conceptualize the success of interdisciplinary collaboration? According to Klein “The need for transdisciplinarity arises from developments in knowledge and culture that are characterized by complexity, hybridity, non-linearity, and heterogeneity” (Klein 1994)2. I would argue, these characteristics and the related aspects of (disciplinary) specialisation and noise are the basic causes for the necessity as well as for the frequent failure of interdisciplinary projects. According to Klein “interdisciplinarity and specialisation are parallel, mutually reinforcing strategies” (Klein 2000, p.7), where interdisciplinarity is meant to reduce some of the drawbacks of disciplinarity in times of increasing specialisation on the one hand and increasing complexity of problems waiting to be solved on the other. A simple but deliberate answer to the question concerning the success of interdisciplinary collaboration is that it is measured by its ability to create new knowledge. But what is it about interdisciplinarity that is, on the one hand, considered to promote innovation and the creation of new knowledge and on the other often leads to misunderstandings and the break-down of cooperation? I would argue that the answer to this problem lies in the concept of noise.

4. The Function of Noise in Interdisciplinary Knowledge Creation: Transferring Concepts So what does this tell us about the function of noise for the collective creation of knowledge? Knowledge creation has a long tradition of being considered a private, individual process that happens in solitude. Consequently, classical approaches in knowledge management have mainly focussed on storing and distributing existing knowledge, i.e., communication has been a means for transfer of knowledge but not explicitly for creation (cf. Probst 2000). In contrast, this paper’s concern is how new knowledge comes into existence through the process of communication and how this process must therefore be defined in the first place. The process of communication can only come to the fore if everything has not already been communicated. There has to be something that has not yet been communicated and this something must defy communication otherwise it would have already been communicated. This is one of the consequences of Shannon’s theory: there is no communication in absolute redundancy. Thus we can argue that noise appears in the non-achievement or rather the not-yet achievement of the process of communication, because it is not yet completed. Consequently, the process of communication and the insistence of noise are interdependent and we can therefore ask: How does new knowledge come into existence through noise in communication? In the process of communication we can distinguish three possibilities for how noise is related to the creation of new knowledge.

4.1. Emergence instead of signal transfer Noise can disturb the inner structure of a semantic system, which in turn initiates new communicative processes to dissolve the perturbation on a higher organisational level. Two constructive consequences can be considered. If a perturbation shows that one concept has different meanings, semantics can be amalgamated and an integrated, superordinate semantic order can be produced through communication. Ideally, the horizons of two different disciplines merge and together they reach a higher level of cognizance. These fusions can take various forms, from convergence to the establishment of completely new 127 fields of research. Klein (2001) cites cognitive science and molecular biology as two examples for hybrid disciplines. Whereas cognitive science remained an interdisciplinary field of research, where scientists predominantly remained in the disciplinary institutions, molecular biology opened up a new line in biological research.

4.2. Noise turns into signal New knowledge can be created due to interdisciplinary fuzziness of concepts, even if higher order integration does not occur. The experience of irritation due to diverging use of words and concepts can lead to an awareness of — possibly inept — “thought styles” (Fleck 1981) and can concuss disciplinary certitude. Processes can be triggered to adapt to the disturbance by independent restructuring of disciplinary assumptions. Even if the receiver does not realize the difference in meaning between his usage of concepts and those of the sender, the sheer integration of the (strange) concept itself in his own semantic system can have synergetic effects. In that sense, even unconscious and unresolved misunderstanding can produce new knowledge. Typical characteristics for this function of noise are concept transfer and metaphorisations of colloquial or common terms. “The concept of transfer indicates that the connotations of a word in its usual context are transferred to the new, “strange” context” states Draaisma (2001, p.13) in his examination of technological metaphors in the history of psychological investigation of memory. In that sense, for a generative transfer of concepts it is not necessary to adopt a theory as it stands, but rather to use fertile bits and pieces of theories to shed light on disciplinary blind spots.

4.3. Noise absorbs signal Interdisciplinary collaborations often face immense difficulties, especially when it comes to agreeing about future research strategies. Often, no common ground can be achieved to legitimize certain strategies. In such situations almost no one involved in the collaboration experiences any success in communicating his or her point of view and two general strategies can be observed: First, the process of communication is hollowed out by an exceeding redundancy: The parties do not stop to repeat their point of view in order to stabilize their own perspective and to eliminate the irritating potential that I have addressed as “noise”. Second, one or more parties terminate the process of communication, either by not reacting to messages of the other party or by declaring the collaboration as failed.

5. Wikis - Mediating Interdisciplinary Collaboration Many activities of scientific collaboration are geographically dispersed endeavours, in fact for most projects which are funded by the European Union cooperations with partners in other European countries are a prerequisite for an application. Hence, these cooperations are dependent on ICT support. This section focuses on one the use and possible utility of a wiki technology. A wiki is a web-based software, which allows instant online editing in a web browser by any user. Clicking on an edit-button directly leads to a frame where the source code of the site can be easily edited, just by following some wiki-specific formatting rules. Two other essential functions of wikis besides this edit mode are a simplified internal link system and the version history (Ebersbach & Glaser 2005). All versions of each wiki site are stored in a database and through the version history it is possible to display the differences between selected versions, ensuring that information does not get lost. Moreover, the function recent changes enables users to watch changes in order to keep track with the developments in the wiki. 128

5.1.Using Wikis for Collaborative Exploration Although the propagation of wikis was leveraged by the web encyclopedia Wikipedia as the most prominent example of a wiki-application, the original wiki has been developed by Ward Cunningham in 1995 as a collaborative writing tool to facilitate the discovery and documentation of software patterns (Venners 2003). The key elements of edit mode and simplified linking system were introduced to support authors by facilitating writing and eliciting the externalisation of individual knowledge, especially of experiences or new ideas which would normally not be considered worth writing down, because they might not be elaborated yet. According to Cunnigham “(a) wiki works best where you’re trying to answer a question that you can’t easily pose, where there’s not a natural structure that’s known in advance to what you need to know” (Venners 2003, p.2). Therefore, wikis promise to be especially supportive for open-ended collaborations or generally in initial stages of projects, when goals and project plans are still not clear – or in other words: when there is a lot of noise and little information. However, as has been shown for interdisciplinary collaborations before, this noise is not automatically turned into information.

5.2. Some Experiences and Recommentations for Using Wikis This section is based on own experiences in a wiki-based interdisciplinary collaboration using the wiki engine MoinMoin. The goal of this self-experiment on wiki-usage conducted with a colleague from Hamburg was to co-write an article exactly on the topic of interdisciplinary collaboration with wikis in a wiki. The work on the article is still in progress, but nonetheless some first results on the process of co-authoring in a wiki have already been obtained. If we reconsider the four processes of knowledge creation by Nonaka (1994), mostly externalisation and to some extent combination were supported. This is in line with the original aims of wikis. Socialisation in wikis is as difficult as for pure ICT devices in general, an insight which has similarly led to a shift from e-learning to blended learning approaches in pedagogy. As for our project, we noted that whenever there were important decisions to make, we used telephone and Email to assist our communication. Hence, it seems advisable not to solely rely on wikis but rather to use them amongst others media. Compared to other web-based collaboration software, also internalisation seems more difficult with wikis. In contrast to blogs, which are chronologically organized, the texture of wikis is rather open. This flexibility often leads to a perceived lack of structure and readability, which is especially difficult for wiki newbies. I would conclude that this easy externalisation and combination along with the difficulty of internalisation yields in the production of noise which can result in the creation of new knowledge. But what are the conditions under which noise in wikis can be turned into new knowledge? In section 4 I have introduced three possibilities for this relation between noise and information. The first type (“emergence instead of signal transfer”) seems to be quite hard to achieve in wikis, so far we did not reach this stage in our experiment for at least one reason: due to time pressure we soon decided to split our planned article into two sections and to write the first drafts independently before exchanging them for comments, further elaboration and final combination. This proceeding proved timesaving, but it surely prevented us from exploiting the full potential of wikis for collaborative writing, which would probably need continuous collaborative writing practise over a fair amount of time. For the second type (“noise turns into signal”) it would be sufficient to learn from the unfamiliar content provided by the others for your own questions and I am sure that we reached this stage and will continue to learn from one another in the course of our project. Fortunately, the third type (“noise absorbs signal”) did not take place in our wiki. However, it can be observed for 129 instance in edit wars in Wikipedia. These edit wars often lead to exhaustion of the same arguments from the proponents (redundancy) and finally users can be blocked by the administrator (termination of communication). To conclude it seems that even though certain features of wiki technology might make it an excellent tool to externalize and combine individual knowledge leaving room for noise and at the same time documenting this process, the full benefit of wikis can only be obtained if it is embedded into a broader communication situation. Even though these are of course only initial considerations and very tentative results I hope to have aroused some interest in the use of wikis in interdisciplinary collaborations and in the revitalisation of the concept of noise for knowledge creation.

Acknowledgements I sincerely want to thank Claudia Koltzenburg from the TU Hamburg-Harburg for enabling and accompanying our project on interdisciplinary knowledge creation in wikis.

Notes 1 Since a detailed discussion of the concept and characteristics of interdisciplinarity would go beyond the scope of this article, please confer Heckhausen (1987), Mittelstrass (1987). 2 Please note that in this paper I do not differentiate between transdisciplinarity and interdisciplinarity. On this topic cf. Heckhausen (1987), Klein (1994), Klein et al. (2001).

Literature Calabrese, A. (2004). The evaluation of quality of organizational communications: a quantitative model. Knowledge and Process Management. 11 (1), 47 – 67. Draaisma, D. (2001). Metaphors of Memory: a history of ideas about the mind. Cambridge University Press: Cambridge. Ebersbach, A. & Glaser, M. (2004): Towards Emancipatory Use of a Medium: The Wiki. International Journal of Information Ethics, 11 (10). Fleck, L. (1980). Entstehung und Entwicklung einer wissenschaftlichen Tatsache: Einführung in die Lehre vom Denkstil und Denkkollektiv. Frankfurt am Main: Suhrkamp. Fuchs, C. (2004). Knowledge Management In Self-Organizing Social Systems. Christian Fuchs. Journal of Knowledge Management Practice, May 2004. http:// www.tlainc.com/articl61.htm Kocka, J (Ed.): Interdisziplinarität. Praxis-Herausforderung-Ideologie. Suhrkamp Taschenbuch Wissenschaft 671. Frankfurt a.M. 1987. Heckhausen, H. (1987): Interdisziplinäre Forschung zwischen Intra-, Multi- und Chimären-Disziplinarität. In: Kocka, J. (Ed.): Interdisziplinarität: Praxis - Herausforderung - Ideologie. Frankfurt a.M., 129-145. Klein J. T. (1994). Notes Toward a Social Epistemology of Transdisciplinarity. (Talk for the 1st World Congrass on Transdisciplinarity, Arrábida, Portugal, 2.-6.11.1994) http://perso.club-internet.fr/nicol/ciret/bulletin/b12/b12c2.htm Klein, J. T. (2000). A conceptual vocabulary of interdisciplinary science. In. Weingart, P. & Stehr, N: Practising Interdisciplinarity: University of Toronto Press: Toronto, 3-24. Klein, J. T. et al. (eds.) Transdisciplinarity: joint problem solving among science, technology, and society: an effective way for managing complexity. Basel; Boston; Berlin: Birkhäuser, 2001. Kuhlen, R. (2004). Change of Paradigm in Knowledge Management - Framework for the Collaborative Production and Exchange of Knowledge. In. Hans-Christoph Hobohm. 130

Knowledge Management - Libraries and Librarians Taking Up the Challenge. München: Verlag K.G. Saur, 21-38. Leonard Barton, D. (1997): Wellsprings of Knowledge. Cambridge: Harvard Business School Press. McDermott, R. (1999): Learning across teams: How to build communities of practice in team organizations. In: Knowledge Management Journal, 8, May/June. 1999, S. 32-36 Mittelstrass, J. (1987). Die Stunde der Interdisziplinarität? In: Jürgen Kocka (Ed.), Interdisziplinarität. Praxis – Herausforderung – Ideologie, Frankfurt/Main 1987, 152-158. Nonaka, I. (1994). A dynamic theory of organizational knowledge creation. Organization Science, 5, (1), pp:14-37. Nonaka, I. & Takeuchi, H. (1995): The Knowledge-Creating Company. Oxford University Press. Pierer, H.v. & Oetinger, B.v. (Hrsg.) 1999., Wie kommt das Neue in die Welt. Hamburg: Rowohlt Taschenbuch Verlag Probst, G., Romhardt, K., Raub, S. (2000). Managing Knowledge: Building Blocks for Success. John Wiley and Sons Ltd. Shannon, C.E. (1998). Communication In The Presence Of Noise. Proceedings of the IEEE, 86 (2), 47-457. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, pp. 379-423, 623-656. Sullivan, T., Hartley, J., Saunders, D., Montgomery, M., Fiske, J. (1994). Key concepts in Communication and cultural Studies. Routledge: London, 2. Edition. Noise, p. 203f. Venners, Bill (2003): Exploring with Wiki - A Conversation with Ward Cunningham. (October 20, 2003). Available at: http://www.artima.com/intv/wiki.html. Last visited 20.03.2006. Weingart, P. (2000). Interdisciplinarity: The paradoxical discourse. In. Weingart, P. & Stehr, N: Practising Interdisciplinarity: University of Toronto Press: Toronto, 25-42. Weingart, P. & Stehr, N. (2000). Practising Interdisciplinarity: University of Toronto Press: Toronto. Alon Friedman Palmer School of Library and Information Science, Brookville, New York

Concept Mapping a measurable sign

Abstract: The objective of this study was to classify, according to Peirce’s definition of the term sign, the various forms of concept mapping presented in 2000 and 2004 proceedings of ISKO Conferences. Our analysis was unable to discover relationships between Peirce’s Thirdness category and Interpretant classifications with regard to concept mapping. The concept mappings we analyzed did not provide text descriptions nor graphic representations as they had when we analyzed Firstness, Secondness, Object and Representamen representations. We see a need for additional studies to assess Peirce’s Thirdness trichotomy and Interpretant classifications as they might be applied to concept mapping.

1. Introduction: We measured the ‘most used’ concepts discussed by presenters in 2000 and 2004 ISKO conference proceedings to determine if they would provide measurable elements. If a relationship between 'sign' and concept mapping could be discovered, this relationship could then be examined in terms of Frame theory and Semantic theory. Frame theory was proposed by Minsky (1975) and Semantic Network theory was defined by Sowa (1992). The ISKO conferences were chosen because of the diverse backgrounds of presenters and the diverse topics covered. Some of this diversity was reflected in the theoretical foundation of knowledge organization theories and the social elements of classification discussed at both conferences. The following question was proposed: What relationship is there between the styles and procedures of concept mapping employed in ISKO proceedings and Peirce’s definition of measurable ‘signs’? The purpose of this research is to contribute to the enhanced understanding of the role of concept mapping by researchers studying knowledge organization.

2. Methodology: We conducted a case study to measure whether the concept of ‘sign’ as envisioned by Peirce could be used to measure specific concepts discussed by ISKO presenters. Concept mapping can be considered a technique used to represent knowledge in graphic mode or with discipline-specific terminology. Content analysis was the methodology chosen to discover possible theoretical or conceptual relationships or patterns of cognitive processing in ISKO concept mapping. We first examined the proceedings to discover whether concept mappings could be classified according to Peirce’s triangle representation, which consists of Representamen, Interpretant, and Object. Our next step was to collect key terms derived from concept mappings and reclassify them according to Peirce’s three categories: Firstness, Secondness, and Thirdness. Peirce made a phenomenological distinction between the sign itself [or the ‘representamen’] as an instance of 'Firstness', its object as an instance of 'Secondness' and the ‘interpretant’ as an instance of 'Thirdness' (Peirce 1983, 2.475). By employing Peirce’s classification, we hoped to be able to answer a very important theoretical question: Is there a relationship between author-specific rhetorical style and author-specific conceptual processes? We employed a quantitative paradigm (actually a form of content analysis) to obtain descriptive data. 132

The study progressed through the following procedures: The first stage measured the most-used terms from both Proceedings (2000 and 2004). The second stage measured the ‘most used’ classifications in terms of Peirce’s triangle. And, the third stage employed content analysis to determine if there were Firstness, Secondness, and Thirdness issues. During the fourth stage, cross-proceedings analysis was conducted to discover possible relationships between the terms employed.

3. Literature Review: Introduction: Concept mapping has been used in academic and business settings since the late 1980s, providing visual representations of knowledge structures and argument forms. In many disciplines, the various forms of concept mapping are already employed as formal knowledge representation systems; for example, semantic networks are used in artificial intelligence formats. We reviewed the term ‘sign’ and its relationship to concept mapping.

The term sign: The term ‘sign’ has long historical debuts because the term may take many forms, such as words, images, objects; although sign have no intrinsic meaning unless we provide it with meaning (Chandler, 1999). The two dominant models of what constitute ‘signs’ are those of the Saussure and Peirce. For Peirce (2.483), 'sign' can be divided into three kinds of terms: icons, indices, and symbols, which he calls legisigns. In contrast, Saussure offers a 'dyadic' or two-part model. Peirce and Saussure also used the term 'symbol' differently. For Peirce, a symbol is ‘a sign which refers to the object that it denotes by virtue of a law, usually an association of general ideas that operates to cause the symbol to be interpreted as referring to that object’ (Peirce 1931-58, 2.486). Peirce thus characterizes linguistic signs in terms of their conventionality. Peirce adds the term ‘icon’ as part of representation and declares that an iconic sign represents its object by its similarity (2.276). A sign is an icon 'insofar as it is like that thing and used as a sign of it' (2.296). Peirce adds that 'every picture (however conventional its method)' is an icon (2.279). Icons have qualities that 'resemble' those of the objects they represent, and they 'excite analogous sensations in the mind' (2.211). For Peirce, icons included 'every diagram, even although there be no sensuous resemblance between it and its object, but only an analogy between the relations of the parts of each' (2.311). Even the most 'realistic' image is not a replica or even a copy of what is depicted, according to Peirce (2.375).

The Classification of the Term ‘Sign’: Peirce presents a semiotic triangle representation to classify the term ‘sign.’ The triangle consists of Representamen, Interpretant and Object. The Representamen is the form the sign takes; an Interpretant: the sense made of the sign; and an Object is that to which the sign refers (1931-58, 3.399). Each of the categories of signs is loosely defined, according to Mai (2002). Peirce breaks each of the semiotic triangle categories and awareness into Firstness, Secondness and Thirdness. The first awareness of a thing separated by our perception from the sensuous manifold titled Firstness. The Firstness existence exists either dependent on its being in the mind of some person, or in the form of sense or thought. The next awareness is the ‘resistance,’ or an interruption of the initial awareness. This interruption, which calls us to an awareness of the state of our awareness, is a Secondness. Thirdness is the relationship between Firstness and Secondness; it is a relationship of linguistic signs. Meaning is not inherent, according to 133

Peirce, but something one makes from signs (2.489). As a result, expression--verbal, written, or otherwise—is the awareness of awareness, which is the Secondness of Firstness, a Thirdness. Figure 1 represents the ‘semiotic triangle' classification of signs based on Peirce’s theory. Chandler (1999) provides us a representation of this relationship in form of concept mapping. Although Peirce classify the terms ‘Interpretant’ and ‘Representamen’ in his ‘semiotic triangle’, we will examine the term ‘Object.’

Figure1.

According to Mai (2002), Peirce provides us with a framework for studying how the meaning of signs is generated, interpreted, and represented in terms of the field of information science.

Concept Mapping: History: Concept mapping was developed by Novak and Gowin (1984), aiming to provide better tool for lecturers, teachers and their students. Jonassen et al. (1993) described concept maps as “representations of concepts and their interrelationship that are intended to represent the knowledge structures that humans store in their minds.” We used Jonassen et al.’s definition because of its close connection to Peirce’s definition.

Current Research: The majority of the studies we reviewed come from the field of education. In the majority of these cases, researchers were concerned with how students use and learn basic terms. Authors were primarily interested in the user’s use of terms and ideas generated through concept mapping.

Most research related to the subject of this study: We discovered two studies that consider the term ‘sign’ and concept mapping. They were conducted by Miller and Riechert (1994) and Priss (2004). Miller and Riechert (1994) used concept mapping to find and describe themes and categories in large bodies of text. Priss (2004) studied the structure of programming language by using Peirce’s definition of signs. Priss (2004) used formal concept analysis to examine the correlation between structured programming languages and concept mapping with regard to Peirce’s definition. 134

Summary: Many researchers in the field of education use concept mapping as a technique to improve student comprehension skills. Even though many information science researchers use concept mapping as a presentation tool, not many use it as a meta-theoretical practice to classify and organize information.

4. Study Findings: We examined the sixth and eight ISKO conference proceedings, which included 113 papers. From the 54 papers out of the 2000 proceedings 13 used concept mapping, represents 22%. Of the 54 articles from the 2004 proceedings, 12 articles used concept mapping strategies in their presentation, representing 22.3%. It is interesting to note that in both there is an almost identical percentage from each proceeding, indicating that concept mapping was a viable element of cognitive processing at both conferences. Further analysis discovered that the majority of the authors who used concept mapping used more than one concept mapping strategy. Out of 14 presenters from the 2004 proceedings, 9 presenters had more than two concept mapping strategies in their articles. In comparison, out of 13 presenters who used concept mapping during the 2000 proceedings, there where only 4 authors who used more than 2 concept mapping strategies. In an attempt to determine the type of concept mapping used (formal vs. informal), we examined conference papers and discovered that informal concept mapping was used the most, accounting for 90% of total use. Kremer’s (1994) definition of concept mapping, which differentiates between ‘formal’ and ‘informal’ concept mapping, was employed. The triangle representation of the sign, according to Peirce, consists of Representamen, Interpretant and the Object wherein the classification of the ‘sign’ was examined. Table 1 represents the concept mappings found in the sixth and eight ISKO conference proceedings based on Peirce’s triangle representation.

Representamen Interpretant Object 2000 Conference 28/28 8/28 28/28 Proceedings 2004 Conference 24/24 5/24 24/24 Proceedings Total 59/59 13/59 59/59 Table 1. The Concept mapping representation based on Peirce’s triangle.

We discovered that the ‘Interpretant’ category was not employed by many researchers: 13 out of 59 representations, in fact. Only 13 concept mappings provided a detailed description and graphic representation in terms of Representamen and Object. Next, we examined Firstness, Secondness and Thirdness in order to find the number of terms provided in these domains. Table 2 represents Peirce’s classifications based on Firstness, Secondness and Thirdness.

Firstness Secondness Thirdness 2000 Conference 108 475 71 Proceedings 2004 Conference 247 481 43 Proceedings Total 355 956 114 Table 2. The number of terms found in Concept Mapping using Firstness, Secondness and Thirdness. 135

The total number of terms and key concepts classified under Firstness was 355, which consists of 108 terms from the 2000 proceedings and 247 from the 2004 proceedings. The total number of terms obtained under Secondness was 956, with 475 terms and concepts deriving from the 2000 proceedings and 481 terms from the 2004 proceedings. The total terms found in the Thirdness category were 114, with 71 deriving from the 2000 proceedings and 43 from the 2004 proceedings. The Thirdness category received the least attention in word description and graphic representation. In addition, we measured the distribution of the Firstness, Secondness and Thirdness categories according to Bradford’s distribution law. We found correlation between Bradford’s law and the distribution found in Firstness, Secondness and Thirdness distributions. (See Appendix B). However, according to Potter (1988), the Bradford distribution is not statistically accurate, and additional studies need to examine its validity even though it used as a general rule of thumb in the field of information science (1988, 238b). We did not find the same ‘most used’ terms in all three categories: Firstness, Secondness and Thirdness. The ‘most used’ term appearing in the Secondness classification, the largest classification category, was the term ‘science’ and the second ‘most used’ terms were the term ‘system’ and the term ‘Amazon’. Those terms did not appear in the Firstness and Thirdness categories. Table 3 represents the most used terms according the Firstness, Secondness and Thirdness.

Most used terms Firstness Secondness Thirdness Map 3.74% Science 2.64% Refers 1.26% Concept 2.80% System 1.28% Access 0.89% Buy/Sell/Subclass 1.87% Amazon 1.01% NR, NT, RT 0.56%

Table 3. Most used terms based on Firstness, Secondness and Thirdness

We found that the majority of the concept mapping strategies examined lack the specificity of the context and the precision of description of Peirce’s Thirdness category. The total number of terms and concepts used to describe the Thirdness category, as Peirce formulated it in terms of the relationship between the Firstness and Secondness, was a mere 114 terms in contrast to 355 and 956 terms in the Firstness and Secondness categories. Thelesffsen (2000) notes that most researchers focus mainly on the Firstness and Secondness categories without defining the relationship between them. In addition, we found that most articles did not describe the relationship between the Firstness and Secondness categories let alone consider the context within which their concepts were formulated. We found there was a complete lack of regard for Interpretant formulations in the semiotic triangle. We considered other possibilities: the kinds of concept mapping strategies most used by presenters in knowledge organization. Table 4 represents the types of concept mappings most used. We found five types of classifications: hierarchy, spider, flowchart, system, and picture. Although Kremer (1997) provides highly specific examples of concept mapping (vis-à-vis programming languages), we believe it is suitable to apply Kremer’s configuration in the present study because concept mapping is a generic cognitive strategy. The majority of the types of concept mappings we discovered were hierarchical in nature. Moreover, it must be duly noted that we did not employ strict guidelines and rules to classify concept mapping strategies because the articles reviewed did not provide enough evidence of the cognitive framing strategies ostensibly employed. 136

Most used concept mapping types 2000 Conference Proceeding Hierarchy 15 System 9 Picture 4 2004 Conference Proceeding Hierarchy 11 System 5 Picture 2

Table 4. The most used types of concept mapping in ISKO proceedings.

To sum up, we did not find significant or meaningful relationships between the Thirdness category and Interpretant classifications in the concept mappings we identified. In both conferences, authors using concept mapping did not provide word descriptions or graphic representations that would correlate with Thirdness or Interpretant categories. Because of this, we were unable to discover noticeable patterns of author-specific metaphorical style and author-specific conceptual processes with regard to concept mapping. It is apparent that most researchers used concept mapping as a presentation tool only to specify concepts in their presentation without defining them in an explicit or categorical manner.

Summary: We were unable to discover a relationship between the styles and procedures of concept mapping employed in ISKO proceedings as they would relate to Peirce’s definition of the term ‘sign’. The concept mappings examined lacked a recognizable correlation to the Thirdness category and Interpretant classification. We observed that most researchers in the proceedings did not explicate their concepts (or terms) to the extent that an analysis and evaluation of a Thirdness category and a comparative analysis of Firstness and Secondness fit could be discovered. The Thirdness category, as Thelesffsen argues (2001), is an important element in Peirce' classification and helps us to better understand the term ‘sign’: “We discover that there is a great amount of knowledge buried in the third trichotomy” (2001, 10). Thelesffsen’s argument makes us wonder why ISKO researchers did not explicate their terms to the extent that a relationship between the ‘sign’ and the object and its nature could be analyzed or discovered. Perhaps presenters did not feel the need to fully explicate the meta-theoretical underpinnings of their research. We found that the distribution of Firstness, Secondness and Thirdness correlate to Bradford’s law of distribution, although we examined only one conference procedure. Of course, Bradford’s law insists that researchers examine more than one field and journal in order to properly apply his formula (Bradford, 1934). We should also note that we were unable to locate a study that examined the relationship between the Thirdness and Interpretant as it might be applied to a concept mapping paradigm. Because this is the first known attempt to analyze concept mappings in terms of Peirce’s classifications and categorizations, we recommend a more extensive analysis of various cognitive strategies employed in future conference proceedings. Our analysis is, as far as we know, the first of its kind and it will undoubtedly help establish better guidelines for analyzing concept mapping strategies employed by researchers in the field of knowledge organization. We feel that more sophisticated methodological analysis will be necessary to investigate the practical concerns herein raised. Keeping this in mind, it should be noted that more sophisticated methodological analysis should prove useful in determining how Peirce’s philosophy might be applied to knowledge organization so that concept mapping can be effectively used by those who present at important academic conferences. 137

Acknowledgment: I would like to thank Dr. Richard Smiraglia for his guidance and assistance with this research.

5. Works cited: Bradford, S.C. (1934). Sources of information on specific subjects. Engineering: 137, 85- 86. Chandler, D. (1999). Semiotics for beginners. London, Routledge. ISKO (2000). Dynamism and stability in knowledge organization. Proceedings of the sixth international ISKO conference. In Beghtol, C. (ed.), Toronto, Canada. Wurzburg: Eron Verlag. ISKO (2004). Knowledge organization and the global information society. Proceedings of the eight international ISKO conference. In McIlwaine, I.C.(ed.), London, UK. Wurzburg: Eron Verlag. Jonassen, D., Beissner, K., & Yacci, M. (1993). Structural knowledge. Hillsdale, N.J: Lawrence Erlbaum Associates, Publishers. Kremer, R. (1994). Concept mapping: informal to formal. ICCS - Proceedings of the international conference on conceptual structures, University of Maryland. Retrieved 11/29/2004 from: http://pages.cpsc.ucalgary.ca/~kremer/papers/ICCS94.html Kremer, R. (1997). Constraint Graphs: A Concept Map Meta-Language. PhD dissertation, University of Calgary, Department of Computer Science Mai, J-E. (2002). Semiotics and indexing: an analysis of the subject-indexing process. Journal of documentation 57(5): 591-622. Minsky, M. (1975). A framework for representing knowledge. In Winston (ed.), The psychology of computer vision, New York, McGraw-Hill. Novak, J.D. & Gowin, D.B. (1984). Learning how to learn. New York, Cambridge University Press. Peirce, C. S. (1998). What is a sign? In the essential Peirce: Selected philosophical writings. Vol. 2 (1983-1913), In Nathan Houser, et al., (ed.), Bloomington: Indiana University Press. pp. 483-491. Peirce, C.S. (1931-58). Collected writings (8 Vols.). Charles Hartshorne, Paul Weiss & Arthur W Burks.(ed.), Cambridge, MA: Harvard University Press. Potter, W, G. (1988). Of making many books there is no end: Bibliometrics and libraries. The journal of academic librarianship 14: 238a-238c Priss, U. (2004). Signs and formal concepts. Concept lattices: Second international Conference on Formal Concept Analysis, pp. 28-38. Sowa, J.F. (2000). Semantic networks. In Encyclopedia of Artificial Intelligence. In S. C. Shapiro (ed.), Wiley, New York, 1987; revised and extended for the second edition, 1992. Theleffsen, T. (2000). Firstness and thirdness displacement: epistemology of Peirce's sign trichotomies. AS/SA. 10(10): 537-552. 138

Appendix B Attached are the study findings with regard to the ‘most used’ terms in the Firstness, Secondness and Thirdness in ISKO proceedings (2000 and 2004). The calculation was done with WordMetry software

1. Firstness ‘most used’ terms and concepts. Total number of terms 355.

Terms # of times Percent Map 4 3.74% Concept 3 2.80% Buy/Sell 2 1.87% Subclass 2 1.87% Digital 2 1.87% Representation 2 1.87% Centered 2 1.87% Student 2 1.87% Expert 2 1.87% Movement 2 1.87% Physical 2 1.87% Geography 2 1.87% Course 2 1.87% Subwork 2 1.87% Cluster 2 1.87% Web 2 1.87% Research 2 1.87% Term 1 0.93% Referring 1 0.93% Internal 1 0.93% Structures 1 0.93%

2. Secondness ‘most used’ terms and concepts. This table represents only a minority of the terms found under Secondness category. The total number of terms was 956.

Terms # of times Percent Science 8 2.64% System 5 1.28% Amazon 5 1.01% Logic 5 1.01% Matter 5 1.01% Human 5 1.01% Individual 4 0.80% Classification 4 0.80% Information 4 0.80% 139

Terms # of times Percent Engineering 4 0.80% Photo 4 0.80% Vision 4 0.80% Ontology 4 0.80% Arts 4 0.80% Society 4 0.80% Medicine 3 0.60% Database 3 0.60% Microbiology 3 0.60% Computer 3 0.60% Community 3 0.60% Knowledge 3 0.60% Sign 3 0.60% Movement 3 0.59% Force 3 0.59% Microbiology 3 0.59% Community 3 0.59% Language 2 0.40% Network 2 0.40% Agriculture 2 0.40% XML 2 0.40% Base 2 0.40%

3. Thirdness ‘most used’ terms. The total number of terms was 114.

Terms # of times Percent

Refers 5 1.26% Access 4 0.89% NR, NT, RT 1 0.56% Att. 1 0.56% Facet 1 0.56% Architecture 1 0.56% Top 1 0.56% Passages 1 0.56%

Chaim Zins

Knowledge Map of Information Science: Issues, Principles, Implications

Abstract: The study, "Knowledge Map of Information Science: Issues, Principles, Implications", explores the theoretical foundations of information science. It maps the conceptual approaches for defining "data", "information", and "knowledge", maps the major conceptions of Information Science, portrays the profile of contemporary Information Science by documenting 28 classification schemes compiled by leading scholars during the study, and culminates in developing a systematic and scientifically based knowledge map of the field, one grounded on a solid theoretical basis. The study was supported by a research grant from the Israel Science Foundation (2003-2005). The scientific methodology is Critical Delphi. The international panel was composed of 57 leading scholars from 16 countries who represent nearly all the major sub-fields and important aspects of the field.

Introduction The field of Information Science (IS) is constantly changing. Therefore, information scientists are required to regularly review – and if necessary redefine – the fundamental building blocks of the field. The study, "Knowledge Map of Information Science", explores the theoretical foundations of information science. It maps the conceptual approaches for defining "data", "information", and "knowledge" (D-I-K), maps the major conceptions of Information Science, portrays the profile of contemporary Information Science by documenting 28 classification schemes compiled by leading scholars during the study, and culminates in developing a systematic and scientifically based knowledge map of the field, one grounded on a solid theoretical basis. The study produces four papers: (1) Conceptual Approaches for Defining 'Data', Information', and 'Knowledge'; (2) Conceptions of Information Science; (3) Classification Schemes of Information Science: 28 Scholars Map the Field; (4) Knowledge Map of Information Science.

Methodology The scientific methodology is Critical Delphi. Critical Delphi is a qualitative research methodology aimed at facilitating critical and moderated discussions among experts (the panel). The international and intercultural panel is composed of 57 participants from 16 countries. It is unique and exceptional, comprising leading scholars who represent nearly all the major sub-fields and important aspects of the field (see Appendix I). The indirect discussions were anonymous and were conducted in three successive rounds of structured questionnaires. The first questionnaire contained 24 detailed and open-ended questions covering 16 pages. The second questionnaire contained 18 questions in 16 pages. The third questionnaire contained 13 questions in 28 pages. The return rates were relatively high: 57 scholars (100%) returned the first round, 39 (68.4%) returned the second round, and 39 (68.4%) returned the third round. Fort three panelists (75.4%) participated in two rounds (i.e., R1 and (R2 or R3)), and 35 panelists (61.4%) participated in all three rounds. In addition, each participant received his/her responses that I initially intended to cite in future publications. The responses were sent to the each panel member with relevant critical reflections. Forty seven (82.4%) participants responded and approved their responses. Twenty three of them, which is 48.9% (23 out of 47), and 40.3% of the entire panel (23 out of 57)) revised their original responses. Therefore, one can say that actually the critical process (the study) was composed of four rounds. 142

"Data", "Information", and "Knowledge" Anthropological document. Forty five scholars formulated about 130 definitions. This collection of definitions is an invaluable "anthropological document" that documents the conceptions of D-I-K, as they are understood by leading scholars in the information science academic community. The definitions provide the basis for mapping the various conceptual approaches for defining "data", "information", and "knowledge" in the context of IS. Metaphysical vs. non-metaphysical approaches. The most basic distinction is between metaphysical and non-metaphysical approaches. Metaphysical approaches refer to data, information, or knowledge as metaphysical phenomena. Obviously, for Information Science, all the panel members unanimously implement non-metaphysical approaches. Human exclusive vs. non-exclusive approaches. Non-metaphysical approaches are divided into those exclusively centered on humans and those that ascribe the D-I-K phenomena to non-human biological (e.g., animals and plants) and/or to physical (e.g., planets, robots) phenomena as well. Apparently, nearly all the panel members adopt human-exclusive approaches for defining D-I-K in the context of information science. Human-centered approaches. Three classifications emerge as highly relevant. The first classification is between cognitive-based exclusive approaches vs. non-exclusive approaches. The second classification is between 'propositional' exclusive vs. non-exclusive approaches. The third classification is between the subjective domain and the objective, or rather universal domain. The mainstream of the field. Undoubtedly, the most common conceptual approach that represents the mainstream of the field is characterized as the non-metaphysical, human-centered, cognitive-based, propositional approach. Models for defining D-I-K. The third division, which is the division between the subjective domain (SD), namely, D-I-K as inner phenomena bound in the mind of the individual knower, and the universal domain (UD), namely, D-I-K as external phenomena to the mind of the individual knower, establishes the theoretical ground for formulating five generic models for defining D-I-K (see figure 1). The first model is UD: D-I; SD: K; meaning: D-I are external phenomena; K are internal phenomena. This model is the most common one. It underlies the rationale of the name "Information Science"; that is, Information Science is focused on exploring data and information, which are external phenomena. IS does not explore knowledge, which is an internal phenomenon. The second model is UD: D; SD: I-K; meaning: D are external phenomena; I-K are internal phenomena. The third model is UD: D-I-K; SD: I-K; meaning: D are external phenomena; I-K phenomena can be in both domains, external or internal. The fourth model is UD: D-I; SD: D-I-K; meaning: D-I phenomena can be in both domains, external or internal; K phenomena are internal. The fifth model is UD: D-I-K; SD: D-I-K; meaning: D-I-K phenomena can be in both domains, universal (i.e., external) or subjective (i.e., internal).

Model 1 Model 2 Model 3 Model 4 Model 5 UD SD UD SD UD SD UD SD UD SD D D D D D D D I I I I I I I I K K K K K K K

Figure 1: Four models for defining D-I-K 143

Conceptions of Information Science Anthropological document. Fifty scholars formulated fifty definitions of "information science". This collection of definitions is an invaluable "anthropological document" that documents the conceptions of Information Science, as they are understood by leading scholars in the information science academic community. Key issues. Based on the panel discussions, conceptions of Information Science differ mainly on three key issues: phenomena, domain, and scope: What are the explored phenomena? What is the domain of the field? What is the scope of the exploration? Explored phenomena. What are the explored phenomena of Information Science? The definitions provide four different foci: data vs. information vs. knowledge vs. message. First, agreement has to be reached on the explored phenomena: data vs. information vs. knowledge vs. message. Nevertheless, analysis of the panel's definitions of D-I-K-M made it clear that the wording can be deceptive. Panel members often misused the terminology. Therefore, I adopt an ad-hoc position that IS explores D-I-K-M phenomena, without differentiating, however defined and in whatever relation to each other. Domain. What is the domain of the field? Three different foci emerge: culture vs. technology vs. hi-tech. Hi-tech (i.e., computer-based technology) is a subcategory of technology (i.e., the physical tools developed by humans to meet their needs), and technology is a subcategory of culture (i.e., overall human activity and creativity in the social context). In fact, the panel endorses only the cultural and the hi-tech approaches, while the technological approach is rather theoretical. Nearly all the panel members follow the cultural approach. Obviously, it can be characterized as representing the mainstream of the field. To summarize, theoretically there are three approaches regarding the domain of the field (i.e., culture vs. technology vs. hi-tech). However, the real dilemma is between the cultural and the hi-tech approaches, while the cultural approach seems to represent the mainstream of the field. Scope. The third issue is determining the scope of the exploration. What is the scope of the exploration? Two approaches emerged: mediating aspects vs. all the aspects of the explored phenomena. Does IS explore the mediating aspects of D-I-K-M phenomena, namely those aspects involved in facilitating the connection between the D-I-K-M originators and users? Or does it explore all the aspects of D-I-K-M,? Six conceptions. Resolving the three issues is crucial. It underlies six generic conceptions, or models, of Information Science (see Figure 1). The six models of Information Science are: The Hi-Tech Model. Information Science is the study of the mediating aspects of D-I-K-M phenomena as they are implemented in the hi-tech domain. The Technology Model. Information Science is the study of the mediating aspects of D-I-K-M phenomena as they are implemented in the technological domain in general (i.e., all types of technologies). The Culture Model. Information Science is the study of the mediating aspects of D-I-K-M phenomena as they are implemented in the cultural domain. The Human World Model. Information Science is the study of all the aspects of D-I-K-M phenomena as they are implemented in the human realm. The Living World Model. Information Science is the study of all the aspects of D-I-K-M phenomena as they are implemented in the living world, human and non-human. The Living & Physical Worlds Model. Information Science is the study of all the aspects of D-I-K-M phenomena as they are implemented in all types of biological organisms, human and non-human, and all types of physical objects. 144

Explored Phenomena Scope Domain Data Information Knowledge Message (Focusing on the mediating aspects of D-I-K-M Model (1) Hi-Tech as they are implemented in computer-based technologies) (Focusing on the mediating aspects of D-I-K-M Model (2) Technology as they are implemented in all types of

Mediating technologies) (Focusing on the mediating aspects of D-I-K-M Model (3) Culture/Society as they are implemented in human societies) (Focusing on all aspects of D-I-K-M as they are Model (4) Human World implemented in the human realm)

Characteristics (Focusing on all aspects of D-I-K-M as they are Model (5) Living World implemented in the living world) (Focusing on all aspects of D-I-K-M as they are

Inclusive Model (6) Living & Physical implemented in all types of biological organisms, (all aspects) Worlds human and non-human, and all types of physical objects)

Figure 2: A map of Conceptions of Information Science

Mediating models vs. inclusive models. The six conceptions are divided into two major groups, the mediating conceptions vs. the inclusive conceptions. To summarize, according the three mediating conceptions Information Science is one field next to other fields, which explore the various perspectives of the D-I-K-M phenomena, while, according the three inclusive conceptions "Information Science" is a name for a generic field that comprises all the other fields that explore the various perspectives of the D-I-K-M phenomena. Six Information Sciences. The six models imply six different bodies of knowledge. Consequently, they establish six different fields of knowledge; all carry the same name, "Information Science". No wonder that scholars, practitioners, and students are confused. In the study the Hi-tech model, the Culture model, the Human World model, and the Living World model emerged as more significant. The vast majority of the panel responses represent the culture model. Although the study is qualitative, it seems that the culture model represents the mainstream of contemporary Information Science.

Schemes of Information Science Anthropological document. Twenty eight scholars compiled classification schemes. This unique and invaluable collection portrays and documents the profile of contemporary Information Science at the beginning of the 21st century. Formulating the schemes. The process of formulating the schemes was exhaustive. It consisted of three steps. First, in the second round each panel member was asked to compile a knowledge map of the field, or rather a classification scheme, which represented his/her conception of Information Science. Next, in the third round, the investigator presented the panel's schemes, and each participant was asked to comment on the various schemes, select the one that best represented his/her position, and revise the scheme that he/she had developed in the second round. Finally, the investigator sent personal letters to the authors of the schemes. Each letter included the panel's relevant reflections, and if applicable, critical comments. Once again, each author was asked to revise his/her scheme. Types of classifications. The collection is composed of different types of classification schemes. Note that the various types are not exclusive. Most of the schemes are taxonomies; namely, they are grounded on empirical data. A few are typologies; namely, they are based on conceptual analysis of the constitutive concept/s. Most of the schemes are subject 145 classification schemes designed for practical purposes. A few classifications are 'ontologies'; namely they meant to divide the relevant phenomenon into its key attributes, characteristics, or facets. The mainstream Information Sciences. Twenty six schemes reflect the culture model (see above). One Scheme represents the Living &Physical Worlds Model, and one scheme is too general, and can represent any model. Evidently, the culture model represents the mainstream of contemporary Information Science. Knowledge Science. The study substantiates the suggestion to changing the name of the field from "Information Science" to "Knowledge Science". More than twenty schemes include the concept "knowledge" as a main category or a sub-category of the field. Conclusions. To summarize, based on the panel diversified schemes it is evident that the culture model represents the mainstream of contemporary Information Science; meaning Information Science is the study of the mediating aspects of D-I-K-M phenomena – mutatis mutandis – as they are implemented in the cultural (i.e., social) domain. Apparently, the suggestion to change the focus of the field – as well as its name – from information to knowledge is supported by a growing number of scholars.

Knowledge Map of Information Science Ten major categories. Finally, the study culminates in developing a systematic, and scientifically valid knowledge map of the field; a map which is grounded in a solid theoretical basis. The map has ten basic categories: (1) Foundations, (2) Resources, (3) Knowledge Workers, (4) Contents, (5) Applications, (6) Operations & Processes, (7) Technologies, (8) Environments, (9) Organizations, and (10) Users. The ten categories are divided into two groups. The first group, which has one category, Foundations, is composed of the meta-knowledge of the field. The second group, which has nine categories, 2 through 10, is composed of the essential body of knowledge on the explored phenomena. Foundations. The first category, Foundations, includes the meta-knowledge of the field of information science, namely the theoretical and the methodological foundations of the field, as well as Information Science education, and the history of the field.. Categories 2 – 10. Categories 2 through 10 are deduced from the conception of information science as the study of the mediating aspects of human knowledge in the collective domain. Based on a phenomenological analysis of the phenomena of mediating universal knowledge (i.e., as it is embodied in physical objects) one can identify at least nine basics: Resources, Knowledge Workers, Contents, Applications, Operations & Processes, Technologies, Environments, Organizations, and Users. This is based on the following rationale. Information Science explores the various conditions relevant for connecting resources (section 2) with users (section 10). This involves seven constituents

The Model Overview. Finally, the study culminates in developing a systematic, and scientifically valid knowledge map of the field; a map which is grounded in a solid theoretical basis. The three-phase research methodology produced a ten-fact hierarchical model. The ten facets are (1) Foundations, (2) Resources, (3) Knowledge Workers1, (4) Contents, (5) Applications, (6) Operations & Processes (7) Technologies (8) Environments, (9) Organizations, and (10) Users (see Figure 3). Most facets are composed of a three-level hierarchical structure, as for example, Foundation (1st level) – Theory (2nd level) – Conceptions (3rd level). In many cases the third level is not fully developed, and is left for further development in future studies by the IS academic and professional community; For example, Operations & Processes (1st level) – Types (2nd level) – Production, Documentation, Representation, Dissemination, 146

Storage, Retrieval, Use (3rd level). In some cases, the classification is refined by adding a fourth level of topical sub-division, as in the following case: Organizations (1st level) – Types (2nd level) – Functional Type (3rd level) – Memory Organizations (4th level), Libraries, Archives, Museums (5th level). The ten categories are divided into two groups. The first group, which has one category, Foundations, is composed of the meta-knowledge of the field. The second group, which has eight categories, 2 through 10, is composed of the essential body of knowledge on the explored phenomena, which are the mediating perspectives and conditions of human knowledge in the universal domain. Meta-knowledge. The Foundation section is unique. It includes the meta-knowledge of the field of information science. Its rationale rests on philosophical grounds rather than on the phenomenological analysis of information science, as is the case with sections 2 through 10. The necessity of a specific meta-knowledge section is derived, as a philosophical implication, from Kurt Gödel's Incompleteness Theorem (Gödel, 1931). From Gödel's theorem one can conclude that it is logically impossible to form an axiomatic system without assuming additional postulates. By accepting this implication, we realize that it is theoretically impossible to formulate a self-sufficient explanation based exclusively on the phenomenological analysis of information science. Consequently, an additional meta-knowledge section, which in the model is titled “Foundation,” is a necessary basis in the knowledge construction of the field. Meta-knowledge is knowledge on knowledge. It includes epistemological, methodological, conceptual, theoretical, historical, and practical postulates, principles and guidelines regarding the relevant body of knowledge (Zins and Guttman, 2003). Nine basics of information science. As noted, sections 2 through 10 are based on the phenomenological analysis of information science. Information science by its very essence is a social science. It is the study of the mediating conditions and perspectives of human knowledge in the universal domain (i.e., as it is embodied in physical objects). Based on a phenomenological analysis of the phenomena of mediating universal knowledge one can identify eight basics of information science. These are: Resources, Knowledge Workers, Contents, Applications, Operations & Processes, Technologies Environments, Organizations, and Users. The nine elements are based on the following rationale. Information Science explores the various conditions relevant for connecting resources (section 2) with users (section 10). This involves seven constituents (sections 3 through 9): the knowledge worker (e.g., information professionals, librarians, archivists) – section 3; the content (e.g., bio-medical informatics, educational information, etc.) – section 4; the application (e.g., searching, shopping, socializing) – section 5; the operation and process (e.g., documentation, representation, organization, processing, manipulation, storing, dissemination, and retrieval of knowledge) – section 6; the technology/medium (e.g., paper, HTML, XML, etc.) – section 7; the environment (e.g., American, European, Internet, etc.) – section 8; and the organization (e.g., libraries, archives, information services, etc) – section 9. Sections 3 through 9 represent seven building blocks of the mediating process. To simplify the explanation of their order let us group them into two parallel sets of characteristics. The first set follows the 'Who-What-Why-How-Where & When' order. The second set follows the equivalent '6M's' order, which is 'Mediator-Matter-Motive-Method-Means, and Milieu'. The mediating process is characterized by answering the following questions: Who mediate (i.e., the mediator)? – the knowledge workers (section 3); What is being mediated (i.e., the matter)? – the contents (section 4); Why it is mediated (i.e., the motive)? – the application (section 5); How it is mediated (i.e., the method, and the means)? – The method is the relevant operation or process (section 6), and the means is the relevant technology (section 7). Where 147 and when the mediating process does happen (i.e., the milieu) – the environment (section 8), and the organization (section 9). Theory and embodiment. The ten-facet map has a dual pattern of a theory-praxis structure, or rather a theory-embodiment structure. The theory-embodiment structure is implemented in the ten-facet map, as a whole, as well as in each of its ten sections. In the ten-facet map the theory constituent is implemented in the Foundation section, while the embodiment constituent is implemented in sections 2 through 10. Now let us zoom into the ten sections. The Foundation section, which is the theory constituent of the map, is too divided into a theory constituent (i.e., the Theory category) and an embodiment constituent (i.e., the Research, the Education, and the History categories). Each one of the nine sections, 2 through 10, which are the embodiment constituent of the map, is too divided into a theory constituent (i.e., the issues category) and an embodiment constituent (i.e., the Types category). Significance. The model establishes the ground for formulating theories of Information Science. To be specific, theories of IS should strive to explain the complex process of mediating human knowledge, that is, bringing the knowledge from the originator to the user. The model paves the way for developing and evaluating Information Science academic programs, as well as for developing systematic bibliographic resources.

Figure 3: Knowledge Map of Information Science - Rationale 148

Concluding Remarks This study maps the major issues on the agenda of scholars engaged in exploring and substantiating the foundations of Information Science. Approaches and models were identified and formulated for defining "data", "information", "knowledge", "message, and "Information Science", 28 classification schemes compiled by leading scholars were analyzed, and a systematic and comprehensive knowledge map of the field was formulated. This might help the reader to a better understanding of the issues and the considerations involved in establishing the foundations of Information Science; but by no means does it replace the personal quest to ground one's positions on solid theoretical foundations.

Acknowledgement I would like to thank the Israel Science Foundation for a research grant that made the study possible. However, what made the difference was my 57 colleagues who participated in this exhausting and time-consuming study as panel members. Their invaluable contributions have made this study really important, and I am truly grateful. The study was conducted at Bar-Ilan University.

Notes 1 I use the term "knowledge" rather than "information" since I define "information" as empirical knowledge (see Zins, in press2).

Appendix I: The Panel Dr. Hanne Albrechtsen, Institute of Knowledge Sharing, Denmark; Prof. Elsa Barber, University of Buenos Aires, Argentina; Prof. Aldo de Albuquerque Barreto, Brazilian Institute for Information in Science and Technology, Brazil; Prof. Shifra Baruchson–Arbib, Bar Ilan University, Israel; Prof. Clare Beghtol, University of Toronto, Canada; Prof. Maria Teresa Biagetti, University of Rome 1, Italy; Prof. Michael Buckland, University of California, Berkeley, USA; Mr. Manfred Bundschuh, University of Applied Sciences, Cologne, Germany; Dr. Quentin L. Burrell, Isle of Man International Business School, Isle of Man; Dr. Paola Capitani, Working Group Semantic Web, Italy; Prof. Rafael Capurro, University of Applied Sciences, Stuttgart, Germany; Prof. Thomas A. Childers, Drexel University, USA; Prof. Charles H. Davis, Indiana University; the University of Illinois, USA; Prof. Anthony Debons, University of Pittsburgh, USA; Prof. Gordana Dodig-Crnkovic, Mälardalen University, Sweden; Prof. Henri Dou, University of Aix-Marseille III, France; Prof. Nicolae Dragulanescu, Polytechnics University of Bucharest, Romania; Prof. Carl Drott, Drexel University, USA; Prof. Luciana Duranti, University of British Columbia, Canada; Prof. Hamid Ekbia, University of Redlands, USA; Prof. Charles Ess, Drury University, USA; Prof. Raya Fidel, University of Washington, USA; Prof. Thomas J. Froehlich, Kent State University, USA; Mr. Alan Gilchrist, Cura Consortium and TFPL, UK; Dr. H.M. Gladney, HMG Consulting, USA; Prof. Glynn Harmon, University of Texas at Austin, USA; Dr. Donald Hawkins, Information Today, USA; Prof. Caroline Haythornthwaite, University of Illinois at Urbana Champaign, USA; Mr. Ken Herold, Hamilton College, USA; Prof. William Hersh, Oregon Health & Science University, USA; Prof. Birger Hjorland, Royal School of Library and Information Science, Denmark; Ms. Sarah Holmes*, the Publishing Project, USA. Prof. Ian Johnson*, the Robert Gordon University, UK; Prof. Wallace Koehler, Valdosta State University, USA; Prof. Donald Kraft, Louisiana State University, USA; Prof. Yves François Le Coadic, National Technical University, France; Dr. Jo Link-Pezet, Urfist, and University of Social Sciences, France; Mr. Michal Lorenz, Masaryk University in Brno, Czech Republic; Prof. 149

Ia McIlwaine, University College London, UK; Prof. Michel J. Menou, Knowledge and ICT management consultant, France; Prof. Haidar Moukdad, Dalhousie University, Canada; Mr. Dennis Nicholson, Strathclyde University, UK; Prof. Charles Oppenheim, Loughborough University, UK; Prof. Lena Vania Pinheiro, Brazilian Institute for Information in Science and Technology, Brazil; Prof. Maria Pinto, University of Granada, Spain; Prof. Roberto Poli, University of Trento, Italy; Prof. Ronald Rousseau, KHBO, and University of Antwerp, Belgium; Dr. Silvia Schenkolewski–Kroll, Bar Ilan University, Israel; Mr. Scott Seaman*, University of Colorado, Boulder, USA; Prof. Richard Smiraglia, Long Island University, USA; Prof. Paul Sturges, Loughborough University, UK; Prof. Carol Tenopir, University of Tennessee, USA; Dr. Joanne Twining, Intertwining.org, a virtual information consultancy, USA; Prof. Anna da Soledade Vieira, Federal University of Minas Gerais, Brazil; Dr. Julian Warner, Queen's University of Belfast, UK; Prof. Irene Wormell, Swedish School of Library and Information Science in Borås, Sweden; Prof. Yishan Wu, Institute of Scientific and Technical Information of China (ISTIC), China. * An observer (i.e., those panel members who did not strictly meet the criteria for the panel selection and terms of participation.)

Rebecca Green College of Information Studies, University of Maryland, USA

Semantic Types, Classes, and Instantiation

Abstract: Semantic types provide a level of abstraction over particulars with shared behavior, such as in the participant structure of semantic frames. The paper presents a preliminary investigation, drawing on data from WordNet and FrameNet, into the relationship between hierarchical level and the semantic types that name frame elements (a.k.a. slots). Patterns discovered include: (1) The level of abstraction of a frame is generally matched by the level of abstraction of its frame elements. (2) The roles played by persons tend to be expressed very specifically. (3) Frame elements that mirror the name of the frame tend to be expressed specifically. (4) Some frame participants tend to be expressed at a constant (general) level of abstraction, regardless of the level of abstraction of the overall frame.

1 Introduction What we interact with directly in the world are instances (a.k.a. instantiations, particulars, tokens). We know specific persons, drive specific vehicles, use specific computers, and eat specific fruits and vegetables. At the same time, our drive to discover knowledge is aimed largely at the class (a.k.a. type, kind) level. The power of the scientific method, which underlies much of our knowledge, lies in postulating hypotheses about classes, then supporting them by generalizing from empirical knowledge of particulars. Indeed, part of the power of culture and civilization is that a person may know much about certain classes without personal knowledge of any of their particular instances. Not surprisingly, when we store and organize knowledge, we often do so in terms of class and instance. For example, when we build a database, we first model the domain at the class level, then record data at the instance level. When we build a knowledge base, we record knowledge about both relationships between classes (e.g., ґx (poodle (x) ĺ dog (x))) and relationships between an instance and the class it instantiates (e.g., poodle (Fido)). When we annotate text, we tag specific word instances with labels for syntactic or semantic classes. As is evident in these knowledge organization contexts, one function of classes is to declare the semantic type of particulars, that is, to identify, from among the classes that a particular is an instance of, which of them best captures its essence. The abstraction provided by semantic types often serves to group together instances with shared behavior, use, and/or purpose (IBM, 2005). The purpose of this paper is to explore the kinds of classes that best serve as semantic types. The immediate motivation for undertaking this exploration derives from a larger project to induce frame-based knowledge representation structures automatically (Green & Dorr, 2004; Green & Dorr, 2005). The frames correspond to states (e.g., physical states, emotional states) relationships (e.g., familial relationships, spatial relationships), and events (e.g., punctual events, durative events). Frame participants are captured by slots within the frame, with the names of frame slots indicating participant types and slot values corresponding to specific participants in an instantiated frame. For example, a Cleaning frame addresses the event that causes a change of state from something’s being dirty to its being clean. Participants in the frame would include a Theme, the thing whose state of cleanliness is at issue, an Agent who causes the change of state to occur, and an Instrument (e.g., a brush) and/or a Substance (e.g., detergent) used by the Agent to effect the cleaning process. An understanding of the character of semantic types is needed to establish frame slot names at appropriate levels of 152 generality: Are Theme, Agent, Instrument, and Substance the best names, or might labels like Cleaner or Cleanser be more appropriate? They are certainly more informative. Analysis of semantic type data from two kinds of data sources are involved in the exploration. One data stream is drawn from lexical resources that provide information on semantic types; this data source type is represented by WordNet, the latest version of which includes thousands of instantiation links to semantic-type-like class concepts. The other data stream is drawn from resources that use semantic types internally for purposes of classification; this data source type is represented by FrameNet, a human-generated lexical database of semantic frames, where names of frame elements correspond to semantic types. (FrameNet uses “frame element” for the more conventional “frame slot.”) The investigation will be restricted to nominal categories throughout.

2 Semantic Types and Hierarchical Level The semantic type concept has come into existence because of the need, experienced in various contexts, to recognize a limited number of general classes. For example:

The Semantic Network [of the Unified Medical Language System (UMLS)] consists of (1) a set of broad subject categories, or Semantic Types, that provide a consistent categorization of all concepts represented in the UMLS Metathesaurus, and (2) a set of useful and important relationships, or Semantic Relations, that exist between Semantic Types. (National Library of Medicine, 2006)

As this explanation indicates, the 135 semantic types of the UMLS provide an important mechanism for referring to meaningful subsets of the 1 million-plus concepts in the UMLS Metathesaurus. Further, the semantic types are the foundation over which the set of Semantic Relations is overlaid; that is, they make the establishment of the Semantic Relations possible. The term semantic type has been adopted in the discourse of various communities, for example, (general) linguistics, computational linguistics, knowledge representation, data modeling, and content management. Although the term is popular enough to be found in almost 150,000 documents on the Web (searching under “semantic type” OR “semantic types”), it is not defined even in updated online versions of unabridged dictionaries (e.g., Oxford English Dictionary, Webster’s Third New International Dictionary). This suggests either (1) a lack of consensus regarding the term’s meaning (so that a suitable definition cannot be crafted) and/or (2) the full compositionality of the term’s meaning (so that no definition is needed). Indeed, both possibilities seem to hold: The meanings of semantic and type are broad/vague enough that even a straightforward, compositional interpretation of semantic type leaves room for sets of semantic types of varying levels of generality and abstraction. The issue at hand is the identification and characterization of kinds of classes that can best serve as semantic types, so that appropriate semantic types can be generated automatically in a particular context. For purposes of this discussion, the ability to identify at least one semantic type candidate class is assumed. The question is, When multiple semantic type candidate classes are under consideration, which of these classes is the best semantic type? For the most part, when multiple candidate classes exist, they are hierarchically related. Indeed, it is most often the case that multiple candidates can be identified specifically because they are hierarchically related: In a true class hierarchy, every member of a more specific class is by definition a member of every broader class. Thus the issue of identifying and characterizing semantic type classes largely becomes an issue of identifying the appropriate hierarchical level(s) among the candidates. 153

Given the potential breadth of interpretation noted above, it is unrealistic to suppose that a one-size-fits-all characterization of semantic types with respect to hierarchical level is achievable. The exploration will thus be constrained from two perspectives arising from its motivating context. First, the relevance of particular semantic types will be evaluated against their use as frame slot names. Second, the analysis of hierarchical level will be operationalized in terms of the WordNet noun network (which is being used as the source of frame slot names).

2.1 Possible Hierarchical Level Scenarios Several possibilities exist for associating instances to classes within a hierarchy:

• Option 1. The activity of identifying the appropriate hierarchical level(s) is at best unnecessary and at worst misleading. Any instance is likely to be a member of multiple hierarchically-related classes. Each such class may be regarded without prejudice or preference as an appropriate semantic type. • Option 2. The most appropriate semantic type for an instance is the lowest-level class to which it belongs. Membership in all broader classes is simply derivative. Considered globally, this option posits that an ontology consists of an interrelated set of class hierarchies, with instances attached to the leaf nodes of the semantic network so formed. • Option 3a. The goal of identifying classes within hierarchies that are appropriately designated as semantic types is highly reminiscent of the basic level category enterprise (Lakoff, 1987, 31-38, 46-48). Serving as a semantic type is simply one more way in which basic level categories are privileged. • Option 3b. Although it is possible to designate universal basic level categories, some of the characteristics associated with those categories (e.g., use of category labels in default situations) appear to change with expertise (Tanaka & Taylor, 1991). Semantic types may vary from basic level category designations in systematic ways. The extent of the potential variation depends on the specific context. • Option 3c. Semantic types are systematically related to hierarchical level, but do not correspond to and/or are independent of basic level categories. • Option 3d. Semantic types are not monolithic. While they are systematically related to hierarchical level, various subsets of semantic types interact with hierarchical level in different ways. • Option 4. Semantic types are not systematically related to hierarchical level.

2.2 Preliminary Review of Hierarchical Level Scenarios Option 1 turns out to be an unsatisfying possibility. If it were true, semantic types would correspond freely to class labels at all levels, and the semantic type concept probably would not exist. Conversely, the existence of the semantic type concept reflects the probability that some class labels serve the semantic typing function better than others. Option 2 is likewise suspect. Intuitively, it seems that the semantic type concept presupposes a level of generalization inconsistent with treating very specific classes as semantic types. Even though the most specific classes are the product of generalization (as indeed all classes are), if they form the basis for semantic types, only very limited generalization would be effected. Semantic types are often used to achieve a level of abstraction inconsistent with this option. The various flavors of option 3 collectively form the hypothesis that motivates the present investigation. The automatic generation of semantic types would be easier–and perhaps the 154 very feasibility of the enterprise depends on it–if semantic types are systematically related to hierarchical level. Unfortunately, if such a systematic relationship exists, it is not as simple as identifying in advance at most one semantic type within any given semantic hierarchy, since semantic types may themselves be organized hierarchically. For example, the UMLS semantic types are organized into two hierarchies (an entity hierarchy and an event hierarchy) that extend at points more than a half dozen levels deep. The hypothesis that semantic types correspond to basic level categories (option 3a), which are not organized hierarchically, thus appears untenable. The thrust of the investigation, then, is to determine which, if any, of options 3b, 3c, and 3d is/are supported by the data. Meanwhile, option 4 is regarded as a worst-case, last resort conclusion, but is from the outset a viable possibility.

3 Analysis of Semantic Types Two data sources support this investigation. The first data source is the set of class-instantiation links from WordNet; this type is used because it offers insights into semantic class relationships within the resource from which frame slot names will be drawn. The second data source is a set of hand-built semantic frames from FrameNet; this type is used because it supplies many good examples of relevant frame slot names for frames of varying levels of generality.

3.1 Semantic Types and Instantiation in WordNet The most recent version of WordNet (http://wordnet.princeton.edu), a database of English lexical items (words, phrases) and their interrelationships, records a number of instantiation links among its synonym sets (usually referred to as synsets). The vast majority of the instance synsets correspond to proper nouns (e.g., Dardanelles, an instance of {strait, sound}).1 In contrast to the standard named entity recognition task (Sang & De Meulder, 2003), where proper nouns are assigned to only a handful of semantic types (e.g., Person, Place, Organization), in WordNet 2.1, the 7669 instantiation synsets are linked to 935 class synsets. Without the information provided by WordNet, proper nouns can shed only a small amount of light on the semantic types of the participants that they name. But for the thousands of named instances with instantiation links in WordNet, semantic typing is considerably more specific. The 935 class synsets are split almost equally between those with no other more specific subclasses (466) and those that have more specific subclasses in addition to the instance synsets (469). For example, the class synset for star has as an instance a separate synset for Pollux (among other stars); it has as well subclass synsets for giant star, lodestar, neutron star, sun, supernova, etc., which may in turn have other subclass synsets or instance synsets. That half of the class synsets have more specific subclass synsets, in addition to their instance synsets, amply refutes option 2's insistence that semantic types correspond to the lowest level in the hierarchy. At the same time, the class synsets tend to be fairly specific: 95% of them are 5 or more levels down in the WordNet noun network (WordNet hierarchies extend down as many as 15 levels, but many are far more shallow; over 60% of the class synsets are situated at 5, 6, and 7 levels down), and half of them have no more specific subclass.

3.2 Semantic Types and Frame Elements in FrameNet Within the FrameNet project (http://framenet.icsi.berkeley.edu) one branch of effort goes into describing semantic frames that are evoked by specific words and identifying how various frames are related to each other. Semantic frames are described in terms of obligatory 155

(“core”) and optional (“non-core”) frame elements, with each frame element corresponding to some participant in the state, relationship, or event conveyed by the frame. Relationships between frames identified in FrameNet include, for example, inheritance (i.e., hierarchical), compositional, causative, and inchoative relationships. As of early 2006, some 780 FrameNet frames have been defined.

Hierarchically Related Frames Frame Elements State Entity, State Locative_relation Figure, Ground Abounding_with Location, Theme Within_distance Distance, Figure, Ground Expected_location_of_person Location, Person Containing Container, Contents Process_initial_state Entity, State Activity_ready_state Activity, Protagonist, Salient entity, State State_of_entity Entity, State Predicament Experiencer, Situation Relation Entity1, Entity2 Accompaniment Coparticipant, Participant Idiosyncracy Entity, Idiosyncracy Partitive Group, Subset Event Event, Place, Time Becoming Entity, Final category, Final state Absorb heat Container, Entity, Heat source Getting Recipient, Theme Taking Agent, Source, Theme Theft Goods, Perpetrator, Source, Victim Intentionally act Act, Agent Using Agent, Instrument, Purpose, Role Operating a system Operator, System Operate vehicle Area, Driver, Goal, Path, Source, Vehicle Misdeed Misdeed, Wrongdoer Committing crime Crime, Perpetrator Process_start Event Activity_start Activity, Agent Transitive action Agent, Cause, Event, Patient Damaging Agent, Cause, Patient Reshaping Cause, Deformer, Undergoer

Table 1. Frame Elements for Hierarchically Related Frames

An effective way of investigating the interaction of the semantic types2 represented by frame element names is by examining hierarchically related frames to see, for example, if more specific frames have more specific frame element names. Table 1 presents the hierarchical structure below three very general frames (with hierarchical relationships being expressed through indentation), along with their corresponding core frame elements.3 The following generalizations are supported (where generality in frames pertains to depth in a hierarchical structure of frame relationships and generality of frame elements pertains to the number of frames in the system that use such an element):

• The most general frames (e.g., State, Relation, Event) have correspondingly general frame elements (e.g., Entity, State, Event, Place, Time). 156

• General frames, i.e., those with at least as many levels below them as above them (e.g., Locative_relation, Getting, Using, Transitive_action) tend to have general frame elements (e.g., Figure, Ground, Recipient, Theme, Agent, Instrument, Purpose, Role, Cause, Event, Patient). • Specific frame elements (e.g., Protagonist, Idiosyncracy, Heat source, Goods, Perpetrator, Victim, Driver, Vehicle, Crime, Deformer, Undergoer) tend to be associated with very specific frames (e.g., Activity_ready_state, Idiosyncracy, Absorb heat, Theft, Operate vehicle, Committing crime, Reshaping).

While the data exhibit a general pattern where the level of abstraction of a frame matches the level of abstraction of its corresponding frame elements, exceptions are evident. It is especially the case that some specific frames (e.g., Operate vehicle, Activity_start, Damaging) have general frame elements (e.g., Goal, Path, Source, Activity, Agent, Cause, Patient), these may accompany more specific frame elements. But characterizing the appropriate level of generality of frame elements in terms of the level of generality of frames presents a bit of a paradox, since automatic identification of hierarchical relationships between frames is based, at least in part, on how their frame elements are related to each other. Another pattern visible in the generalizations presented above is that highly specific frame elements tend to be of two kinds. One subset of the most specific frame elements (e.g., Protagonist, Perpetrator, Victim, Driver, Deformer, Undergoer) identify roles played by persons. These frame elements include made-up words (Deformer, Undergoer) or real words used creatively (Protagonist). Where the frame elements are real words used in their normal sense, the word is drawn from either the most or the next-most specific WordNet synset in its hierarchy. The slight generalization (e.g., Driver, Victim) appears to be used when it abstracts from a number of more specific subclasses (e.g., road hog, race driver, motorist, designated driver, chauffeur; prey, muggee, martyr, casualty). Another subset of the most specific frame elements (e.g., Idiosyncracy, Heat source, Vehicle, Crime) are closely related to the name given the overall frame. Other kinds of participants (e.g., Path, Distance, Figure, Ground, Location, Goal, Source, Cause) appear not to vary in their level of generality with the generality of the frame.

4 Summary of Findings and Future Work Of the options presented, option 3d is best supported by the data examined. The interaction of hierarchical level and semantic types is complex, not monolithic. While the level of abstraction of a frame is generally matched by the level of abstraction of its frame elements, there are systematic exceptions. The semantic types of some kinds of frame participants are relatively constant, no matter whether the overall frame is relatively general or rather more specific. The hierarchical level of the semantic types of other kinds of frame participants admit more variation. For example, the roles played by persons tend to be expressed very specifically; frame elements that mirror the name of the frame also tend to be expressed specifically. Given the preliminary nature of the investigation (and concomitant limits on the amount of data examined), the conclusions just drawn are at best tentative and require further examination. In particular, more extensive examination of the hierarchical level of FrameNet’s frame element names is needed to establish whether patterns observed over hierarchically related frames generalize over all frames. 157

Notes 1. The very few exceptions, where instance synsets are common nouns (e.g., curb market, over-the-counter market, barrier island, isle, protagonist), are not easily distinguished from situations where common nouns serve as class synsets. 2. FrameNet has begun annotating some of the frame element names with values from a small set of “semantic types,” which is potentially a source of confusion. It is the frame element names in FrameNet and not the semantic types they have begun using that correspond to the semantic types under investigation. 3. The FrameNet “Book” (Ruppenhofer et al., 2005) explains with regard to inheritance relationships that “a child frame is a more specific elaboration of the parent frame. In such cases, all of the frame elements . . . of the parent have equally or more specific correspondents in the child frame.” It appears this principle has not always been followed.

References Green, R. & Dorr, B. J. (2004). Inducing a semantic frame lexicon from WordNet data. Proceedings of the Second Workshop on Text Meaning and Interpretation, Workshop held in cooperation with the 42nd Annual Meeting of the Association of Computational Linguistics, 65-72. Green, R. & Dorr, B. J. (2005). Frame semantic enhancement of lexical-semantic resources. Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, Workshop held in cooperation with the 43rd Annual Meeting of the Association of Computational Linguistics, 57-66. IBM. (2005). Welcome to the Content Management Version 8.3 Information Center. Subsection on Defining a Semantic Type. Retrieved February 27, 2006 from http://publib.boulder.ibm.com/infocenter/cmgmt/v8r3m0. Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind. Chicago: University of Chicago Press. National Library of Medicine. (2006). Unified Medical Language System. Section 3: Semantic Network. Retrieved February 17, 2006 from http://www.nlm.nih.gov/research/ umls/meta3.htm. Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., & Johnson, C. R. (2005). FrameNet: Theory and practice. Retrieved February 27, 2006, from http://framenet.icsi.berkeley.edu/ index.php?option=com_wrapper&Itemid=126. Sang, E. F. T. K. & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proceedings of the Seventh Conference on Natural Language Learning, Workshop held in cooperation with the 41st Annual Meeting of the Association of Computational Linguistics, 142-147. Tanaka, J.W. & Taylor, M. (1991). Object categories and expertise: Is the basic level in the eye of the beholder? Cognitive Psychology, 23, 457-482.

Clare Beghtol Faculty of Information Studies, University of Toronto, Ontario, Canada

The Global Learning Society and the Iterative Relationship between Theory and Practice in Knowledge Organization Systems

Abstract: In the global learning society, we need to understand how knowledge is transferred within one field and among different fields. In addition, we need to know how to create an atmosphere of tolerance for different points of view. One way of achieving understanding between different cultures and from different vantage points within the same culture is to study the relationship(s) between theory and practice. For this purpose, it is useful to understand the relationship(s) among ideas, how initial ideas migrate into practice and back into theory, and how “best practices” are identified and become widespread. In this paper, knowledge organization systems are used as examples of how knowledge organization systems are created, how knowledge of the systems may be disseminated, and how that new knowledge is integrated into accepted theory and practice. This examination provides clues about the development of theories and practices that can enhance the contributions knowledge organization systems make to the global learning society.

1. Introduction and Background The recent growth of electronic information access and retrieval capabilities, particularly using the ubiquitous internet, has created an environment in which human societies can engage in various new kinds of electronic dialogues and interactions. In turn, this new kind of information environment fosters the capacity for different civilizations to learn from each other on both the micro- and macro-levels. In such an evolving global learning society, we need to understand how knowledge is transferred within one field and among different fields and how to create an atmosphere of tolerance for different points of view. One way of achieving understanding between different cultures and between different vantage points within one culture is to study the relationships of theory and practice, both in basic knowledge organization in our fields and in the many other fields that create their own knowledge organization systems and practices (Kwasnik, 1993). To these ends, we need to understand how knowledge creation and revision takes place, how knowledge organization systems are created and revised, how new ideas migrate between practice and theory, how “best practices” are identified and then disseminated, and how these iterative and cyclical procedures can be encouraged to perpetuate themselves in the electronic world. This paper approaches some of these problems by taking knowledge organization systems as examples of how theoretical and practical processes interact and influence each other. In particular, the paper depends upon and expands upon the two-stage cyclical iterative process model of the relationship between theory and practice developed by Keedy (1992). This model provides clues about how ideas are created, disseminated, and integrated into human thought and action, and it provides a basis for understanding how the global learning society can profit by the development and use of knowledge organization systems.

2. Keedy’s Model Keedy (1992) developed his model in a multi-case interpretive study of how successful public school principals used interactions between theory and practice to improve their schools. It is appropriate to enlarge the model developed for research in one relatively narrow 160 field by examining whether it can be usefully applied to the broader and longer-term field of knowledge organization research for the global learning society. Keedy’s model consists of two interlocking steps. First, during Keedy’s study, “interaction among theory based assumptions, procedure, and discovery of practices” occurred (1992: 161). These interactions helped increase the quality and utility of the original procedures. Second, theory-building arose, in Keedy’s case, after the study was completed. At that time, increased thought about the study made theoretical breakthroughs possible. In developing an analogy between Keedy’s work and the creation and development of knowledge organization systems, we may generalize the model to take in more than one study. That is, we may consider various aspects of the histories of modern classification and knowledge organization systems to be the constructs under study and use these histories to ascertain empirically how theory and practice have been related in knowledge organization over an appropriately long period of time. The conclusions will help indicate how knowledge organization systems can be used as one element in evolution of the global learning society. Of particular use in this endeavour is Keedy’s model of the interactions between theory and practice. Keedy’s diagram is copied as exactly as possible below.

Model describing Interaction Between Theory and Practice (Keedy, 1992: 162, Fig. 2) In this diagram, two distinct paths to theory and knowledge discovery are shown. First, theory and “discovered practices” interact reiteratively and fertilize each other through procedural decisions that are made before, during and after the study. In knowledge organization systems, procedural decisions are made about, for example, basic classificatory elements such as notational devices, class structure, depth of analysis, auxiliary tables and many others. Second, new hypotheses can arise from discovered practices that suggest and inform new and/or revised theory. In knowledge organization research and development, new hypotheses can be seen in major changes based on developments in literary warrant and consensus, complete revisions of various schedules, new notational practices and the like. Keedy’s diagram was created on the basis of his multi-case educational study, and he did not suggest that these processes occur in every case or, if they do occur, that they are necessarily fruitful. Interactions between theory and practice have been observed in a number of fields (e.g., Schiffrin 1997). Nevertheless, Keedy’s diagram is suggestive as a fundamental method for studying the development and revision of knowledge organization systems. In particular, it suggests that theory development is both a top-down and a bottom-up process in which elements of both theory and practice are interwoven to create new theories, new practices, and new relationships between the two. 161

3. Keedy’s Diagram and the Development and Revision of Knowledge Organization Systems In the field of aeronautics, according to Fairthorne, “theory and practice are regarded as aspects of the same reality” because in aeronautics “separation of practice from theory was lethal” (1970: 557). In the study of knowledge organization systems, however, a different view has been customarily taken. Theories for knowledge organization systems have sometimes been presented as something that arise suddenly and artlessly from the mind of one seminal thinker and that new practices follow seamlessly from this spontaneously- generated theory. For example, Melvil Dewey wrote that the idea of a decimal notation for his bibliographic classification occurred to him during a church service at Amherst College (Dewey, 1920). Similarly, S.R. Ranganathan claimed that the idea for faceted classification occurred to him when he observed how the separate pieces of a Meccano set could be joined together to make different kinds of toys (Friis-Hansen, 1985: 30). These stories from the creators of two important bibliographic classification systems may be apocryphal and self-serving, but their longevity demonstrates the strength the belief that theoretical ideas spring apparently from nowhere has attained in the field of bibliographic classification. In contrast, the actual situation is somewhat different. We may take the two classification systems and theories developed by these important classificationists as examples of how theory and discovered practice have interacted in the history of modern classification and knowledge organization systems. In spite of Dewey’s claim to have personally invented decimal notation and the general structure of his classification system, it is unlikely that he did so. Comaromi (1976) suggested three possible sources for decimal notation and the structure of the system: 1) the notation, which Dewey may or may not have seen, devised by William Phipps Blake for the Centennial Exhibition in Philadelphia in 1876; 2) various ideas in Battezzati’s Nuovo Sistema di Catalogo Bibliografico Generale in Milan, as mentioned by Dewey himself in the preface to the first edition of DDC; and 3) the work of Willliam Torrey Harris at the St. Louis Public School Library and the work of Jacob Schwartz at the Apprentices’ Library of New York, which were also mentioned in the preface to the first edition. Wiegand (1998) has broadened and deepened the kinds of analyses Comaromi made by considering additional contemporaneous sources and by including discussion of the cultural milieu within which Dewey designed his classification system. Like Comaromi, Wiegand credits Harris’s St. Louis Public School Library classification as a source for many of the ideas in Dewey’s first edition, but he also considers other sources, such as the work of C.A. Cutter and Nathaniel Shurtleff. In addition, Wiegand suggested that the courses Dewey took at Amherst, the faculty members who befriended him, the texts he read there, and the general intellectual environment he encountered greatly influenced both Dewey’s character and his classification system. Investigations into the origins of Dewey’s system will undoubtedly continue. What seems clear, however, is that Dewey’s ideas came from a rewarding mixture of “discovered practices” and from his own ideas. This interaction of theories and practices generated what could be argued to be the most successful modern bibliographic classification system. The Colon Classification, and particularly the concept of “facet,” have received considerable scholarly attention. A number of faceted classificatory ideas have been found to have been developed long before Ranganathan (e.g., Svenonius, 1978; Schulte-Albert, 1979; Whitrow, 1983). In addition, Cordonnier (1961) claimed that he was the first to use the term “faceted classification” in its modern meaning, and a number of influences and re-influences of facet principles on classification systems such as Dewey, UDC, and the schemes and publications of the CRG have been identified (Cockshut 1976:40, Fig. 1). It can be reasonably 162 argued that faceting is a universal cognitive method of subdividing a whole, based on the large number of examples that have been found in different disciplines and in different historical contexts (Beghtol, 2006). In addition, the concept of faceting and of facet indicators was clearly understood in the development of the Classification Bibliographique of the Institute International de Bibliographie (now UDC), although with different terminology and somewhat different definitions. Early editions of the UDC contained basically the same auxiliary tables that the UDC now contains, including the marks that are still used as what would now be called facet indicators (e.g., Manual…1907). The reasons that facet indicators were needed in an analytico-synthetic notation were also clearly understood both in England by readers of the Library Association Record and in the United States by Melvil Dewey (Hopwood, 1907). Thus, the practice of facet analysis and of analytico-synthetic predated Ranganathan’s theoretically rich amplifications of the concepts. His ideas were popularized by the Classification Research Group (CRG) in their investigation of the possibility of creating a new classification for science, and particularly in their seminal paper “The Need for aFaceted Classification as the Basis of all Methods of Information Retrieval” (CRG, 1955). Like Dewey, then, Ranganathan may be said to have used “discovered practices” in his development of the Colon Classification and of the facet concept, including its attendant concepts of analytico-synthetic notation equipped with facet indicators.

4. The Global Learning Society and Knowledge Organization Systems Classificatory activity is a cognitive universal (Beghtol 2000, 2006; Kwasnik, 1999), but the processes of creating and revising knowledge organization systems on the basis of discovered practices are not well-documented. Communication among scholars and classificationists can be facilitated by discussion of existing and emerging classification systems, their theories and their discovered practices. We need, therefore, to study the effects of the relationships of theory and practice in classification research. Understanding the different purposes, materials, and contexts of classification and knowledge organization systems helps advance our knowledge of how classificatory techniques can span boundaries between cultures, languages, times, and places in a globalized information society that values and promotes continuous learning. It seems likely that a conscious and detailed search for useful discovered practices can help in promoting knowledge organization systems for use in a globalized learning society. In particular, those practices that signify the core value of access to information through knowledge representation and organization systems should be advanced and actively advocated. For example, those practices that enhance “cultural hospitality” and provide individual choice for information retrieval systems of all kinds need to be pursued. These concepts were described in detail in previous papers (Beghtol, 2002, 2005). They provide both top-down and bottom-up elements that can be incorporated into the search for useful discovered practices in knowledge organization in appropriate contexts with appropriately ethically based and globally acceptable points of view.

Acknowledgements This research was partially supported by the Social Sciences and Humanities Research Council of Canada grants 410-2001-0108 and 410-2005-0337.

References Beghtol, Clare. 2000. “A Whole, Its Kinds, and Its Parts.” In Dynamism and Stability in Knowledge Organization. Proceedings of the 6th International ISKO Conference, 10-13 163

July, Toronto, Canada. Clare Beghtol, Lynne C. Howarth, Nancy J. Williamson, eds. Würzberg, Germany: Ergon, pp. 313-319. Beghtol, Clare. 2002. “A Proposed Ethical Warrant for Global Knowledge Representation and Organization Systems.” Journal of Documentation 58(5): 507-532. Beghtol, Clare. 2003. “Classification for Information Retrieval and Classification for Knowledge Discovery: Relationships between ‘Professional’ and ‘Naïve’ Classifications.” Knowledge Organization 30(2): 664-673. Beghtol, Clare. 2005. “Ethical Decision-Making for Knowledge Representation and Organization Systems for Global Use.” Journal of the American Society for Information Science and Technology (JASIST) 56(9): 903-912. Beghtol, Clare. 2006. “The Facet Concept as a Universal Principle of Subdivision.” In Knowledge Organization, Information Systems and Other Essays: Professor A. Neelameghan Festschrift. Edited by K.S. Raghavan and K.N. Prasad, pp. 41-52. New Delhi: Ess Ess Publications. Classification Research Group. 1955. “The Need for a Faceted Classification as the Basis of all Methods of Information Retrieval.” Library Association Record 57 July: 262-268. Cockshutt, Margaret E. 1976. “Dewey Today: An Analysis of Recent Editions.” In Major Classification Systems: The Dewey Centennial. Kathryn Luther Henderson, ed. Urbana-Champaign, Ill., pp. 32-46. Comaromi, John P. 1976. “The Historical Development of the Dewey Decimal Classification System.” In Major Classification Systems: The Dewey Centennial. Kathryn Luther Henderson, ed. Urbana-Champaign, Ill., pp. 17-31. Cordonnier, G. 1961. “Metalangage pour les Traductions D'intercommunications entre Hommes et son Adaptation dans le Domaine des Machines pour Recherches Documentaires.” In Information Retrieval and Machine Translation A. Kent, ed.. New York: Interscience Publishers, 1961, part 2, pp. 1091-1138. Dewey, Melvil. 1920. “Decimal Classification Beginnings.” Library Journal 45(15 February): 151-154. Fairthorne, Robert A. 1970. “Innovation Resulting from Research and Development in the Information Field. I: A Researcher’s View: The Detection of Innovation.” Aslib Proceedings 22(11): 550-558. Friis-Hansen, J.B. 1986. Ranganathan's philosophy: assessment, impact and relevance: report on the International Conference in New Delhi 11-14 November 1985. Libri. 36(4), 1986, 313-319. Hopwood, Henry V. 1907. “Dewey Expanded.” Library Association Record 9: 307-322. Keedy, John L. 1992. “The Interaction of Theory with Practice in a Study of Successful Principals: An Iterpretive Research in Process.” Theory Into Practice 31(2): 157-164. Kwasnik, Barbara H. 1999. “The Role of Classification in Knowledge Representation and Discovery.” Library Trends 48(1) Summer: 22-47. Kwasnik, Barbara H. 1993. “The Role of Classification Structures in Reflecting and Building Theory.” In Advances in Classification Research 3: Proceedings of the 3rd ASIS SIG/CR Classification Research Workshop. Raya Fidel, Barbara H. Kwasnik, and Philip J. Smith, eds. Medford, NJ: Learned Information, pp. 63-81. Manuel du Répertoire bibliographique universel: organisation – état des travaux – règles – classifications [decimale].1907. Publication no. 63. Bruxelles: Institute International de Bibliographique. Schiffrin, Deborah. 1997. “Theory and Method in Discourse Analysis: What Context for What Unit?” Language & Communication 17(2): 75-92. 164

Schulte-Albert, Hans G. 1979. “Classificatory Thinking from Kinner to Wilkins: Classification and Thesaurus Construction, 1645-1688.” Library Quarterly 49(1): 42-64. Svenonius, Elaine. 1978. “Facet Definition: A Case Study.” International Classification 5(3): 134-141. Wiegand, Wayne A. 1998. “The ‘Amherst Method’: The Origins of the Dewey Decimal Classification Scheme.” Libraries & Culture 33(2): 175-194. Whitrow, Magda. 1983. “An Eighteenth-Century Faceted Classification System.” Journal of Documentation 39(2): 88-94. Elaine Menard

Image Retrieval in Multilingual Environments: Research Issues

Abstract: This paper presents an overview of the nature and the characteristics of the numerous problems encountered when a user tries to access a collection of images in a multilingual environment. The authors identify major research questions to be investigated to improve image retrieval effectiveness in a multilingual environment.

1. Introduction There are approximately 6,900 living languages throughout the world (Ethnologue, 2006). Although the Internet constitutes a vast universe of knowledge and human culture, allowing the dissemination of ideas and information without borders, the elimination of linguistic barriers with the worldwide adoption of a single language is an old utopian dream (Grefenstette, 1998). The Web has become an important medium for the diffusion of multilingual resources. However, its “worldwide” expansion is a rather recent phenomenon. We also can say that the Web is still not really adapted to languages other than English. The performance or the non-performance of different retrieval systems varies considerably for documents depending on their language (Chaudiron, 2002). Linguistic differences constitute a major obstacle to scientific, cultural, and educational exchange. With the ever increasing size of the Web and the emergence of more and more documents in various languages, this problem, far from reducing itself becomes more and more pervasive. Besides this linguistic diversity, more and more databases and collections now contain documents in various formats (text, image, video...), which may also adversely affect the retrieval process. For example, the retrieval of digital images differs from the retrieval of textual documents. Consequently, with the advent of the Web as one of the most popular medium for delivering and accessing image information and the quick development of image technology, there is an increasing need for more research in image indexing and retrieval. Although images by their very nature are usually language-independent resources, image captions or other text associated with images may be available for retrieval in a variety of languages. As a result, some form of multilingual retrieval approach can be considered. To this date, very little research has been done on the effectiveness of image retrieval in multilingual environments. Nevertheless, image retrieval in multilingual environments still presents its share of problems.

2. Utility of Image Retrieval The need to retrieve a particular image from a collection is shared by several user communities including teachers, artists, journalists, scientists, historians, film makers and librarians, all over the world. Image collections also have many areas of application: commercial, scientific, educational, and cultural. Until recently, image collections were difficult to access because of limitations in dissemination and duplication procedures. The advent of the Web highlighted the pressing need to develop suitable tools for the documentary description of digital images, since these can be found in the majority of the available resources: personal Web pages, digital libraries, commercial services and products catalogues, government information, and all the collections related to fine art, architecture, archival materials, and other cultural material. 166

Images are often used to show a certain concrete object, but they are also needed to express ideas and feelings. People are looking for images for information (i.e. a house buyer may want to have a precise idea of houses he will be visiting soon) and inspiration (i.e. a painter may want to retrieve images of objects or landscapes to use in the composition of a painting). It is therefore important to organize all this visual information in order to maximize its accessibility and eventually, its usability.

3. User Needs and Behaviours On most image search engines on the Web, users will basically express their needs using textual queries which describe the images they want. Analyzing and understanding the query pattern used and the user behaviour is the first significant step necessary to meet their needs. Recent research involving non-art images (Turner, 1998; Jörgenson, 2003) suggest that image management can be very much informed by focusing on users and their information seeking behaviours. Fundamentally, we can classify user needs in four main types: (1) The user searches for a specific image and only one image will satisfy this need. This is the case, for example, of a user who will look for a famous painting like The Giaconda painted by Leonardo Da Vinci or The Swing by Auguste Renoir; (2) The user is looking for a set of similar images which will match a series of criteria he has in mind. For example, it would be the case when the police is trying to match a composite drawing of an individual with the pictures of potential criminals; (3) The user may search for a concept, an abstraction or an idea. For example, an e-consumer may want to retrieve pictures for future purchases (jewels, motorcycle, coffin, etc.); (4) Finally, the user may only have a very vague idea of what he is looking for. In this case, the user is looking for inspiration or ideas rather than information (Conniss, Ashford & Graham, 2000). One key component of the conceptualization of image retrieval is the particular context of the retrieval (motivation, seeking strategy, familiarity with the retrieval system, resources available, and barriers to retrieval). The user must be studied within the context of the retrieval in which the information needs take its source Although a great deal of research which attempts to model the user of traditional text-based retrieval systems, very little work have concentrated on the users’ behaviour when they seek for images on the Web.

4. Access to Images Compared with text-based retrieval, image retrieval poses specific challenges. At a certain level of abstraction textual documents and images can be thought of as being very similar (both expressing specific ideas and concepts). Nevertheless, perhaps the most significant difference between textual and image retrieval is that, with images, users have a higher propensity to browse when they go through their searches (Clough and Sanderson, 2003). As a result, users will also check associated textual metadata to decide if an image is relevant or not. Consequently, in a multilingual environment, the possibility to browse textual metadata in a language the user understands will be important. Similarly, another important difference between textual and image retrieval, in a multilingual environment, is that once a relevant image is retrieved, it does not require any translation mechanism to be understood by the user and can be used immediately. Different approaches can be used for image retrieval: (1) a user can submit an image query (i.e. an example image selected from a database or drawn by the user); (2) a user can submit a textual query and the system searches for images using image captions; or (3) some semantic meaning is assigned to images and they can also be retrieved by a textual query (Alvarez, Oumohmed, Mignotte & Nie, 2004). With the first approach, the user must provide an 167 example offering some visual feature to be used for image comparison and matching. The two other approaches require a textual query and it implies that retrieval is limited to images associated with captions or semantic annotations. However, these two approaches have limitations since captions are not always available and it is difficult to recognize all the semantic meanings of images. Collections of digital images are relatively new compared to the world of textual documents. Paradoxically, the development of specific techniques to handle this particular type of documents has been neglected. Regarding what is the best approach to describe images, two schools of thought present divergent points of view: “those who feel that images are so different from text that there can be no similarity in methods for providing access and those who feel that a single set of rules based on conditions of authorship can be satisfactorily applied to all materials” (Jörgensen, 2003, 75). It is nevertheless not recommended to apply to visual resources such as digital image, the same metadata schemas expressly designed for textual documents because they are usually too general to describe images adequately.

4.1 Content-Based Image Retrieval Over the years, different approaches have been developed by researchers for conducting studies on image retrieval. As the number of digital image collections is rapidly growing, we need efficient browsing and searching tools to manage these large image databases. Content-Based Image Retrieval (CBIR) was proposed to address this problem. The CBIR approach is based on the primitive features of images such as colour, texture and shape. With this specific approach, most of the queries in CBIR systems are expressed as a visual exemplar of the type of image or image attribute being sought and retrieval operates by comparing physical features. The CBIR systems allow users to form queries by either: (1) selecting or providing an example image, (2) graphically sketching or drawing a query image, or (3) filling-in fields for query-by-example provided by the retrieval system. Once the query is submitted, the CBIR system goes through the database containing features representing the images to retrieve stored images that exhibit a high degree of similarity to the requested features by matching the query against the descriptive information in the database. While attributes such as colour, texture and shape and other visual characteristics are unquestionably important features for image representation, it is still difficult to implement these attributes for image retrieval. Without human supervision CBIR can only extract low-level visual features. It is generally accepted that effective browsing requires semantic information. The information relates to the composition of the image. Moreover, since this method does not require the images to be associated with any form of text (i.e. captions), CBIR systems are often complex to use for a non-expert user. This is the main reason why people still prefer to retrieve images using words.

4.2 Context-Based Image Retrieval As a result, it seems that the CBIR approach can hardly replace the text-based approach, but rather constitutes a complement to the latter. The text-based (or context-based) approach uses the textual features associated with the image for retrieval purposes. With the text-based approach, human beings are directly or indirectly involved in the image description process. A lot of information about the content of an image comes from other sources than the image itself. Context-based image retrieval can rely on manual annotations (descriptors, keywords or other metadata) or on collateral text (captions, titles, nearby text) provided directly with the image. With the text-based approach, the similarity is then based on the word similarity between the associated texts (manually added or accidentally provided by the image) and the text of the queries. 168

Inevitably, one important problem with this approach is the difference between words used to describe and words used to retrieve the images. Moreover, manual annotations of images is both time consuming and expensive. The assignment of keywords for description of images is also problematic for another reason. Since human indexers describe the image according to a specific indexing policy (i.e. ofness, aboutness, or a combination of both), it will not necessarily correspond to the words the end-users will use to search. The main difficulty related to context-based retrieval is that it requires most definitively a match between the representation the indexer will assign to the image and the representation used by the searcher. However this “match” will not be obtained easily due to factors such as misspelling of query or indexing terms, inaccuracy of the descriptors, images that are too old or of too poor-quality to be easily analyzed during the indexing process and so on. Despite these apparent weaknesses it seems that context-based image retrieval will continue to be the best technique to search images since language is our principal mode of communication.

5. Related Problems in Multilingual Environments 5.1 Multilingual Indexing Significant work has been done to identify problems when metadata are used in a multilingual context, but there are still a number of outstanding issues: multilingual metadata description, encoding of multilingual metadata descriptions, multilingual interoperability of metadata, re-use of metadata schemas, maintenance of element definitions in various languages, etc. In connection with digital images, metadata are mainly used for indexing collected information concerning the image itself, the technological processes of capture and the related rights (Gilliland, 2000). In other words, access to visual resources is done primarily on account of metadata. For example, metadata can be information about the content of the image and its origin (date, hour, luminosity, use of a flash, etc) or data designed to identify the image inside a collection (indexing terms, categorization, classification, etc.). In the same way, administrative and technical metadata are also essential for the management digital image collections. They provide information about technologies which allowed the production of the informational object and information which will allow its reproduction, as well as the procedures for data management. In recent years, many metadata schemas and specifications have been designed in order to meet specific needs or specific document types. Metadata schemas (MIX NISO, VRA Core, CDWA, MPEG-7, RLG, etc.) can be used to describe images. However, most metadata schemas intended for the description of digital images are only available in English. Success of retrieval in such a case will depend for the most part on completeness and the accuracy of the associated metadata, as well as on the translation infrastructure provided by the retrieval system.

5.2 Quality of Linguistic Resources A great deal of research is currently in progress in the area of Cross-Language Information Retrieval (CLIR). With CLIR systems, documents written in one language (source language) are retrieved by a query written in another language (target language). Generally, two approaches are used in CLIR. In the first approach the query of the user can be translated into the language of the collections of documents. This translation allows matching between the query and each document of a collection. This is the preferred approach because it is easy, efficient and not too expensive to achieve. Moreover, the translation of the query does not require too much time or storage space. However, the main hurdle encountered with this approach is the subsequent translation ambiguity, since queries are often very short and they do not provide enough contexts for the translation to be reliable. 169

The second approach used in CLIR systems requires the translation of all documents in the collection, in all languages. Compared to the first approach, the latter offers more specific results, but it is very expensive and problematic in terms of storage space. CLIR systems also make use of three techniques for the translation of either the queries or the documents: (1) bilingual or multilingual dictionaries, (2) machine-translation (MT) systems or (3) parallel or comparable corpora extracted from vast textual collections. The quality of the linguistic resources plays a major part at the time of digital image retrieval. The majority of the linguistic resources we mentioned previously present obvious limitations. The principal problem of the use of these resources remains the difficulty in translating certain specific elements of the language (i.e. expressions, compound words, proper names, etc.). Moreover, each language has its particular characteristics, which complicates the translation process considerably. To overcome these limitations, many researchers recommend the use of a combination of language resources to achieve optimal translation.

5.3 Semantic and Syntactic Ambiguity Classification systems are built upon the foundation of language, and language is ambiguous by nature. Words can be interpreted and understood in more than one way. When words are used as labels for categories, there is a risk that users will miss or misunderstand the meaning. Classification will be particularly difficult when organizing abstract concepts such as subjects, topics or functions which constitute a big part of the image description. There is a substantial call for subject and pre-iconographic indexing in systems which are intended to serve a broad range of user queries. One of the most important ambiguities of the indexing process is the choice to be done between two types of indexing. Turner (1998) made the distinction between the “hard” indexing which is the description of what the indexer sees on the image and the “soft” indexing which refers to the significance of what he sees. The second type of indexing, the “soft” indexing (the level “about”) contains the moods, the emotions, the abstractions and the symbols contained by images (Hollink, Schreiber, Wielinga & Worring, 2004). With this second level, it is possible to interpret the image in several manners, using abstract concepts such as richness, success, sadness, and other connotations attached to an image. The decision to index in a “soft” way must be made according to the nature of the collection and compared to the image uses which will be made once retrieved. However, it is necessary to keep the idea that this type of indexing remains difficult to realize. Moreover, since it will not necessarily bring the level of specificity necessary to optimal image retrieval, it becomes perhaps useless to include it in an image indexing policy. In clear, it is generally easier to determine the contents of an image than to interpret its significance. The problem of polysemy at the time of indexing is another problem known since a long time, but we can wonder if it does also exist in the case of the digital image. Actually, images are rich in various elements and basically, they are several levels of significance (Shatford, 1986). Therefore we can conclude that images are polysemous by nature. In fact, the problem of image polysemy is even more complex because it emerges from two levels. First, this problem occurs at the time of the translation of the concepts chosen to describe the image. Semantic and syntactic ambiguity is also a direct consequence of the lack of quality of the linguistic resources used for multilingual retrieval. The problem of semantic ambiguity often occurs when a query does not provide necessarily enough information to be well translated. Consequently, this ambiguity can have a negative impact on images retrieval. Second, and what is even more important, is the polysemy of the image itself since the image can have as many meanings as there are indexers. As we use to say: an image is worth a thousand words! 170

6. Conclusion The challenges engendered by the linguistic diversity of the Web are colossal. Much work remains to be done on all aspects related to multilingualism in order to give everyone full access to information, in a language-independent way. Multilingualism is a prerequisite for a true information society. The issues presented in this paper open up a number of areas for future research, and can generate a large number of research questions. For example, we should investigate what are the essential characteristics of metadata schemas that would help improve access to digital images. We also wonder if existing metadata schemas can be used adequately for multilingual retrieval of visual material. Are existing schemas appropriate, should they be modified or should new specialized ones be developed? Another necessary approach is to take into account user’s needs and behaviour to investigate image retrieval. For example, we can investigate what are the most effective search methods to be provided to users looking for images in a multilingual collection. We can also examine if standard measures such as precision and recall are really adequate for measuring retrieval performance or if more human-oriented measures need to be implemented for visual material. The eventuality to answer these emergent questions would bring into light the essential elements of a suitable processing for digital image collections, since these collections offer an unequalled informational richness. Indeed, these collections represent essential elements of the collective memory and world inheritance. Therefore it is fundamental to elucidate the growing need for harmonization of the methods to ensure an optimal dissemination and retrieval of images in multilingual collections.

References Alvarez, Carmen, Oumohmed, Amhed Id, Mignotte, Max & Nie, Jian-Yun. (2004). Toward cross-language and cross-media image retrieval. Retrieved on January 31st, 2006, from http://clef.isti.cnr.it/2004/working_notes/WorkingNotes2004/61.pdf. Chaudiron, Stéphane (2002). La question du multilinguisme en contexte de veille sur Internet. In Frédérique Segond (Ed.), Multilinguisme et traitement de l’information (pp. 63–85). Paris: Hermès Science. Clough, Paul & Sanderson, Mark (2003). The CLEF 2003 cross language image retrieval task. Retrieved on January 31st, 2006, from http://clef.isti.cnr.it/2003/WN_web/45.pdf. Conniss, Lynne R., Ashford, A. Julie & Graham, Margaret E. (2000). Information seeking behaviour in image retrieval: VISOR I final report. Newcastle upon Tyne: Institute for Image Data Research, University of Northumbria at Newcastle. Ethnologue (2006). Statistical Summaries. Retrieved on January 31st, 2006, from http://www.ethnologue.com/ethno_docs/distribution.asp?by=area. Gilliland, Anne J. (2000). “Setting the stage: Defining metadata”. In Introduction to Metadata: Pathways to Digital Information. Retrieved on January 31st, 2006, from http://www.getty.edu/research/conducting_research/standards/intrometadata/2_articles/in dex.html. Grefenstette, Gregory (1998). The problem of cross-language information retrieval. In Gregory Grefenstette (Ed.), Cross-Language Information Retrieval (pp.1-9). Boston: Kluwer Academics. Hollink, L., Schreiber, A. Th., Wielinga, B.J. & Worring, M. (2004). Classification of user image descriptions. International Journal of Human-Computer Studies, 61 (5): 601–26. Jörgensen, Corinne (2003). Image retrieval theory and research. Lanham MD: Scarecrow Press. 171

Shatford, Sara (1986). Analyzing the subject of a picture: a theoretical approach. Cataloging & Classification Quarterly, 6 (3): 39–62. Turner, James M. (1998). Images en mouvement : stockage, repérage, indexation. Sainte-Foy: Presses de l’Université du Québec.

Graciela Rosemblat and Laurel Graham National Library of Medicine, NIH, DHHS, Bethesda, MD, USA

Cross-Language Search in a Monolingual Health Information System: Flexible Designs and Lexical Processes

Abstract: The predominance of English-only online health information poses a serious challenge to non- English speakers. To overcome this barrier, we incorporated cross-language information retrieval (CLIR) techniques into a fully functional prototype. It supports Spanish language searches over an English data set using a Spanish-English bilingual term list (BTL). The modular design allows for system and BTL growth and takes advantage of English-system enhancements. Language-based design decisions and implications for integrating non-English components with the existing monolingual architecture are presented. Algorithmic and BTL improvements are used to bring CLIR retrieval scores in line with the monolingual values. After validating these changes, we conducted a failure analysis and error categorization for the worst performing queries. We conclude with a comprehensive discussion and directions for future work.

1 Background and Introduction Online health information systems predominantly offer only English language support. This language barrier undermines non-English speakers’ ability to access information. Cross-Language Information Retrieval (CLIR) techniques are often used to overcome this challenge by supporting searches in the users' native languages. Dynamic databases, such as ClinicalTrials.gov1, add an extra layer of complexity to cross-language search, with time- sensitive information, such as protocol amendments and registration deadlines, which requires keeping cross-lingual retrieval synchronized with periodic, unanticipated changes. Previous work (Rosemblat, Tse, & Gemoets, 2004) reported on two query-based approaches to Spanish-English CLIR at ClinicalTrials.gov, a health information system. Retrieval results with machine translation (MT) were compared against those from a then- newly developed Bilingual Term List (BTL). The BTL approach provided a transparent and controllable process in which the translation entries, corresponding to both medical and common vocabulary terms in English and Spanish, were obtained from publicly available sources. After a series of evaluations and subsequent improvements, BTL translation results were brought in line with the MT approach scores, through rudimentary normalization of Spanish-language query terms. The current paper describes a fully functional prototype that supports Spanish search over the English-language ClinicalTrials.gov data set, and presents a design that is generalizable to other health information systems and languages. We stopped further MT evaluations and concentrated on comparing subsequent CLIR results via the BTL approach against an English monolingual standard. CLIR scores are now close to equivalent monolingual retrievals due to improvements in translation algorithms and the BTL. This paper focuses on 1) the prototype architecture and design, 2) strategies adopted in the BTL to render greatly improved retrieval, and 3) failure analysis from a random query sample, to categorize the worst performing queries.

2 Current Architecture The project’s goal is to provide cross-language search in a cost effective, generalizable method for monolingual systems. From the early design stages of the Spanish prototype, we 174 planned to use common software, hardware, and backend data (Figure 1) for the English and Spanish search, rather than maintaining completely separate systems. While this design required a greater initial effort, the savings in maintenance time and avoidance of synchronization errors were compelling. The resulting system:

x Shares the web application code, the backend code, and data for both language systems; x Intermingles data sets: English/Spanish mixed tags in one XML document; and x Displays Spanish or English data based on run-time language selection.

Figure 1: Prototype architecture overview

The efficiency of this mechanism notwithstanding, the recent increase in clinical trial registrations at ClinicalTrials.gov is forcing a re-evaluation of this approach. Large numbers of XML documents with mixed tags can offset the advantages, especially if more languages were to be introduced, increasing the information within each document and resulting in sluggish performance. Decoupling of the language specific data may be considered in the future as a means of optimizing performance while maintaining the advantages of the common backend system and web application. The current design provides a generic application program interface (API) between the NLM2-developed search engine "Essie" (previously "SE") (McCray, Ide, Loane, & Tse, 2004) and the Spanish prototype, unifying the search-and-retrieval process. We developed a quasi-translation module which implements the generic interface and is callable by search engines. The incorporation of this module into the search engine leverages information retrieval enhancements designed for the English monolingual system. These enhancements, now available for Spanish queries, include conceptual mapping, lexical variant generation (Divita, Browne, & Rindflesch, 1998), and a synonymy component via the UMLS® (Lindberg, Humphreys, & McCray, 1993). This allows for better CLIR retrieval scores with an economy of effort, as future improvements need only be made in the English system without a need to duplicate the changes in the Spanish. Thus cross-language search takes advantage of existing monolingual system enhancements. 175

3 The BTL Look-Up Process The interchangeability of lexical databases such as glossaries or term lists dovetails well with the modular design of this prototype. While our BTL is focused on clinical trials, a different, new BTL or other lexical source may be used in the future, or combined with the current one, without redesigning the general architecture. Before the Spanish-language queries are searched against the English-language corpus, they are converted into corresponding English queries in a stepwise process. First, the Spanish search page calls Essie with a flag indicating that the incoming query terms will be in Spanish. This triggers Essie to forward all incoming query terms to the translation module, which looks up these terms in the BTL and returns the English translations. The term-look-up process consists of several stages: number and gender normalization, stripping of diacritics and conversion of all words to lowercase, and attempting to match the corresponding English-language expressions in the BTL. Multi-word expressions are initially considered as a single unit, then successively decomposed into smaller phrases and, ultimately, individual terms (McCray et al., 2004). Spanish result lists will often exceed their English counterparts, for a variety of reasons:

x Multiple senses for a single Spanish expression (whether one-word or multi-word) may correspond to different English terms, a phenomenon known as polysemy or one-to-many relationship. For example, both English dust and powder translate to Spanish polvo; English nouns drop and gout translate to Spanish gota; x Spanish translations in the BTL may include slight word-family or semantic variations from the corresponding English expression, to capture those cases where Spanish uses adjectives but English uses premodifying nouns, as in Spanish punción pulmonar (noun + adjective), English lung puncture (noun + noun); Spanish tumor cerebral (noun + adjective), English brain tumor (noun + noun); x To optimize retrieval, expansion mechanisms (synonymy, lexical variants) apply to all putative translations, including any BTL context-independent alternate English translations for a single Spanish term, as in the polysemy cases outlined above; and x More fields are searched in the Spanish searches because Spanish tags are searched in addition to English ones in the same XML document.

4 Improvements Implemented: Description and Validation Our current prototype shows a 23% average increase3 using the same unedited query sets and corpus (7,170 records) from the earlier project (Rosemblat et al., 2004). These queries came from two sources: English domain-specific queries from ClinicalTrials.gov, and Spanish general health queries from MedlinePlus en español4. External translators converted the original English queries into Spanish and the original Spanish ones into English, rendering two parallel query sets, one for each language. Retrieval results from the English set served as the monolingual standard for the CLIR results from the equivalent Spanish query set. CLIR performance was measured by F-values, which combine precision and recall into a single value (Van Rijsbergen, 1979). Table 1 shows F-values for the test sets using the 10 document cut-off calculation (Rosemblat et al., 2004). The interim step displays improvements to the search and retrieval mechanism within Essie (outside the scope of this project), without improvements in translation algorithms and the BTL. The final step shows the total performance increase with both Essie and BTL improvements factored in. 176

ClinicalTrials.gov (N = 488), F Factor MedlinePlus (N = 466), F Factor

Initial BTL Modified BTL Initial BTL Modified BTL Environment Test Set Test Set Training Set Training Set Training Set Training Set

Baseline5 0.398 0.460 0.481 0.443 0.489 0.487

BTL5, Current Essie 0.534 0.539 0.516 0.546 0.543 0.526

Current BTL, Current 0.68 0.688 0.672 0.606 0.606 0.607 Essie

Table 1: Comparing CLIR performance improvements

Lexical CLIR score improvements resulted from the following changes to the BTL:

x gender normalization (building on earlier algorithms for singular/plural variation5); x increased vocabulary coverage, both domain-centered and data-focused; x addition of stop words and punctuation; and x removal of excessive context-dependent, semantic variations or “over-extended” English translations for a given Spanish entry.

Alternate translations in the BTL for Spanish trastorno offer an example of over- extended translations: depending on word context (Table 2), disease, disorder, condition, and disturbance can all translate to trastorno:

Spanish English trastornos en la marcha gait disturbance trastorno de salud adverse health condition trastorno de los nervios periféricos peripheral nerve disorder trastorno de Tourette Tourette's disease

Table 2: One Spanish source entry [trastorno/s] - many possible English translations

The prototype does not contain context-dependent rules to indicate which translation to select in the vicinity of other terms or collocations. Therefore, for Spanish searches, all the translations for a given Spanish expression are used against the English corpus, much as if they were joined by an OR operator. The original Spanish query term is searched along with the BTL translations. This results in Spanish retrievals often outnumbering those for the English monolingual search (gold standard), as each alternate translation contributes its own set of retrievals. The extra Spanish retrievals may not correspond to the original query searched. Thus, limiting the number of these over-extended translations is a critical part of the on-going BTL clean-up process, especially when including such translations hurts, rather than helps, precision values. To evaluate how the bigger corpus affects retrieval in the prototype, we tested CLIR performance (Table 3) on the complete set of documents as of August 25, 2005 (15,064 records). The 10-document cut-off calculations were dropped because they require a frozen 177 corpus for measurements to be comparable. Instead, calculations for precision at 10 now show the impact on the user, as this measure is independent from the data set(s) used in the search. Values for the prototype without lexical (translation) improvements are also shown, for comparison. Essie improvements are held constant in all rows.

ClinicalTrials.gov (N = 483) MedlinePlus (N = 460)

Precision Precision F Factor PrecisionRecall F Factor Precision Recall at 10 at 10 January 2004 BTL, 0.7 0.838 0.601 0.856 0.683 0.853 0.57 0.87 Plural Normalization Current BTL; Plural + 0.83 0.836 0.823 0.868 0.846 0.843 0.848 0.875 Gender Normalization

Table 3: Comparing performance improvements on August 2005 clinical trials data set The F-value increase reflects significant improvements in recall (43%) due to increased BTL coverage. While recall was prioritized over precision, considerable effort was given to ensure precision did not suffer. This increase was validated with a new random sample of 926 queries. For parallelism and consistency with the earlier study, the new sets of queries were extracted from the same sources, and underwent the same processing: external translators rendered two parallel query sets, one for each language. Retrieval results from the English set served as the monolingual standard for the CLIR results from the equivalent Spanish query set.

ClinicalTrials.gov (N = 470) MedlinePlus en español (N = 456)

Precision Precision F Factor PrecisionRecall F Factor Precision Recall at 10 at 10 Current BTL, Number + 0.861 0.888 0.844 0.905 0.879 0.843 0.918 0.861 Gender Normalization

Table 4: Validation of performance improvements with a new unedited test set CLIR scores from the new query sets (Table 4) demonstrate that the improved F-values result from the strategies and changes implemented in the BTL, irrespective of the query sets used. Eliminating the 10-document cut-off computation will allow future performance comparisons as the data set grows, without having to recalculate measures at each point6.

5 Failure Analysis Failure analysis and subsequent categorization of the 200 worst performing queries7 uncovered problem areas in the BTL and its interaction with lexical resources used by Essie, namely, the UMLS®. Two categories accounted for 69% of the worst performing queries: x BTL Coverage: Missing translations or missing entries (46%), including Spanish- English pairs; and x Semantic Coverage: (23%) Over-extended lexical variations and synonymy. Included are differences between BTL translations (too many or too few) and UMLS® entries and conceptual mappings used by Essie for each of the English translations. 178

Spanish mareo is an example of the latter category, with the following valid BTL translations: dizziness, lightheadedness, airsickness, carsickness, and seasickness. The UMLS® has no semantic mappings between these terms. To illustrate how the differences between BTL translations and UMLS® entries affect precision values for the Spanish search, let’s take Spanish síndrome, which has the following BTL translations: syndrome, disease, condition, and disorder. Since the UMLS® does not have a relationship for the last three terms, the English search will only include syndrome. For the Spanish search, however, the BTL look-up procedure will collect the four translations for Essie to search against the English corpus, along with the original Spanish query, síndrome. Thus more documents (not necessarily search-targeted) will be retrieved in the Spanish search than in the English. Alternatively, this may result in better Spanish returns. For example, until recently the UMLS® did not have lung as a synonym for pulmonary, while both terms were BTL translations for Spanish pulmonar. Since the English retrieval serves as the gold standard, the better Spanish results will be assigned low retrieval scores, as they indicate a mismatch between the English and Spanish results. This represents a weakness in our methodology for performance evaluation and validation. Other category areas that hurt retrieval (31% combined) were:

x Query Translation used: Lack of context may cause a variance in query interpretation, and there are often several ways to translate a given query. Professional renderings may vary slightly from commonly used translations, resulting in zero matches; x Search Procedure: Failure caused by bugs or limitations in the search; and x Language Differences: Polysemy, false cognates or general language differences. Figure 2 shows the distribution8 of categories in percentage of total failed queries:

50%

45%

40%

35%

30%

25% Distribution 20%

15%

10%

5%

0% BTL Coverage Semantic Coverage Query Translation Search Procedure Language Differences Failure Analysis Categories

Figure 2. Graphical representation of the failure analysis distribution 179

Some problems from the initial failure analysis have since been resolved. For example, several queries failed due to punctuation being treated as a search term in the Spanish search. The equivalent punctuation was skipped for English as it was part of the stop word list. This led to the inclusion of punctuation as stop words in the BTL, rendering the Spanish stop word list more equivalent to the English, and unifying search results for the previously failed queries.

6 Discussion Table 1 shows a 23% improvement in the F-values of the Spanish retrieval results against their English counterparts. This performance increase (from Rosemblat et al., 2004), especially in recall values, resulted entirely from the normalization of Spanish terms and additions and clean up of the BTL. These changes were validated with a new, unedited query set (Table 4). The high F-values obtained in our tests (0.860 – 0.900) attest to the viability of dictionary-based CLIR, despite the known pitfalls of term-lookup lexical-based systems. Once some simple problems were resolved, the analysis highlighted a fundamental weakness with dictionary-lookup systems not implementing context-based translation rules: translation gaps. BTL gaps accounted for 46% overall in the failure analysis, which pointed to a need for automated entry generation. Roughly 72% of the gaps were caused by English- Spanish pairs missing entirely, as opposed to specific translations for certain Spanish expressions. One potential solution would be to use the lexical resources of the monolingual system to ensure entries in these resources map to translations in the BTL. This would take care of those missing BTL entries that are present in the lexical resources, such as Spanish synsets in the UMLS®, and could be used to increase coverage. In addition, Spanish queries from medical websites, such as MedlinePlus en español may be mined to locate potential lexical candidates for addition in the BTL. An interrelated issue relates to conflicts between the BTL and existing lexical resources. Resolving this problem would require aligning and/or mapping the existing lexical resources and the term list, which goes against the philosophy of a pluggable, modular design. One mechanism to address this would be using the UMLS® to cover existing gaps. The UMLS® contains multilingual entries that could be used to extend the BTL and align the two resources. But the UMLS® is ever evolving with constant updates and extensions, like any lexical resource. Thus keeping alignments or mappings in synch between two lexical resources could entail replacing one set of problems with another. Occasionally, English synonyms in the UMLS® coincide with semantically unrelated Spanish terms, a phenomenon known as false cognate. These synonyms, missing in the BTL, could have a negative impact on precision for a Spanish search. For example, Spanish herpes has two BTL translations: herpes and shingles. The UMLS® offers several semantically related expressions for English herpes, such as zona among others, a bona-fide English synonym (Dorland’s Medical Dictionary, 2000). However, zona is also an unrelated Spanish term, meaning zone or area. Since English synonymy is included in the Spanish search, the Spanish query herpes will retrieve not only pertinent documents on this condition, but also some not pertinent ones on marginal zone lymphomas for example, translated in Spanish as linfomas de la zona marginal. Alternatively, the English search for herpes, which will include a search for English zona, will only retrieve pertinent records, as this term has no other English semantic senses and only English fields will be included in the search. Thus, adding all potential alternate translations in the BTL for a single Spanish expression magnifies retrieval in the Spanish search, hurting precision. Conversely, deleting 180 some of the alternate translations may hurt both precision and recall for Spanish searches, as key documents may be missed altogether. Possible approaches include disambiguating translations based on frequency of usage and commonality of terms, or including context- based rules, or a combination of both. These strategies could be used either during translation or in a post-translation disambiguation module, but will require extending the BTL design to include frequency or context-based information for each translation entry. Further research is required.

7 Future Developments We have just completed a consumer-centered usability study to assess whether the Spanish prototype provides accessible, readable content that encourages Spanish-speaking users to read clinical trials information, learn about key health opportunities, and make appropriate decisions based on their own situations. The analysis and subsequent categorization of the different types of errors point the way to problems that arise when using a BTL approach to CLIR, and working with ontologies in general. The next phase of the project requires extensive tools and manual labor to ensure the consistency of the BTL and reduce the number of gaps and over-extended translations. Automated methods for identifying and adding lexical entries for translation will extend the BTL and cover missing entries. Aligning it with other lexical resources should be seen as part of a larger project to curate and validate it. Once curated and validated, we will be able to provide the BTL as a free resource. This prototype offers immediate extensibility to monolingual health information systems. It can also be applied to creating controlled vocabulary translations of key documents in different languages, and extended to other websites, within and outside the health domain.

Notes 1 Available: http://www.clinicaltrials.gov/ 2 U.S. National Library of Medicine 3 Enhancements to Essie, which further increase performance percentages, are not included. 4 Available: http://medlineplus.gov/spanish/ 5 Rosemblat, Tse, & Gemoets (2004) 6 The 10-document cut-off required a constant clinical trials corpus for the values to be comparable. 7 In terms of zero results and low retrieval scores, against the English monolingual standard. 8 In the two instances in which the queries displayed problems that fell into multiple analysis categories, we coded the queries once for each applicable category. 9 Acknowledgements: We are largely indebted to Tony Tse for significant feedback and valuable contribution to earlier versions of this manuscript. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine/Lister Hill National Center for Biomedical Communications.

References Divita, Guy, Browne, Allen C. & Rindflesch, Thomas C. (1998) Evaluating lexical variant generation to improve information retrieval. In Proceedings of the American Medical Informatics Association Annual Symposium. pp. 775-779. Dorland’s Illustrated Medical Dictionary. 29th Edition (2000) W.B.Saunders Company. 181

Lindberg, Donald A., Humphreys, Betsy L., McCray, Alexa T. (1993) The Unified Medical Language System. Methods Inf Med 1993;32:281-91; Unified Medical Language System® (UMLS®) Available: (http://www.nlm.nih.gov/research/umls) McCray, Alexa T., Ide, Nicholas C., Loane, Russell F., Tse, Tony. (2004) Strategies for supporting consumer health information seeking. Medinfo. 2004:1152-6. Rosemblat, Graciela, Tse, Tony, Gemoets, Darren. (2004) Adapting a monolingual consumer health system for Spanish cross-language information retrieval. Proceedings of the 8th International ISKO Conference. London, England. pp. 315-321. Van Rijsbergen, C.J. (1979) Information Retrieval, 2nd edition. Department of Computer Science, University of Glasgow.

Susanna Keränen Department of Information Studies, Faculty of Economics and Social Sciences, Åbo Akademi University, Turku, Finland

Equivalence and focus of translation in multicultural thesaurus construction

Abstract:This paper reports a part of an on-going PhD study on problems related to multicultural social science thesaurus construction in the general frame of information science. The main analysis methods used are discourse analysis and co-word analysis. In theoretical framework the emphasis is on communicative equivalence theories and different aims of thesaurus translation are discussed. Some examples are given how co-word analysis can be used to study contextual equivalence.

1 Background This paper reports a part of an on-going PhD study. The topic of the project deals with translation problems and indexing practices in creating multilingual and multicultural thesauri. The focus of this study is the translatability of British-English social science indexing terms into Finnish language and culture on a concept, term and indexing term level. The study is a qualitative case study and both quantitative and qualitative methods are used. The main data collection method is focused interview and the main analysis methods are discourse analysis and bibliometric co-word analysis. The perspectives are linguistic and sociological – a combination through which a broader understanding of the phenomena is being aimed at in the general frame of information science. The aim is twofold: 1) to identify different discourses and vocabularies existing in a particular information domain and to see how they are considered in information storage and in multilingual and multicultural thesaurus construction, and to 2) operationalise the concept of equivalence in multicultural thesaurus construction work. The study is financed by the Academy of Finland (as a part of a bigger project Cultural and linguistic differences in digital storage and retrieval of information) and by the Finnish Cultural Foundation. In this paper the emphasis is on the idea of translation strategies and on the co-word analysis method.

2 The study 2.1 Central concepts Each information search in a database covers at least five different languages: the authors, the indexers, the synthetic structure, the users and the search strategy (Buckland 1999), which all represent a type of discourse. An indexer's or a specialist's ways to express their ideas and thoughts on a certain social environment differ from each other and in indexing this can cause problems. For example, when speaking about lone mothers a politician may use an eloquent term (re-miss and mother), a journalist an eye-catching one (single moms) and an indexer a term accordingly to the thesaurus practices (mothers --- divorced). The social sciences are connected not only to the development of science but also to the development of their surrounding culture and society. In a social sciences thesaurus this phenomenon is seen more clearly than in, for example, thesauri of technology or medicine. Language is not static (see e.g. Aitchison 1991; TSK 1989; Varantola 1990; Wierzbicka 1997; Lehtonen 2000) and therefore the language and documentation of social sciences is tied with culture and time. 184

When analysing different discourses it is essential to be aware of the context where the discourses take place. In this study cultural context is understood accordingly to Nida (1975, 229) as “the part of the context which includes both the total culture within which a communication takes place and the specific nonlinguistic circumstances of the communication”. At a general level culture can be defined to refer to the customs, beliefs, and ways of life of a group of people. Culture has several subcultures like those of indexers’ and social scientists, which are subsets of the main cultural group. When studying equivalence it is essential to understand the limits of equivalence and to accept it as an unavoidable reality.

“Every language is a self-contained system and, in a sense, no words or constructions of one language can have absolute equivalents in another. The idea that there might be some linguistic elements which are universal in the sense of having absolute equivalents in all the languages of the world is of course all the more fanciful.” (Wierzbicka 1991, 10)

This kind of uniqueness does not mean end of story, since as Wierzbicka further states, that when “we abandon the notion of absolute equivalents and absolute universals, we are free to investigate the idea of partial equivalents and partial universals; and if the former notion is sterile and useless, the latter idea is fruitful and necessary.” (Ibid.) Zethsen (2004, 126-127) states that equivalence “is thus not an absolute, but relative concept and functionality, not complete equivalence, is considered the sound aim of the translator”. Vehmas-Lehto (1999, 12) has given a careful definition for translating: when translating one expresses something with the means of target language that has been expressed earlier with the means of source language. The definition displays that most important is the content of a translation, the meaning - not language. (Ibid.)

2.2 Theoretical framework When considering multicultural thesaurus construction the heart of the theoretical context lays in indexing, in information seeking, in relation between language and culture and in equivalence. Not only translations, but also theories around translations are – or should be - context-bound (see e.g. Snell-Hornby 1988, 14; Chesterman & Arrojo 2000, 152). In the next session translation is discussed from the perspective of its function and strategy, while area of empirical studies in this project is thesaurus construction and translation and this perspective represents also the glasses used in this theoretical section. The theory of dynamic equivalence was developed by Eugene A. Nida in the 1960’s and it is the first communicative translation theory. The ‘dynamic’ refers to an idea, that a translation should cause the same reaction in the target audience as the source text does in the original context. According to Nida (1964/2000) there are basically two different orientations in translating and thus two fundamentally different types of equivalence: formal and dynamic. If a translator aims at formal equivalence, his/her aim is to try form the target language equivalent in as much as possible in coherence with the elements of the original source language equivalent. Nida states, that it is usually more recommended to aim at dynamic equivalence, when the relation between the translation and its receiver is the same as the original text and its receiver has. He remarks, that “--- of course, there are varying degrees of such dynamic-equivalence translations”. (Nida 1964/2000, 129-130) Nida has further defined translation as “reproducing in the receptor language the closest natural equivalent of the message of the source language, first in terms of meaning and second in terms of style.” (Vehmas-Lehto 1999, 54) 185

Although formal equivalence means often word-for-word translation and the translation units in thesauri translations are thesaurus terms, we can still assume that the fundamental idea behind the dynamic equivalence is still closer to the real aims also in thesaurus translations. Translations usually do not aim to express how a foreign thesaurus is constructed, but instead to make a functional thesaurus also in the target language and culture context. Nida’s theory of dynamic equivalence has been further developed and called also as functional equivalence theory. Functional equivalence theory bases on an idea that the function of the source text is the same or similar with the function of the target text. The function of the translation adapts to the source text function. (Vehmas-Lehto 1999, 70) Skopos-theory was developed by Reiss and Vermeer in 1980’s. Skopos means the aimed purpose of the translation. In Skopos-theory it is more important to fulfil the function of the translation than to translate in a certain style. However, the function of the target text is not necessarily the same as the source text has in its original context. (Reiss & Vermeer 1986, 54-59; Koskinen 2001, 380) In Skopos theory the translation action is stressed – translating is doing something and for a certain purpose. Vermeer describes a translation action as a particular sort of behaviour:

“for an act of behaviour to be called an action, the person performing it must (potentially) be able to explain why he acts as he does although he could have acted otherwise. Furthermore, genuine reasons for actions can always be formulated in terms of aims or statements of goals (as an action “with a good reason”---). (Vermeer 1989, 176)

When translating something, a novel or a thesaurus, there are also more general than previously presented translation strategies to consider and to choose. There are mainly two strategies to choose from when translating a text from one culture another; namely domestication and foreignisation. When a text is brought closer to the reader in the target culture, more recognisable and familiar, it is called domestication, and the opposite, when the reader is taken over to the foreign culture and made feel the cultural and linguistic difference, is called foreignisation. (Lindfors 2001, 6)

"This choice between domestication and foreignization is linked to questions of ethics, too: should the translator be accountable to the source or target culture, and to what extent? If target-cultural conventions are followed in the translation process, the text will be readily acceptable in the target culture, but it will inevitably lose some of the characteristics that would have given it a foreign or even exotic feeling.” (Lindfors 2001, 6)

Venuti (1995, 306) states, that translating should never aim to remove dissimilarities between different cultures entirely. According to him the routineness of fluent domestication has influenced the British and American cultures that are “aggressively monolingual, unreceptive to the foreign, accustomed to fluent translations that invisibly inscribe foreign texts with English-language values and provide readers with the narcissistic experience of recognizing their own culture in cultural other”. Translating is not a value-free action and choices are made in all the stages of the process: what to translate, to whom, how etc (Venuti 1998, 67). When considering multilingual and multicultural thesauri the basic assumption is that the different language versions should work in their own linguistic-cultural surroundings, e.g. the English version of a thesaurus in England and the Finnish version in Finland, and also cross-culturally so, that e.g. a Finnish information seeker could make searches in an international database in Finnish and still retrieve in documents written in English and 186 indexed in English by a British indexer. We can also easily imagine a situation, where an indexer would need a British concept in Finnish, like when indexing a British-English book into Finnish-Finnish catalogue. Needs and expectations for multilingual and –cultural thesaurus vary, as well as strategies to construct such. How to construct a thesaurus then? There are not only two, but in practice three basic strategies: domestication, foreignisation - and internationalisation. When writing a fictional novel the author would hardly try to avoid using culture-bound words, but how about thesaurus constructors in multilingual and –cultural thesauri?

“The ELSST thesaurus will be created from the current UKDA HASSET. This will involve reducing the present hierarchies so that all cultural and institutional specificity are removed.” (Miller & Matthews 2001)

A kind of a modern translation strategy is also “existential equivalence”, which Koskinen (2000) found as a typical strategy within European Union context.

“--- Especially in the case of lesser used languages like Finnish, the communicative function may often be subordinate to a symbolic function. Sometimes the primary function of the translation of a particular official document is simply to be there, to exist. Rather than just conveying a message or providing possibilities for communication, the role of the translation is then to stand as a proof of linguistic equality. ---” (Koskinen 2000, 83)

All translators make decisions between foreignisation and domestication translation strategies, although they are not necessarily aware of it (Ruokonen 2004, 63). An explicit discussion around these strategies would be valuable and would open new perspectives and help to evaluate predominant practices. (Ibid., 75)

2.3 Research questions, methods and material The general research questions are:

• How different discourses are considered in information storage? • How is equivalence understood in thesaurus construction guidelines and standards versus in modern communicative/functional translation studies versus in practice? • What is the pragmatic indexing term equivalence?

The general research questions are operationalised into several sub-questions:

• How are the studied concepts understood? To which extent are the differences due to institutional versus cultural differences? What is the semantic invariant? • What are the studied terms about according to indexing and thesauri? How are the studied concepts used in indexing and why so? • How is equivalence understood? What do thesaurus constructors, indexers and social scientists aim at in their translations? Do the potential thesaurus users share same vision as thesaurus constructors?

As a case for the empirical part of the study is selected a theme, “family roles”. The emphasis is on human effort and on Finnish language and culture. As a background question is asked what is the sociological context of the studied concepts in the source and in the target culture. 187

The empirical material and analysis consists of focused interviews (with Finnish and British social scientists, thesaurus constructors and indexers, in all 32 informants), simulated indexing tasks with Finnish and British indexers (six persons, five documents about care giving issues and roles), semantic component analysis of dictionary definitions and translations, co-word analysis and datasets retrieved in four databases, and discourse analysis of ten thesauri. Co-word analysis is very similar to co-citation analysis. Co-word analysis deals with co-occurrences of terms in documents, while co-citation analysis deals with shared citations. It is thus about the relatedness of terms rather than documents. (See Callon 1991, Persson 1991; Ungern-Sternberg 1994; Kärki & Kortrlainen 1996; Horton & Coulter & Grant 1998; Schneider 2004; Forsman 2005.) In this study co-word analysis is used to study and evaluate the idea of dynamic equivalence in practice. The results of the co-word analysis can be shown as two-dimensional geographical maps. In the maps the relative circle sizes correlates with the number of occurrences and width of lines correlates with the co-occurrence, i.e. the bigger the circles the more descriptors have occurred in the material and the thicker the lines the greater the interaction between the linked descriptors. In co-word analysis the tool-box used is Bibexcel (BIBEXCEL 2004). The word association method has been used earlier also in the context of thesaurus construction. According to Nielsen (2002) word association method may result in a usable and workable thesaurus, performing as well as a thesaurus designed by traditional thesaurus construction methods. The advantage of word association method is that it may enrich thesaurus construction by providing current, relevant, and domain-specific information. (Ibid.)

3 Findings The interviews revealed clear differences both on the institutional and on the (geographical) cultural level. When considering equivalence problems it was found that there exist different expectations on term level between social science researchers versus indexers and thesaurus constructors, but the idea of dynamic equivalence was common among each of the interviewed groups. - Social science researchers tend to consider the ideological motivations much more closely than information specialists, for whom the studied concepts represented tools. Still the main difficulty seems to be a lack of common understanding of the function of indexing terms. The idea of an international thesaurus was often critised – a thesaurus should reflect different cultures rather than “international average”. The co-word analysis has been used to illustrate and summarise the material used – to visualise it and make it observable also to the reader of the research. When studying database indexing it gives a compact illustration of the main trends and common (indexing) understanding. The weakness of the method is, that it may hide the nuances, which maybe the most important aspects in a study like this, and it does not provide reasoning. In order to get a user perspective and also reasoning a more qualitative research method and material is also needed. Next are shown some examples from the material. In the following the test concept “family roles” is illustrated through co-word analysis in one database and with Finnish and British social scientists in order to compare the contextual equivalence. In the study as an international database studied was Cambridge Sociological Abstracts and one of the keywords studied was “family roles” (query: DE="family roles" OR DE="family role") AND LA=English AND (CP=United Kingdom) AND (PT=article OR PT=aja, 07.03.2003). The below descriptor-sample includes 132 records (years 1985-2001) and the descriptors included to the co-word analysis had ten or more occurrences. The map 188 below illustrates the co-occurrences and the network of the most used descriptors (the thickness of the line correlates with the relationship frequencies and the size of the circle correlates with the frequency of the descriptor occurrence):

Figure 1: FR-DE-search, 132 records published 1985-2001, co-occurrence of the descriptors with ten or more occurrences

Two major topics were found: 1) Working life and sexual division of labour, and 2) Relations between family members. Within the first mentioned topic the strongest aspect is sex roles and conflicts. Genderism is also very clear – it co-occurred often with descriptors expressing females (Womens Roles, Working Women, Mothers). The other topic is relations between family members and it is more gender-neutral. The descriptor Family Roles seems to be rather about social roles (e.g. working, care giving role) than about family member roles (e.g. mother, father, child, grandparent). In the picture below is illustrated the co-occurrences of a more recent sample (1995-2001): 189

Figure 2: FR-DE-search, 96 records published 1995-2002, co-occurrences of the descriptors with five or more occurrences

The records are as in the first sample studied greatly about women and families from the perspective of labour force participation and child care taking. But emphasis has changed – the issue is even more a social one from the perspective of inequality problems. In the first sample the amount of gendered terms was 23 out of 241, in the second sample it was 41 out of 862. At the same time as the total amount of gendered terms has decreased, the amount of masculine related terms has grown from about one sixth (4/23) to half (20/41). The emphasis is more also on fathers and masculinity issues. Co-word analysis and similar maps were also produced with the interview material (word associations). First the informants were asked to give about five response words for a stimulus term “family roles”. The Finnish social scientists (representing three different universities and four different social science disciplines) interviewed gave altogether 46 response words, 30 different: 190

Figure 3: Word associations for “family roles” by six Finnish social scientists (transl. by author)

The associations of the Finnish social scientists were clearly related to family members and family was understood as a nuclear family with parents and children. As a research object it associated more with terms of statistics. Family roles was clearly a foreign concept and connoted more strongly also as an old-fashioned concept (even with “Britishisms”). It was not understood as a modern common language concept in Finnish, although some of the informants mentioned, that it is coming back, but differently from the 1950’s Parsons functionalist theory (men with instrumental role, women with expressional role). But does the picture itself give this information? - It does provide information about the denotative level of the studied concept, but it does not necessarily tell about the connotative aspect and usability of the studied concept. The informants used to prefer to give criteria or themes according to which to divide and name different roles instead of actually naming roles. The themes mentioned were parenthood, parenting, grandparenthood, caregiving, participation in working life, sisterhood, 191 sexuality, financial relationships, relatives and kinsfolk system, adulthood, childhood, marital status. The Finnish translation would be “perheroolit”, but the Finnish social scientists interviewed would often rather talk about tasks and responsibilities within a family and they emphasised that each person in a family has several roles and for example the caregiver and supporter role is often shared. In the most known and used Finnish thesaurus YSA (Yleinen suomalainen asiasanasto) the nearest equivalent is “family members” (“perheenjäsenet”). The six British social scientists (representing four different universities and five different social science disciplines) interviewed gave in all 45 word associations, 37 individuals:

Figure 4: Word associations for “family roles” by six British social scientists

For them the term “family roles” connoted mainly as a social role and often together with words expressing transition and potential or clear conflict. When giving the stimulus words the difference between the Finnish and British informants is clear, but when the British social scientists named themes the difference was significantly smaller. When a term was considered having a strong connotative meaning it was clearly expressed in word associations, like in the case of term “breadwinners”: 192

Figure 5: Word associations for “breadwinners” by six British social scientists

4 Conclusions The early demands given to translations do not conform the nature of language, and that is also the main trend seen as well in modern translation theories, terminology guides as in thesaurus standards. The division of equivalence types in standards and guides, and in translation theories the proportion of equivalence to the function of translation bases on this understanding. In thesaurus construction literature also the functions of translations should be more carefully considered instead of being taken as even self-granted or unspoken matters. According to the results obtained in the study co-word analysis seems to be a valuable method when studying the contextual equivalence in indexing. It illustrates a kind of semantic network of the studied concept. In order to find out relevance equivalence other methods are still needed like focused interview. The study has shown how important it is to consider documentary language, i.e. also thesauri, from the perspective of natural language problems and to consider the pragmatics and norms of social science discourse. It has also shown how important it is in translations to aim at predictability in information seeking context and to follow the norms of social science discourse. The true challenge is to construct multicultural thesauri and for information seeking, not only multilingual ones and for indexing. The concept of equivalence and the function of translation should be considered more carefully when constructing multilingual and multicultural thesauri and operationalised also from the more pragmatic viewpoint (and not solely from the linguistic point of view). Within a social science thesaurus context especially noteworthy is to consider the connotative level of the word. Also in thesaurus construction literature and practice it could be useful to discuss the basic translation strategies – domestication, foreignisation and internationalisation - and their basic values and implications. It has to do with the ethics of indexing, which is not an established topic within information science as ethics of translation is in translation science. 193

References Aitchison, J. (1991, 2nd ed.). Language Change: Progress or Decay? Suffolk: Fontana Paperbacks. Bibexcel - a toolbox for bibliometricians. Version 2004-06-21.Developed by Persson, O., Inforsk, Umeå University, Sweden. Buckland, M. (1999). Vocabulary As A Central Concept In Library And Information Science. Preprint of paper published as "Vocabulary as a Central Concept in Library and Information Science" in Arpanac, T. & Saracevic, T. & Ingwersen, P. & Vakkari, P. (Eds.) Digital Libraries: Interdisciplinary Concepts, Challenges, and Opportunities. Proceedings of the Third International Conference on Conceptions of Library and Information Science (CoLIS3, Dubrovnik, Croatia, 23-26 May 1999) (pp. 3-12). Zagreb: Lokve. Retrieved 2000 from http://www.sims.berkeley.edu/~buckland/colisvoc.htm Callon, M. & Courtial, J.P. & Laville, F. (1991) Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry. Scientometrics, Vol.22. No.1 (1991) pp.155-205. Chesterman, A. & Arrojo, R. (2000). Shared Ground in Translation Studies. Target 12:1, 151-160. Forsman, M. (2005). Development of research networks: The case of social capital. Doctoral dissertation. Åbo: Åbo Akademi University Press. Horton, T. & Coulter, N. & Grant, E. (1998). Applying Content Analysis to Humanities Computing Research Literature. [electronic resource] Joint International Conference ALLC/ACH'98, Association for Literary and Linguistic Computing, Association for Computers and the Humanities, 5 – 10 July 1998, Lajos Kossuth University Debrecen, Hungary. Retrieved April 23, 2001, from http://lingua.arts.klte.hu/allcach98/abst/ abs20.htm Koskinen, K. (2000). Beyond Ambivalence. Postmodernity and the Ethics of Translation. Doctoral dissertation. Acta Electronica Universitatis Tamperensis; 65, Tampereen yliopisto. Retrieved December 7, 2005, from http://acta.uta.fi/pdf/951-44-4941-X.pdf Kärki, R. & Kortelainen, T. (1996). Johdatus bibliometriikkaan. Tampere: Informaatiotutkimuksen yhdistys. Lehtonen, M. (2000, 3rd ed). Merkitysten maailma. Kulttuurisen tekstintutkimuksen lähtökohtia.Tampere: Osuuskunta Vastapaino. Lindfors, A-M. (2001). Respect or ridicule: Translation strategies and the images of a foreign culture. Helsinki English Studies. The Electronic Journal of the Department of English at the University of Helsinki. Volume1, 2001. Retrieved December 7, 2005, from http://www.eng.helsinki.fi/hes/Translation/ respect_or_ridicule1.htm Miller, K. & Matthews, B. (2001). Having the Right Connections: the LIMBER Project. Journal of Digital information, volume 1 issue 8. Retrieved 2001, from http:// jodi.ecs.soton.ac.uk/Articles/v01/i08/Miller/ Nida, E. A. (1975). Componential analysis of meaninig. An introduction to semantic structures. Paris: Mouton, The Hague. Nielsen, M. L. (2002). The word association method : a gateway to work-task based retrieval. Doctoral dissertation. Åbo: Åbo Akademi University Press. Persson, O. (1991). Forskning i bibliometrisk belysning. Umeå: INUM. Reiss, K. & Vermeer, H. J. (1986) Mitä kääntäminen on. [Original work: Grundlegung einer allgemeinen Translationstheorie (1984). Transl. and ed, Roinila, P.] Helsinki: Gaudeamus. 194

Ruokonen, M. (2004). Schleiermacher, Berman ja Venuti: kolme käännösteoreettista näkökulmaa vieraannuttamiseen. In Tommola J. (Ed.), Kieli, teksti ja kääntäminen. Language, text and translation(pp. 63-80). Turku: Turun yliopisto. Schneider, J. (2004). Verification of bibliometric methods' applicability for thesaurus construction. Doctoral dissertation. Aalborg: Department of Information Studies, Royal School of Library and Information Science. Snell-Hornby, M. (1988). Translation Studies. An Integrated Approach. Amsterdam: John Benjamins. The Finnish Terminology Centre TSK (1989). Sanastotyön käsikirja. Soveltavan terminologian periaatteet ja työmenetelmät(TSK 14, SFS-käsikirja 50). Jyväskylä: Suomen Standardisoimisliitto SFS. Ungern-Sternberg, S. von (1994). Verktyg för planering av tvärvetenskaplig informationsförsörjning. En tillämpning på ämnesområdet bioteknik i Finland. Doctoral dissertation. Åbo: Åbo Akademis förlag. Varantola, K. (1990). Tekniikan suomi yhdentyvässä Euroopassa. Sanastotyön merkitystä koskeva selvitys. Helsinki: Tekniikan Sanastokeskus ry (TSK). Vehmas-Lehto, I. (1999). Kopiointia vai kommunikointia. Johdatus käännösteoriaan. Helsinki: Oy Finn Lectura Ab. Venuti, L. (1995). The Translator's Invisibility. A History of Translation. London / New York: Routledge. Venuti, L. (1998). The Scandals of Translation : Towards an Ethics of Difference. London / New York: Routledge. Venuti L. (Ed.) (2000). The translation studies reader. London: Routledge. Vermeer, H. J. (1989). Skopos and commission in translational action. In Chesterman, A. (Ed.), Readings in Translation Theory (pp. 173-187). Loimaa: Oy Finn Lectura Ab. Wierzbicka, A. (1991). Cross-cultural Pragmatics: The Semantics Of Human Interaction. Berlin: Mouton de Gruyter. Wierzbicka, A. (1997). Understanding Cultures Through Their Key Words. English, Russian, Polish, German, and Japanese. New York: Oxford University Press. Zethsen, K. K. (2004). Latin-based terms : True or false friends? Target 16:1 (2004), 125-142. Marianne Dabbadie University of Lille 3 – GERIICO Laboratory (EA 1060) - France

Jean-Marc Blancherie i-KM – Intelligence Knowledge Management – Levens - France Alexandria, a multilingual dictionary for Knowledge Management purposes

Abstract: Alexandria is an innovation of international impact. It is the only multilingual dictionary for websites and PCs. A double click on a word opens a small window that gives interactive translations between 22 languages and includes meaning, synonyms and associated expressions. It is an ASP application grounded on a semantic network that is portable on any operating system or platform. Behind the application is the Integral Dictionary is the semantic network created by Memodata. Alexandria can be customized with specific vocabulary, descriptive articles, images, sounds, videos, etc. Its domains of application are considerable: e-tourism, online medias, language learning, international websites. Alexandria has also proved to be a basic tool for knowledge management purposes. The application can be customized according to a user or an organization needs. An application dedicated to mobile devices is currently being developed. Future developments are planned in the field of e-tourism in relation with French “pôles de compétitivité”.

1 Introduction Within the context of the globalisation of exchanges, companies feel an increasing need to gather and capitalise on multilingual strategic information issued from the web, as well as from various information sources. This is why a direct access to sense is crucial to the definition of an organisation’s business intelligence strategy. Alexandria is an innovation of international impact. It is the only multilingual dictionary for websites and PCs. A double click on a word opens a small window that gives interactive translations between 22 languages and includes meaning, synonyms and associated expressions. Alexandria is the only multilingual dictionary for websites in the world. It is unique and has no equivalent whether it be in the field of private or academic research. The developments made for French and English include definitions. The dictionary can be queried from the Internet, intranets, collaborative, e-learning and e-commerce platforms or directly on a PC. Alexandria can be customized with specific vocabulary, descriptive articles, images, sounds, videos, etc. Its domains of application are considerable: e-tourism, on-line medias, language learning, international websites. The PC version allows diving into languages and international information monitoring. It is not a software: a few HTML code lines added to a website of platform will give any user the possibility to query instantaneously the « Integral Dictionary », the result of 15 years of private and public research on semantic networks and sensemaking in multilingual environments. The knowledge contained in the application can be expressed through the number of words contained in the semantic network. The application gives access to over 3 million words and definitions in twenty-two languages, through the work performed by Memodata on the semantic network.

2 Knowledge sharing through customization Alexandria has also proved to be a basic tool for knowledge management purposes. The application can be customized according to a user or an organization needs. Specific 196 vocabularies or encyclopedic information can be added to the database and be accessed through a double click : newcomers, project teams can share an organization’s specific knowledge. The teams of different departments or foreign offices of a company that produce a technical language can thus communicate their language. Specific vocabulary, project explanation slips in independent window, specific know-how and methods of technicians or managers that will soon retire: the different company entities can share a common and explicit language. The company and its clients are therefore able to share knowledge, learn the same notions and thus speak the same language. It is also possible for administrations to disseminate specific notions or vocabularies through their websites. In France two ministries are considering the possibility of simplifying access to the notions of electronic administration by making use of the Integral Dictionary through customization of Alexandria. Through the semantic network, the dictionary gives access to a broad variety of functionalities. It can display definitions, explanations, news, unlimited possibilities of association, uses and combinations of technological supports that display contents. This possibility to display specific encyclopedic contents gives a strong added value to the information thus disseminated. By querying the name of a town the system can display information about a public exhibition. In the following example, the query sent to the Integral Dictionary is the French town of “Avignon”. The system then displays geographical information about the town together with hot information about the local art festival.

Figure 1 – Local contents with hot information display

3 Technological background Alexandria is not a software. It is an ASP application grounded on a semantic network that is portable on any operating system or platform. Technically a double click on a word opens a javascript window that displays a definition or translation in twenty-two different languages. The following table summarizes the functionalities of the dictionary: 197

1/ in English and French 2/ Translations of words and expressions - 100.000 definitions - Between 22 languages : English, German, - Synonyms and associated expressions Arabic, Bulgarian, Chinese, Corean, Spanish, Esthonian, French, Greek, Italian, Japanese, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Swedish, Czech, Turkish. - - The user can deepen his search by also clicking the words displayed in the window and thus navigate in a world of information.

3/ Access specific added information Specific information can be displayed by sector: For example: Tourism and pharmaceutical industry, holding companies, can add specific vocabulary including specific information slips about, medicines, a project, a monument, etc. In each domain it is thus possible to create a user oriented mini encyclopaedia.

The application becomes accessible on a website after the webmaster has added a few HTML code lines to a website. Alexandria also is an innovation that simplifies information access: with a double click on a word a website visitor can access its definition in French and English on 100.000 entries and its translation in 20 other languages, in a small independent window. The information displayed by Alexandria is a partial view of the Integral Dictionary, a multilingual semantic network developed in France by Memodata a language processing company and the CRISCO CNRS laboratory. The Integral Dictionary has integrated data from Wordnet and Balkanet programmes and the semantic network shifts from words to ideas by linking synonyms, antonyms, domain specific information based on isotopies. Following a query, Alexandria displays a suite of synonyms and associated expressions in which the word of a query can be found.

3.1 The Integral Dictionary The Integral Dictionary is the semantic network created by Memodata. It processes 22 languages to this date. Its name only refers to the multiplicity of its possible uses, its organisation, and to the many different types of information it contains. This dictionary is made of a databank that is processed through a relational database. It is represented by a semantic graph as in the following example in French. 198

Figure 2 - illustration of the Integral Dictionary semantic graph: lexical realisations are represented into the circle

The contents of the Integral Dictionary are a connection between words, expressions (or collocations) and concepts. The Integral Dictionary is a compound of a syntactic layer and distinct levels of utterance together with many different types of expression that can link words and concepts. The relations are oriented according to a “father-son” model.

4 Alexandria the KM springboard The information watch domain, whether it be information filtering, information routing, Knowledge Management applications, collaborative platforms or social networks, is at the crossroads of different domains that did not use to share common issues. Technology providers now meet document processing specialists and take on similar priorities. In this context, the translation tasks inherent to multilingual information monitoring systems have become a crucial issue for the building of a knowledge based economy.

4.1 Managing explicit knowledge By favoring the standardisation and dissemination of technical or patrimonial knowledge, Alexandria has the potential to become a reference tool for the quick gathering and dissemination of specific vocabulary and cultural knowledge.

4.1.1 Reaffirming knowledge explicitation Alexandria encourages the process of knowledge explicitation. Specialized lexicons are a support that will make explicit the terms used by an organization or a community. And beyond the contents of a definition, uses, explanations and standards are described, illustrated and disseminated more widely than through any other media. Explicit knowledge can be formalized, reaffirmed by the simple fact of its inclusion into Alexandria. Reaffirming an explicit piece of knowledge addresses collective knowledge acquisition issues, together with organisational and cultural evolutions in the case, for example, of company merges, partnerships or the development of collaborative distant cooperation. 199

4.1.2 Agreement between partners There can be a participative process to that leads to consensual definitions, knowledge explanation or the choice of shared multimedia documents. The KM processes that take Alexandria as a support encourage participative creation of a shared common language. These processes can also aim more generally at sense disambiguation.

4.1.3 Procedural knowledge Alexandria is a procedural knowledge base that can be used only when necessary. Its contents are always connected with the needs of a user in a work situation. A specific knowledge bank is a complementary tool, but Alexandria may become the support of any knowledge item whatever the media used to display it.

4.1.4 Speed and ubiquity Alexandria gives explicit knowledge unprecedented characteristics. It is neither centralized into a database nor on a platform, nor is it disseminated in thousands of documents. Knowledge becomes permanently accessible at the background of any document or process, and appears whenever it is necessary. Knowledge spreads through any company or virtual community immaterial information media. Alexandria makes it ubiquitous: knowledge is everywhere behind the word, and solicited only in a specific context when a user wants it to be displayed.

4.1.5 The levels of interaction and navigation into the knowledge network Alexandria in fact comprises various possible levels of sense production and dissemination. 1/ According to the principles and adopted ergonomics: integration of new items into the semantic network. Integration into the network encourages interpretation and navigation into a network of knowledge. 2/ According to knowledge explicitation modes, the circumstances and methods will vary. For example this induces the constitution of knowledge banks or knowledge gathering. The forms of transmission take place through definitions, explicitations or demonstrations, story telling, podcasts, videos, with a natural prolongation through blogs and communities of practice. 3/ According to the induced contacts: towards domain experts, managers or communities of practice, that may thus become a reference for domains and issues addressed in this context. Then it becomes possible to see the semantic network as a new entry into a sub- network that comprises reference experts.

4.2 The hermeneutic function: representation The semantic network and the possibility to navigate into the network itself, its themes and semantic classes, associated terms, whether they be generic or specific, offers in itself an original method of knowledge acquisition based on interpretation. It is a methodology particularly adapted to complexity. It gives the possibility to reach the borders of knowledge remaining centered however on the original query. A set of pieces of knowledge often proves necessary to answer a formulated query. They can be part of different paradigms or result from a combination of meanings that question our capacities of interpretation. Alexandria and its associated navigation facilitate our capacity to discriminate senses. 200

4.3 Skill referential, experience directories, problem solving Alexandria can be enriched with an XML database that contains a company or profession skill referential. This referential allows deepening and illustrating in concrete terms, practices, concepts, and the experience they refer to. For example a referential created for nutritionists (this referential was produced by the French association of nutritionists) indicates:

"Decide of, conceive, conduct, adapt, therapies or projects x Decide of, conceive, conduct in the field of nutrition prevention (education, communication, product conception, etc.) x Decide of reasonable and well documented therapies in relation with the diagnosis established, while taking specific and individualised treatment measures. x Conceive and adapt teaching and nutritional education strategies whether they be individual or collective"

This statement could also lead to project return of experience, the setting of a dietetical diagnosis associated to a network of concepts that could clarify this diagnosis and be used as a decision making assistance tool. Similarly, technical terms, whether they be linked to skills or knowledge can be related to an experience directory made of the compilation of voluntary testimonies. The entry to this knowledge base can be a search engine, that would use Alexandria for result analysis on the one hand, and the semantic network for contextual and intelligent indexing and information processing on the other hand. In fact, this would lead to providing an interactive data compound constituted of operating representations. A professional will not act according to stimuli but from the interactive representation of these stimuli (Le Boterf .157). The choice of the adapted representation, guided by the semantic network also allows a professional to make the right decision and find his way out of confusion.

4.4 Multilingualism and a shared language Sharing a common language is an increasing need in a globalization context. Alexandria brings an immediate response to this need from which the most commonly used Knowledge Management tools can benefit immediately without any change in their structure. Communities of practice may share knowledge and experiences without stepping on the obstacles of misunderstanding and cross language ambiguities that often depreciate the value of the exchange. Social interactions, project teams, local actors networks, clusters and « poles de compétitivité »… any entity that requires sharing a clear language, agree on meanings or key representations can rely on this new tool to produce a common meaning and invent new ways of communicating, understanding each other and explain concepts. In this configuration, the dominant logic is not any more in information filtering but in the re- appropriation and sharing of a common meaning.

5 Mobility 5.1 Alexandria Micro application An application dedicated to mobile devices is currently being developed. It is a MIDlet application that has been developed for smart phones. Once the MIDlet is selected the user has to press a command key on the bottom right of the screen. When the MIDlet is launched, the application main screen appears as illustrated below. 201

Figure 3 – Smart Phone application display

As can be seen on the main screen, only minimal information is displayed. Thus a user used to searching word translation into a given target language can directly type the word into the query window and press the « submit » key. If the user wishes to modify either the source or target language, he can make his choice with the up and down keys and press the select key. The language selection screen thus appears. 202

Figure 4 – source language window

Once the settings have been chosen the user enters the word to be translated. After pressing the “submit” key, the results appears.

Figure 5 - Query and result window

If the result does not fit into the window it is necessary to press the « bottom » key to display the following sequence. The parameters are memorized when the user has finished using the application and the same source and target language can then be used again.

5.2 Further developments Further developments identified in relation with mobile devices user needs are the following :

– The description of authentification mechanisms that allow the identification of a user of a Smartphone or PDA. – The security mechanisms available on mobile terminals (secured protocols, secured applications, etc.) – Micro payment technologies on mobile terminals 203

– Multi-threading management on PDA management systems. – Screen shot management on mobile terminals including key combination on mobile phones – Validation of the «Alexandria mobile» application now developed for mobile phones with Java compatibility

6 Projects, realizations and future developments The creators of the system wish to increase the size of the semantic network and extend multilingual lexical covering through the addition of thesauri and large multilingual databases. The MESH CISMEF and EUROVOC thesauri have already been added to Alexandria. Through a partnership with i-KM a company specialized in language processing research and Knowledge Management issues, Alexandria is reaching another dimension: specific adaptations and vocabularies are being created, particularly in the field of e-tourism. A collaborative platform designed to share specific vocabulary and accessible to domain specialists has been created. Multilingual vocabularies are being created in a collaborative mode by domain experts. In addition to the inclusion of domain specific vocabulary into a knowledge base, this unprecedented collaboration between domain experts aims at creating a concept based, reliable, and domain-specific referral system. This experience being carried out in the field of e-tourism will create a reproducible model that can be adapted to other domains and used to generate other domain specific professional referral systems. Collaboration with the French laboratory of usage is presently under study to integrate Alexandria to the French SCS competitivity pole and consider the development of various e-tourism mobile applications and other developments of the network for knowledge management purposes.

7 Bibliography Le Boterf G. – Compétence et navigation professionnelle – Editions d’Organisation – 324p - 1999 Ballay JF – « Tous managers du savoir ! » – Editions d’Organisation – 431p – 2002 Dabbadie M. : « Alexandria dictionnaire multilingue pour sites » web in Veille Magazine N° 81 – May 2004 Dabbadie M., Fraysse F.: Coaching applications: a new concept for usage testing on information systems in ISKO conference proceedings – London – 13-16 july 2004 Dominique Dutoit, ‘Quelques opérations senstexte et textesens utilisant une sémantique linguistique universaliste a priori’, PhD thesis, Université de Caen, 2000. Dominique Dutoit, ‘A Set-Theoretic Approach to Lexical Semantics’, Proceedings of COLING, 1992. Jouis & Mustafa El Hadi, W. (1996), “ Approche sémantique par exploration contextuelle pour l'aide à la construction de terminologies. Vers une intégration à une approche statistique ”, JSLB ‘96 (Journées de Sémantique Lexicale de Brest), ENST de Brest, 11- 13 septembre 1996. Mustafa El Hadi, W. (1998). “Automatic Term Recognition & Extraction Tools: Examining the New Interfaces and their Effective Communication Role in LSP Discourse”. In Mustafa El Hadi, Maniez, J. & Politt, S. éds Structures & Relations in Knowledge Organization, Proceedings of the Fifth International ISKO Conference, Lille, 25-29 Août 1998, pp. 204-212.

Elin K. Jacob, PhD Indiana University, Bloomington, Indiana, USA

Hanne Albrechtsen, PhD Institute of Knowledge Sharing, Copenhagen, Denmark

Nicolas George Indiana University, Bloomington, Indiana, USA

Empirical analysis and evaluation of a metadata scheme for representing pedagogical resources in a digital library for educators

Abstract: This paper introduces the Just-in-Time Teaching (JiTT) digital library and describes the pedagogical nature of the resources that make up this library for educators. Because resources in this library are stored in the form of metadata records, the utility of the metadata scheme, its elements and its relationships is central to the ability of the library to address the pedagogical needs of instructors in the work domain of the classroom. The analytic framework provided by cognitive work analysis (CWA) is proposed as an innovative approach for evaluating the effectiveness of the JiTT metadata scheme. CWA is also discussed as an approach to assessing the ability of this extensive networked library to create a common digital environment that fosters cooperation and collaboration among instructors.

1. The Just-in-Time Teaching Digital Library In 2000, the National Science Foundation [NSF] initiated the National Science Digital Library Program [NSDL] in an effort to promote the development of innovative educational resources and to pioneer original methods for delivery of instruction in science, technology, engineering and mathematics [STEM]. Digital libraries currently funded by the NSDL Program are intended to offer "organized access to collections and services from resource contributors that represent the best of public and private institutions including universities, museums, commercial publishers, governmental agencies, and professional societies" (NSDL 2004, p. 1). While many of the NSDL libraries are focused on the development of digital collections addressing specific areas of instruction (e.g., geoscience, animal behavior, health sciences and microeconomics), there are several libraries whose mandate is to provide instructional support for STEM teachers. One such library whose primary motivation is to provide support for the classroom activities of STEM instructors is the Just-in-Time Teaching Digital Library (JiTTDL). JiTTDL is a web-based collection of pedagogical resources that have been developed to support the instructional methodology known as Just-in-Time Teaching, or JiTT (Novak et al., 1999). Instructors who have adopted the JiTT approach rely on pre-class "WarmUp" questions to enhance student-teacher interaction in the classroom: prior to class, students are given a carefully constructed web-based (digital) assignment and must submit their responses electronically to the instructor before the start of class. The instructor then reviews the responses from her students “just-in-time” to tailor her in-class instruction to respond to the level(s) of understanding (or misunderstanding) indicated by student responses. JiTT methods have been adopted in a number of high school and college classrooms throughout the United States: currently, there are approximately 400 instructors in 25 206 different disciplines at more than 100 different institutions who have implemented this pedagogical method. JiTT instructors represent a wide range of disciplines from the humanities (history and journalism), the social sciences (accounting, finance, economics and psychology), the hard sciences (astronomy, biology, chemistry, geology and physics), the applied sciences (nursing and engineering), and mathematics. Collectively, these instructors have amassed an impressive storehouse of several thousand JiTT-based resources, including digital simulations, follow-up classroom activities, assessment tools and instructional support materials as well as pre-class warmup questions. In addition to these instructional materials, many JiTT instructors have also accumulated a wealth of examples of student responses to specific JiTT assignments; and these collections of student responses have frequently been analyzed and annotated by the contributing instructors to illustrate various levels of student understanding. The JiTT Digital Library is intended both to provide a centralized storehouse or archive for storage and exchange of these JiTT resources and to serve as the platform for development of an interactive, online community where instructors can collaborate on the revision, enhancement and extension of existing resources as well as the development of new materials. In the JiTT digital library, each resource exists only as a metadata record; that is, a resource is stored in a relational database as a set of element-value pairs, or statements. This mode of representation was chosen, in part, because it allows the individual instructor to tailor online presentation of retrieved resources by selecting only those elements of the metadata record that are relevant to her immediate need. For example, one instructor, having retrieved a set of resources dealing with a particular topic, might want to display nothing more than the description, audience level and required student reading(s) for each resource; another instructor, having already selected a particular warmup question or in-class activity, might want to view previous student responses and comments from other JiTT instructors as to how student responses to this question were used to structure in-class presentation of the instructional content. Storing each resource in the form of a metadata record provides the flexibility of presentation necessary to meet the very different needs of JiTTDL users. Another important consideration that influenced the selection of metadata records as the format for storing JiTT resources was the need to support the potentially dynamic nature of resources in the JiTT library. One of the primary objectives of JiTTDL is support for collaboration in the development and evolution of resources and for the exchange of in- class experiences among JiTT instructors. By storing a JiTT instructional resource as a metadata record, any registered member of the JiTT online community can comment on the success of certain JiTT materials, report student responses to a particular activity, or share modifications in the application of an existing resource. If resources were stored in the library as either html or pdf documents, for example, the process of updating an existing resource in a timely fashion would be a cumbersome task. In contrast, storing each resource as a metadata record allows for dynamic updating of existing resources as well as the attribution of authorship for new or modified content while simultaneously, maintaining in its original form, the intellectual content and authorship of the original resource. The metadata scheme used to represent JiTTDL resources was designed to provide basic information about a resource (e.g., author of the resource, discipline, type of activity or assignment, topic, audience level, etc.) and to address a set of questions that might typically be posed by an instructor (e.g., What classroom strategies have been used with this assignment? Are there assignments that lead up to or follow from this activity?). These questions were gathered from novice and expert JiTT implementers during a series of JiTT workshops and were analyzed to identify additional data elements that would be of value to instructors; but the basic element set was developed based on the intuitions of a small group 207 of researchers, only one of whom was an experienced JiTT instructor. And, while metadata elements were extended and refined during an iterative process of application and revision, the utility and appropriateness of the metadata scheme for the community of JiTT instructors was not evaluated during the process of scheme development. This paper reports on the use of cognitive work analysis (CWA) as a theoretical framework for evaluation of the JiTT metadata scheme in order to determine how well the scheme supports the instructional needs of JiTT instructors both in the work domain of the classroom and in the collaborative environment provided by the digital library.

2. Cognitive Work Analysis Cognitive work analysis (Rasmussen, Pejtersen & Schmidt, 1990; Rasmussen, Pejtersen & Goodstein, 1994; Vicente, 1999; Sanderson, 2003) provides a theoretical and methodological framework for a work-centered approach to the design of large-scale information systems and the empirical investigation of how information technology changes human conditions of work. Because the CWA framework originated in the cross- disciplinary and problem-oriented international research environment of Risoe National Laboratory, Denmark, it has drawn on a diversity of theoretical backgrounds ranging from general systems theory to cognitive psychology and the sociology of work (see Rasmussen, Pejtersen and Goodstein, 1994; Albrechtsen et al., 2001; Sanderson, 2004; Hollnagel and Woods, 2005). In addition, field studies and field experiments addressing users’ work activities with information technology in a variety of work domains have contributed to the development of CWA as a generic approach to the analysis of work environments and the evaluation of complex information systems. While CWA does not prescribe particular theoretical perspectives for the researcher, it does offer an integrative approach to analysis of work domains and the evaluation of complex information systems. The first key feature in the development of this approach was the taxonomy of work described by Rasmussen (Rasmussen, Pejtersen & Schmidt, 1990). Based on empirical findings from field studies and field experiments, the taxonomy of work represents the various aspects of work domains, work tasks and users. To include the perspective of human-computer interaction, the original taxonomy was extended by Rasmussen’s skills-rules-knowledge taxonomy (Rasmussen, 1983), thereby developing a broader framework for systems analysis and design. Taken together, the taxonomy of work and the taxonomy of human-computer interaction constitute the conceptual framework of cognitive systems engineering (CSE), which comprises the analytical framework of CWA (Rasmussen, Pejtersen & Goodstein, 1994). A major challenge in the development of CWA has been coupling the understanding of information systems as sociotechnical systems (which focuses on the collective features of domains and systems) with an emphasis on the cognitive aspects of information system use (which focuses on individual users’ preferences, knowledge and strategies). Continuing development of the framework has addressed the collective aspects of information system use, for example, through empirical analysis of collaborative information searching (Fidel et al., 2004) and collaborative knowledge organization (Albrechtsen, 2003; Albrechtsen et al, 2004). The research described here represents a new application of CWA as a framework for evaluating the utility of a metadata scheme within a specific work domain. More importantly, this innovative application of CWA involves the analysis and evaluation of knowledge sharing in large-scale information systems such as the JiTT digital library with the ultimate goal of developing a framework for describing explicit as well as implicit knowledge architectures in complex digital collaborative environments – knowledge 208 architectures that can guide the design of work-based and user-centered knowledge organization schemes.

3. The Means-Ends Abstraction Hierarchy The core of CWA is the means-ends model which is used to analyze the stable properties of a work domain and thus functions as the glue joining analysis and evaluation of the work domain. The means-ends model addresses the overall territory of work in terms of domain structures and user work strategies on the one hand and user resources, backgrounds and preferences on the other. The means-ends model described by Rasmussen (1986) was originally introduced as a paradigm for representing engineering control systems such as those used in power plants. Since its introduction, the means-ends model has been the object of much discussion and ongoing development by the CSE research community concerned with human-machine systems design (e.g., Lind, 1999). Use of the means-ends model has recently been extended beyond CSE and has been applied to research on the design and evaluation of information systems serving a broad range of application domains. Adaptations of the means-ends model in various domains have been discussed in Vicente (1999), in Rasmussen, Pejtersen and Goodstein (1994), and in Rasmussen, Pejtersen and Schmidt (1990). Specific applications have ranged from patient care in hospitals and case handling in public institutions to knowledge exploration and organization in digital libraries and web-based collaboratories (Pejtersen & Albrechtsen, 2002; Albrechtsen et al., 2004).

MEANS-ENDS RELATIONS PROPERTIES REPRESENTED

Goals Properties necessary and sufficient to establish relations between the and performance of the system and the reasons for its design (i.e., the purposes and Constraints constraints of its coupling to the environment). Categories in terms referring to properties of the environment.

Priority Properties necessary and sufficient to establish priorities according to the Measures intention behind design and operation: topology of flow and accumulation of mass, energy, information, people, monetary value. Categories in abstract terms referring neither to system nor environment.

General Properties necessary and sufficient to identify the "functions" that are to be Functions coordinated irrespective of underlying physical processes. Categories in terms of recurrent, familiar input-output relationships.

Processes Properties necessary and sufficient for control of physical work activities and and use of equipment: to adjust operation to match specifications or limits; to Activities predict response to control actions; to maintain and repair equipment. Categories in terms of underlying physical processes and equipment.

Physical Properties necessary and sufficient for classification, identification and Resources recognition of particular material objects and their configuration and for navigation within the system. Categories in terms of objects, their appearance and location.

Figure 1: The means-ends abstraction hierarchy. (Reprinted from Rasmussen, Pejtersen & Goodstein (1994) with permission of the authors.) 209

Means-ends analysis is based on two fundamental strategies: (a) empirical analysis of work domains and (b) mapping of identified domain features in a means-ends abstraction hierarchy (Rasmussen, 1986; Rasmussen, Pejtersen & Goodstein, 1994). This abstraction hierarchy is designed to capture all relevant features of the work domain, from high-level, abstract attributes such as goals and constraints to very concrete elements in the form of material resources. As represented in Figure 1, the means-ends abstraction hierarchy consists of five levels:

• Goals and Constraints: The highest level of abstraction in the means-ends hierarchy addresses the purpose of the work domain in relation to its functions in the environment and captures the work domain's anchoring in cultural, political and economical systems. Examples of goals are found in statements of policy formulated for the work domain, while examples of constraints are to be found in the outside regulations imposed either by legislation or by codes of practice. • Priority Measures: The second level of abstraction addresses the organizational structure of the work domain and the division of labor and distribution of resources within the domain – how resources like staff, material and finances are allocated and managed within the domain. • General Functions: The third level of abstraction addresses the recurrent tasks carried out in a work domain, irrespective of those physical resources, such as staff or tools, which may be involved in carrying out such tasks. • Physical Processes and Activities: The fourth level of abstraction addresses the actual activities involved in carrying out workplace tasks – the processes necessary to establish and maintain the general functions of the work domain. • Physical Resources: The fifth and most concrete level of abstraction consists of an inventory of the material resources that are created, used and maintained within the work domain. As discussed by Rasmussen, Pejtersen and Goodstein (1994, pp. 35- 55), the category of physical resources includes the actors who participate in the work domain – the staff and users who take part in the activities of the work domain.

Within the theoretic framework of the original means-ends model, the relationships between the five levels of abstraction were understood to be governed by laws of nature, by the structure of the control system, and by the human operator's interpretive and decision- making activities within the limits set by the environmental constraints of the work domain. Current theory informing means-ends analysis views these environmental constraints as sources of regularity that inform the actors’ decisions and freedom of choices rather than as conditions that causally determine the actors’ activities and understandings, as is the case with work systems such as those associated with power plants. Recent adaptations of means-ends analysis emphasize work systems as territories where the user navigates more or less freely. The conditions for user decision-making are not determined by natural forces, nor are they prescribed by the information system itself; rather, they develop dynamically through the interdependencies and relationships established between users. When applied in user-centered work domains, means-ends analysis addresses the ongoing construction of the territory of work within which users will navigate. Accordingly, recent applications of the means-ends model, ranging from concurrent engineering (Pejtersen et al., 1997) to collaborative knowledge organization in film research (Albrechtsen, 2003), have focused on mapping the work domain in terms of common or shared workspaces within which collaborating users navigate. 210

4. Application of CWA to Analysis of the JiTT Digital Library CWA integrates traditional areas of investigation through analysis of the relevant knowledge domain(s); analysis of the organizational domain, including how work is divided, delegated, managed and financed; analysis of the work domain, including specific tasks, decision-making strategies and heuristics, and domain vocabulary; and analysis of user skills, performance criteria, preferences and expectations. The objective of CWA is to analyze these various elements of the work domain to develop an understanding of the structural, social and individual components that constitute a "system of work". While the various areas of investigation are treated as analytically distinct, the overall emphasis of CWA is on developing a comprehensive understanding of the relationships that exist among the structural, social and individual components of the work domain and how these relationships interact to produce a "system of work" that is greater than the sum of its parts. CWA offers a theoretical framework and a set of heuristic models that are being applied both in analysis of the classroom as work domain and in evaluation of the JiTT digital library as a large-scale networked information system. The means-ends abstraction hierarchy and the task situation model are two key features of the CWA framework that are being used in the empirical analysis of the JiTT metadata scheme and its role in supporting the functional components of the JiTT digital library. The means-ends hierarchy is being used to map the work domain of the classroom from goals and priorities to functions, processes and physical properties. As such, it is providing a multi-faceted representation of the territory in which JiTT resources are being applied by instructors. The task situation model complements the means-ends hierarchy by providing guidelines for empirical analysis of the prototypical activities in which JiTT instructors are involved – task situations such as construction of assignments, evaluation of student responses, classroom instruction and interaction with students. Initial data collection is relying on artifact-based interviews with JiTT instructors. Because the goal of this research is to evaluate the metadata scheme and the contribution of JiTTDL as a collaborative workspace, these open-ended interviews have attempted to elicit information regarding the process of selecting, adapting and implementing a JiTT activity, the analysis of student responses to warmup questions, and the contribution of this analysis to in-class instruction. Preliminary analysis of the means-ends hierarchy developed from these interviews appears to indicate a high degree of diversity in the expertise instructor's bring to application of the JiTT methodology as well as in the knowledge content of resources used by individual JiTT instructors. At the same time, the work domain supported by JiTTDL is characterized by common work functions such as development of lesson plans, classroom instruction and student tutoring. This juxtaposition of diversities and commonalities among JiTT instructors raises interesting questions regarding instructor comfort with the JiTT pedagogy and the relationship of evolving comfort levels to the adoption, adaptation and modification of JiTT activities in the classroom. The JiTT digital library is intended to provide instructors with the resources and functionalities that will create a dynamic environment in support of the work domain of education and to construct a shared virtual workspace where instructors can collaborate on the design of innovative instructional resources and techniques. Preliminary findings of this analysis indicate that, while the common workspace offered by the JiTT digital library has encouraged changes in the activities and processes used by instructors to carry out the general functions characteristic of classroom instruction, these activities remain closely tied to the goals established by the policy pronouncements of individual educational systems and, in some cases, by the constraints imposed by state legislation. Initial findings also indicate that the JiTT digital library supports mutual learning and innovation through the 211 sharing of expert knowledge across instructors, that it promotes awareness of the wide range of interactive resources available to instructors and that it encourages coordination of work among instructors. More importantly, however, these preliminary findings suggest that ongoing alterations in the common work space – alterations following from the instructors' evolving understanding and articulation of the content of educational work – will require modifications in the scope of the current JiTT metadata scheme. However, it is becoming increasingly clear that, in order to explore how modifications in the representation of these resources will accommodate instructors' use of JiTT resources and encourage an evolving semantics within the common work space, it will be necessary to evaluate the efficacy, efficiency and utility of the metadata scheme through empirical field experiments that involve both instructors and students in interactive task scenarios designed to explore a pre- defined set of pedagogical techniques and resources.

5. Conclusion This paper has presented the theoretical foundation for an innovative approach to the evaluation of both a domain-specific metadata and a specialized digital library. Prior to the current effort, CWA had been applied in the evaluation of information retrieval systems and relatively small and cohesive digital libraries and collaboratories such as Collate (Albrechtsen et al., 2004). The information environments in which CWA has been applied have not been part of a larger digital library context, as is the case with JiTTDL. Although the collection of empirical data is still in process, the use of CWA as a tool for evaluating the utility of metadata schemes appears promising. CWA provides a comprehensive approach to understanding the scope and complexity of a system of work by integrating the analysis of the human component in the work environment (including, for example, decision-making strategies and heuristics, domain vocabulary and user preferences and expectations) with the more traditional analysis of knowledge domains and organizational structures. It is precisely this comprehensive approach to analysis of the work domain that makes CWA an attractive framework for the evaluation of a work-specific metadata scheme as well as for assessment of the effectiveness of the shared and networked environment provided by the JiTT digital library.

References Albrechtsen, H. (2003). Classification schemes for collection mediation: Cognitive work analysis and work centered design. Aalborg, Denmark: Aalborg University, Faculty of Engineering and Science. Albrechtsen, H., Andersen, H. H. K., Cleal, B. R. & Pejtersen, A. M. (2004). Categorical complexity in knowledge integration: Empirical evaluation of a cross-cultural film research collaboratory. In I. C. McIlwaine (Ed.), Knowledge organization and the global information society: Proceedings of the 8th International ISKO Conference, London (GB), 13-16 Jul 2004 (pp. 13-18). Ergon Verlag: Würzburg. Albrechtsen, H., Andersen, H. K. K., Bødker, S., & Pejtersen, A. M. (2001). Affordances in activity theory and cognitive systems engineering. Retrieved March 14 2006 from the World Wide Web: http://www.risoe.dk/rispubl/SYS/ris-r-1287.htm Fidel, R., Pejtersen, A. M., Cleal, B. & Bruce, H. (2004). A multi-dimensional approach to the study of human-information interaction: a case study of collaborative information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 939-953. 212

Hollnagel, E. & Woods, D. D. (2005). Joint cognitive systems: Foundations of cognitive systems engineering. Boca Raton, FL: Taylor & Francis/CRC Press. Lind, M. (1999). Making sense of the abstraction hierarchy. In Proceedings of CSAPC 99’, Villeneueve d’Ascqu, France, 21-24 September 1999. Retrieved March 14 2006 from the World Wide Web: http://www.iau.dtu.dk/~ml/csapc99.pdf National Science Digital Library. (2004). NSDL: 2004 Annual Report. Boulder, CO: National Science Digital Library. Novak, G. M., Patterson, E. T., Gavrin, A. D., & Christian, W. (1999). Just-in-Time- Teaching: Blending active learning with web technology. Upper Saddle River NJ: Prentice Hall. Pejtersen, A. M. & Albrechtsen, H. (2002). Models for collaborative integration of knowledge. In M. Lopez-Huertas (Ed.), Challenges in knowledge representation and organization for the 21st century: Integration of knowledge across boundaries (pp. 412- 421). Ergon Verlag: Würzburg.. Pejtersen, A. M., Sonnenwald, D. H., Burr, J., Govindaraj, T., & Vicente, K. (1997). The design explorer project: Using a cognitive framework to support knowledge exploration. Journal of Engineering Design, 8, 289-301. Rasmussen, J. (1983). Skills, rules, and knowledge: signals, signs, and symbols, and other distinctions in human performance models. IEEE Transactions on Systems, Man and Cybernetics, 13, 257-266. Rasmussen, J. (1986). Information processing and human-machine interaction: An approach to cognitive engineering. New York: North-Holland. Rasmussen, J., Pejtersen, A. M., & Schmidt, K. (1990). Taxonomy for cognitive work analysis. Roskilde, Denmark: Risoe National Laboratory. Rasmussen, J., Pejtersen, A. M., & Goodstein, L. P. (1994). Cognitive systems engineering. New York: Wiley. Sanderson, P. M. (2003). Cognitive work analysis. In J. Carroll (Ed.), HCI models, theories, and frameworks: Toward an interdisciplinary science (pp. 225-264). New York: Morgan-Kaufmann. Schmidt, K. (1990). Analysis of cooperative work. A conceptual framework. Roskilde, Denmark: Risø National Laboratory. Vicente, K. J. (1999) Cognitive work analysis: Towards safe, productive and healthy computer-based work. Mahwah, NJ: Lawrence Erlbaum. Nancy J. Williamson University of Toronto, Canada Knowledge Structures and the Internet: Progress and Prospects

Abstract: This paper analyses the development of the knowledge structures provided as aids to users in searching the Internet. Specific focus is given to web directories, thesauri and gateways and portals. The paper assumes that users need to be able to access information in two ways – to locate information on a subject directly in response to a search term and to be able to browse so as to familiarize themselves with a domain or to refine a request. Emphasis is to the browsing aspect. Background and development are addressed. Structures are analyzed, problems are identified, and future directions discussed.

1. Introduction Since its early and somewhat primitive beginnings, the Internet has gone through many changes. It has grown ever larger and search engines have become increasingly sophisticated. In efforts to improve access, and to ease the search for information, old tools have be adapted and reconfigured and new tools are constantly being developed. The question of how to organize websites so users can actually find the information they are seeking is a continuing problem. Nevertheless, while information systems may change, to be successful they must continue to meet two fundamental requirements of information seekers: to permit users to locate information on a subject directly and to allow them to browse so as to familiarize themselves with a domain or to refine a request. In the infancy of the Internet, information providers were working on a trial and error basis. Research findings indicate that the expertise of the providers varied in its sophistication. The medium and its potential were often misunderstood. Information seeking methods were sometimes naïve and the access tools inadequate to the task. These findings were confirmed in the research for two previously prepared papers – “Knowledge structures and the internet” (Williamson, 1997) and “Thesauri in the digital age” (Williamson, 2000). In 1997, search engines were relatively primitive and control over the development of the Internet was minimal. Each information provider had his/her own objectives as to how information should be organized to accomplish easy and productive browsing. The literature seldom addressed the intricacies of structuring the data and no standards, or generally accepted guidelines, existed for dealing with document content. Moreover, early research and development focused on the societal and technological problems, rather than on a concern for effective access to data. Indeed, like so many new technological toys, the developers and many users assumed the entertainment value of ‘surfing the net’ to be its most important function. Research into the possibilities for organization and access had only just begun. In the 9 years since that time much has changed. The Internet is now seen as a serious source of information in the academic, business and industrial communities, as well as in the eyes of the ordinary citizen. In 2006, web designers and researchers are well into exploiting a full range of knowledge structures and search strategies. Software cannot solve all the problems. There is an urgent need for more user-friendly interfaces and greater emphasis on human computer interaction to aid the user in achieving successful searches. From its early beginnings much effort has been put forward in efforts to ‘index the internet’. First approaches were rather simplistic in nature and not always appropriate to the new medium. For example, early use of traditional classification was at a minimal level, 214 usually shallow and sometimes incorrect. Now, controlled vocabularies of various kinds (e.g. thesauri and taxonomies) as well as other kinds of information structures are deemed to have an important role to play. Most significantly, endeavours to create seamless information systems have led to the integration of the tools with the databases themselves. At this point in time, major questions to be addressed are “How well are these tools performing their task? Is there room for improvement? Using the aforementioned papers as a starting place, this current research examines the access tools provided for the aid of users. Focus is on those devices that can actually be viewed on the screen by users and can be manipulated to facilitate subject searching. Emphasis is on structure and complexity as embodied in the use of tools that foster browsing of subject domains and of individual websites. Structure has been defined as the bringing together and organization of information in a way that facilitates browsing. Included is classificatory structure as embodied in both traditional and newly developed aids to searching. In conjunction with the Internet the term “browsing” is often defined as being very broad and all-inclusive, encompassing as simply scanning an alphabetical list, as well as ‘browsing by category’. Emphasis here is on classificatory structure as embodied in both traditional and newly developed aids to searching.

2. Methodology Using the aforementioned papers as a starting place, the first step in this investigation was an analysis of research and publication on this topic published since 1997. Since most of the material was published prior to 2004, an accurate picture of the current situation was possible only through a critical examination of the design and structure of relevant websites in each of the categories. Some of the sites viewed in 2000 (Williamson) were revisited and new sites reflecting the newer structures were added. There is some confusion in the use of terminology and some of the terms needed to be defined. Tools that control the Internet are of two kinds - those that are controlled by the system and are, for the most part, invisible to the user (e.g. search engines) and those that can be used by information seekers to navigate a site or a group of sites in the process of subject searching. This study focuses exclusively on the navigational structures visible to the user. There are numerous structures at their disposal. Some of these are very simple; others are more complex. Since the list is long, three major types of structures were chosen for analysis. These were web directories, thesauri, and gateways and portals. Hyperlinking is an essential feature of all of these structures and features such as site maps come up in the discussion where appropriate. The ultimate goal in this research has been to attempt to find answers to such questions as: What is the nature of these structures? Do they share common characteristics? Do they truly support the browsing capabilities of the system? Are they being used effectively? Are there improvements that could be made? Where should research and development go from here?

3. Web subject directories One of the most obvious tools in a directed search of Internet sites is the web subject directory. Web directories “are, in fact, a form of classification” (Gilchrist, 2003, 11) and described by Gilchrist, along with some other applications, as belonging under the more generic term “taxonomies”. These directories fulfill the two basic requirements for searching an information system. They permit searching on specific terms and also allow browsing through the lists of resources. This investigation focused on subject directories of six sites that are reputed to be “good sites”(Notess, 2003) and ten randomly selected sites from a list described as “invisible web” sites (Ru and Horowitz, 2005). 215

In each case, searches were carried out using the terms ‘health’ and ‘education’. Web directories lead to such things as information, documents, and other web sites and are designed by humans for each particular website. While there are no rules or standards for these directories, an expert web designer will have some idea of the characteristics of a workable directory. Web sites vary in nature. Some sites are large; some are small. Large sites require somewhat more complex structures than small ones. Some sites deal with single domains while others have very broad subject coverage and are necessarily more complex. Size and content have considerable influence on the kind of directory needed. Absolutely essential are the following: a) terminology suitable to the subject and the target audience; b) two modes of access – direct access via the user’s own choice of terms; and a list of categories or terms, or a site map from which the user may select; c) a logical path to the documents, data, or information provided at the end of the search; d) hyperlinking to permit navigation of site to the end of the search. Within these essentials there are some variations. Suitability of terminology can only be assessed by doing a user study. In some cases, traditional classifications such as DDC, UDC or LCC may work but they are not suitable for many sites. All of the sites viewed provided the two methods of access permitting the user to enter his/her own keyword or term. and all provided hyperlinking as a basis for navigation. However, ‘browsing’ the directories varied with the site. Some provided an alphabetical list of terms; others preferred to present the searcher with a list of categories. Some, but not all, sites using the alphabetical approach permitted users to select a letter of the alphabet, which would allow them to scan terms beginning with that letter. In some cases, searching would be made more precise through the selection of a segment of a particular part of the alphabet (e.g. Ag – Ar). The alphabetical approach is really only suitable for small sites or sites devoted to a single domain. Otherwise scanning for acceptable terms demands excessive scrolling of the list. Moreover, an alphabetical approach fails to organize related topics in a way that might be useful to the searcher. For access to the larger sites, browsing by category makes the most sense, particularly if the subject area is large. In this scenario, a directory offers the user a menu of top terms. By selecting one of the terms, the user can move to second and succeeding levels until information or references are reached. In some cases the user presented with the opportunity of using another search engine. Those top terms name the categories and it is crucial that the choice of those terms by the web designer be suitable to the subject matter and the site. The number of top terms, the total number of sites involved and the nature of the domain influence the degree of division, the number of hierarchical levels in a directory and the length of the pathway to a particular topic. For example, the Yahoo directory leading to 3,000,000+ locations is much more detailed than RDN (Resource Discovery Network) leading to 30,000 locations (Notess, 2003). As found in the sites searched, top terms tended to be at the domain level, but popular topics might also be top terms. For example Librarians’ Internet Index, a very broadly based site, includes the very broad topics – ‘Arts and Humanities’, ‘Science’ and ‘Society and the Social Sciences’ as top terms as well as terms that could be subsumed under these. At the upper levels, terms sometimes included a brief contents note to aid the user in making choices. Web directories are described in the literature as being hierarchical but many of them are not hierarchical in the classical sense. In some cases a term may be repeated at different levels. For example in the Yahoo directory, ‘cancer’ and ‘breast cancer’ both appear at level 3 but ‘breast cancer’ appears again at level 4. Frequently also, the categories are not ‘pure’. For example ‘diseases and conditions’ and ‘news and media’ are in the same category in the Yahoo directory. Perhaps this is unavoidable if the ‘news and media’ apply to that particular segment of the directory. 216

As indicated above, not every term relative to the various websites will appear in a directory. In all of the subject directories searched there was provision for searching on keywords as well as for browsing. As might be expected, there is wide variation in the structure and usefulness of web directories. The user controls the search by following a path in the directory. A useful device is a display of the search path at each step of the user’s search. (e.g. Yahoo and Librarians’ Internet Index). InfoMine differs from the other directories examined. The first page of the site contained the top terms in the directory. A click on the top term leads to a search form designed for selecting search options and inputting a term that permits a choice of regular display or relevance ranking. At this location also, users are offered the option of selecting search terms by browsing LCSH, LCC, keywords and other indexes. Similarly, searching ‘biological, agricultural and medical sciences’ in RDN permits the use of LCSH, MESH and LCC. The large web directories (e.g. Yahoo and Librarians’ Internet Index) tended to be the most logically organized and most minutely divided. In the Yahoo example, some paths went down six levels whereas RDN provided division that was only three levels deep in the paths searched. BUBL was the only directory searched that used a traditional classification scheme. It categorizes by DDC class numbers starting with broad classes and moving down the path incrementally but tens to one digit past the decimal point. It is clean and logical. Of all the web subject directories searched, the Yahoo (organized by terms) and BUBL (organized by DDC class number) were the best developed and the most logical. With respect to large databases, effective sorting of the material is extremely important. The larger the website, the greater the need for an increasing number of levels as the size of the database increases. In the Yahoo directory, for example, a search for ‘mammography’ goes through 5 levels: Health - Diseases and conditions – Cancer – Breast cancer – Mammography. ‘Cancer’ is too large a subject to have everything lumped together in one unorganized mess. It needs to be subdivided. Sorting is a simple principle of categorization. As a database grows, its continued usefulness over time requires resorting of the database and revision of the directory. At this point in time there is no way of knowing whether this will actually happen in the future. Given the magnitude of the problem it seems unlikely. There are many web subject directories that are performing a useful function. However, there are also numerous sites that have no directories and should have. One page websites are not a problem, but sites that run to 15 or more pages are.

4. Thesauri While web subject directories (taxonomies) provide useful tools to aid in navigating websites, thesauri have the potential to provide much more powerful support by allowing subjects to be arranged hierarchically and by permitting other kinds of term relationships and linkages. Controlled vocabularies, in the form of subject headings, go back to the nineteenth century. Following in the footsteps of Roget, the modern thesaurus emerged in the early 1960’s. By the late l990s it was generally assumed that there was an important role for thesauri to play as tools in online access to information (Milstead, 2000). At that time, their role was still being defined. Among other things, it was assumed that thesauri could complement full text access by aiding users in focusing their searches, by supplementing the linguistic analysis of the text search engines, and even by serving as tools used by the search engine for its analysis. Machine aided indexing could make use of thesauri as a basis for easier term selection by indexers. Also, by analogy, the principles of term relationships as applied in thesauri should be applicable in the creation of logical hyperlinks in large databases and the Internet. 217

Over time, a growing number of thesauri became available through the Internet with two basic types of display – static and dynamic. Static thesauri are displayed much as they appear in printed form, differing only in that they are available electronically. The contents can be browsed by scrolling; otherwise there is little or no facility for moving about the list in a dynamic way. For practical purposes they are clumsy and often it would be easier and more efficient to use a printed volume. This early approach is typical of initial attempts to move printed products into digital format and was precipitated by the existing use of electronics to create the printed versions. Dynamic thesauri, on the other hand, are presented in such a way that they can be ‘searched’ by inputting thesaurus terms or by browsing a section of the alphabet. Some provide a list of categories as a starting place for beginning a search. Some allow Boolean searching and use hyperlinks to enabling users to move from one part of the thesaurus to another, following the BT, NT, and RT relationships and moving from one type of display to another (e.g. from a rotated display to an alphabetical display). Some are derived from a printed product. Others are newly created for the Internet and only exist in electronic format. Some of the early online examples actually led to documents. (e.g. the Cook’s Thesaurus that led to actual recipes). By 2000, there was considerable improvement in the way thesauri were presented online. There was evidence of a solid start in the development of online thesauri, but there was still much research to be done on possible ways to enhance the display and use of these search tools. There were plenty of examples of what to do and what not to do. Many of the designs were predicated on what the designers knew about databases and not always on what was known about the behaviour of users. Some were incomplete, displaying a portion of the thesaurus as an encouragement to buy the product. However, there were healthy signs of innovation. In particular, many lists now began to lead to actual document citations – confirming the prediction that direct linkage between a thesaurus and documents was to be the way of the future. All thesauri that presently exist were created, more or less, on the basis of the current guidelines for thesaurus construction. These guidelines, while sound in linguistic principles, are technologically somewhat out of date. The ISO 2788 guidelines were published in 1986. ANSI/NISO Z39.19 guidelines published somewhat later, in 1993 employed the same guiding principles as ISO 2788 but included a section on screen display which recognized that “sophisticated thesaurus display and terminology” (National Information Standards Organization, 1993, 25) would be appropriate for expert searchers and indexers. Further, the guidelines acknowledged the fact that screen viewing is different from print viewing and presents difficulties. They also allude to the need for standards for human- computer interaction, but do not refer directly to the possibility of a dynamic thesaurus. By 1999 at a workshop sponsored by NISO (Milstead, 1999, 8) recommended that a new standard for thesauri would be needed and that a variety of flexible displays should be provided for and that it should not be a standard for electronic thesauri, because today all thesauri are essentially digital, making electronic thesauri superfluous. The work on the new standard has now been completed. ANSI/NISO Z39.19-2005 was published in November 2005 and is available on the web at http://www.niso.org/standards/resources/Z39-19- 2005.pdf. Parts 1 and 2 of BS 8723, a British version of the standard was also published in 2005. This guide supercedes BS 5723:1987, the Guide to the establishment and development of monolingual Thesauri. Leonard Will describes it as being substantially different from BS 5723. “The text has been rewritten in today’s idiom and some additional aspects are now covered, including facet analysis, presentation via electronic (as well as printed) media, thesaurus functions in electronic systems, and requirements for thesaurus management software” (Will, 2006). Three other parts yet to come will cover vocabularies 218 other than thesauri, interoperation between multiple vocabularies (with multilingual as a special case), interoperation between vocabularies and other components of information storage and retrieval systems. The hope is that “BS 8723 will pave the way towards a corresponding revision of the international standard ISO 2788” (Will, 2006). While the full text of the BSI version was not available in time for perusal for this paper, assurance has been given that both standards support the principles laid out in previous standards and that they are cognizant of the need to provide for electronic manipulation and display of thesauri, and their integration with databases. Problems of interoperability will also be addressed. In NISO Z39.19-2005, section 11.47 sets out requirements for “browsing within hierarchical and alphabetical displays”(National Information Standards Organization, 2005, 103-104) and the “viewing of a term in the context of its relationships and its complete term record from any display (through hyperlinking).” Hopefully, these guidelines will lead us into the future but the work is still in transition. Many of the predictions made by Milstead (2000) have come to pass and, while there are still some unfortunate links with the past, things are moving on. Clearly, the thesaurus has now assumed its role as a search tool. One of most significant developments is the increase in the linking of thesauri to relevant databases permitting users to use a thesaurus to plan a search and move seamlessly into a database to select documents. Another sign of change is the convergence of databases with access through a single access point. The CSA Illmina gateway leading to a long list of databases is one example of this. Online thesauri can be found in one or more of the following formats (Shiri and Revie, 2000): a) simple static text format (e.g. ASFA and the NASA Thesaurus); b) static HTML format; c) dynamic HTML format with fully navigable hyperlinks (e.g. MeSH and the UNESCO Thesaurus; d) advanced visual and graphic interfaces (Thinkmap, Plumb Design Visual Thesaurus); e) XML format (Virtual Hyperglossary) Because of the lack of suitable standards, there are some variations in: a) the way the thesauri are accessed and 2) the provision for their manipulation electronically. Some are stand-alone; some may be related to a database but not directly linked to it. Others are linked to a database in a seamless manner. Some of the stand-alone versions may not be updated on a regular basis and may be displayed primarily to encourage subscriptions to the full and up to date product. With minor variations, most of the thesauri in dynamic format can be accessed in response to input by the searcher of a term, a partial term, or a known descriptor. Most are forgiving of spelling mistakes and responded to truncation. The response to the initial input may be either a list of descriptors or a list of categories. Further choice from the list produces a thesaurus entry including its BT, NT, and RT relationships. In the best examples, hyperlinks are provided permitting movement from one thesaurus entry to another. Except in static format, a searcher cannot access the whole thesaurus at once. In the dynamic format a “black box” effect exists and access is to a single thesaurus record, or in rare cases several thesauri records germane to a particular topic. In the dynamic thesaurus, the browsing feature of the printed thesaurus appears to have gone the way of the card catalogue. There are some things the searcher can no longer see. The following examples demonstrate some the features. The NASA Thesaurus is of the static type and, yes, you can view the whole of the alphabetical display but it can’t be manipulated online and is not connected to the database. The CATIE HIV/AIDS Treatment Thesaurus is a dynamic thesaurus and is accessed through letters and letter combinations (e.g. A - Ana, Anb - Az, B etc.). Selection of a segment of the alphabet brings up an alphabetical list of terms from which a choice can be made. The chosen term leads to a thesaurus entry with all the necessary relationships, but that list is static it can’t be manipulated further. Having chosen the term, 219 the user must then go to the website page or to the library catalogue to get the final information. That is, the thesaurus and the database are not connected. The UNESCO Thesaurus is a multilingual thesaurus containing 7000 English terms, 8,600 terms in French and 8,600 terms in Spanish. It is easy to use and has most of the requirements of a good online thesaurus. A search on ‘UNESCO’ brings up the official website for UNESCO documents and publications that provides a link to the thesaurus. The thesaurus can be searched both alphabetically and hierarchically. An alphabetical search is initiated by inputting a few letters (e.g. “cultur” for culture). The click leads the user to the relevant terms from the permuted list. Each term includes a full thesaurus entry with the number of documents in square brackets. BTs, NTs and RTs are hyperlinked to their own records so that a user can expand/narrow the search if desired. A click on the number of documents attached to a thesaurus term retrieves the records for the documents for that topic from unesdoc/bib. From that point the full catalogue record can be accessed if there is one; full text can be accessed online if available and “no full text” is indicated where appropriate. If the hierarchical approach is chosen the process is slightly different. The user is asked to choose a) a domain from a list of categories (e.g. Social and human sciences) and b) a specific subject area (e.g. Social problems). Input again takes the user to full thesaurus entries supported by hyperlinks. An unusual feature of this thesaurus is that in some cases more than one thesaurus record may be retrieved. For example the request for ‘social problems’ brought up thesaurus records for ‘crime’ ‘disadvantaged groups’ and ‘social problems’ as being germane to the topic. Clicking the number of documents then leads to records for the documents. While the user must understand the system, it is simple and effective to use and the instructions are clear. Details of how to browse the thesaurus are provided. The user is alerted to the existence of the thesaurus on the first page of the site. Another format now being introduced is the advanced visual and graphic interface. This approach has great potential but so far the examples are very simple and applied to a small group of terms. The Thinkmap Visual Thesaurus is one example of this and at best might be thought to be “cute”. On the Internet a graphic design is presented which changes when individual terms are clicked on. The result is a change in the configuration resulting in the addition of new terms and the subtraction of others. Very little information is given and the user is only allowed to play with it for a few minutes. Access to it is problematic without further information. There may be an alphabetical list linked to the graphics but only a subscription or questioning the vendor would provide answers. One would not want to subscribe without more information. Nevertheless, this format has great significance. This approach to graphic design is not new. It was demonstrated by Lauren Doyle (1961) and by Eric Johnson and Pauline Cochrane (1998) in an experiment with the INSPEC Thesaurus. This kind of design deserves some further thought. In the last several years the design and use of thesauri online has come a long way. Thesauri and databases are being linked and thesauri are becoming search tools for the information seeker. Also great strides have been made in the design of multilingual thesauri (Hudon, 1997) as well as in interoperability among thesauri, as demonstrated by the Unified Medical Languages System (UMLS). Nevertheless the role of thesauri in the Internet is still in transition. While many systems are set up so that an information seekers can move directly from selection of search terms in a thesaurus into databases (e.g. UNESCO Thesaurus) there are still cases where the user can consult the thesaurus, but the thesaurus is not linked directly to the database (e.g. HIV/AIDS Thesaurus). Moreover, none of the dynamic thesauri viewed allowed a user to browse a whole thesaurus serendipitously in the way you might browse the index of a book. Most significantly, the existence and the 220 advantages of an available thesaurus are often not made known to the searcher. As described by Julia Marshall (Marshall, 2005, 120)

Hidden controlled vocabularies are tucked away under the hoods of search functions by various linking methods. A user types in a term, the search function finds the term in the controlled vocabulary and then displays documents linked to the term the user typed in. The user never actually sees the controlled vocabulary, just the results.

Surely the user needs to be alerted to the presence of a thesaurus at the access point of the database. A pattern, with some minor variations, appears to have become and established norm and some well-known thesauri are being restructured to follow that pattern. The latest thesaurus to be restructured is the ERIC thesaurus of descriptors. The ERIC clearinghouses and most of their services have been eliminated and the database is now being sponsored by Institute of Education Sciences (IES) of the U.S> Department of Education. In conjunction with this change, the ERIC Thesaurus has been under reconstruction and is now accessible to users. The ERIC website at http://www.eric.ed.gov/ is the principal access to the system. The presence of the thesaurus is indicated on the first page of the site. The first page of the thesaurus gives the user a choice of three modes of access – keyword search, browsing alphabetically and browsing by category. Browsing alphabetically leads to an alphabetical list of terms, from which the user may choose a term to browse the user selects a category, which also leads to an alphabetical list of terms. Selection of a term from either of these lists leads to a full thesaurus entry. Hyperlinking is used throughout. From that point the user can ‘start an ERIC search’ on the chosen term leading to document surrogates containing descriptions and information on availability. Alternatively the user may select one of the related terms from the thesaurus record and move to a different choice of search terms. The ERIC approach mirrors to some extent the approach taken in the UNESCO database. It is quite elegant and may set a pattern for the searching thesaurus of the future. Nevertheless, the thesaurus displays clearly indicate this pattern of thesaurus display is probably more effective with small narrowly focused databases than with large ones. ERIC is a very large database requiring a thesaurus with numerous categories and many terms. Hence the alphabetical listings are quite long and the effectiveness of categories is somewhat lost. When it comes to the kind of thesaurus format needed, there are some unanswered questions. Does the searcher need to be able to scan the whole thesaurus? Some would say ‘no’. Others would disagree. The indexer surely should be able to view the whole thesaurus. One author suggested that “information professionals must rid themselves of the mythical perception that browsing is casual searching. Browsing is actually an important work engagement among knowledge workers” (Su, 2005, p 66). Does this mean that different online formats are needed for searching and for indexing? How much human indexing is being done now? What is the future of human indexing and databases? More needs to be known about searchers and how they would use a thesaurus.

6. Gateways and portals When a giant information system becomes extremely large it is difficult to locate the best materials available or to find all the important databases pertaining to that domain. Access to information has reached a stage where there is a need to provide one’s clients with improved access through “narrowing the focus to a super discovery tool.” (ARL 2005, p 3) Among the newest tools coming to the aid of information seekers are gateways and portals 221 that provide access to web resources and/or various databases through one facility. They tend to be either client or subject oriented. Some cover vast territories, while others are developed by individual libraries and special information centers. The particular interest here is subject gateways. More precisely they have been defined as follows:

Subject gateways are Internet services, which support systematic resource discovery, provide links to resources (documents, objects, sites or services) predominantly accessible via the Internet. The service is based on resource description. Browsing access to the resources via a subject structure is an important feature. (Koch, 2000, 24-25)

Quality-controlled subject gateways are created by editors and subject specialists to ensure a high level of quality. Completeness and balance are sought in collection development and a policy is developed to ensure the contents are up to date. Quality metadata is used and it should comply with an acceptable standard. Formalized content description is also recommended. Of particular interest here is the kind of subject access provided. Koch (2000, 25) indicates that there is a need for deeper levels of classification and that subject/browsing structure is important. He also calls for keyword or better- controlled vocabularies (e.g. subject headings, thesauri etc) for subject indexing as well as advanced search and browse access. As with thesauri, taxonomies and ontologies, there is some confusion over the terminology. The terms ‘gateways’ and ‘portals’ appear to be used synonymously (Lancaster, 2003) and some have referred to them as virtual libraries. One example of this is that Lancaster refers to InfoMine and Librarians’ Internet Index as portals, while they are described elsewhere in the context of web directories. There are several important examples of these structures and the number is growing rapidly. Founded in 1988, the Ovid gateway provides access to electronic, scientific, and academic research information and provides for multiple ways to carry out searches. The NLM Gateway is a web-based system that permits users to search simultaneously in multiple retrieval systems at the National Library of Medicine. Users can initiate a search from one web interface and carry out one-stop searching in all of the NLM’s databases. One huge gateway is CSA Illumina described as a worldwide information company. It leads to bibliographies and journals in four primary areas – natural sciences, social sciences arts and humanities and technology. It provides a single point of access to very large number of electronic resources. One might argue that it is not necessarily a “quality” subject gateway but it all depends on the resources it leads to. It includes more than 100 databases including ERIC and Scholar’s Portal, Information Science and Technology Abstracts (ISTA) and Library and Information Science Abstracts (LISA). These systems are primarily important tools in the academic and research communities. Five portals/gateways were searched. In general they were well organized and lead the user very quickly. Options were clear up front. At the top level, all provided access by keyword and by category. Some categories were very broad (e.g. Biology, Agriculture and Medical Sciences in InfoMine. One (ILO WorkGate) provided, as an option, a site map that was a hierarchical classification scheme. InfoMine provided keyword access and categories at the top level. Further search was supported by three options – search options (e.g. author, title, etc.), display options, and subject search options. The latter led to LCSH, LCC, and keywords. Response to a term led to an alphabetical list of subject headings or to a place in the classification schedules. The alphabetical list can be accessed by individual letter, but given the size of LCSH a large amount of scrolling is required. 222

A popular gateway for academics is Scholars’ Portal. The ARL Libraries developed Scholars’ Portal in 2002 and users of some of those libraries will be familiar with it. From the official web page four approaches are possible:

a) Search Illumina’ leads to CSA Illumina and a broad list of databases – from which user can select (e.g. AGRICOLA, ERIC and LISA) taking the user to a description of the resource; b) ‘Electronic journals” leads to a list of journals that can be browsed and accessed alphabetically and the journal name is hyperlinked to the actual journal and ultimately to tables of contents which will allow the search of a particular journal; c) ‘RefWorks’ permits the building of a personal bibliography of sources. d) ‘RACER’ leads to an inter1ibrary loan function

One kind of access to these gateways is through the portal itself. A second approach is through a search of a known databases such as LISA or ISTA. A user’s first encounter with a gateway may be through the database. For example, a library may create its own “system” by making licensing arrangements with certain databases. In such cases, a link is provided between the database and its library system. One such institution has arrangements such that users searching certain databases will lead to retrieval of items flagged as being available from the libraries electronic resources through links to such sites as Scholars’ Portal, Haworth Press, Proquest, H.W. Wilson, etc. Optionally, where online full text is unavailable the searcher is referred to the library catalogue where it can be determined whether the library has the document in printed form. In addition Ref/Works may come into action allowing the searcher to add the citation to a personal bibliography. All steps can be carried out from one’s personal computer. These gateways appear to be well organized and effective. That they are dealing with a defined portion of the Internet and impose a degree of quality on the output is significant. For the academic and research users an effective means of retrieval with a minimum of effort.

7. Conclusion Has control of subject access to the Internet improved since 1997? Definitely yes. There are more and better provisions for access to materials and serious efforts are being made to single out the best materials. This is happening at four levels – at the gateway/portal level, at the individual database level and at the website and resource levels. Aside from the structures discussed above, lengthy pieces of text are appearing with tables of contents and outlines hypertexted to the appropriate location in the text. There are indexes that look very like book indexes that bring together related parts internal to a text and other devices. Most importantly, some of this might not have happened without the development of the web- based OPAC and the emergence of metadata. These two developments have done much to aid the changes that have taken place. However, there is much yet to be done. There could be further improvements in online thesauri to make them more useable for searching and browsing. The gateways and portals will improve as time goes on. They are still evolving. Will the Internet ever be a perfect world? No. Things will become more complex but in a world where everybody is an information provider, many of them will go their own way. However it is to be hoped that “quality” information will rise to the top.

References Aitchison, J. and Clarke, S. D. (2004). The thesaurus: A historical viewpoint with a look to the future. Cataloging & Classification Quarterly, 37, 5-21. 223

ARL Scholars Portal Working Group. (2002). Final report. Retrieved 12/12/2005 from http://www.arl.org/access/scholarsportal/final.html Blake, M. (2003). NISO initiative for the next generation of standards for controlled vocabularies and thesauri. Electronic Library 21, 397-398. British Standards Institution. (1987). British standards guide to the establishment and development of monolingual thesauri. London: BSI. (BS 5723:1987) CSA guide to discovery. Retrieved 02/06/2006 from http://illumina.scholarsportal.info/ Doyle, L.B. (1961). Semantic road maps for literature searchers. Journal of the Association for Computing Machinery, 8, 573-578. Gilchrist, A. (2003). Thesauri, taxonomies and ontologies: An eymological note. Journal of documentation, 39, 7-17. Garshol, L. M. (2004). Metadata? Thesauri? Taxonomies? Topic maps! Making sense of it all. Journal of information science 30, 378-391. Gullikson, S., Blades, R., Bragdon, M., McKibbon, S., Sparling, M. & Toms, E.G.. (1999). The impact of information architecture on the academic web site usability. The Electronic Library, 10, 293-304. Hudon, M. (1997). Multilingual thesaurus construction: Integrating the views of different cultures in one gateway to knowledge and concepts. Information services & use, 17, 5167-5265. International Organization for Standardization. (1986). Guidelines for the establishment and development of monolingual thesauri. 2d .ed. Geneva: ISO. (ISO2788). Johnson, E. H. & Cochrane, P.A. (1998). Hypertextual interface for a searcher’s thesaurus. Retrieved 05/12/2005 from http://csdl.tamu.edu/DL95/papers/johncoch.html Koch, T. (2000). Quality-controlled subject gateways: Definitions, typologies, empirical review. Online information review 24, 24-34. Lancaster, F.W. (2003). Portals. Indexing and abstracting: theory and practice. 3rd ed. (352- 354). Champaign, IL: University of Illinois, School of Library and Information Science. Marshall, J. (2005). Controlled vocabularies: A primer. Key Words, 13, 120-124. Milstead, J. (1999). “Report on the workshop on electronic thesauri, November 4-5, 1999. Retrieved 12/15/2005 from http://www.niso.org/news/events_workshops/thes99rprt.html Milstead, J. (2000). Invisible thesauri: The year 2000. Online & CDROM review, 19, 93-94. National Information Standards Organization. (1993). Guidelines for the construction, format and management of monolingual thesauri. Bethesda, MD: NISO Press. (ANSI/NISO Z39.19-1993). Nicholson, D. (2002). Subject-based interoperability: Issues from the high level thesaurus (HILT) project. The Hague: IFLA. Notess, G. R. (2003). Internet subject directories. Retrieved 12/13/2005 from http://searchengineshowdown.com/dir/ . Renardus. Gateways defined. Retrieved 10/20/2005 from http://www.renardus.org/ about_us/subject_gateways.html Ru, Y. & Horowitz, E. (2005). Indexing the invisible web: A survey. Online information review, 29, 249-265. Shiri, A.A. & Revie, C. (2000). Thesauri on the web: Current developments and trends. Online information review 24,273-279. Su, S-F. Desirable search features of web-based scholarly e-book systems. The Electronic Library, 23, 64-67. Thinkmap visual thesaurus. Retrieved 2/23/2006 from http://www.visualthesaurus.com/ UMLS metathesaurus: Fact sheet. Retrieved 2/23/2006 from http://www.nlm.nih.gov/ pubs/factsheets/umlsmeta.html 224

Welcome to the ERIC database. Retrieved 02/28/06 from http://www.eric.ed.gov/ Will, Leonard. [e-mail received 01/31/06] Williamson, N. J. (1997). Knowledge structures and the Internet. I.C. McIlwaine. Ed. Knowledge organization for information retrieval: Proceedings of the sixth International Study Conference on Classification Research. (23-27). The Hague, Netherlands: International Federation for Information and Documentation. Williamson, N. J. (2000). Thesauri in the digital age: stability and dynamism in their development and use. C. Beghtol, L. C. Howarth & N.J. Williamson (Eds) Dynamism and stability in knowledge organization (268-274). WĦrzburg: ERGON Verlag. Zeng, M. L. and Chan, L. M. (2004). Trends and issues in establishing interoperability among knowledge organization systems. Journal of the American Society for Information Science and Technology, 55, 377-395. B. H. KwaĞnik, Y.-L. Chun, K. Crowston, J. D’Ignazio, and J. Rubleske School of Information Studies, Syracuse University, Syracuse, NY, USA

Challenges in Creating a Taxonomy of Genres of Digital Documents*

Abstract: We report on the process and difficulties of building a taxonomy of genres of digital documents. The taxonomy is being created to be used in the experimental phase of an ongoing study to learn about the usefulness of providing genre information to support information-seeking tasks. To build the taxonomy, we conducted field studies to collect webpage-genre information from 55 respondents: K-12 teachers, journalists, and engineers, who routinely use the web for information seeking. Challenges described in this paper include the difficulties respondents experienced in identifying and naming genres and that the researchers faced in unambiguously linking the genre identifications with clues to genre attributes and purposes.

1.0 Introduction We report on one phase of a project whose aim is to discover whether and how identifying the genres of digital documents helps in a variety of information-seeking tasks (Crowston & KwaĞnik, 2005-07). Our project was motivated by the recognition that many bottlenecks in successful access to information occur because users are not able to adequately specify their needs, and systems are not able to adequately disambiguate the results of searches, resulting in undifferentiated and overwhelming amounts of information. The process of information seeking is often detached from the purpose for which the information is sought, and the context in which the information will ultimately be used. Incorporating that context is of great importance in tailoring an information-seeking experience, but it is not easy to do. We propose that the inclusion of genre metadata might be useful in remedying this situation. Documents are communicative acts produced by members of a discourse community for a mutually understood purpose as reflected in the document’s genre. Every genre embodies not only a particular form and content, but also a recognized function – context in a capsule, as it were. We hypothesize that recognizing and using genre metadata for searching will allow the results to be contextualized for the searcher and therefore more precise. One of the challenges of studying genre in general is that there does not seem to be a consensus on what a genre is or how best to identify, construe, or study genres. In general though, we note that most definitions include some consideration of the form of the document and often the expected content, as well as the notion of intended communicative purpose. The definition of genre we have adopted for our study: “a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form” (Orlikowski & Yates, 1994, p. 543), appeals to us because of its recognition of genre as a fusion of form, function and content that is situated in a context of human endeavor. 226

1.1 A Summary of Our Study Our overall project has three phases:

x Phase I. Harvesting and identifying a test-set of web pages that have been coded for the motivating task, genre terms and the clues people use to identify each genre’s form, content and function. x Phase II. In the second phase, presently underway, we attempt to build a faceted taxonomy of the genres identified in Phase I. This is the phase on which we focus in this paper. x Phase III. In the final phase we will test the utility of including genre information in an information-access environment that will be simulated using the results of the analysis in Phase II.

1.1.1 Information about Genres from the Searchers’ Perspective In this section we briefly summarize the data-collection process in Phase 1 because it affects the conceptualization and building of the taxonomy described in subsequent sections of this paper. Because genres are situated in a community’s language and work processes, we felt it was important to learn about genres from people engaged in real tasks, and in their own words. Knowing that we could not study the universe of web genres or searchers, our first task was to identify respondents who would, in the course of their daily work, need to search on the web, and who most likely would want to distinguish between one type of web page and another. That is, we tried to identify people for whom genre information might be useful – indeed necessary – for determining whether a given web page might be relevant to their needs.

Respondents No. Typical Tasks Typical Genres Comments Teachers from four public Preparing and revising lesson Lesson plan and private schools; most Teachers 15 plans Story page grades from K-12 are Resource page represented Developing a story or article: generating ideas; searching for News story 18 print journalists, 2 Journalists 20 other stories on the same Directory television journalists topic; collecting new Press release information; fact-checking, Searches for tutorials, detailed Manual page Includes 20 aeronautical information about products Engineers 20 Commercial page and software engineers and tools, new or updated Product page from one multinational firm “knowledge” about a topic,

Table 1. Our Source of Genre Information: Three Groups of Respondents

Our study solicited information about genre from three groups of respondents: K-12 teachers, journalists and engineers as summarized in Table 1. We chose these three groups because the members of each share a discourse community in which a set of identifiable tasks and genres may play a role, and in which the identification of the genre of a document is likely to be important for their tasks yielding a wide range of tasks, genres and genre attributes. 227

1.1.2 The Process of Eliciting Genre Information from Respondents In general, our data-collection goal was to identify, for a collection of web pages, the genre of the page, the clues each respondent used to recognize the genre, and the usefulness of the page for a task, all in the words of the respondents. We used think-aloud technique to understand the search goals and general strategy, but then followed it with a debriefing. During the interview, for every page visited we asked four questions:

1. What is your search goal? 2. What type of web page would you call this? 3. What is it about the page that makes you call it that? (If they did not understand the question, we would ask, “Which features/clues on the page make you call it that?”) 4. Was this page useful to you? How so (or why not)?

At the conclusion of the debriefing, and with permission from the respondent, we copied the URLs of the web pages and the sequence in which it was visited into a database. This data was used to later re-create the search. From this re-creation, screenshots were taken of each web page visited by the respondent, and a web-based slide show (with accompanying URLs) of the entire sequence was created for each session. We are able to use this for coding and analysis, and intend to draw from these slide shows to develop a corpus of web pages that a subsequent set of respondents can view and evaluate. We have nearly 1,000 screenshots of web pages visited by respondents, each accompanied by its original URL and audio recordings of the sessions with transcripts, or detailed field notes for those interviews where recording was not permitted.

2.0 Creating a Working Taxonomy of Genres 2.1 Why We Need a Taxonomy of Genres Phase III of our study will consist of a series of experimental simulations, based on our collection of web pages, in which the subjects will be able to formulate queries or navigate results using genre metadata to enhance the process. We intend to simulate such enhancements by providing aids for query construction, clustering of identical or related genres, ranking of documents by a match on genre, and enabling genre-specific search strategies. Before we proceed to testing these ideas, however, we are faced with the formidable task of describing and organizing the genres and their attributes into a working taxonomy, which we consider an important formative step. Such a taxonomy will enable us to manipulate the experimental conditions and the interface to find out:

x Which specific facets of genre improve performance most? x To what extent does using genre metadata to cluster and/or rank documents improve performance and utility? x Are there differences in utility between interfaces that explicitly name genres and those that use other methods of labeling clusters of documents (e.g., by providing example documents); x What are the effects of granularity, that is, the specificity of genre identification? Is there, perhaps, a basic level of genre that is neither too general and abstract, nor too specific? 228

2.2 Why a Facetted Taxonomy? We recognize that because genres embody attributes of form, function, and content, they are complex and thus do not lend themselves to classification using a simple set of criteria. Thus, we are attempting to create a facetted classification in which all the important aspects of the genres will be taken into consideration. Genre is a subtle and difficult-to-define notion, and our field studies, described in this paper confirm this. Moreover, the possible set of genres is very large, and we have no way of knowing when and if we have a complete set. Without a strong theory of genre to guide us, it is problematic to set up a classification structure that will accommodate all known and future genres. For these reasons, we believe that a facetted scheme will serve us well for the purposes of this study. A facetted classification is a useful tool because:

x It does not require complete knowledge of all genres; x It is relatively hospitable to new genres, so long as they can be described by the fundamental facets of form, purpose, and content and any subfacets yet to be identified; x Facetted schemes have flexibility in that the elements can be invoked in an almost endlessly flexible way; x It allows for requisite expressiveness because each facet can be developed independently to the degree of specificity needed. For example we can have a more general taxonomy of purposes, but a much more specific taxonomy of genre types; x It can accommodate a variety of theoretical structures and models – again, a different structure for forms, for tasks, for genres – thereby allowing a number of perspectives that can be invoked in the future phase of our study.

There are, of course, some obstacles: among these are the difficulty of establishing a robust set of basic facets, a lack of a “natural” relationship among the facets (e.g., the relationship of a given genre to a given task), and the difficulty of visualizing or representing a multidimensional scheme in any implementation of it. We feel that on balance, a facetted scheme is our best option. We need a taxonomy that will do justice to the richness and complexity of the concept of genres, and which can be developed and expanded as needed for future use (KwaĞnik & Crowston, 2004).

3.0 Challenges to Creating the Taxonomy Genres are a way people refer to communicative acts that is understood by them, more or less, but which is often difficult to describe in its particulars. Thus, genres are recognized and used, but not so readily described and defined. The challenges fall into several categories: difficulties in identifying the genres themselves and difficulties in identifying genre attributes such as form and content, and finally difficulties in unambiguously linking the genres with their purpose.

3.1 Challenges to Identifying Genres The first step in building a taxonomy is to identify the entities. In this case we were looking for genres and asked our respondents to label the various web-page “types.” Several difficulties emerged, which we describe below.

3.1.1 Difficulties with Identifying the Genre Unit. For practical purposes we had decided to arbitrarily limit the identification of genre to the web page as a whole, operationalized as the URL of that page. In practice, this decision 229 has not worked out quite as well as we had envisioned because it is sometimes difficult to ascertain from the interviews which part of the page is the genre that is being described. For example, homepages are often described as both a homepage and an index page, because homepages usually have a list or an index of links embedded in the web page. One web page which consists of search box, search directory and other related links is described as both a search engine and search directory, these labels being dependent on the emphasis of a different element of the page.

3.1.2 Difficulty of Eliciting Unambiguous Genre Labels. We learned that the genres of some types of web pages are more difficult than others for respondents to articulate. We describe a few examples of such instances:

x Multiple genre terms are applied to one document. Several genre terms (both conceptually similar and different), might be suggested for one web page. For example, one page was described as a first search step page, navigation page, and menu with the comment “I don’t know if I have the vocabulary to describe it.” Under such circumstances, respondents might come up with genre terms based on either the purpose or the content of the page, personal feelings, or just based on words on the page that catch their attention. x Different types of pages are labeled with same genre term. In the flow of the iterative process of asking for genre terms, respondents have a tendency to use some words repeatedly. One respondent described a page as a highlights page since she saw the word “highlights” on it. Later, she used the same term to describe a memo, a news release, a calendar page, and so on. x The respondent lacks a term for a given genre. When respondents can’t easily name a genre, it is either because they can’t think of the term or because they don’t know if a term exists. In the first case, a respondent may just describe the page based on a personal feeling, such as calling it a frustrating page, or admit to not having a word for the page. If the page is identifiable, but the respondent doesn’t know if there is an appropriate word for it, he or she may struggle to find a term, but in the end, not be sure of it. x Nested genres. A web page can be composed of one or more elements, each of which can be construed as a stand-alone genre by itself. For example, a web page was described as both an article and a newspaper. We coded article as a Genre Component of newspaper. x Terms are too general or unspecific. When a genre term does not come readily to mind, respondents often provide a general or vague term such as, a page with information.

3.2 Difficulties with Identifying Genre Attributes. Another step in building a taxonomy is to identify the criteria by which an entity (in our case a webpage genre) is aggregated with like entities or differentiated from unlike ones. The lack of clear and precise labels pointing to a given web page, as described above, was not our only problem. We have attempted to distinguish genre attributes along a number of criteria: form, content, and purpose. Participants were often vague about clues to these attributes. For instance, they might refer to a page as having a "look and feel" but not specifying in what way. Since journalists are very familiar with the format of a news story page, for instance, they are good at identifying that genre; however, they may have 230 difficulty specifying the clues that helped them identify it because such clues have become implicit and they barely pay attention to them.

3.2.1 Challenges in Distinguishing Form and Content. In coding we first flagged the genre term applied to a web page, and then tried to mark the clues the respondents identified in establishing their concept of that genre. Marking clues in a consistent manner according to the tripartite definition of form, expected content, and purpose has not been easy, however. The first two aspects are often convolved in the participants’ utterances where it is difficult to ferret out both what they mean or what is in their minds when they invoke a genre term. Should a passage be a clue for form, or content, or both? This convolution of form and content has three manifestations:

x Identifying aspects of key page elements that signify a page belongs to a genre. For example, one participant invoked a municipality genre, and using the municipality’s seal as a clue. How much of a simplified seal “form” would have been enough to qualify it as a municipality page? Or, was she looking at the particular “content” of the seal that made it specific to a municipality of interest? x The mixture of form and content in total that establish a page as part of a genre. For example, a participant readily assigned a genre term based on the presence of tabs that allowed for presentation of categories and subcategories. Was it the form of the page, with spatial separation of categories and less visual emphasis given to the subcategories that mattered to him? Or, was it the contextual relationships among the written material on the web page to which he was referring? x Our own preconceived notions of what these “form” and “content” concepts mean. Achieving consistent coding for clues has been difficult when coders bring different conceptions to the task. For example, in deciding on whether an image represented form or content, one coder interprets the meaning of the image and calls it “content,” while the other coder, interprets an image as pure “form.”

3.2.2 Challenges in Identifying Purpose One of the key ways in which genre provides context is by incorporating an understanding of the genre’s purpose or function. While most of the respondents can identify the purpose of the web page for their own work it is not always clear whether the task requires a particular genre or whether the genre identified happens to be useful (but another one could have been just as useful).

x Borrowed Purpose. Another situation that causes some confusion is the difficulty in assessing whether the purpose of a genre is generated by the respondent’s situation, or whether they recognize the purpose others have for that genre. A homepage of a university that is described as an institutional page has several purposes depending on the discourse community. The purpose of the page from the institution’s perspective is to “get its message out,” while from the perspective of students and their parents, its purpose is to find out information about the university. x Granularity of Tasks. We are finding that people’s tasks, as well as the genres that are useful for them are at various levels of specificity. Some are expressed broadly, such as “double-checking facts,” while some are narrowly defined, such as “finding the phone number of Joe Smith.” This range and variety presents a problem in creating a taxonomy that has as one of its facets the tasks associated with the genre. 231

We are, therefore, working on a more general formulation of tasks that would capture the experience of our respondents but be applicable to a wide range of genres.

4.0 The Building Blocks of a Taxonomy: A Discussion We describe a work in progress. Our intent is to build a taxonomy in which the entities are genres of webpages as they are used and recognized by people in the course of their usual professional endeavors. Because genres are a fusion of form, content, and purpose, it is important for us to identify these aspects of the genres as well and incorporate them into the taxonomy as facets. Thus far, we have been able to elicit a range of genre terms, and at least some of the clues to these genres’ attributes and function. Along the way, we have encountered some dilemmas in the sense that it seems obvious that people do in fact recognize and use genre information, but that they are not always explicitly aware of this process, and thus are not able to offer us clear and unambiguous terms for genres or their attributes. As well, we have noticed that our three respondent groups have identified different sets of genres (in addition to many that overlap), and a different set of tasks and goals associated with those genres. This is both encouraging, in that we are confident that we have identified a range of genre terms and uses, but it has also presented us with the added burden of analysis.

5.0 Conclusion For many of the reasons outlined above, we are building the taxonomy relying on some compromises and some interpretation on our part. Nevertheless, we believe that having the taxonomy emerge for the most part from the respondents’ own reports is worth the effort in that it will add validity to the experiments in our next phase where we will attempt to describe and assess the value of the genre information in information seeking. Besides being critical to our own study, the digital-document genre taxonomy can be augmented in the future, and we hope it will be useful in other related studies, as well as in other efforts to build multidimensional representations of complex phenomena.

6.0 Notes * This research was partially supported by NSF Grant 04-14482. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the National Science Foundation.

7.0 References Crowston, K. and KwaĞnik, B.H. (2005-07). How can document-genre metadata improve information-access for large digital collections? NSF IIS Grant 04–14482. KwaĞnik, B. H., & Crowston, K. (2004). A framework for creating a facetted classification for genres: Addressing issues of multidimensionality. In Proceedings of the Thirty- Seventh Hawai'i International Conference on System Science (HICSS-37). Big Island, Hawai'i. Orlikowski, W. J., & Yates, J. (1994). Genre repertoire: The structuring of communicative practices in organizations. Administrative Sciences Quarterly, 33, 541–574.

Hur-Li Lee School of Information Studies, University of Wisconsin-Milwaukee, USA

Navigating Hierarchies vs. Searching by Keyword: Two Cultural Perspectives

Abstract:The study explored how people in two different cultures conduct two types of subject searches. Twenty-four American students and 40 Taiwanese students were given the same task looking for answers on the Internet by navigating through directories and by using keywords in the American-based Yahoo services. The findings pointed to differences between the groups due to their familiarity with the English language. On the other hand, presumed cultural differences did not seem to result in different search behaviors or preferences. These findings were preliminary and suggestions for future research were made.

1. Introduction It has been understood that culture influences the way people perceive, organize, seek, and use information. In knowledge organization, the main focus of research in this area is cultural influences on subject representation in organizational theory and tools, presuming that subject representation facilitates or hinders information seeking. Some of the literature criticizes cultural bias in existing subject languages such as the Library of Congress Subject Headings (Gerhard, Su, & Rubens, 1998) and Dewey Decimal Classification (Pacey, 1989). Other literature proposes remedial models to reduce bias or problems related to cultures (e.g., Kublik et al., 2003; KwaĞnik & Rubin, 2003). Little in knowledge organization research has investigated how culture shapes the way people use knowledge organization tools. The studies of searching behaviors often examine keyword access exclusively without cultural contexts (e.g., Park, Lee, & Bae, 2005; Spink & Jansen, 2004). Taking a different approach, this paper explores how people in two different cultures conduct subject searches by using the same search tool, hoping to shed some light on the relationship between cultural products of knowledge organization and users in cultural contexts. The key questions being explored are:

x What are the similarities and differences in subject search approaches taken by two groups of participants from two distinct cultures? x Is it feasible or plausible to associate some of the cultural differences between the two groups with the differences exhibited in search behaviors?

The preliminary study reported here is an extension of a recent study that examined how 24 library and information science (LIS) students in the United States used U.S.-based Yahoo internet search services (http://www.yahoo.com/) in two different ways: navigating its directories, an alphabetico-classed system with hyperlinked categories for navigation, and searching by keyword (Lee & Olson, 2005). In the new study, the same task was given to 40 LIS students in Taiwan and their search results and discussions of the search results were then compared to those in the American study. Two cultural factors were considered in selecting the Taiwanese group for comparison because of their potential impact on search behaviors. First, the written Chinese language (the official language used in Taiwan) is not based on an alphabet and requires very different input techniques on a computer, making keyword searching in Chinese more complex than in 234

English. Second, hierarchies are a common feature in the Taiwanese culture, especially pervasive in family and social relationships, government, and many daily activities. This familiarity with hierarchies may translate into a preference for and or fluency in hierarchical navigation in Taiwanese participants’ information seeking.

3. The Study In preparation for the study, a preliminary search on Yahoo’s site in Taiwan (hereafter, YahooT; http://tw.yahoo.com/) was conducted and it was determined to be inadequate for the study because of its shallow hierarchies and the limited number of sites listed under individual categories. The American Yahoo site (hereafter, YahooA), the same tool used in the American study, was then chosen for the Taiwanese study, introducing the familiarity of English as an additional factor. Hence, sufficient language skill was necessary for completing the task of the study. After contacting two U.S.-educated LIS professors in two national universities in Taiwan, 3 of their graduate level beginning research methods classes were selected. Submitted reports on search results from a total of 40 students were included in the analysis. The students in each class were given a task of finding 5 national or international scholarly or professional associations in microbiology by conducting two types of search (i.e., navigating directories and entering keywords on YahooA). They had a week or two for the assignment. The instructions asked them to write down or print out their searches and comparison of the searches before they attended a lecture given by the researcher, which would require the use of their searches as an example in discussing comparative study as a research method. After the lecture, the researcher asked the students to give their consent for having their work included in a study and to complete a consent form with a short demographic survey. The students then submitted the form together with their written notes or printouts. LIS education in Taiwan begins at the college level (i.e., bachelor of arts). The 3 classes chosen were all the basic research methods classes at the graduate level, with the majority of the participants being master’s students and one a first-year doctoral student. Due to the person’s lack of LIS background, the doctoral student was not excluded because the person was more similar to the American students in their limited prior exposure to knowledge organization. Two of the classes (32 participants) were “traditional” in the sense that students in those classes were full-time and mostly entered the master’s program directly after graduation from college or with limited work experience. Of the 32 traditional students, 5 had working experience in organization of information (OI), ranging from 1 to 4 years. The third class consisted of those who were part-time students, mostly holding a full-time job and attending classes in evenings. Of the 8 participants in this group, half had OI work experience, ranging from 2 to 15 years. Three others in this class failed the task and were excluded from the study. One of them gave English deficiency as the reason for not completing the task. All but 4 participants had taken OI courses, some at undergraduate, a couple at graduate and others at both levels. Among the Taiwanese participants, 36 had used Yahoo services, 15 of whom had used YahooT only, 21 of whom had used both YahooT and YahooA, and 4 participants did not provide such information. Seven of the participants had only searched Yahoo by keyword, 29 had done both keyword and directory searches, and 4 did not provide this information. The American study included 24 LIS master’s students taking their first OI course via the World Wide Web. Because the data source for this study was a class assignment, no information about demographics or prior experience was collected purposefully. Even though not all students in the classes participated in the study, the mandatory self-introductions 235 provided at the beginning of the semester indicated that as a group, none of the students had had any prior formal training or knowledge in OI in general or in classification theory specifically. Only did one of them clearly stated in the submitted assignment that he or she had not used Yahoo before. Many admitted casually that their experience with the Internet was overwhelmingly limited to keyword searching and few said that they had used Yahoo directories in any substantial way. For more detailed description of these students and the study design as well as the full instructions for the task required of the participants, please see Lee & Olson (2005).

4. Results 4.1 Paths Taken and Keywords Used Table 1 shows paths taken by American and Taiwanese participants. Some participants in both groups took more than one path. Path E (in boldface) was the number one choice of both groups, which is also the common path in traditional classification schemes with a general form element (i.e., Organizations) placed after the topical ones progressing from general (i.e., Science) to specific (i.e., Microbiology). The American students were told to start with “Science” and the Taiwanese students were not told to do so. Interestingly, almost all Taiwanese participants started with “Science” in navigating Yahoo directories. Only did one of them first experiment with “Education” and “Society & Culture” before starting over from “Science”. In both groups, some participants took two other paths that included the same 4 elements as those in Path E with the form interposed between topical elements (Paths H and M).

Paths Taken American (n=24) Taiwanese (n=40) Tally Percentage Tally Percentage A. Education->Organizations->Professional 1 2.7% B. Science->Biology->Microbiology 2 5.4% C. Science->Biology->Microbiology->Conferences 1 2.7% D. Science->Biology->Microbiology->Institutes 9 24.3% E. Science->Biology->Microbiology->Organizations 21 87.5% 32 86.5% F. Science->Biology->Microbiology->Web Directories 1 4.2% 1 2.7% G. Science->Biology->Organizations 1 4.2% 1 2.7% H. Science->Biology->Organizations->Microbiology 3 12.5% I. Science->Biology->Organizations->Professional 1 4.2% J. Science->Biology->Parasitology->Organizations 1 4.2% K. Science->Medicine 1 4.2% L. Science->Medicine->Microbiology and virology 4.2% ->Organizations 1 M. Science->Organizations->Biology->Microbiology 3 12.5% 6 16.2% N. Science->Research 1 4.2% O. Science->Science and Society 1 2.7% P. Society & Culture->Social Organizations 1 2.7% Not indicated 1 4.2% 2 5.4%

Table 1. Paths taken by participants 236

At the time, keyword search results in Yahoo had a hierarchical directory path displayed at the beginning of each item retrieved, making a suggestion to the searcher. To avoid such external influence, task instructions asked the participants to navigate directories before searching by keyword. The keyword choices of the American participants appeared to have been significantly influenced by the results of the directory search in that “microbiology” (100% of the American participants) and “organization(s)” (96% of them) were overwhelmingly their favorite two keywords (see Table 2). The Taiwanese participants, on the other hand, were less focused on the word “organizations” used in the directories. Thirty two of them (80%) tried “association”, the word used in the task instructions. In addition, “organization”, “society”, “academy” and “institute” (n= 3, 6, 2, 1) were used as other alternatives by the Taiwanese. A couple even experimented with “microbiological” as opposed to “microbiology” and “scholar” as opposed to “scholarly”.

Keyword American (n=24) Taiwanese (n=40) Tally Percentage Tally Percentage Microbiology 24 100% 39 98% Organization(s) 23 96% 3 8% Professional 8 33% 2 5% Association(s) 4 17% 32 80% National 1 4% 1 3% Scholarly 1 4% 4 10% Society 1 4% 6 15% International 2 5% Academy 2 5% Institute 1 3% Microbiological 1 3% Scholar 1 3% Not indicated 1 3%

Table 2: Keywords used by participants.

4.2. Language Matters The selection of a general science topic “microbiology” was an intentional one to avoid search failure (in both groups) and potential difficulties caused by language (in the Taiwanese group). It turned out that the familiarity (and unfamiliarity) with language could be both an advantage and disadvantage. Being non-native English speakers, the Taiwanese attempted many more categories in navigation to match the concept “association”, including “conferences”, “institutes”, “Web directories” and “society”. Even after seeing the results from the first search, some used more alternative words in the keyword search in addition to “association”, the word given in the instructions. A discussion with them in the lecture revealed that they used a dictionary to identify additional keywords (i.e., academy, institute, and society). It could also explain why few of them used “organization” as an alternative to “association” because the dictionary did not include the former for the same concept. In all, 13 keywords were used by the Taiwanese, as compared to only 7 by Americans. One Taiwanese participant’s selections of “Science—Science and Society” and “Society & 237

Culture” also were likely a result of language barriers. The English word “society” in these contexts would not be misunderstood by American graduate students. The Americans were less attentive to language. In keyword search, most of them did not explore valid keywords other than the two shown in the directories, “microbiology” and “organization”. The keyword given in the instructions, “association”, was used by only 4 of them and the keyword appearing in many organizations found in the directory search, “society”, was searched by only 1 American participant. On the contrary, they were more exploratory and quick to go down a path in navigation without carefully evaluating the appropriateness of a concept when one word in a category appeared to be somewhat relevant to the topic in hand (e.g., “parasitology” and “medicine”).

4.3. Other Cultural Issues As mentioned in Introduction, the Chinese language requires a more complex input method on a computer and hierarchies are more pervasive in the Taiwanese culture. Those two characteristics may lead to a reasonable speculation that the Taiwanese feel more comfortable with hierarchical navigation or possibly use it more often. In talking with the Taiwanese students during the lectures, this speculation was found to have no merit. Many of them might have used Yahoo directories previously; but all of them, with no exceptions, currently used keyword search exclusively in their inquiries on the Internet. Furthermore, the complex Chinese input steps were not even a factor for the Taiwanese participants. Their notes also indicated somewhat more negative comments on navigation and more positive comments on keyword search, as compared to their American counterparts. For example, 45% of the Taiwanese participants (33% Americans) mentioned that it would require subject knowledge to navigate a hierarchy and thus it was more difficult for those who lack such knowledge. Contrarily, 22.5% of the Taiwanese (16.7% Americans) praised the keyword search, saying that it was easy and less time-consuming; but, 35% of the Taiwanese (66.7% Americans) complained that keyword search results required careful evaluation due to the large number of irrelevant sites being retrieved. Interestingly, the Taiwanese participants did not dislike keyword search in English despite the fact that they had to consult a dictionary to come up with additional keywords. More interestingly, 1 out of 4 Americans believed that a disadvantage of keyword search was the need for the knowledge of useful keywords; but, only 1 in 5 Taiwanese felt the same. One possible explanation is that Yahoo directories are basically an alphabetico-classed system, obviously possessing many problems inherited in verbal subject access. One such problem is the inconsistent contexts in the structures of Yahoo directories because of the alphabetical arrangement. Without a proper context, many words in the directories have the same kind of uncertainty faced by non native speakers in dealing with keywords.

5. Discussion and Conclusions The findings pointed to different searching behaviors caused more by language familiarity than cultural differences. In other words, the study was able to answer the first research question but unable to address the second one. Two other interrelated questions are raised as a result of these findings:

x Is it because the Taiwanese participants have never or rarely been exposed to online tools that offer hierarchical navigation as a search mechanism? x Is it possible that LIS education and information retrieval systems in Taiwan have had heavy influence from the United States and, as a result, Taiwanese LIS students exhibit behaviors more similar to Americans? 238

These were discussed with the Taiwanese students in the last of the lectures. Tables 3 and 4 are a simplified version of two of the search results screens shown to them. Both the students and their U.S.-educated professor agreed that their online catalog functioned in the same fashion. The list given in Table 3 represents a typical list of results for a subject term search (the term used was “biology”) and the list in Table 4 represents a typical list of results for a classification number search (the number searched was “QH307” representing “biology—general works” in the Library of Congress Classification scheme). In each list, the retrieved entries are arranged alpha-numerically with no references to various types of semantic relationships between “biology” and other concepts, broader, narrower or related. Even though a typical online catalog in Chinese is not arranged by an alphabet, the Taiwanese participants were used to another linear arrangement (sometimes based on the number of strokes in each character) instead of a classified one. Neither were cross-references present for navigation. The participants further confirmed that common online tools besides the online catalog in their libraries provided no hierarchical or other classification-like search choices. As a matter of fact, they were unable to recall having conducted any classification-like searches in libraries or having learned this kind of search strategy in their LIS coursework.

# Titles Subject Headings Type [1] 331 Biology LC subject headings [2] 1 Biology LC subject headings for children [3] 76 Biology Medical subject headings [4] 1 Biology Abbreviations LC subject headings [5] 1 Biology Abstracting and indexing LC subject headings [6] 1 Biology Abstracts LC subject headings

Table 3. A simplified alphabetical list of subject strings retrieved in an online catalog.

Call # Full Title Date Number Author Workbook for field biology and QH307 [1] ecology, by Allen H. Benton and Benton, Allen H. 1957 .B31 William E. Werner, Jr. QH307 .B31 [2] Manual of field biology and ecology Benton, Allen H. 1972 1972 Ideas of biology. Drawings by Anne [3] QH307 .B58 Bonner, John Tyler. 1962 L. Cox. QH307.2 .A4 Nature and origin of the biological [4] Ambrose, Edmund Jack. 1982 1982 world / E.J. Ambrose. Alexander, Gordon, [5] QH308 .A37 General biology. 1956 1901- Buffaloe, Neal Dollison, [6] QH308.5 .B75 1962 Principles of biology. 1962 1924-

Table 4. A simplified alphabetical list of call numbers retrieved as the result of a classification number search.

The findings of this study seemed to suggest that a new research design consider the use of Taiwanese participants who have had less exposure to Americanized information retrieval 239 theory and systems. Further research should also be conducted to examine both American and Taiwanese users with information retrieval systems that apply a well designed classification scheme or taxonomy in their own languages and appropriate for their cultures and life styles. Comparisons must be made at several levels to understand similarities and differences between the two groups. By using a more sophisticated system in their native tongue and contexts, users can then complete more complex tasks that better resemble their own real life situations. Future studies may also compare users who represent other cultures similar to the ones in this study. For example, we may study users in China and Chinese users in Singapore, for Singapore has a predominantly Chinese population but uses English as its official language. Comparisons may also be made between Chinese users born and raised in Singapore and Chinese users born and raised in the United States, for both use English as the predominant language, but American Chinese may be exposed to a less hierarchically-oriented social structure. These studies will provide evidence to help us better understand whether cultural differences have more or less influence on search preferences than the topic and context of a particular inquiry.

References Gerhard, K. H., Su, M. C., & Rubens, C. C. (1998). An empirical examination of subject headings for women's studies core materials. College & Research Libraries, 59(2), 130-138. Kublik, A., Clevette, V., Ward, D., & Olson, H. A. (2003). Adapting dominant classifications to particular contexts. Cataloging & Classification Quarterly, 37(1/2), 13-31. KwaĞnik, B. H., & Rubin, V. L. (2003). Stretching conceptual structures in classifications across language and cultures. Cataloging & Classification Quarterly, 37(1/2), 33-47. Lee, H.-L., & Olson, H.A. (2005). Hierarchical navigation: An exploration of Yahoo! directories. Knowledge Organization, 32(1), 10-24. Pacey, P. (1989). The classification of literature in the Dewey Decimal Classification: The primacy of language and the taint of colonialism. Cataloging & Classification Quarterly, 9(4), 101-107. Park, S., Lee, J., & Bae, H. (2005). End user searching: A Web log analysis of NAVER, a Korean Web search engine. Library & Information Science Research, 27(2), 203-221. Spink, A., & Jansen, B. J. (2004). Web search: Public searching of the Web. Dordrecht, The Netherlands: Kluwer Academic Publishers.

Maria Teresa Biagetti Scuola Speciale per Archivisti e Bibliotecari, Università 'La Sapienza', Roma, Italy

Indexing and scientific research needs

Abstract: The paper examines main problems of semantic indexing taking into consideration the connection with the needs of scientific research, in particular in the field of Social Sciences. Multi-modal indexing approach, which allows researchers to find documents according to different dimensions of research, is described. Request-oriented indexing and Pragmatic approach are also discussed and, finally, the possibility of assuming as fundamental principle, in indexing, C. S. Peirce theory of Abduction, is outlined.

1. Introduction One of the main requests in Library and Information Science is to focus the attention on the options pertaining to the choice, organization and use of more appropriate semantic access points, both for traditional publications and for digital resources. Theoretical considerations concerning the creation, functions and performances of semantic access points, can improve the substance of the entire field of Library and Information science. Presently, for instance, full-text indexing systems in databases offer new and important possibilities for subject research. However, it is necessary to add to research possibilities offered by full-text retrieval systems, those offered by semantic access points, with the aim of enhancing precision in document retrieval (B. Hjørland, L. Kyllesbech Nielsen, 2001). My purpose is to draw attention to the most fundamental problems of the subject indexing process and to contribute to a deeper understanding of the essence of indexing and, at the some time, to suggest to increase the effectiveness of information retrieval systems, taking the needs of researchers into account. I am going to present some of the most significant theoretical approaches which were elaborated in the past decades in the field of semantic indexing, and I will take into consideration their validity and the opportunity they can offer to contribute to solve problems in knowledge organization in the current era.

2. Aboutness model vs. Multi-modal approach Conceptual contents of books and documents can be more precisely identifiable and unequivocally definable in the field of Natural and Physical Sciences, where it is easy enough to recognize the specific subject of books or articles. In the field of Social Sciences, instead, same contents can be considered from different points of view and can receive many interpretations, and, of course, be used within different directions of research, by different scientists, each belonging to a specific domain and having his particular scientific experiences. The subject indexing process that starts analysing a document in order to determine the "aboutness", i.e. what the document is said to be about, implies the analysis of which is the precise content of books and to measure it in relation to an already defined set of arguments, which the document could be about. This model implies the realization of an only subject index for each document. Following the Aboutness model, in most cases, and in particular in the field of Social Sciences, you can produce an inadequate statement, in which the richness and copiousness of the contents of a document is condensed. Aboutness approach does not allow to make the book's information potentials stand out or emphasize (D.F. Swift, V.A. Winn, D.A. Bramer, 1978). 242

Considering, now, problems of subject indexing from a general point of view, we can refer to an important distinction. In document indexing we can recognize two distinct levels: the level of the object (or concept) and the level of the subject, in particular in indexing books and documents belonging to the field of Social Sciences, in which ideological aspects are more evident and in which cultural orientations are more incisive. The object-approach for indexing emphasizes objects, or concepts the document is about; the subject-approach, instead, is concerned with the way the document's author developed his discourse, by which point of view he examined the object (A. Serrai, 1980). Which could be the most appropriate index for a book dealing with "war horrors"? "War" (the object), or, perhaps, in agreement with the author's point of view, "Pacifism" (the subject), i.e. the issue really dealt with by the author? Or, perhaps, I would suggest, both? At the end of the Seventies, the Multi-Modal approach was elaborated in order to organize indexes aimed to represent contents of books and documents in the field of Social Sciences, and in particular in the field of Education Science (D.F. Swift, V.A. Winn, D.A. Bramer, 1977). In Social Sciences researchers interpret the content of books according to their own scientific orientation. They use also different criteria in considering the relevance of the same documents: it depends on the point of view from which they start. Through the Multi-Modal approach, documents can be represented considering different research approaches and taking into account the ways that are more interesting for users, using a plurality of view points, that can allow all the possible dimensions of research. Indexes used for representation of contents of books and documents will be grouped into categories which will offer multiplicity of starting points for researchers. These categories can be considered as different dimensions, which represent a framework within which an indexing system might produce and organize data about documents. These dimensions could be, for instance: 1) theoretical orientation, 2) method of research, 3) empirical situation under study, 4) data collected. Each of these dimensions could represent a particular characteristic of the book indexed; all together will represent the complete framework of the content. The process of indexing can produce a number of indexes, each centred on a particular category (D.F. Swift, V.A. Winn, D.A. Bramer, 1977 and 1979). Each book or document will be represented by a number of descriptions, and then, several frameworks will organize the same books or documents, putting into practice a faces-of-knowledge model. The system operate on the assumption that in a scientific field the knowledge presents various faces, and the system is able to realize a multiple representation of documents' contents.

3. Request-oriented indexing and Pragmatic approach Also in the field more particularly devoted to Information Retrieval (IR), it was devised a new approach, the Problem-oriented indexing (or Request-oriented indexing). Using this approach, the view of the researcher and his possibly queries are primary and the indexing process starts anticipating users' problems and queries, and then, relationships between the analysed entities that have relevance and these anticipated queries are established. The indexers work as informational agents, and they recognize the potential relevance of the documents also in the situations in which that relevance is not so obvious and evident. In this approach, document's subjects are determined considering the answer that the document could give to the researcher's possible queries. In request-oriented approach, subject indexing has to predict in advance to which researches the document could provide answer (D. Soergel, 1985). User-oriented and Request-oriented approaches, in fact, allow to represent document subject starting from different interpretations of documents "relevance", in consideration of 243 future queries. A book or a document, generally speaking, could answer to a lot of queries: still, semantic analysis can emphasize a number of queries to which the indexer thinks the document could answer in the future. Since the Nineties in Library and Information Science it has also been devising a Pragmatic approach, elaborated in reply to the requirement of representing the knowledge in documents taking into account the informational needs of researchers. (B. Hjørland, 1992). Also this approach offers a contribution to the viewpoint that takes particularly into account different needs of researchers. According to that conception, determining subjects of books and documents concerns the identification of a number of properties of the book, but it also involves an evaluation and a decision about the "priorities" of these properties, i. e. what are the properties manifested by a document that have greater relevance in an historical age. If the indexer can not evaluate properties of documents and establish priorities, taking into account user's needs, the description of documents' contents is restricted to superficial aspects and properties: "[…] a 'pure' description of documents without connection to other modes of cognition such as hypothesis, prognosis etc., can only extract the more trivial and superficial properties of the document." (B. Hjørland, 1992, p. 188). We could identify existing properties of any document and analyse the epistemological potentials of documents: "Subjects in themselves must thus be defined as the epistemological potentials of documents" (B. Hjørland, 1992, p. 185, italics in the text). Properties of documents can assume a different meaning according to different scientific domains and to different scientific aims. Actually, it is the level of society's development, which determines the realization of documents' potentialities. Epistemological potentialities could be realised according to the different levels of the development of human society: from time to time, the potentialities can be developed or not. "A document has not just one true subject. It has several epistemological potentialities that are given priority based on disciplinary viewpoints" (B. Hjørland, 1997, p. 42). A broader theoretical framework for a larger comprehension of that indexing conception and that approach can be recognized considering the philosophical development of Semiotics, after the Charles Morris' elaboration (C. Morris, 1938). Morris takes into account three elements: "sign vehicle", "denotatum" and "interpretant", making little changes to Charles S. Peirce's definitions, who defined the nature of sign as a triadic relation of a "representamen", an "object" and an "interpretant". Morris adds to these three elements a fourth element, the "interpreter", in whose mind the "sign" produces the "interpretant" (in Peirce's view). Morris divided Semiotics in three branches, or, we could say, he studied Semiotics from three different points of view: Syntactic, Semantic and Pragmatic. Pragmatics' concern is the study of the relationship that "signs" have with their "interpreters", which are above all living organisms. Pragmatic point of view of Semiotics analyses psychological and sociological phenomena that are involved in the signification process and in the way signs operate. Pragmatics includes Semantics as well as Syntactic, and it is related to behavioural systems of human beings. In order to discuss the relationship of signs with their interpreters, we have to know the relationships of signs among them before that. Subject indexing process is really linked to the Pragmatic point of view of Semiotics. Indexing can proceed through the consideration that the properties owned by documents can be described as dependent by the social and cultural context within which knowledge has developed. 244

Besides, several authors in Library and Information Science suggested that it is necessary to study in depth theoretical explanations that Semiotics provides, in particular the Semiotics that was elaborated through the philosophical, gnoseological and epistemological ideas of Charles Sanders Peirce (J.E. Mai, 2001, T.L Thellefsen, M. Thellefsen, 2004). In fact, the entire subject indexing process could be considered as a single act of interpretation, i.e. "semiosis", and more precisely, it could be studied by means of the concept of "unlimited semiosis" elaborated by Peirce. If we agree that the subject indexing process is a kind of interpretation process, we can also understand why we see so great differences when we analyse the outcomes of the indexing process, and also why it is so difficult (or, perhaps, impossible) to suggest rules universally valid in semantic indexing, which allow to have a kind of uniformity. The "Science of signs" that Peirce had devised, was founded on Gnoseology and Epistemology: "semiosis" has a coincidence with the process of knowledge. Thinking, as well as Science, goes forward using Hypothesis and Interpretations. The Logic of scientific discovery, that specifically belongs to Epistemology (which studies the scientific knowledge), and in particular the Logic of the "ampliative or synthetic inferences" that Peirce elaborated, can be considered as a fertile field that could be helpful in order to much more understand the subject indexing process. In case of "abductive inferences", the end of reasoning does not necessarily derive from the introductory statements, but it is an amplifying rather than an explanation. "Abductive inferences" suggest a hypothesis of explanation, starting from a data set. It seems that the process of the Abduction, therefore, could be of interest for indexing. Semantic indexing provides tools through which users can enlarge their knowledge: subject indexing process belongs to the "Logic of the ampliative inferences", and in any case to the processes of Induction and of Abduction. The process that allows the indexer to decide what is the subject of a document is Abduction process, as well as the process that helps the researchers decide some strategies of research (A. Serrai, 1974). Indexing theory should assume as fundamental principle the logical process of Abduction. The process of semantic indexing presents the same characteristics of the process of Hypothesis. This certainly implies the possibility of create a number of Hypothesis, each considering different scientific orientations and different research approaches. We should investigate these processes much more, also considering performances of Artificial Intelligence. In Library and Information Science we should try to elaborate representations of knowledge expressed in books and documents much more correctly, also taking into consideration the broad opportunities offered by different scientific fields. That means that knowledge expressed in books and documents can be used within different views of research, by different scientists, each belonging to a specific domain and driven by his particular scientific exigencies. Actually, Pragmatic approach is based also on "Domain analysis", a field of research concerned with the processes of document creation in every scientific field, the informative structure of each science, the bibliometric distribution of scientific literature in each scientific field and, finally, the interdisciplinary exchanges. According to this approach, Information Science would elaborate contents belonging to different domains of knowledge, within different knowledge structures, using the particular vocabulary of the scientific domain (B. Hjørland, 1997). 245

4. Conclusion Considering research purposes, in order to allow researchers to use a multiplicity of starting points, one solution could be to establish and prepare a number of subject headings, each describing a different epistemological potentiality of a book. Different libraries and different information centres could define their subject headings taking into account different needs of their specific users. Multi-modal approach, as well as Request-oriented indexing and Pragmatic approach, can actually offer a substantial contribution to the development of the theoretical foundation of Library and Information Science. Semantic indexing of same books by different libraries, each one considering different views and scientific approaches, would provide a rich epistemological resource. For instance, the different headings could be organized in a union catalogue: "If many libraries' different subject descriptions of this book are merged in one database (a union catalog) this book would be visible from many different epistemic interests. This would be an ideal situation." (B. Hjørland, 1997, p. 95).

References Collected Papers of Charles Sanders Peirce(1931-35). Vol. I-VI ed. by Charles Hartshorne and Paul Weiss. Cambridge: Harvard University Press. Collected Papers of Charles Sanders Peirce(1958). Vol. VII-VIII ed. by Arthur W. Burks, Harvard University Press. Hjørland, B. (1992) The concept of "subject" in information science. Journal of Documentation, 48, 2, 172-200. Hjørland, B., Albrechtsen H. (1995) Toward a new horizon in Information Science: Domain-Analysis. Journal of the American Society for Information Science, 46, 6, 400-425. Hjørland, B. (1997) Information seeking and subject representation. An activity-theoretical approach to information science. Westport (Conn.), London: Greenwood Press. Hjørland, B. (2001) Towards a theory of Aboutness, Subject, Topicality, Theme, Domain, Field, Content…and Relevance. Journal of the American Society for Information Science and Technology, 52, 9, 774-778. Hjørland, B., Kyllesbech Nielsen, L. (2001) Subject Access Points in electronic retrieval. Annual Review of Information Science and Technology, 35, 249-298. Hjørland, B. (2002) Domain analysis in information science. Eleven approaches traditional as well as innovative. Journal of Documentation, 58, 4, 422-462. Hjørland, B. (2005) Empiricism, rationalism and positivism in library and information science. Journal of Documentation, 61, 1, 130-155. Mai, J.E. (2001) Semiotics and indexing: an analysis of the subject indexing process. Journal of Documentation, 57, 5, 591-622. Morris, C. (1938). Foundations of the theory of signs. Chicago: The University of Chicago Press. Serrai, A. (1974) Indici logica e linguaggio. Problemi di catalogazione semantica. Roma. CNR. Laboratorio di studi sulla ricerca e sulla documentazione. Serrai, A. (1979) Del catalogo alfabetico per soggetti. Semantica del rapporto indicale. Roma, Bulzoni, Soergel, D. (1985). Organizing information. Principles of data base and retrieval systems.Orlando (Florida): Academic Press Inc. Swift, D. F., Winn, V.A., Bramer, D.A. (1977) A multi-modal approach to indexing and classification. International Classification, 4, 2, 90-94. 246

Swift, D. F., Winn, V.A., Bramer, D.A. (1978) 'Aboutness' as a strategy for retrieval in the social sciences. A paper presented at a Colloquium held by the Coordinate Indexing Group (now Informatics Group), April 1977. Aslib Proceedings, 30, 5, 182-187. Swift, D. F., Winn, V.A., Bramer, D.A. (1979) A sociological approach to the design of information systems. Journal of the American Society for Information Science, 30, 4, 215-223. Thellefsen T.L., Thellefsen M., (2004) Pragmatic semiotics and knowledge organization. Knowledge Organization, 31, 3, 177-187. Writings of Charles S. Peirce. A chronological edition.(1982-2000) Bloomington, Indiana University Press. Babajide Afolabi and Odile Thiery Laboratoire Lorraine de Recherche en Informatique et ses Applications (LORIA) Campus Scientifique, BP 239 54506 Vandoeuvre - Lès - Nancy, France

Using Users’ Expectations to Adapt Business Intelligence Systems.

Abstract: This paper takes a look at the general characteristics of business or economic intelligence system. The role of the user within this type of system is emphasized. We propose two models which we consider important in order to adapt this system to the user. The first model is based on the definition of decisional problem and the second on the four cognitive phases of human learning. We also describe the application domain we are using to test these models in this type of system.

1. Introduction 1.1. Definitions: Business Intelligence and SIS According to (Revelli, 1998), Business Intelligence (BI) “is the process of collection, processing and diffusion of information that has as an objective, the reduction of uncertainty in the making of all strategic decisions”. This is referred to as Economic Intelligence (EI) to in this paper as this is the term used in France our country of research. Also, using EI avoids the confusion of limiting the process to the business world (Business Intelligence), since we think that every organisation/institution, be it socio-economic, political, cultural or otherwise can adopt the process). Also, we have adopted this term to avoid the erroneous thoughts that one has to be competitive (in Competitive Intelligence) in order to innovate. The real interest of our research team is the aid that can be got from the use of the EI process in the resolution of decisional problems. The processes of collection, processing and diffusion can be fully or semi automated, we imagine this automation being powered by an Information System (IS). In actual fact, the EI process relies on the effective use IS. This type of information system belongs to the class of IS referred to as Strategic Information Systems (SIS), strategic in the sense that it contains information that is considered strategic because they are used in the decisional processes of the organisation (and not strategic because it is used to run the day to day activities of the organisation). In its simplest form, a strategic information system (SIS) can be considered as an information system (IS) consisting of “strategic information and permits the automation of the organisation to better satisfy the objectives of the management”. For instance, an IS that aids in the management of stocks, we denote this as SI-S. A SIS can also be seen as “an IS that is dedicated to strategic decision making and contains only strategic type of information”. For example, an IS that permits the decision maker to observe sales by country for a number of years or that permits an information watcher to point up the choices made during the analysis of the result obtained from an information search on the web. This is denoted as S-IS. (Tardieu and Guthmann, 1991) (David and Thiery, 2003)

1.2. Economic Intelligence Systems (EIS) The decisions taken, using an IS, are based on the information found in the IS and are also based on the user that has as an objective, the appropriation of such system for a decision making process. To us, an Economic Intelligence System EIS is a system that combines 248 strategic information systems and user modelling domains. The final goal of a BIS is to help the user or the decision maker in his decision making process. Figure 1 shows the architecture of an EIS as successive processes as proposed by the research team “SITE” (Modelling and Developing Economic Intelligence Systems of the Lorraine Laboratory of IT Research and its Applications (LORIA) Nancy, France) one can easily identify the following four stages:

x Selection:selection which permits the constitution of the IS of the organisation that can be (i) the production database (that allows current usage of the organisation), (ii) all the information support for an information retrieval system (in documentation for an example) or (iii) a SIS based on a data warehouse. This information system is constituted from heterogeneous data and from heterogeneous sources with the aid of a filter. x Mapping:mapping permits all users an access to the data in the IS. We are permitting two methods of access to the user: access by exploration and access by request. The exploration is based on a system of hypertexts. The requests are expressed with the aid of Boolean operators. The result of the mapping is a set of information. x Analysis:in order to add value to the information found, techniques of analysis are applied on the results. For instance, the assistant of a head of department that we consider as the information watcher can present a summary of the results obtained on the information requested to his head of department. x Interpretation:this means in general, the possibility of the user of the system being able to make the right decisions. It does not mean that the sole user of the system is the decision maker; it can include the information watcher. One can see then the interest in capturing the profile of the decision maker in a metadata stored on the data warehouse which can be used to build a specific data mart for a group of decision makers or even better a particular user.

Figure 1: Architecture of an Economic Intelligence System. 249

Also in this process one can identify three main actors:

x Decision maker: this is the individual in the organization that is capable of identifying and posing a problem to be solved in terms of stake, risk or threat that weighs on the organization. In other words, he knows the needs of the organization, the stakes, the eventual risks and the threats the organization can be subjected to. x Information watcher: this refers to the person within the organization that specializes in the methods of collection and analysis of information. His objective is to obtain indicators (using information) or value added information that the decision makers depend on for his decision process. After receiving the problem to be solved as expressed by the decision maker, the information watcher must translate it into information attributes to be collected and which are used to calculate the indicators. x End user: this is the final user of the system; it can be either of the previously outlined users or neither of the two. This user is defined depending on which layer of the Economic intelligence system he interacts with.

Other works have shown that there may be other actors involved in this system (Knauf and David, 2004). Earlier works by (Thiery and David, 2002) on personalization of responses in Information Retrieval Systems (IRS) adapted the four cognitive phases in the human learning process i.e.:

x Observation phase:here, the learner gathers information about his environment by observation. x Elementary abstraction phase: the learner describes the objects observed using words, this corresponds to a phase of acquiring the vocabulary of the system being observed. x Reasoning and symbolization:the learner starts to use the vocabulary acquired which implies a higher level of abstraction. x Creativity phase:here the learner discovers and uses the knowledge that were not explicitly presented in the system.

2. Modelling the User 2.1. The role of the user The user, who in this context can be the decision maker or in a larger context can any of the persons enumerated above, has a central role to play in an economic information system. His ability to efficiently use the system is directly proportional to his knowledge of the system. The first thing to do then will be to evaluate his knowledge of the system, use this knowledge to establish the importance of his role, his working habits, the most frequently used data etc. Next, using this information, a personalised structure can be generated to improve his use of the system. A complete and robust work environment can enormously increase his efficiency. On the other hand, he can bring out the critical elements of the system, the errors, faults and missing points of the system. For a user – decision maker, the decision making process begins by acknowledging a decisional problem, that can be translated as a decisional need. The resolution of a decisional need consists in identifying the needs necessary for such resolution, be it informational, strategic, human etc. we will be concerned, at this time with only the need in information (informational needs). Informational need can be defined as a function of the user model, his environment and his objectives. 250

User modelling and adaptivity are needed to support:

x Query adaptation: the user’s query may be adapted by the system to meet the user’s specific needs as identified by the user model x Response adaptation:the response of the system is based not just on any information but on information that relates to the user’s goal or purpose.

2.2. Information need The information need of a user is a concept that varies in definition, according to different researchers and according to the different users (Campbell and Rijsbergen, 1996), (Devadason and Pratap Lingam, 1996) and (Xie, 2000). We believe that the information need of a user is an informational representation of his decisional problem (Goria and Geffroy, 2004) and (Mizzaro, 1998). Defining a decisional problem implies certain level of knowledge on the user and his environment. Therefore, a decisional problem is a function of the user model, his environment and his objective. We base our definition on that of (Bouaka and David, 2003) where a decisional problem was defined as

Pdecisional = f(Stake, Individual Characteristics, Environmental Parameters)

Stake(goal) is what the organization stands to loose or gain. It is made up of Object, Signal and Hypothesis. Individual Characteristics refer to the user, his behaviours and his preferences. This includes his Cognitive Style, his Personality Traits, and his Identity. Environmental parameters mean the input of the society on the organisation. This can be Immediate or Global.

2.3. The user’s expectation In the Economic Intelligence System, the process starts from the identification of the decisional problem. In order to resolve this problem it is translated into an informational need. The definition of informational need depends most of the time on the person or user involved his experiences, his functions, his environment etc as defined above. The user then forms his requests based on this informational need. The requests formed by the user are usually based on his expectations of the system and his idea that the system will respond to some of these expectations. We defined the user’s expectations based on his model that as described above and the definition of his information need that is also dependent on his model. These expectations are contained in the potential knowledge field (“potential” because this may not actually be in use but it exists and is usable) of the organisation (figure 2). 251

Figure 2: The use of user’s expectation in an IS

The information system is alimented from an information world that is a global sum of all the credible sources available (within the organisation or external to it). The decisional problem can then be solved based on this available information. The resolution of a decisional problem can lead to the identification of another problem which means the system is in a continuous cycle. These expectations define some of his actions and these actions form the basis of his interactions with the system. Therefore we added these expectations in form of a variable called actions (activities) to the definition of decisional problem earlier cited. In other to complete this model for resolving decisional problems, we added the means that were implicated to achieve a resolution. Means in this case refer to the ways, methods and materials used.

Therefore the model for resolving decisional problems is based on:

MPdecisional = f(Stake, Individual Characteristics, Environmental Parameters, Actions, Means)

2.4. The user model Earlier models used in the Information System base of BIS were not complete since each user reacts differently according to his needs and his working habits, the possibilities of his evolution while using the system was not included. Also, a user/decision maker may have a need that is specific to him (in terms of his personality traits, cognitive style, preferences etc. as noted by Bouaka and David, 2003) that may not have been treated in the base. Our premier preoccupations include trying to complete a user model that will help the user in his evolutionary use of the system. This includes the system’s ability to respond to his needs (informational) and allowing him to progress from a user learning to use the system to an expert user of the system without letting him feel he is going through the same processes over and over again. 252

The objective of user model is to be able to personalize the responses of the system. User modelling is the way a user and his behaviour are represented. We transformed the four cognitive phases of human learning earlier mentioned into a user model in an IRS context. This is to give us a learning aspect to the system. The first two phases were compressed into exploration and this gives a model.

M= {Identity, Objective, {Activity} {Sub-sessions}}

Where:

Activity = {Activity-type, Classification, Evaluation} Type = {Exploration, Request, Synthesis} Classification = {Attributes, Constraints} Evaluation = {System’s solution, Degree of Pertinence}

x Identity:the identity of the user. This allows the individualisation of the historic of the sessions of the user. x Objective:the principal objective or the real need of the user for the session. x Activity (or actions in our model for resolving decisional problems): A user activity that leads to the resolution of his information need. A session is composed of many activities and each activity is defined by three parameters: activity-type, classification, and evaluation. x Activity-type:the types of activity correspond to the different phases of evocative user habits which are in this case exploration, request and synthesis. x Classification: this is the approach we use to access stored information. The classification technique permits the user to express his information requirement in terms of the evocative phases that we are implementing. The user will be able to specify the attributes of the documents to classify and the constraints that are to be met by these documents. x Evaluation:the user will be able to evaluate the pertinence of all the solutions proposed by the system. This evaluation relies on the degree of pertinence and the reasons for this judgement. x Sub-sessions:a sub-session is represented exactly like a main session. The only difference is that the objective of the sub-session is associated to the objective of the main session and a sub-session will not constitute a session apart.

This user model permits the proposition of an information system architecture that relies on a cognitive user evolution. The user can: explore the information base to discover its contents; formulate his requests; add annotations; and link his information retrieval activities to a definite predetermined objective. The information on the user is updated with each use. In simpler words the user evolves with the system.

2.5. Collecting user’s information The user’s information used in this IS are collected using the models described. It begins with the user entering the basic information concerning him (profile) explicitly. Firstly, we associate his requests or his information need(s) to the first model i.e. MPdecisional. This is used to improve his earlier given profile. Our hypothesis is that his use of the system improves with experience. His activities, called interactions, are stored as an experience base using the second model. This base is used to improve the model of the user and it helps follow 253 the user with his use of the system. We used the experience of older users at a level of learning to guide new users at that same level or to kick start a new user expressing the same similarities. For instance a user that could not give attributes and the values associated with the attributes is considered as new to the system and will need to go through a system of observation to discover the attributes and the corresponding values.

3. Application domain We are testing these methods, in the first instance, by applying the framework in information retrieval, using a base of documents published by researchers in a research centre. This base contains publications, historicized and grouped according to the habitual bibliographic nomenclature, of members of the research centre. We had worked on the classification, normalization and improvement of such electronic document resource and our objective is to constitute a real data warehouse of documents from which we could create all type of information analysis. In particular, we propose producing different data marts for the different group of users of the system, before going on to personalize the system to the individuals. This presupposes that each of these groups of users has a different view of the data from the data warehouse and would want to propose to him the data that essentially respond to his needs. Thus, while testing the earlier version, a user wanted to follow the evolution of publications of each research team. However, we found out that the attribute that could have helped in calculating this evolution was missing from the base. Another user wanted to know how represented African researchers are in terms of publications within this centre. The system could not provide the answer as nationalities of authors were not considered during the construction of the base. During the period of adaptation the information resources of the system is reengineered to contain the important attributes and or values that were missing which limit the responses got by the users.

4. Conclusion If the user’s expectations had been considered from the beginning, some of these questions would have been answered in the base. On verifying the identity of the user that wanted to follow the evolution of the publications of each research team we discover he is the director of the centre. If this user’s information need had been properly defined, this information would have been taken care of. The model for resolving decisional problems can be used along with the user model in an IRS context as defined above to further extract a lot of information on the user, his behaviours and why he behaves the way he does when in direct contact with the system. Our next phase of the research is to construct a metadata from these two models which will serve as a tool in the re-conceptualisation the actual base.

References Bouaka, N. and David, A. (2005). Modèle pour l’Explicitation d’un Problème Décisionnel : Un outil d’aide à la décision dans un contexte d’intelligence économique. In David A. (Ed.) Organisation des connaissances dans les systèmes d’informations orientés utilisation. Nancy: Presses Universitaires de Nancy. Campbell I. and van Rijsbergen K. (1996). The Ostensive Model of developing information needs. In Proceeding of the Second International Conference on Conceptions of Library and Information Science. Copenhagen. 254

David, A. and Thiery, O. (2003). L’Architecture EQuA2te et son Application à l’intelligence économique. In David A. (Ed.) Proceeding of the Conférence on Intelligence Economique: Recherches et Applications. Nancy. INRIA Lorraine. Devadason F.J. and Pratap Lingam, P. (1996). A methodology for the Identification of Information Needs of Users. In Proceedings of the 62nd IFLA General Conference. Seoul. Goria, S. and Geffroy, P. (2004). Le modèle MIRABEL: un guide pour aider à questionner les Problématiques de Recherche d'Informations. Veille Stratégique Scientifique et Technologique. Toulouse. UPS-IRIT. Haynes, S. R. (2001). Explanation in Information Systems: A Design Rationale Approach, PhD thesis submitted to The London School of Economics. Inmon, W. H. (1995). What is a Data Warehouse? Prism Tech Topic, Vol. 1, No. 1. Knauf, A. and David, A. (2004). The role of the infomediary in an economic intelligence process. In The 8th World Multi-Conference on Systemics, Cybernetics and Informatics. Orlando, USA Mizzaro S., 1998. How many relevances in information retrieval?. In Interacting With Computers. n°103. 303-320 Revelli, C. (1998). Intelligence stratégique sur Internet. Paris. Dunod. Saracevic, T. (1996). Modeling interaction in information retrieval (IR): A review and proposal. In Proceedings of the American Society for Information Science, volume 33. 3-9. Tardieu, H. and Guthmann, B. (1991). Le triangle stratégique. Paris. Les Editions d’Organisation. Thiery, O. and David, A. (2002). Modélisation de l’Utilisateur, Système d’Informations stratégiques et Intelligence Economique. In Revue Association pour le Développement du Logiciel (ADELI). n° 47. Thiery, O. and Ducreau, A. and Bouaka, N. and David, A. (2004). Piloter une organisation : de l’information stratégique à la modélisation de l’utilisateur ; application au domaine de la GRH. In Congrès Métamorphoses des organisations. Nancy. Xie, H. (2002). Patterns between interactive intentions and information-seeking strategies. In Information Processing and Management. Volume 38. 55 – 77. Aaron Loehrlein, Elin K. Jacob, Seungmin Lee, and Kiduk Yang Classification-based Search and Knowledge Discovery, Indiana University Bloomington, USA

Development of Heuristics in a Hybrid Approach to Faceted Classification

Abstract: This paper describes work in progress to identify automated methods to complement and streamline the intellectual process in the generation of faceted schemes. It reports on the development of the word pair heuristic, the suffix heuristic, and the WordNet heuristic, and how the three heuristics integrate to produce an initial organization of terms from which a classificationist can more efficiently construct a faceted vocabulary.

1. Introduction The creation of a faceted scheme involves identifying concepts that are relevant to the domain and systematically analyzing them into aspects that can be combined to form a wide range of classes, many of which cannot be specifically considered by the classificationist (Foskett, 2000). The faceted approach is frequently much more labor intensive than the traditional approach to classification, which typically identifies a few broad concepts and nests within them a series of increasingly specific sub-concepts. Although all classification requires an in-depth examination of the relevant literature, facet analysis requires a particular care in order to create a structure that can support the addition of new concepts while only rarely requiring the addition of new facets (ibid.) For these reasons, automation of the development of faceted schemes has not seemed feasible. However, we theorized that certain automated approaches to facet generation could be used to integrate the processing capabilities of the machine with the analytical and evaluative capabilities of the human. This hybrid approach to facet generation would begin with identification of the heuristics or basic sorting strategies used by humans in the grouping process. Analysis of these heuristics would then indicate which strategies could be handled automatically by the machine to generate a set of candidate facets and values.

2. Analyzing the faceted vocabulary construction process To assess the viability of an integrated, hybrid approach, we decided to begin the process of constructing the faceted vocabulary by identifying a lexicon of concepts from an existing representational system currently used to index a collection of Web documents. The representational system selected for this project was EPA Topics, available at , an indexing scheme used by the United States Environmental Protection Agency (EPA) to provide access to a collection of high-quality resources dealing with a range of environmental issues. EPA Topics is not a true classification scheme in that not all categories are mutually exclusive and concepts are occasionally nested within multiple branches of the hierarchical structure. Also, many of the hierarchical relationships are not strictly generic, but instead place the superordinate concept into a more specific context. However, this representational system does provide a set of nested categories with each category represented by a chain of descriptors indicating its relationship within the overall hierarchical structure. Extracting terms from EPA Topics resulted in a lexicon base of 723 terms. 256

A hybrid classificatory process would combine the strengths of both manual and automatic approaches to the construction of classification systems. It analyzes the steps undertaken by a human classificationist and determines which of those steps, if any, could be automated or could benefit from an automated process. We found that the most effective automated processes provided an initial, approximate grouping of the terms in EPA Topics. Some automatic processes were also able to provide expressive labels for the groups and, in some cases, relatively meaningful relationships between the groups. Because this approach seeks to provide a "first draft" of a classification system, it is not necessary for the automatic classificatory processes to correctly place every term. We found that we could not manually develop a faceted scheme of concepts relevant to the domain of the EPA without first grouping terms into general categories. Once we created those categories, we then proceeded to refine these categories into a precise faceted vocabulary. Therefore, automatic processes could usefully group together terms like Compliance, Mediation, and Indemnity. Human classificationists could then use their knowledge of the domain to determine the precise relationship between those concepts. An automatic process that organized terms into the clusters would not have to be completely accurate. It could also violate certain rules of classification. For example, it could put a term into more than one category. This paper reports on the results of the application of three automatic processes to the classification of terms in EPA Topics. They are referred to as the word pair heuristic, the suffix heuristic, and the WordNet heuristic. It also discusses aspects of the classification process for which we found no useful heuristics for augmenting the manual process.

3. The word pair heuristic This heuristic identified all instances of two words occurring in the same subject heading within EPA Topics. The heuristic particularly emphasized terms that appeared only in conjunction with another term, or two terms that only appeared together. Of the 723 terms in the lexicon base, 168 occurred in a subject heading only in conjunction with another specified term. These terms were automatically analyzed into a total of seventy-three phrases. This approach identified useful phrases such as Coral Reefs or Global Warming, which in the context of EPA Topics are made up of terms that are generally only meaningful as part of the phrase. It was also useful in identifying potential hierarchical relationships between concepts. We found that the usefulness of a co-occurrence between two terms depended partially on the number of times the two terms co-occur, compared to the number of total occurrences of each term. If the two terms almost always appeared together, then they might be part of the same phrase. For example, the terms Acid and Rain each occurred only once in EPA Topics. They also occurred together. Therefore, Acid and Rain might be used in the domain of the EPA primarily as part of the phrase Acid Rain. There were also cases of word pairs where the first term appeared only in conjunction with the second term, but the second term frequently appeared without the first term. In that case, the first term might be a useful subordinate class to the second term. For example, Acid and Effects occurred together only once, but Effects occurred a total of twenty times in EPA Topics. Therefore, Acid Rain might be considered to be a subordinate of Effects (i.e., effects of pollution) along with, e.g., Brownfields, Global Warming, and Health Problems. However, if the second term occured very frequently and usually independently of the first term, then it might not be useful to associate the two terms. For example, Acid and Air occurred together once, but Air occurred a total of 225 times. Therefore, Acid (or Acid Rain) might not be a very useful concept by which to organize concepts related to Air. 257

In many cases, the phrases in which the terms occur are not the only valid use of the terms. For example, the phrase Prior Informed Consent might be the only context in the EPA for which each constituent term needs to be considered. However, the same is not necessarily true for Confidential Business Information, in which each of the constituent terms could easily be combined with other terms to form other valid classes. Ultimately, many phrases were broken up and their constituent terms were analyzed separately. Like the other heuristics, the Word Pair heuristic is useful for creating initial groupings of terms, which are then manually revised and validated by a human classificationist. We did not use this heuristic to automatically classify terms appearing the text of documents. When analyzing free text, it seems simplest to count as "pairs" only terms that appear next to each other, or that are separated by a preposition (of, in, etc.). This approach would miss long phrases such as National Environmental Performance Track, but it would also filter out words that frequently appear near each other, but which are not part of a phrase, such as "conditions that affect".

4. The suffix heuristic This heuristic automatically groups terms that share a common suffix or other ending string. It then organizes those groups according to the meaning of that suffix. This approach differs from previous research into suffixes, such as methods for stemming suffixes in order to conflate terms (Harman, 1991; Savoy, 1993), or using suffixes to identify a term’s position within a phrase (Okada, Ando, Lee, Hayashi, and Aoe, 2001). Most of the ending strings we used are morphemes, which are the smallest meaningful units of a word (Bybee, 1988). Morphemes include official English language suffixes such as -ure, which refers to an office or function (e.g., Judicature), or ending strings that are words in their own right, such as -field (e.g., Brownfield) and -flow (e.g., Overflow). There is substantial research into the role of morphemes in cognitive organization (ibid.). However, to our knowledge, only Loehrlein, Jacob, Yang, Lee, and Yu (2005) have used morphemes to create a classification system from existing lexicon bases. We also made use of certain non-morphemic ending strings. For example, -ena is not (to our knowledge) meaningful, and therefore non-morphemic. However, it was useful in identifying proper nouns such as Wadena. We created a list of suffixes and other ending strings by which terms could be usefully grouped. We then assigned meanings to these ending strings and organized them by meaning into a shallow (three-level) hierarchy. For example, -ian was subsumed under the class "People" (e.g., Dialectician). The ending string -graph was subsumed under the class "Records and instruments related to records" (e.g., Phonograph), which was in turn subsumed under the class "Things". Both "Things" and "People" were subsumed under the general class "Entities". Therefore, terms ending in -ian and terms ending in -graph shared Entities as an indirect superordinate. For more details, see (ibid.) We found that the suffix heuristic provided an initial classification of approximately half of the terms in each lexicon base. It used sixty-eight suffixes and other ending strings to organize 49.79% of the 723 terms into forty-five groups. However, many groups were very large and not particularly meaningful. For example, the heuristic placed ninety terms into a group that was simply labeled "actions, processes", which was too large and ill-defined a group to be of much use to classificationists. Only 187 (25.86%) of the terms in the lexicon base were placed into groups that were deemed small enough and specific enough to be useful. We found that the suffix heuristic was particularly useful for identifying entities and characteristics. However, it did a poor job identifying actions and events. Overall, it appears that suffixes alone are not sufficient when seeking to organize a large number of terms by their meanings. 258

5. The WordNet heuristic The final heuristic that we identified makes use of WordNet 2.1, available at . WordNet is a database of English words that are grouped and organized by the concepts that they represent. We used the semantic categories and relationships to cluster terms in the lexicon base and assign labels to those clusters. The heuristic identified groups of terms that were assigned to a fairly specific semantic class. It formed groups of terms that shared with three or more other terms at least four hierarchical levels in WordNet. For example, the terms Audit, Review, Inspection, and Screening share five hierarchical levels in WordNet (see Table 1). It should be noted that different interpretations of each of these terms also appear elsewhere in the WordNet hierarchy. For example, WordNet classifies Review as, among other things, an "examination, scrutiny", a "written communication", a "variety show", and a "periodical". The heuristic grouped Review according to the first interpretation, which would associate Review with other terms in the lexicon base. We hypothesize that this approach is likely to provide the most appropriate interpretation of a term, since it is responsive to trends in how terms are used in the domain.

act, human action, human activity activity work investigation, investigating examination, scrutiny audit follow-up, followup, reexamination, review inspection, review testing screening

Table 1. Classification in WordNet, where "audit", "review", "inspection", and "screening" each share at least four hierarchical levels.

It should be noted that the WordNet heuristic was fully developed only after the manual classification was complete. We validated the groups formed by the heuristic according to their similarity to the groups that the classificationists had manually created. Because the WordNet heuristic is intended to provide initial semantic clusters of terms, which the classificationist then revises, the groups were compared to those that the classificationists constructed early in the process, as opposed to the facets in the final version. The output of the heuristic was 270 terms (37.34% of the lexicon base) in fifty-nine groups. Because many terms appeared in more than one group, there was an average of 8.17 terms per group. Twenty-two of these groups were deemed to be useful. They consisted of 202 terms (27.94% of the lexicon base) with an average size of 9.32 terms per group. Of these twenty-two groups, eleven (50%) were classified in WordNet as "act, human action, human activity" or "event". The remaining groups were distributed between "abstraction", "entity", "group, grouping", "possession", and "psychological feature". WordNet was also useful in the identification and grouping of proper nouns, including locations such as Russia and London. 259

6. Integration of the heuristics Although each heuristic on its own was not adequate in creating an initial, approximate classification system, using all three heuristics together produced more satisfactory results. Of the 723 terms in the lexicon base, 59.92% were grouped by at least one heuristic. Of the terms that were grouped, 73.47% were grouped by only one heuristic. We found that the three heuristics each grouped a roughly equal number of terms (see Figure 1). We also found that the overlap between heuristics was not extreme. Overall, each heuristic seemed to make a notable contribution to the classification process.

WordNet Suffix (202) (187) 40 119 99

18

25 30

95 Not Organized (297)

Word Pair (168)

Figure 1. The number of terms in the lexicon base that each heuristic organized.

In addition to grouping terms, both the WordNet heuristic and the suffix heuristic organized the terms into categories. Table 2 lists the primary categories used. The WordNet heuristic organized most of the terms related to Actions, Events or Abstractions. The suffix heuristic organized most of the terms related to States, Qualities, Conditions, Possessions, and Characteristics. Of the terms classified as Entities, both heuristics contributed roughly equally, and with little overlap. The only category for which there was a great deal of redundant effort was Chemicals. Even then, each heuristic found many terms that the other did not. In addition to identifying words that the other has missed, the two heuristics appear to have strengths that naturally complement each other. 260

Class Total Organized by Terms Both Suffix WordNet Heuristics Only Only Abstractions 62 0 8 54 Actions, events 128 19 44 65 Characteristics 26 0 26 0 Chemicals 46 13 12 21 Entities (except 77 3 40 34 chemicals) Psychological 9 0 0 9 features States, qualities, conditions, 34 4 26 4 possessions

Table 2. Top-level classes of terms, broken down by the heuristic that was responsible for organizing them.

7. Manual analysis (what the heuristics did not do) These heuristic provided only initial categories of terms. In order to determine the most appropriate classes and relationships, the classificationist must research the relevant documents and interpret the usage of the key concepts. If the classificationist has access to a lexicon base of key terms, they may use the terms to retrieve relevant documents from an online database. However, interpretation of the documents is an intellectual process that cannot be automated. In many cases, terms are used within the domain in ways that would be considered inappropriate outside of the domain. For example, in the domain of environmental protection, the term Pretreatment refers to treatment performed on wastewater prior to its discharge to a Publicly-Owned Treatment Works (NPDES, 2003). In contrast, Merriam- Webster Unabridged provides one definition of Pretreatment: "occurring in or typical of the period prior to treatment". There are also cases where terms are used differently, even within the same domain. For example, the use of the term Non-Point differs, even within the context of environmental protection. Some resources in the EPA refer to non-point sources of pollution as having no identifiable sources, e.g., the runoff of snow or rainwater that picks up and spreads pollutants on or under the ground (NPS Q&A, 2006). Other resources refer to non-point sources of pollution as those that are identifiable, but too numerous and mobile to easily monitor, such as automobiles that produce emissions (Kral, 2006). A great deal of the manual process was also spent in determining to extent to which a set of classes were mutually exclusive. For example, Household and Urban are generally considered to be not mutually exclusive, since many households are located in urban settings. However, the waste generated by households (e.g., compost) might be mutually exclusive with the substances that make up urban pollution (e.g., smog). Therefore, Household and Urban were treated as mutually exclusive classes in the context of environmental protection. It seems much more appropriate to assign this kind of analysis to humans than to automated processes.

8. Conclusion In this paper, we have described three heuristics that can be integrated with a manual, intellectual analysis to develop a faceted vocabulary. We found that a hybrid, semi- 261 automated approach to faceted scheme creation can effectively combine the intelligence, context awareness and evaluative judgment of the human with the speed of processing, unlimited memory and consistency in repetition of the machine.

References Bybee, J.L. (1988). Morphology as lexical organization. In Theoretical morphology: Approaches in modern linguistics (pp. 119-142). San Diego: Academic Press, Inc. Foskett, A.C. (2000). The future of faceted classification. In R. Marcella and A. Maltby (Eds.), The future of classification (p. 69-80). Aldershot: Gower. Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42 (1), 7-15. Kral, L. (2006). General Air Quality Information. (2006) Retrieved March 18, 2006 from http://yosemite.epa.gov/ r10/airpage.nsf/webpage/general+air+quality+information Loehrlein, A., Jacob, E.K., Yang, K., Lee, S., Yu, N. (2005). A Hybrid Approach to Faceted Classification Based on Analysis of Descriptor Suffixes. In Proceedings of the 68th ASIST&T Annual Meeting, Vol. 42 (Charlotte, North Carolina, USA, November 1, 2005). American Society for Information Science. National Pollutant Discharge Elimination System (NPDES) - FAQ. (2003). Retrieved March 18, 2006 from http://cfpub1.epa.gov/npdes/faqs.cfm?program_id=3#65 Okada, M., Ando, K., Lee, S.S., Hayashi, Y., and Aoe, J. (2001). An efficient substring search method by using delayed keyword extraction. Information Processing & Management, 37, 741-761. Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the American Society for Information Science, 44 (1), 1-9. What is Nonpoint Source (NPS) Pollution? Questions and Answers. (2006). Retrieved March 18, 2006 from http://www.epa.gov/owow/nps/qa.html.

Michèle Hudon Sabine Mas Université de Montréal, Montréal, Canada Structure, logic, and semantics for Web-based collections in education

Abstract: Results of a project focusing on six Web-based collections in education. Our analysis of home- grown classification structures considers three dimensions. “Structure” is described through quantitative data (e.g. Number of main categories, Number of hierarchical levels, etc.). “Logic” concentrates on two aspects of the subdividing process: division principle, and type of hierarchical relation. “Semantics” relates to concepts and their representation in the form of terms. In our sample, the classification structures are hierarchical, not overly complex and not very specific. The choice, arrangement and sequence of classes are logical. Conceptual and terminological inconsistencies are due to significant gaps in conceptual coverage and lack of terminological control.

1. Background Little is known about the specific information needs and behaviour of academics and researchers in education, but we assume that this population’s needs and behaviour are similar to those of their peers in other fields. Electronic information use is substantial and growing in the academic and research community. There is evidence, however, that the Web is not used as much as it could be by researchers in education (Auricombe, 2001). On the Web, it remains difficult to uncover sources that have education as their main subject, and that are concerned with theory, philosophy and research rather than with the daily activities and tools of teaching and training. Home-grown classification structures provide access to dozens of virtual collections in education. These structures exhibit weaknesses associated with this category of classification schemes (such as a lack of concern for standardization and a disregard for known theoretical principles of knowledge organization), but they also exhibit several characteristics deemed essential in good and efficient classification structures. Common to lists of desirable qualities are the following: simplicity, logic, flexibility, hospitality, authority, and specificity (Maniez, 1987; Iyer, 1995; Molholt, 1995; Mai, 2004; Van der Walt, 2004). In a previous paper (Hudon, 2003), we proposed general observations on home-grown structures used to organize education-related sources in general Web directories and in special Web-based libraries, with an emphasis on macro-levels of subdivision. In the framework of a project funded by the Fond québécois pour la recherche sur la société et la culture1, we have refined our analysis with the objectives of establishing patterns and estimating appropriateness and efficiency. The results of this analysis will serve as a basis for the development of an authoritative, transparent and efficient classification structure for Web resources potentially useful to specialists and researchers in the field of Education.

2. Methodology The objectives of the project on which we report here were to generate data that would help us characterize more precisely home-grown classification structures applied to Web- based libraries in the field of education, while testing a descriptive and evaluative model conceived for application in a slightly different context. A straightforward methodology, a sample of existing virtual collections, and various types of analyses allowed us to reach these objectives. 264

We used a model developed by Sabine Mas as part of her doctoral research on the classification of electronic records residing on personal workstations in large organizations. The model combines findings of theoretical research in classification, of theoretical and applied research in the field of personal information management (Boardman and Sasse, 2004), and of previous analyses of Web directory structures (Van der Walt, 1998; Zins, 2002; Hudon, 2003). The analysis is multidimensional: structure is described through quantitative data, logic concentrates on various aspects of the subdividing process, and semantics relates to concepts and their representation in the form of terms.

2.1 Sample A sample of five Web-based education libraries was established, to which was added the education class of a virtual library covering all Social sciences. Only those libraries that were still being maintained and whose collections were classified could become part of the sample. The following virtual collections were retained: Educator’s Reference Desk (ERD) (www.eduref.org); EdNA (www.edna.edu.au/edna/); SOSIG Education (www.sosig.ac.uk/ education/); GEM - The Gateway to Educational Materials (thegateway.org); Education Index (www.educationindex.com/); Education Virtual Library (www.csu.edu.au/education/ library.html). ERD and EdNA offer a strong and deep hierarchical structure, SOSIG has a shallow hierarchy, GEM starts as a faceted structure but presents a shallow hierarchy at its lower level, Education Index and Education Virtual Library are faceted structures. The libraries were all visited between January and March 2005. Each library was visited more than once, but its classification structure was imported for analysis at a single moment in time.

2.2 Analysis Each classification structure was collapsed and transposed into a standard Windows file directory format. This transfer was necessary for the structural analysis, part of which would be achieved automatically by the PDS (Personal Document Space) software (Gonçalves and Jorge, 2003). Excel-based tables were produced during phases 2 and 3 of the analysis, which involved comparison and interpretation of qualitative data. Coding was done by two doctoral students working independently from one another. All analyses were conducted on the basis of classification structures only, but occasional forays into the actual collections proved essential for disambiguation purposes, to determine the exact meaning and extension of a class, for example.

3. Results The classification structure in each virtual library was examined from three perspectives: structure, logic, and semantics. The presentation and discussion of results follow the same sequence. In all tables, sample libraries are represented as: ERD (Educator’s Reference Desk), EdNA (Education Network Australia), SOSIG, GEM (Gateway to Educational Materials), EI (Education Index), EVL (Education Virtual Library). Faceted structures are separated from the hierarchical ones by a thick vertical line.

3.1 Structure The structural analysis generated quantitative data relating to maximum, minimum, and average numbers of classes and hierarchical levels, as well as data on the branching factor 265 or average number of classes at each level. Table 1 presents the results of this analysis; they will later support comments on the simplicity and specificity of the structures. The asterisk (*) identifies results provided by the PDS software.

ERD EdNa SOSIG GEM EI EVL Main classes 12 10 12 7 2 4 Total classes 351* 1916* 162* 398* 68* 44* Levels max 5 6 2 3 2 2 Levels min 2 2 2 2 2 2 Levels average 3,41 4,1 2 2,28 2 2 Branching 2,978* 3,32* 12,238* 6,994* 7,761* 6,152* Standard dev. 4,385* 4,298* 3,282* 9,936* 23,166* 7,403*

TABLE 1 Structure

The number of top level or main classes varies from 2, an absolute minimum for a library to be part of our sample, to a maximum of 12, with an average of 7,83 main classes. Larger variations are observed in total number of classes, from a low of 44 in EVL to a high of 1916 in EdNA. Five libraries fall below the average of 489 classes, an average inflated by the very complex structure of EdNA; when EDNA is excluded, the average number of classes goes down to 205. EdNA has the deepest structure, with six hierarchical levels; the average number of levels is 3,33. The branching factor varies from a low of 2,9 to a high of 12, with an average of 6,57. Standard deviation in average number of categories at each level shows that ERD, EdNA and SOSIG offer a balanced structure, with a deviation lower than 5; Education Index is the least balanced of all, with a deviation of 23.

3.2 Logic Qualitative data relating to the logic dimension of each classification structure was obtained through manual examination and interpretation; it will support comments on the logic, flexibility and hospitality of each structure. Three sets of data are available. The first set of data relates to criteria applied at the first three levels for logical division. Potential values were selected from eight principles of division described by Zins (2002). They were: Subject, Object, Audience, Format (or Outer form), Reference (or Inner form), and Location. Table 2 shows that Subject is used as main principle of division (i.e. from Education to Main classes) in all structures, and reappears at the second level (i.e. from Main classes to Sub-classes) in four of them. At the second (and lowest) level of its shallow hierarchy, SOSIG chooses a standard arrangement based on inner and outer form for all of its specialized collections, while Audience makes a strong appearance in GEM.

ERD EdNa SOSIG GEM EI EVL st Subject Subject Subject Subject Subject Subject 1 level (92%) (80%) (100%) (50+%) (100%) (50%) Location Reference (58%) nd Subject Subject (50%) Audience Subject Subject 2 level (85%) (72%) Format (47%) (100%) (19%) (50%) Format (15%)

TABLE 2 Logic – Principle of division 266

A second set of data describes the nature of the relation linking classes at the top three hierarchical levels. Potential values are: Generic relationship, where the lower level class is a type of the object, event, etc. named at the higher level, Partitive relationship, where the lower level class is a component of the object, event, etc. named at the higher level, Instance relationship, where the lower level class is a particular object, event, etc. serving as an example of the object, event, etc. named at the higher level, and Contextual relationship, where higher and lower class are found in the same environment but not in the same natural or logical hierarchy. Table 3 summarizes our results. At the top levels of the structure, it is most likely through a relation of context that the user is led to lower levels of the arrangement, with the generic relationship appearing as a distant second. We note in EDL a predominance of the instance relationship linked to the choice of Location as principle of division in 58% of cases.

ERD EdNa SOSIG GEM EI EVL Context Context Instance Context Context Context nd (87%) (90%) (58%) 2 level (100%) (70+%) (100%) Generic Generic Generic Context Context 3rd level (50+%) (81%) ------Generic Generic

TABLE 3 Logic – Type of hierarchical relationship

The third set of data describes the internal arrangement of classes at the top three levels of the structure. Below the top level, alphabetical order of class denominations is prevalent in all libraries but one, but at the main class level, the rationale behind class arrangement is not apparent in as many as three libraries out of six.

3.3 Semantics The semantic analysis provided data on conceptual and terminological concordance with authoritative sources in the field. Results were obtained through a standard methodology for establishing compatibility, involving manual examination of data and coder’s judgment as to degree of concordance. Possible values were: Full terminological concordance, Partial terminological concordance, Full conceptual concordance, Partial conceptual concordance, No concordance. The results, presented in Tables 4 and 5, allow us to propose preliminary observations on the authoritative quality and the simplicity of the sample classification structures. We compared the top three level classes denominations to those appearing in the table of contents of a popular reference tool, the Encyclopaedia of Educational Research, 6th ed. Table 4 reveals low levels of either terminological or conceptual concordance, with the latter slightly superior to the former, as was expected; denominators (e.g. 302 in ERD) correspond to the total number of classes at the top three levels of the structure. 267

ERD EdNA EI Encyclopaedia Terminological Full 18/302 (5,96%) 2/604 (0,33%) 3/68 (4,41%) Terminological Partial 24/302 (7,94%) 60/604 (9,93%) 8/68 (11,76%) Conceptual Full 30/302 (9,93%) 5/604 (0,83%) 5/68 (5,19%) Conceptual Partial 57/302 (18,87%) 152/604 (25,16%) 12/68 (17,64%)

SOSIG GEM EVL Encyclopaedia Terminological Full 4/162 (2,47%) 11/398 (2,76%) 0/44 (0%) Terminological Partial 25/162 (15,43%) 15/398 (3,77%) 2/44 (4,55%) Conceptual Full 4/162 (2,47%) 4/398 (1%) 1/44 (2,28%) Conceptual Partial 40/162 (24,70%) 88/398 (22,11%) 5/44 (11,36%)

TABLE 4 Concordance with the Encyclopaedia of Ed. Research (Top three levels only)

Top level classes denominations in ERD were also compared to captions in the Web version of the Dewey Decimal Classification (DDC). The DDC was chosen on the basis of convenience; any other traditional classification systems could have been chosen for this exercise.

DDC Encyclopaedia Terminological Full 64/302 (21,19%) 18/302 (5,96%) Terminological Partial 107/302 (35,43%) 24/302 (7,94%) Conceptual Full 50/302 (17%) 30/302 (9,93%) Conceptual Partial 145/302 (48%) 57/302 (18,87%)

TABLE 5 Semantics – Concordance between ERD / DDC / Encyclopaedia

Table 5 shows that concordance is higher between ERD and the DDC, than between ERD and the specialized encyclopaedia.

4. Discussion Home-grown classification structures are characterized by their developers as user- friendly and flexible tools, capable of structuring knowledge and information and providing useful assistance in navigating virtual collections. Recognized weaknesses include a lack of specificity, inconsistency in class arrangement, and lack of conceptual and terminological standardization. Our results are consistent with these observations.

4.1 Structure The quantitative data relating to structural aspects reveal little more than what we knew or at least suspected. The average of 7,83 main classes in all structures is well below the maximum of 10 classes judged efficient for organizing resources in a specialized field, and not surprisingly, the higher numbers of distinct classes are found in the deeper and/or most complex structures (ERD, EdNA). Total numbers of classes show a clear distinction between the four hierarchical structures (ERD, EdNA, SOSIG, GEM) and the two faceted ones (EI, EVL). The average number of hierarchical levels, at 3,33, also corresponds to the “standard” number of levels recommended and common for general Web classification structures. ERD and EdNA, both set up and maintained by field specialists, offer a slightly deeper structure; it is doubtful, however, that even six hierarchical levels will be sufficient 268 to allow for the specificity in classification that as complex a field as education would warrant. Faceted structures (GEM, EI, and EVL) are less balanced than their hierarchical counterparts, starting with a narrow choice of main categories, then expanding quickly into long lists of sub-classes. None of the six structures is overly complex. An obvious lack of specificity in at least four sample libraries will allow for no more than broad classification to be effected. Since total numbers of actual resources are not available (except in ERD), we do not know whether broad classification is sufficient to ensure efficient retrieval, given the size of the virtual collections offered to education specialists and researchers.

4.2 Logic The practice of mixing various principles of division in a developing hierarchy is contrary to principles of classification because it creates classes of resources that are not mutually exclusive, thus “causing uncertainty for the browser when he has to select a category” (Van Der Walt, 1998, 382). Such a mix is found at the top level in four out of six collections, and unfortunately in both ERD and EdNA, the most complex structures. In both ERD and EdNA, a Reference class refers to Inner form, while the Specific populations class in ERD clearly refers to the Audience criteria. Whether classes are mutually exclusive or not may not be a problem, providing resources can be assigned to more than one class at the same hierarchical level; this possibility needs to be verified and we do not know if structures and process allow for greater flexibility in classification than was possible in traditional, pre-Internet contexts. Objects, events, etc. are linked through hierarchical relations of a contextual nature. This is also the case in bibliographic classifications such as the Dewey Decimal or the Library of Congress Classification. The choice of contextual (e.g. Educational management Ź Educational facilities) rather than truly generic relationships (e.g Educational institutions Ź Secondary schools) contributes to making the whole structure hospitable and capable of integrating easily new classes and specific topics. The simple and familiar alphabetical display of classes is also beneficial to the hospitality of the structure and undoubtedly preferable to an obscure logical arrangement reflecting someone’s personal view of the world. In this project, we did not look at the criteria of placement of a class in the structure, and at the context thus provided. Such data would allow us to provide more definitive observations on the logic of class determination and arrangement. From the available data, we suggest, however, that all structures present more than sufficient logic to be of immediate and reasonably easy use. And while the objectives of this project did not include a comparison with traditional class structures, enough is known about the DDC and LCC, among others, to suggest that our sample structures would be no more, but most likely no less, difficult to navigate than the traditional ones.

4.3 Semantics Terminological and even conceptual concordance with external authoritative sources is surprisingly low. It may be that our choice of reference sources was not the most appropriate for virtual collections set up elsewhere than in North America, or that our criteria for establishing concordance were too stringent. The higher numbers obtained when comparing ERD and DDC may be explained by the fact that the Dewey system is already used for classifying millions of documents and subjects: this could contribute to making it close to being conceptually complete at the first five or six levels of hierarchy, even in specialized areas. The comparison with Dewey also benefits from the encyclopaedic 269 character of its coverage; when a concept is only peripherally related to education, it may not be found in a specialized reference tool but it is likely to be present in a general knowledge organization structure. Results also reveal that partial concordance is always higher than full concordance, at both conceptual and terminological levels. This is not surprising, given that there does not exist a single, and best way to segment and organize the world of concepts, even within the same cultural, political, etc. context. This particularity increases slightly the complexity of the structure, without affecting its authority. We did not examine actual class names to determine how simple, easily memorized, and unambiguous they are, nor did we compare each sample structure with the others; such an analysis would have allowed us to comment on the issue of interoperability. A look at top classes reveals that Educational levels is the only topic/facet represented in all six sample structures, either as an inclusive class (in ERD, EdNA, GEM, EI, and EVL) or as a listing of individual classes (as in SOSIG).

5. Conclusion The data gathered through this project allow us to add to the body of literature on the organization of Web-based knowledge, and to increase our understanding of how Web- based resources in education are organized and could be accessed. Web-based libraries in education use home-grown classification structures to facilitate access to their collections. These structures are likely to be hierarchical, not overly complex and not very specific. The choice, arrangement and sequence of classes are logical enough that the whole structure becomes easy to apprehend and navigate. Problems arise at the semantic level where conceptual and terminological inconsistencies emerge, making the structure less efficient for retrieval because of significant gaps in coverage and lack of terminological control. The semantic dimension of our evaluation model is the most interesting, but also the most difficult to deal with methodologically. Although we did obtain usable results with a straightforward methodology, further research is needed to provide reliable and consistent data with regards to conceptual coverage, terminological consistency, and structural interoperability.

Notes 1 Conception d’un schéma de classification pour l’organisation et le repérage des ressources du Web dans le domaine de l’éducation (2003-2006)

References Auricombe, S. (2001). Recherche et usage de l'information documentaire : analyse des pratiques des chercheurs et enseignants-chercheurs sur la formation du CNAM. Perspectives documentaires en éducation, 52, 61-69. Boardman, R., and A.M. Sasse. (2004). Stuff goes into the computer and doesn't come out : A cross-tool study of personal information management. CHI 2004 : Conference on Human Factors in Computing Systems, April 24-29 2004, Vienna, Austria, pp. 583-590. New York, NY: Association for Computing Machinery. Gonçalves, D. J., and J.A. Jorge. (2003). Analyzing personal document spaces. Proceedings of the 10th International Conference on Human-Computer Interaction, 22-27 June 2003, Crete, Greece. Retrieved February 17, 2006, from www.inesc-id.pt/pt/indicadores/ Ficheiros/942.pdf 270

Henderson, S. (2003). Information workspaces : Investigating the information behaviour of knowledge workers and its implications for the design of usable information workspaces. Ph.D. dissertation proposal, University of Auckland, New Zealand. Retrieved February 17, 2006, from staff.business.auckland.ac.nz/staffpages/shen045/docs/ ResearchProposalAugust2003.pdf Hudon, M. (2003). Subject access to Web resources in the field of education. In Subject retrieval in a networked environment : Proceedings of the IFLA satellite meeting, held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section, and OCLC, pp. 83-89. München, Germany : K.G. Saur. Mai, J.E. (2004). Classification of the Web: Challenges and inquiries. Knowledge Organization, 31, 2, 92-96. Maniez, J. (1987). Les langages documentaires et classificatoires: conception, construction et utilisation dans les systèmes documentaires. Paris : Éd. d’Organisation. Molholt, P. (1995). Qualities of classification schemes for the information superhighway. Cataloging & Classification Quarterly, 21, 2, 19-22. Van Der Walt, M. (1998). The structure of classification schemes used in Internet search engines. In Structures and relations in knowledge organization : Proceedings of the Fifth International ISKO Conference, 25-29 August 1998, Lille, France, pp. 379-387. Wurzburg, Germany: Ergon. Van der Walt, M. (2004). A classification scheme for the organization of electronic documents in small, medium, and micro enterprises (SMMEs). Knowledge Organization, 31, 1, 26-38. Zins, C. (2002). Models for classifying Internet resources. Knowledge Organization, 29, 1, 20-28. Catalina Naumis Peña Researcher at the Centro Universitario de Investigaciones Bibliotecológicas/ University Library Research Center. Universidad Nacional Autónoma de Mexico/ National Autonomous University of Mexico

Evaluation of Educational Thesauri

Abstract: For years, Mexico has had a distance learning system backed by television-signal-transmitted videos. The change to digital and computer transmission demands organizing the information system and its subject contents through a thesaurus. To prepare the thesaurus, an evaluation of existing thesauri and standards for data exchange was carried out, aimed at retrieving subject contents and scheduling broadcasting. Methodology for evaluating thesauri was proposed, compared with a virtual educational platform and a basic structure for setting up the information system was recommended.

1. Introduction The purpose of this paper is to evaluate educational thesauri in order to decide whether to construct a new one or adapt one that already exists. The area of expertise that will be analyzed is that of educational contents, given the need for documentary organization demanded by the current distance learning proposal. To represent educational contents for distance learning programs using multimedia digital support in information systems, it is necessary to have a documentary vocabulary that establishes the indexing terms of the educational system to be supported. Distance learning systems require secure, precise and efficient shared information. While the problem at hand is the evaluation of thesauri, we should start with some background on information retrieval mechanisms in the educational contents of the Web.

2. Use of Automatic Computing for Educational Content Information Retrieval Distance learning relies on computer intervention to manipulate the information that each particular educational program needs. E-learning systems combine educational resources or learning objects, with people’s pedagogical activities, in their respective roles (Griffiths, Blat, García, Sayago, 2005, 6). Our concern in this paper are the learning objects related to educational subject contents, their indexing and retrieval. In the automatic computing arena, contents plus metadata are the essence (Jong, 2003, 9), so it is essential to clarify that the contents we observe here are educational, not automatic computing, subject contents. Metadata are used to locate, identify, select and have access to the learning objects. Metadata also document how that object behaves, its function and use, relationships with other objects and visibility characteristics. Data exchange standards help define the interfaces so that different metadata schemes may be transferred in educational information systems.

Whatever technology is used, it must enable the information exchange to retain the metadata’s meaning and original structure. These standards offer a neutral representation of the metadata and their structural order. They have nothing to do with the underlying semantics but rather with providing a common, machine legible way to transfer the elements that have been defined through networks, systems and platforms. Data exchange standards thus provide the mapping interfaces between the level of definition and the technological level of information systems (Jong, 2003, 16) 272

In general, educational information systems are using the Extensible Mark-up Language (XML), which distinguishes between form and content (unlike HTML), to translate any metadata scheme to a common representation format, with the aim of transferring them through the Web. The Resource Description Framework (RDF) works like a global metadata exchange structure and provides the base work for other standards. RDF has defined a top level metadata model and syntax that is expressed in XML. The RDF model presents three object types:

x a resource; anything that can have a URL address x a property: a resource with a name that can be used as a property (e.g. author, title) x a statement: the relationship between a resource and a property

While objects are handled in the RDF model, the relationships among them are not the ones needed to develop the full potentiality of the query systems and arrive at the learning objects. Other specifications, also based on object programming, have been proposed, in which the information search explores within content. Outstanding among the proposals are the IEEE (Institute of Electrical and Electronics Engineers) Learning Technology Standards Committee Learning Objects Metadata Working Group, the Dublin Core Metadata Initiative, the Global Learning Consortium which proposed the IMS Resource Metadata and the Advance Distributed Learning/Sharable Content Object Reference Model. Some of the people involved in the development of these standards comment, …“to develop mutual interoperable metadata for technology-supported learning, education and training tools” (Hodgins, 2001, 1), “…having a common approach to educational metadata is crucial to further speed up adoption of metadata technologies. That, in turn, is the first, crucial step on the long road to open learning infrastructures” (Duval, 2001, 2) The virtual platforms mentioned evolve, and for example Dublin Core, which started out providing bibliographic metadata for information resources and dealing with text resources, has introduced certain extensions to describe multimedia and audiovisual resources that cover specific audio and video aspects. The metadata model should logically connect metadata entities, relationships and attributes to the essence (digital) itself (Jong, 2003, 9). The model will have to link the information elements to the description of a whole that encompasses them and define the level of access within the contents. Another feature of the standard is that it be able to link textual, and sound information, as well as fixed and moving images.

3. Representation of Educational Subject Contents As an offshoot of the previous point, educational subject contents in the realm of automatic computing are learning objects organized into virtual educational platforms. In addition to the interface model for perusing and transmitting a general idea about the location and characteristics of the subject contents, is the problem of linguistically representing those contents. Linguistic representation of educational contents is necessary both for textual digital documents and multimedia. The wealth of expressions makes the former confusing when they are consulted in a concept rather than word search. So as not to repeat an expression in a single educational text, words that are not habitually used by the educational platform users may be used. The process becomes long and complex when the user does not find the theme and must search all the possible forms in which the same concept may have been expressed ... “When searching for information about concepts that can be expressed in multiple ways it is more effective to use classified Web directories. A directory controls for synonyms and 273 homonyms and provides context for the index terms by placing them in a hierarchical structure.” (Mai, 2004, 92) Thesaurus descriptors are another kind of specific content metadata and are used mainly as an indexing tool. In digital multimedia documents, linguistic representation is basic for locating images that will illustrate an educational content consisting of zeros and ones. Educational images have to transmit exactly the subject meant to be taught so as not to confuse the student, and although it has been said that, “A picture is worth a thousand words,” the collective experience of information professionals assures that a word can take us to an image, yet the opposite does not always happen. In today’s technological world, many access points may be used ...“the computer is capable of storing and organizing texts in ways that enhance their retrievability, and the ways that knowledge can be organized in digital archives on the Internet by using a variety of potential access points” (Andersen, 2002, 37). This fact, which Andersen explains, is a great benefit, becomes a problem when too much information is retrieved. Discretion is not an exclusively human virtue; it is also desirable in information systems. The virtual teaching platform does not solve conceptual information retrieval if words in the text are used as the sole guide. The support of a thesaurus is basic for specifying themes consistently. However, the creation of a thesaurus is costly and difficult, so that proposals available on the market should always be analyzed as possible knowledge organization solutions.

4. Methodology for Evaluating Thesauri In Mexico, the need developed for a thesaurus of educational contents with a classification structure, in order to retrieve videos for use in the transmission of distance learning television programming at the elementary and intermediate levels. The first task was to analyze existing educational thesauri, and a methodology was set up that would be applied to each one in the same manner. This evaluation is carried out within a particular field and type of information, so as to zero in on a practical problem, for which the consistency and balance among existing thesauri is evaluated alongside the proposal of a new thesaurus. Given the characteristics of this paper, it is not possible to delve into details, but the methodology applied and the result of the evaluation are summarized. The proposed methodology focuses on five different points of view: presentation analysis, consultation analysis, consistency, content analysis and terminological and semantic structure. All five of these perspectives are important: presentation, because while the main part of a thesaurus contains an alphabetical body of descriptors and indices with entries for indexing and retrieving information that differ from the alphabetical body, there must be an introductory explanation of sorts regarding a series of aspects to initially help users decide whether it is useful for the information system meant to be inserted and later take full advantage of the tool being offered. It has to do not only with greater or lesser ease of handling but with the authors’ explanation as to the level of conceptual structuring of the thematic field of application. Consultation defines the way of accessing the thesaurus, whether on line, on a compact disc or in printed form, as well as the date the terms included were updated. Consistency involves observation of the following elements: reciprocal relationships, terms in the same form and under the same circumstances, use of clarifications on the application of terms, standardization of the genre and translation to another language. Content deals with the relationship between the number of descriptors and non descriptors, the levels of hierarchy and the types of contents included. 274

The terminological and semantic structure of the thesaurus entails the analysis of the thesaurus’ subject division, in other words, terminology dissection and organization. Applying a methodology for reviewing existing thesauri is not only done to discard the possibility of using a preexisting one instead of creating a new one but with a further intention of reconciling partially problematic aspects of the thesaurus to be generated, in addition to comparing the criteria used in the design of the most general categories and, of course, take advantage of their useful elements for the thesaurus to be developed. So, while ...“a thesaurus for organizing educational contents must be the faithful reflection of the educational system which it is going to serve" (Naumis, 2002, 9), the following educational thesauri were analyzed, for the methodological and practical advantages already explained:

– Education thesaurus of the UNESCO: OIE. – 5th ed. – Paris: UNESCO, 1992. – 144 p. – Thesaurus of the UNESCO: structured list of descriptors for bibliographic indexing and retrieval in the spheres of education, science, the social and human sciences, culture, communication and information / United Nations Education, Science and Culture Organization. – http://databases.unesco.org/thessp/ – UNBIS thesaurus: trilingual list (Spanish, English, French) of terms used as subject headings in the analysis of documents and publications related to United Nations programs and activities / Dag Hammarskjöld Library. http://unhq-appspub-01.un.org/LIB/ DHLUNBISThesaurus.nsf/$$searchs?OpenForm – SPINES thesaurus: controlled and structured vocabulary for the treatment of science and technology information for development / United Nations Education, Science and Culture Organization. Institute of Science and Technology Information and Documentation. http:// unesdoc.unesco.org/ulis/cgi-bin/ulis.pl?database=ged&req=2&by=3&sc1=1&look=new &sc2=1&text_p=inc&text=SPINES&submit=GO – Macrothesaurus for processing information regarding economic and social development // actualizado por Anne Di Lauro y Alice Watson. – 5ª ed. – París: Naciones Unidas. OCDE, 1998. – 427 p. – ERIC thesaurus: descriptors http://www.eric.ed.gov:80/ERICWebPortal/Home.portal?_ nfpb=true&_pageLabel=Thesaurus&_nfls=false

Selection of these thesauri was mainly based on their general focus (in other words, they deal with education), the concrete aspects undertaken and their present use (they are recent thesauri that include audiovisual material and are centered on the use of new didactic resources). The framework used for the analysis of the thesauri was prepared in accordance with certain indicators which were taken, adapted and complemented among themselves, in keeping with works by Alvaro Bermejo(1989) and Michel Dauzats (1994), and with our own indicators added. The review of thesauri from which to select the sampling made it evident that most have been produced by international institutions, with fewer thesauri having a national radius of action, and are limited to the scientific realm, not even dedicated to supporting a teaching system. The field of the thesauri analyzed may be summarized as follows: economic and social development, education in an international context, science and technology for development, United Nations Programs, another one with various areas (education, science, culture, social and human sciences, culture, information and communication, politics, laws and economics, countries and human groups) and, finally educational research. 275

The UNESCO thesaurus, put together by the International Office of Education, is divided into eight semantic fields, and in principle the faceted organization should be centered on educational themes. However, upon examining the contents of the semantic fields, it was found that they actually encompass UNESCO projects and activities. Even the third section, devoted specifically to teaching-related terms, that is, knowledge transmitted through a teaching process, includes overly general terms for an educational information system. Unfortunately, the next UNESCO Thesaurus (divided into microthesauri) has the same particularities and the same limitation as the previous one. In the UNBIS Thesaurus, which has a mainly thematic structure, not all the descriptors are part of the hierarchical order, so that it does not comply with the interlinked structure required to give the thesaurus greater consistency. The SPINES Thesaurus, on the other hand, is organized into 34 subject fields, with all the descriptors belonging to one of them. Consequently, it is a thesaurus of educational contents in limited realms. The educational level it represents is professional and up, and the limitation observed for its functioning in an elementary and intermediate educational system level is the scientific language and subject coverage. In conclusion, it may be adopted as a model for specific educational subject realms using suitable language for user level. The OCDE Thesaurus is an excellent model for economic- and social development-related aspects, since it is structured into nineteen classes and seven hierarchies. ERIC is, unquestionably, the most well-known thesaurus on educational internationally, having become the international information system’s backbone on education. Nevertheless, a quick look at its content reveals that it is aimed at educational research, not educational contents. Furthermore, this linguistic tool is written in English, so it would have to be translated into Spanish before it could be used. As Grijelmo says, “Language constitutes the most faithful core of every community, and therefore no other language may define us” (Grijelmo, 2002, 283). In the same vein, Yule states that, …“your language will give you a ready-made system of categorizing what you perceive, and as a consequence, you will be led to perceive the world around you only in those categories.”(Yule, 1998, 280). The conclusion reached through the analysis of the aforementioned educational thesauri is that none really responds to the indexing needs of a system with multimedia support documents for an elementary and intermediate level educational program. Other solutions were subsequently analyzed so as to propose the classification scheme of a thesaurus with educational contents, and a technological proposal was found, “ROSA: Repository of Objects with Semantic Access for e-Learning”, which is based on the organization of the educational system itself. The first classification scheme is grounded in the programs and different teaching levels; in the next phase, the names of the different courses are included, and the subject classes are the contents of each one of the courses. While the information organization uses the same institutional premise, it was not possible to review and evaluate the contents since the system does not appear on line.

5. Conclusions The terminological and semantic evaluation of the theme of the thesaurus is the most significant aspect; the others contribute to its appreciation. Beyond the result of the evaluation carried out, it is necessary to recognize that the methodology enabled approaching and understanding thesaurus organization and provided elements of help for creating a thesaurus. It can practically be ensured, however, that it is difficult to find a thesaurus with a structure of useful knowledge for any other case. Every educational space has its objectives, level of delving, vocabulary used, language and organizational traits of the institution to which it pertains. 276

In the analysis of library science literature on the subject and starting from the premise that the thesaurus is a classification scheme, elements were found that support the conclusion of the difficulty of reusing a thesaurus developed out of one context in another:

A classification scheme is just one potential way to describe a particular domain or the universe of knowledge. To create a classification system for a particular company, organization, library, or any other information center, one needs to begin with a study of the discourse and the activities that take place in the organization or domain. One needs to learn the language used in the community, since the classification must reflect and respond to this particular discourse community. A classification is not something that can be created for an organization by an epistemic authority; a classification must grow out of the organization. The classification is a typification of the language in the organization. (Mai, 2004, 46)

The difficulty lies in the very social, cultural and economic dynamics of the institutions where the need arises for organizing the contents they generate ...”concept of knowledge organization is in interaction with and derived from the social organization of knowledge.” (Andersen, 2002, 37) In view of the previous conclusions compared with elementary and intermediate education in Mexico we propose organizing the classification scheme of the thesaurus for multimedia educational contents with different types of metadata: the first ones to reflect the different educational levels taught, others for the different courses encompassed by each educational level and the last ones for the educational contents, as such, that will be developed through the usual thematic interrelationship in a thesaurus. The technological part must take care of the relationships among the three types of metadata, to enable the option of retrieving any educational content, in order to open the option of using them in cultural promotion or extension programs.

Reference List Alvaro, C., Villagrá, A. & Sorli Rojo, A. (1989) Desarrollo de lenguajes documentales formalizados en lengua española : II. evaluación de los tesauros disponibles en lengua española. Revista Española de Documentación Científica. – Vol. 12, no. 3, p. 283-305. Andersen, J. (2002) Communication Technologies and the Concept of Knowledge Organization. Knowledge Organization. Vol 29, no. 1, p. 29-39 Duval, E. (2001) Metadata Standards Leaders IEEE and DCMI Collaborate to Design Future Metadata Architecture for Web-based Learning, Education and Training. Standards IEEE Retrieved February 15, 2006. http://standards.ieee.org/announcements/metaarch.html Griffiths, D. Blat, J. García, R. & Sayago, S (2005) La aportación de IMS Learning Design a la creación de recursos pedagógicos reutilizables. Universitat Pompeu Fabra. Retrieved february 17, 2006. http://www.upf.edu Grijelmo, A. (2002) Defensa apasionada del idioma español. México: Taurus Hodgins, W (2001) Metadata Standards Leaders IEEE and DCMI Collaborate to Design Future Metadata Architecture for Web-based Learning, Education and Training. Standards IEEE Retrieved February 15, 2006. http://standards.ieee.org/announcements/ metaarch.html Jong, A. (2003) Los metadatos en el entorno de la producción audiovisual / traducción de J. Andérez. 2ª ed. México: Radio Educación. Mai, J. E. (2004) Classification in Context: Relativity, Reality and Representation. Knowledge Organization. Vol 31, no. 1, p. 39-48 277

Naumis, C. (2001) El macrotesauro mexicano para contenidos educativos : manual para uso y retroalimentación. México: CUIB (documento interno) Repository of Objects with Semantic Access for e-Learning Le thesaurus de l'image : etude des langages documentaires pour l'audiovisuel / Sous la direction de Michel Dauzats. París: ADBS Editions Yule, G. (1998) El lenguaje / Traduc. N. Bel Rafecas. Madrid: Cambridge University Press.

Martin Thellefsen Ass. Prof. Royal School of Library and Information Science, Dep. for Information studies Sohngaarsholmsvej 2, DK-9000 Aalborg, Denmark Email: [email protected]

The dynamics of information representation and knowledge mediation

Abstract: This paper present an alternative approach to knowledge organization based on semiotic reasoning. The semantic distance between domain specific terminology and KOS is analyzed by means of their different sign systems. It is argued that a faceted approach may provide the means needed to minimize the gap between knowledge domains and KOS.

1. Introduction The purpose of any knowledge organization system (KOS) is to facilitate access to information resources. The field of knowledge organization (KO) may be discussed from a metatheoretical perspective which would prioritize social and epistemological aspects of KO, or the focus may be on methodologies for knowledge representation (KR)1, including KOS; nevertheless from a LIS perspective the focus on libraries or a library functions is essential. A central issue regarding libraries is to produce information services that in the end supply potential users of the service with relevant information. Consequently, matching information sources (documents) with information warrants (requests) by means of access points is a fundamental feature for any KOS. Thus the functions of library services, as mediators of information sources are evident. The meta-theoretical perspective is thus concerned with issues regarding what kind of theories that may be fruitful, and KOS on the other hand is concerned with methods and models feasible for KR and document retrieval. In other words, the general perspective is preoccupied with what may be referred the epistemological basis of KO, and KOS may be thought of as the applied side KO. Looking at libraries as mediators is by no means unusual; on the contrary, the quality of library services is (or at least should be) measured by their success as mediators of information sources. However, mediation may be reflected from different viewpoints. Originally the public library movement, exemplified by DDC, promoted a focus on functional economy and document management. With Ranganathan and the Colon system the facet analytical approach ensured a standardization of concept analysis, based on rational decomposition of concepts and concept relations. Recent theories of KO are based on a pragmatic outlook, prioritizing the social activities of a discourse community, the latter especially promoted by the domain analytical view in LIS cf. (Hjørland & Albrechtsen, 1995). This perspective also marks a shift in how we understand and conceptualize KO and KOS. The technical side of KOS development is regarded as less important compared to the context and communities the KOS is meant to serve. One may also argue that the domain analytical approach marks a shift in paradigm, from one of objectivity to one of relativity. The argument promoted in this paper is that KO fundamentally is about KR, and KOS is the formal process of signification of knowledge. The analytical outlook promoted is based in semiotics, thus seeing KO as interpretive and mediative framework, and KOS as a manifestation of a sign structure. The semiotic approach is sympathetic to the domain 280 analytical view, and may be regarded as a branch of domain analysis. The consequences that may be drawn from the domain analytical view are that knowledge domains rest on knowledge claims based on epistemological assumptions and socially developed norms and ideals that affect the knowledge base of the community. KO should thus be concerned with the fundamental units that constitute a domain. Furthermore the semiotic approach is related to the logic of facet analysis. Semiotics is basically the science dedicated to the logic of signs. And it is argued that facet analysis may provide the analytical tool for domain analysis that appreciate the specialities of a discourse community, and at the same time provide the categories necessary for classification of documents. Finally the semiotic approach demonstrates a unifying perspective on KO/KOS.

2. Problem situation: The logic of mediation. In terms of semiotics, the logic of mediation, consist of at least three components; a sign, an object, and an interpretant. Formulated in terms of communication and information seeking, the sign may correspond to an event that causes a perceiver to further inquiry. The object corresponds to that which satisfies the perceiver’s inquiry. The interpretant correspond to the linguistic means used by the perceiver to formulate the inquiry, or the linguistic / symbolic means used by a KOS to signify a particular subject. However as may be anticipated the aforementioned components are also inherently complex, and needs further explication. Firstly the event or an event that causes a perceiver to speculate and necessitate information should be seen in a context. Individuals always take part in communities, and a scientist participate in a research community, therefore an unanticipated event is unanticipated because the community may know nothing or little about it. And if the community knows nothing or little about it, there is no concepts describing the event, which makes the search for information in bibliographic databases a difficult task. This scenario may be related to sciences favouring nomothetic methodologies, where the search for universal principles demands consistent and universal concepts. The nomothetic sciences give priority to so called ‘hard data’ and are based on observation and description of observable fact. The sciences of ideographic nature may not issue or demand the same kind of consistency in concepts. The objects of ideographic science are unique events and products of culture, and may be analysed and interpreted differently by scientists2. The first barrier of the perceiver is to formulate the event in linguistic form, and select appropriate terms that express the problem of investigation. The request thus forms an interpretant of the problem situation. Secondly the request is put to a library e.g. a special library that serves the particular research community. The library takes the place of the object in the semiotic framework, and is understood as a collection of documents or information sources. However, the collection of a special library is organized according to a classification scheme that may fulfil the demands of the library, i.e. ease library administration, and the KOS used may not reflect the demands of the user. The second barrier of the perceiver is to express the problem situation in the terms of the KOS, i.e. the interpretant of the library. Thirdly, what the perceiver really needs is the documents containing relevant or correct answers, no more, no less. This means, that the real object of interest is the actual documents or sources containing the answers sough. However, the documents are only attainable by means of representation, i.e. the KOS of the library containing document 281 surrogates. The KOS thus forms a second interpretant in the semiotic model of the problem situation. Weather the query match the documents found by searching the library system, may be judged on the basis of the representation itself, or on the actual documents3 e.g. full text. This judgement of relevance is thus conducted on the basis of the feedback from the KOS, and may course the user to revise the search strategy. The third barrier is to judge the usefulness of document representations, and select documents for further investigation. The real challenge seen form a KO point of view, is to match the two interpretants mentioned, i.e. the terminology of the discourse community used by the scientist, and the concepts of the KOS used by the library. The following model tries to sum up the different aspects of the problem situation discussed:

Request

Intermediary Language & function Terminology KOS

User Library

Feedback Match

Documents & sources

Figure 1: The process of information mediation: The problem situation is thought of as an iterative process, where the feedback in the form of references constitutes the basis on which success is evaluated.

The success of the intermediary function may be expressed by the KOS’s ability to provide useful or correct answers to questions or information warrants. This statement is however obvious, nobody wants a system that provides bad or useless information. Despite the obvious nature of the statement, success may be a difficult task to achieve.

3. Two interpretants and three barriers. As discussed above the problem situation depicted by figure 1, constitute at least two interpretants and three barriers that need to be conquered. The first interpretant relate to the terminology of the domain. 282

From the perspective of science, a terminology constitutes the tools used by the scientist to express theories, methods and findings. Different discourse communities may demonstrate different levels of consistency and exactitude in their respective terminology4. The science of terminology is dedicated to the study of concepts, concepts systems, concept definitions and term-concept designation. The purpose of the science of terminology is to promote standardization and communicative economy by reference to certain aspects or facets of domain specific concepts; however, the more complex the concepts are, the more difficult it is to specify a uniform meaning5. The science of terminology is similar to KO in several aspects. Firstly, the science of terminology has a theoretical base, which is discussed vigorously within the field cf. (Cabré, 1999; L'Homme, Heid, & Sager, 2003; Temmerman, 1997). Secondly it has an applied focus namely to specify terminological standards and organize terminologies for e.g. ontologies and term databases. From the perspective of the scientist, new discoveries have to be documented and published in appropriate peer reviewed journals. And the first challenge is to find an appropriate expression for the phenomenon discovered. This may be a simple task because the event discovered may be appropriately explained by existing scientific terminology, or fit into existing nomenclature/taxonomy, however sometimes events do not fit nicely into the existing scientific classification, and may even threaten the existing consensus. Furthermore, within scholarly communities there may be different paradigms that affect the perspective and definition of concepts. The second interpretant is also concerned with representation but from another perspective. Where terminology serves the purpose of addressing scientific concepts from within, the KOS is concerned with classification of documents in appropriate categories. The purpose is not efficient scientific communication, but rather efficient information retrieval. These two aspects may however be two sides of the same coin, and the concern for the information specialist, should be about analyzing or using the terminology of a subject field, in order to establish a representation that is in accordance with the concepts used. Therefore existing Terminologies i.e. ontologies, taxonomies etc., may be of value for the information scientist. The obvious argument would thus be that the KOS should ‘mirror’ the Terminology used in a scientific domain. Knowledge about the concepts and terminology of a domain may be achieved using different methodological approaches, either by asking the scientist of a domain to identify the most basic or most important concepts of the domain, or by examining a selection of documents and analyse the frequency and contextual use of concepts, or one could examine existing secondary literature of the domain, and evaluate existing concepts and taxonomies. This provides us with three fundamentally different kinds of warrants, one that gives priority to the individuals of a domain, one that gives priority to documents and one that gives priority to existing knowledge structures. The objective of his paper is not to promote an argument pro et contra for one of these mentioned viewpoints, for a detailed discussion of these matters see e.g. (Hjørland, 2002b, 2003, 2005), the objective is rather to argue for a viewpoint that select the most appropriate methods for achieving domain knowledge, and at the same time recognize that different kinds of knowledge domains, produces different kinds of knowledge. 283

Units of terminology Units of KOS Serves domain specific communication Serves (domain specific) information retrieval The tools of the scientist The tools of the librarian/documentalist Is affected by the paradigms of the scientific Is affected by paradigms within LIS community The property of the knowledge producing The property of libraries and information services communities

Table 1: Knowledge representation in two perspectives. In some cases there may be a close relation between two interpretants. The use of domain experts may reduce the semantic distance between the domain specific terminology and the KOS, especially with regards to specialized indexing. Chemical Abstracts may be an example of a KOS that reflects the specialized terminology of the field.

Table 1 summarizes the main differences between the two interpretants discussed, and as such, the terminology of a domain, constitute a barrier for KO, and vice versa. It should be noted that KOS address a wide range of different systems of KO, and some KOS are domain domain specific and close to the domain, others are of lesser specificity. Even though a KOS covers a scientific discipline, there may be different schools that disagree with the structure displayed. Occupational therapy is a scientific area that is dedicated to rehabilitation of patients, however different schools of occupational therapy tend to emphasize different values and paradigms. E.g. the Scandinavian school of occupational science stresses a holistic approach to rehabilitation, which influences the concepts and terminology of Scandinavian occupational therapy. The American school tends to be much more medical and ‘hard science’ in their approach to rehabilitation and the ‘Thesaurus of occupational therapy subject headings’ (TOTSH, 2000), thus reflect the Scandinavian school poorly.

4. Semiotics, and signs for communication Fundamentally semiotics is the study of signs. And signs are units of representation. A sign has a reference, i.e. it signifies something else that may be another sign.

A sign, or a representamen, is something, which stands to somebody for something in some respect or capacity. It addresses somebody, that is, creates in the mind of that person an equivalent sign, or perhaps a more developed sign. That sign which it creates I call the interpretant of the first sign. The sign stands for something, its object. It stands for that object, not in all respects, but in reference to a sort of idea, which I have sometimes called the ground of the representamen. (CP 2.228)6

The famous quotation of C.S Peirce, defines a sign as a triad, we have the sign, which may correspond to an idea, we have an object, which may be regarded as the originator of the idea, and finally we have an interpretant, a connection between sign and object / idea and phenomenon. 284

interpretant

sign / representamen object

Figure 2: The semiotic triangle.

In the context of the discussion above, terminology and KOS takes the place of the interpretant in the sign system

KOS

Terminology

documents / information information warrants sources Domain / Concepts and discourse ideas community

Figure 3: Semiotic model of Terminology and KOS.

There are two essential points to be made form figure 3. Firstly the sign systems are skewed, meaning that they only partly overlap. The information warrants emerges from the terminology triad, and originate from the research process. Secondly the sign triad of KOS is secondary in relation to the sign triad of terminology. The two sign structures may never entirely overlap, because KOS will always be in a reactive position, and therefore we may consider KOS as an interpretant of terminology. 285

5. Facets of a domain The conclusions that may be drawn from the semiotic analysis, is that there will always be a gap between the knowledge domain and the KOS. From the perspective of LIS the challenge lies in reducing the gap to a minimum. Domain analysis is concerned with reducing that gap by stressing the importance of subject knowledge, knowledge about the social organisation of the knowledge domain, the documents and its text genres, its terminology etc. In (Hjørland, 2002a) eleven approaches to domain analysis is discussed, and what may be concluded from the study, is that a knowledge domain may be approached differently dependent on the nature of the domain. Facet analysis is generally associated with Ranganathan and Bliss, and consequently with special subject classification, and shelf arrangement of documents in libraries. However facet analysis may also be conceived as a technique for analyzing and creating a general model of a complex structure at a meta level, e.g. the demarcation of a knowledge domain. A domain may e.g. be analyzed from the perspective of epistemology, i.e. identifying the paradigm or paradigms that exist within in a community, the theories and methods appreciated, and the terminology used. Facet analysis may also be conducted from a contextual perspective, i.e. the use and development of technology within a domain. The strength of facet analysis is its flexibility in perspective. And it may correspond well with the classic formulation quoted above, the definition of a sign. A facet is also a sign that interpret a certain aspect of another sign. Therefore facet analysis is basically about semiotics. This argument is however subject for further investigation.

6. Summing up The function of libraries has been discussed from the perspective of mediation. And the success of libraries should be measured by their ability to mediate documents and information sources warranted by users. The meditative function of libraries has been discussed from a semiotic perspective arguing that a KOS constitutes a interpretive sign structure that however also provide a barrier for the user that request information. There may be a gap between the terminology used by the professional user, and the units of the KOS and vice versa. The objective of any domain specific KOS should be to minimize the distance between the Terminology of a knowledge domain and its representation of it, and faceted analysis of some kind may provide the intellectual means to bridge the gap between the interpretant of terminology and the interpretant of KOS.

Notes 1 KR is frequently associated with computer science (Davis, Shrobe, & Szolovits, 1993), however several LIS scholars has proposed ontologies, semantic nets and taxonomies as closely related to knowledge organization technologies as thesaurus construction, topic maps, metadata and document surrogates (Morrissey, 2002; Poli, 1996; Soergel et al., 2004). 2 Commonly the humanities and social sciences are thought of as ideographic sciences, opposing the natural sciences which are considered nomothetical. This may however be a simplified view. A more precise conceptualization is to mark the distinction between methodologies and objectives of the scientific discipline. The nomothetic methods are concerned with generality, the finding of universal law, where the ideographic methods are concerned with the understanding of non- reoccurring events. 3 The different methods and concepts of relevance will not be discussed in this paper. 286

4 It may be assumed that the natural sciences issue a greater consistency and exactitude in Terminology than the humanities and social sciences. This may be explained by different research traditions and different research objects. 5 Rita Temmerman (2000) conducted an investigation of the terminology of Life Science, and the findings showed that certain concepts were clear cut, but other concepts showed a structure of family resemblance. Furthermore she found that polysemy and figurative language was functional within the discourse community, especially with regards to concepts that had not yet been fully developed. 6 The notation ’CP’ stands for Collected Papers of C.S. Peirce (Peirce, 1958-1966), followed by volume, and paragraph.

References Cabré, T. M. (1999). Terminology. Theory, methods and applications (Vol. 1). Amsterdam: John Benjamins Publishing Company. Davis, R., Shrobe, H., & Szolovits, P. (1993). What is a knowledge representation? AI Magazine, 14(1), 17-33. Hjørland, B. (2002a). Domain analysis in information science: eleven approaches - traditional as well as innovative. Journal of Documentation, 58(4), 422-462. Hjørland, B. (2002b). Epistemology and the Socio-Cognitive Perspective in Information Science. Journal of the American Society for Information Science and Technology, 53(4), 257-270. Hjørland, B. (2003). Fundamentals of knowledge organization. Knowledge Organization, 30(2), 87-111. Hjørland, B. (2005). Empiricism, rationalism and positivism in library and information science. Journal of Documentation, 61(1), 130-155. Hjørland, B., & Albrechtsen, H. (1995). Toward a New Horizon in Information Science: Domain-Analysis. Journal of the American Society for Information Science, 46(6), 400- 425. L'Homme, M.-C., Heid, U., & Sager, J. C. (2003). Terminology during the past decade (1994-2004). Terminology, 9(2), 151-161. Morrissey, F. (2002). Introduction to a semiotic of scientific meaning, and its implications for access to scientific works on the web. Cataloging & Classification Quarterly, 33(3/4), 67-97. Peirce, C. S. (1958-1966). Collected papers (Vol. I-VIII). Cambride, MA: Belknap Press of Harvard University Press. Poli, R. (1996). Ontology for knowledge organization. In R. Green (Ed.), Knowledge organiztion and change (pp. pp. 313-319). Frankfurt: Indeks. Soergel, D., Lauser, B., Liang, A., Fisseha, F., Keizer, J., & Katz, S. (2004). Reengeneering thesauri for new applications: the AGROVOC example. Journal of Digital Information, 4(4), Article No. 257. Temmerman, R. (1997). Questioning the univocity ideal. The difference between socio- cognitive terminology and traditional terminology. Hermes, 18, 51-90. Temmerman, R. (2000). Towards new ways of terminology description. The sociocognitive- approach (Vol. 3). Amsterdam: John Benjamins Publishing company. TOTSH. (2000). Thesaurus of occupational therapy subject headings : a subject guide to OT search. Retrieved 10.12.2002, 2000, from http://www.aotf.org/html/thesaurus.html Jian Qin 1, Peter Creticos 2, and Wen-Yuan Hsiao3 1 of Information Studies, Syracuse University, Syracuse, NY, USA 2 for Work and the Economy, Oak Brook, IL, USA 3 ISONTO, LLC, Jamesville, NY, USA

Adaptive Modeling of Workforce Domain Knowledge

Abstract: Workforce development is a multidisciplinary domain in which policy, laws and regulations, social services, training and education, and information technology and systems are heavily involved. It is essential to have a semantic base accepted by the workforce development community for knowledge sharing and exchange. This paper describes how such a semantic base—the Workforce Open Knowledge Exchange (WOKE) Ontology—was built by using the adaptive modeling approach. The focus of this paper is to address questions such as how ontology designers should extract and model concepts obtained from different sources and what methodologies are useful along the steps of ontology development. The paper proposes a methodology framework “adaptive modeling” and explains the methodology through examples and some lessons learned from the process of developing the WOKE ontology.

1 Introduction Workforce development is a multidisciplinary domain in which policy, laws and regulations, social services, training and education, and information technology and systems are heavily involved. It is common practice in the United States that the federal government establishes the primary workforce development agenda: the Congress enacts legislation setting policies and ensuring the funding for workforce development programs. The Department of Labor, the Department of Education and other federal agencies write the administrative rules, establish programs and provide funds for national, state and local initiatives. Workforce organizations and state and local governments implement the programs by running various projects independently or in partnership. Stakeholders communicate to one another in this complicated process from their own standpoint in their own professional jargons. Technology advances enable the Internet to serve as an open platform for workforce partners and government to collaborate on programs and projects and to deliver resources and services to the general public. The diversity of these partners and government agencies has mushroomed in the last decade. Some, such as general purpose one-stop centers that serve all job seekers and businesses looking to hire workers, employ generalists who must understand at least at the surface the full range of laws, programs and projects that define the workforce system. Others specialize in meeting the needs of particular groups, including: older workers, veterans, the disabled, workers who have been displaced because of global competition, migrants, minorities that have had uneven access to the workforce system, people receiving welfare assistance, and youth. The barriers for effective and efficient knowledge exchange over this open platform stem from the lack of a systematic modeling of the workforce knowledge domain. We started investigating the problems and develop strategies to address the barriers three years ago. During this period, we developed a conceptual model that has been revised many times through consultations with and focus groups comprised of workforce professionals, researchers in the workforce field, educators, and officials at the Department of Labor (DOL). In addition, we held focus group meetings to solicit input on the ontology. An earlier version of the Workforce Open Knowledge Exchange (WOKE) ontology was described in Creticos & Qin (2004). 288

The Workforce Open Knowledge Exchange (WOKE) system is currently under development and a prototype has been shared with DOL and several workforce organizations. Their feedback on the system has been very positive. This paper summarizes the methodologies we used in developing the WOKE ontology and lessons learned from ontology modeling.

2 Development of Domain Specific Ontologies Ontologies are considered to be the underpinning of Semantic Web. Research and development on ontologies started more than a decade ago. Broadly, there are two approaches for developing ontologies. One approach is to re-engineer part or all of the concepts and relationships in an existing thesaurus by following ontology construction principles. For example, the FAST project restructured the form subject headings in the Library of Congress of Subject Headings (LCSH) (O’Neill and Chan, 2004). Welinga et al. (2001) took the concepts in Western furniture and converted these terms into an ontology for managing the knowledge of antique furniture. The other approach is to start from scratch. Some of such ontologies are large-scaled ontology projects, including Cyc (http://www.cyc.com/cyc/cycrandd/overview), WordNet (http://wordnet.princeton.edu/), and Unified Medical Language System (http://www. nlm.nih.gov/research/umls/). Their development methodologies are well documented in Fernández-López and Suncióngómez-Pérez (2002). As the Web is increasingly used as an information communication and exchange platform, domain specific ontologies are in great demand for organizations of all kinds. Since most thesauri and classification schemes are often too general to be deployed directly in Web-based systems, many such domain specific ontologies have to be built from scratch. The large number of publications in the past decade with “ontology-based” or similar terms in their titles demonstrates a strong research stream and active development in this area. Strategies at various stages of building an ontology from scratch have been discussed in Ushhold and Gruninger (1996), Noy and McGuinness (2001), and subsequently in Fernández-López and Suncióngómez-Pérez (2002). Leo Obrst (2003) reworded the 7 steps proposed in Noy and McGuinness (2001) as:

1. Determine the domain and scope of the ontology 2. Consider reusing existing ontologies 3. Enumerate important terms in the ontology 4. Define the classes and the class hierarchy 5. Define the properties of classes 6. Define the additional properties related to or necessary for properties (i.e., cardinality, bidirectionality/inverse, etc.) 7. Create instances 8. Create axioms/rules

Most publications in ontology methodologies are written by computer scientists and software engineers, which show a clear orientation toward the engineering aspects of ontology development. Interactive conceptual modeling and the close engagement of subject experts and constituents for input were largely absent from these studies. While each of the steps relies heavily on various methods and tools, the validity and usability of ontologies is largely dependent on how well the ontologies fairly represent the users’ conceptualization and contextual understanding of the knowledge domain. But achieving validity and usability requires a large amount of human effort which can be costly to anyone who wants to develop 289 a domain specific ontology. Although Natural Language Processing (NLP) (Aussenac-Gilles et al, 2000), text mining (Maedche and Staab, 2000), query log analysis (Qin and Hernandez, 2006), and machine learning (Bournaud et al, 2000; Wiratunga and Craw, 2000) have been used to draw concepts and terms from texts, the initial modeling and scoping has to be done by humans. Questions remain in developing domain specific ontologies from scratch: How should ontology designers extract and model concepts obtained from difference sources? What methodologies are useful along the steps of ontology development? The rest of this paper addresses these questions as we explain the “adaptive modeling” methodology used in the WOKE project and offer examples of some lessons learned from the process of developing the WOKE ontology.

3 Adaptive Modeling of Domain Concepts “Adaptive modeling” is a term borrowed from computer science. In object-oriented programming, objects “have states and respond to events by changing state. The Adaptive Object-Model defines the objects, their states, the events, and the conditions under which an object changes state. “If you change the object model, the system changes its behavior” (Yoder and Razavi, 2000). The “objects” in the workforce development domain include concrete concepts such as laws that provide instructions and policies to state and local governments, program operators, and other relevant groups as well as appropriate funds, programs that implement workforce development policies, projects that execute the programs, organizations and persons of all types involved in programs and projects, and resources generated from or created for programs and projects. Abstract concepts are another type in the ontology. The abstract concepts represent the subject content of concrete concepts because they attach a meaning and context for the other objects. For example, projects targeted to youth obtain funds from associated programs that are established by associated laws. The subject of these projects, programs, and laws may be represented by terms such as “Competencies,” “Employable skills,” “Partnership in training,” and so forth. These abstract concepts can not be quantified but are important semantic labels for helping understand what the concrete concepts are about. In a search and browse scenario, this type of knowledge structures will serve as the semantic infrastructure for developing powerful search and browse functions. We used a wide variety of methods and sources to gather information, to develop the ontology, and to refine our initial model for the workforce domain. One of the sources is the relevant terms in the Library of Congress Subject Headings (LCSH). An examination of LCSH quickly found that the vocabulary and concept relationships defined in LCSH were far from the needs of the workforce community and practices. The closest match for the core concept “workforce” is “labor supply”. However, the results from our focus group meetings with workforce professionals suggest that the concept “workforce” is a general term that produces an image in the mind’s eye of a user of one or many different groups of workers and jobseekers such as dislocated workers, veteran, youth, adult workers, farmers and migrated farm workers, etc. Various federal and state programs as well as projects run by organizations serve the workforce, and the meaning of what constitutes the workforce is established by the context of the program or project itself. E.g., any references made to “workforce” in program documents for an initiative targeting youth implies that “workforce” means job seekers and workers between the ages of 18 and 22 . A document examining the conditions of the labor market for a given area may use “workforce” to describe all who are able to work or who are working. Therefore, “workforce” as it is defined in the LCSH tends to be too general and macro-oriented and does not adequately reflect the more contextually driven meanings of the word. 290

The nature of the workforce development domain requires the WOKE ontology to be sensitive to the needs of a multidisciplinary, multi-sector user population. We determined that the WOKE ontology must be adaptive to 1) the real world knowledge structure, 2) users’ working terminologies and habits, and 3) evolving workforce development policies and practices. 3.1 Identifying top level concepts The top level concepts play one of the three roles: 1) as an entity class that has instances conformed to the “is-a” relationship. Laws, resources, programs, projects, organizations, and persons belong to this group; 2) as a subject class that represents the knowledge body of workforce development domain; and 3) as an auxiliary or utility class that will be used as a value space for the entity class properties. The adaptive modeling produced three groups of top classes, each of which plays one of the three roles described here. It is also possible that a subject class plays the role of auxiliary and utility class. For example, “Industrial sector” is a top class representing the industry to which a workforce population belongs or a policy addresses, but at the same time it is also used as the value space for representing the subject content of workforce information resources, projects, and programs. 3.2 Defining subclasses of and relationships between concepts Concepts are associated with one another through parent-child and sibling relationships in a hierarchical structure. Most top classes in WOKE ontology have two or three levels of subclasses. They came primarily from two sources: brainstorming with workforce professionals and information scientists and pools of keywords collected from constituents’ databases. The brainstorming sessions through conference calls and face-to-face discussions resemble a top-down approach. By using this approach we clarified and defined boundaries between concept classes. It helped build the first and second levels of the concept hierarchy, which was then supplemented and enhanced by bottom-up approach – categorizing keywords contributed by workforce organizations against the hierarchy. If similar keywords for the same concept recurred, but there was no place in the hierarchy to fit it, a new class will be created to cover the emerging concept. All classes in a parent-child relationship followed the “a-kind-of” principle, i.e. a subclass is a kind of its parent class. The sibling classes followed the “mutually exclusive” principle, but there were exceptions. For instance, “Unemployed” and “Special classes of workers” are two sibling classes. While it is true that an unemployed worker may be a member of special classes of workers, the two overlapping groups are necessary because the federal programs and workforce projects are often targeted to workers in one group or the other. In this case, these two sibling classes are created according to workforce practices rather than the mutually exclusive rule. 3.3 Specifying properties of concept classes Classes in an ontology fall into two categories: concrete and abstract. The concrete classes have properties and such properties can be used as a metadata model for an entity class. The properties for resource class, for example, may be modeled after the Dublin Core Metadata Element Set (DC). Based on the feedback from workforce staff, we adapted DC to fit the need in describing workforce resources by dropping the unnecessary elements and added more customized elements (properties). As a result, the resource class has properties using simple text string as the property type, including title, version, status, description, URL, format, keywords, and type. It also has properties that use class or instance of class as the property type, e.g. Activity area is a property of resource that references to the class “Activity areas” since the property type is class. Property definition is also a process of creating connections between related classes. 291

3.4 Populating the classes with instances As Noy and McGuinness (2001) and Obrst (2003) point out, populating instances is an important step in developing ontologies. The instances for the WOKE ontology comprise those of entity classes and subject classes. However, these are two different types of instances. The entity class instances function as metadata records for resources, project, programs, and so forth. These instances are well defined in terms of their relationships with both concrete and abstract concepts, as well as their data types and value space. Although subject classes are abstract in nature and not quantifiable for the semantic meanings they represent, they may have synonyms, related terms, broader terms, or narrower terms, which can be treated as instances of subject classes. The ontology currently contains over 260 classes in three levels and more than 1600 terms that have been mapped to the 260+ classes. The WOKE ontology was developed using an iterative process in which all classes were carefully weighted. We identified the top concepts in the first stage, and then specified the relationships between concepts. In each of the first three stages we consulted with subject matter experts and possible users on the conceptual model and subclasses. Existing taxonomies having broad acceptance by the workforce system were used to populate two classes. Concepts and terms from workforce documents, references, and existing subject categories and vocabularies were carefully examined and refined based on the discussion with subject experts.

4 Lessons Learned The leadership with respect to the development of WOKE is comprised of subject matter experts and information scientists. This has resulted in a qualitatively different semantic framework than other organizational frameworks now employed in Web-based information systems serving the workforce community. WOKE has relied heavily on the initial involvement of users in identifying top-level concepts and subclasses and in populating the classes with instances. This approach has presented several special challenges, however. First, the intended users of WOKE are rarely required to articulate a conceptualization of the workforce domain. Their efforts are focused on the immediate moment of delivering a service or in identifying a problem and addressing it through the design, development, implementation and evaluation of a policy, law, program or project. Consequently, it often was difficult to engage users in a broad open-ended discussion on the WOKE ontology. We found that it was necessary to establish a framework for that discussion by presenting an ontology for their reaction and assessment. Second, we found that it was important to have a mix of users as part of any given discussion. A homogenous group often was too limited in its view of the workforce domain. For example, a group comprised of people delivering services exclusively to veterans would employ narrower definitions to terms shared by others in the workforce system and may identify only a small number of the relationships between instances. The discussion and interaction between members of a heterogeneous group not only revealed each member’s understanding of a term and the relationships between instances, it also often produced broader conceptualizations of the workforce domain. Third, the WOKE ontology offered users their first comprehensive view of the workforce domain. This often prompted new “discoveries.” Relationships between classes or instances were not explicitly known until users were asked whether they existed. Once revealed, they prompted users to add new classes or instances. It often came down to a question of what is missing or lacking in the ontology. Overall, this process added both depth and complexity to the WOKE ontology. 292

Fourth, the complexity and depth of the WOKE ontology is constrained by the point in granularity of detail where the information ceases to be important to the user (i.e., when the detail becomes too fine) and when the perceived time it takes to apply the WOKE ontology in meta tagging data becomes too costly in relation to the value of subsequent searches. Finally, the development of the WOKE ontology is enhanced by the development of applications that demonstrate the utility and value of the ontology in retrieving and re-using knowledge within the workforce domain. The process becomes self-reinforcing as users comprehend the value of the ontology in organizing their understanding of the workforce system and in helping them gain new insights on policies and practices.

5 Conclusions Creating an ontology for a multidisciplinary domain such as workforce development involves extensive discussions with constituents in various traditional fields. To extract and integrate concepts from difference sources, we developed an adaptive modeling methodology framework. Each iterative refinement of classes and relationships was based on the input from subject experts and frontline staff in order to adapt to the knowledge representation needs of a versatile user population. A prototype system has been developed that implemented conceptual model of WOKE ontology. The initial feedback from a number of major players shows positive comments for the system. As we continue to refine the ontology and populate it with more instances, the knowledge base cumulated will allow for developing more advanced applications.

References Aussenac-Gilles, N., Biébow, B., & Szulman, S. (2000). Revisiting ontology design: a method based on corpus analysis. In: R. Dieng and O. Corby (Eds.), Knowledge Engineering and Knowledge Management: Methods, Models, and Tools, 12th International Conference, EKAW 2000, Juan-les-Pins, France, October 2-6, 2000 Proceedings, pp. 172-188. Berlin: Spinger. Fournaud, I., Courtine, M., & Zucher, J.-D. (2000). KIDS: an interative algorithm to organize relational knowledge. In: R. Dieng and O. Corby (Eds.), Knowledge Engineering and Knowledge Management: Methods, Models, and Tools, 12th International Conference, EKAW 2000, Juan-les-Pins, France, October 2-6, 2000 Proceedings, pp. 217-232. Berlin: Spinger. Creticos, P. & Qin, J. (2004). Open Knowledge Exchange (OKE) for workforce development. In: C. Bussler et al. (Eds.): WISE 2004 Workshops, LNCS 3307, pp. 73–81 (The Fifth International Conference on Web Information Systems Engineering November 22-24, 2004, Brisbane, Australia). Fernández-López, M. & Suncióngómez-Pérez, A. (2002). Overview and analysis of methodologies for building ontologies. The Knowledge Engineering Review, 17(2): 129-156. Maedche, A. & Staab, S. (2000). Mining ontologies from text. In: R. Dieng and O. Corby (Eds.), Knowledge Engineering and Knowledge Management: Methods, Models, and Tools, 12th International Conference, EKAW 2000, Juan-les-Pins, France, October 2-6, 2000 Proceedings, pp. 189-202. Berlin: Spinger. Noy, N. F. & McGuinness, D. L. (2001). Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001. Retrieved 2/22/2006 from: http://www.ksl.stanford.edu/people/dlm/papers/ ontology-tutorial-noy-mcguinness.pdf 293

Obrst, L. (2003). 2 Issues: ontology methodology and upper ontology. Ontolog-Forum. Retrieved 2/22/06 from: http://ontolog.cim3.net/forum/ontolog-forum/2003-04/msg00008.html O’Neill, E. & Chan, L. M. (2004). FAST: a faceted LCSH-based subject vocabulary. Presentation at ALA Annual Conference. Retrieved 2/9/2006 from http://www.oclc.org/ research/presentations/oneill/ALA2004FAST.ppt Uschold, M. & Grüninger, M. (1996). Ontologies: principles methods and applications. Knowledge Engineering Review, 11(2): 93–137. Wielinga, B. J., Schreiber, A., Wilemaker, J., & Sandberg, J. A. (2001). From thesaurus to ontology. In Y. Gil, M. Musen, & J. Shavlik (Eds.), Proceedings of the international conference on knowledge capture (K-Cap’01), pp. 194-201, New York: ACM. Wiratunga, N. & Craw, S. (2000). Informed selection of training examples for knowledge refinement. In: R. Dieng and O. Corby (Eds.), Knowledge Engineering and Knowledge Management: Methods, Models, and Tools, 12th International Conference, EKAW 2000, Juan-les-Pins, France, October 2-6, 2000 Proceedings, pp. 233-248. Berlin: Spinger. Yoder, J. W. & Razavi, R. (2000). Adaptive object-model. In: Addendum to the 2000 proceedings of the conference on Object-oriented programming, systems, languages, and applications, pp. 81-82. New York: ACM Press.

Julianne Beall Library of Congress

Diane Vizine-Goetz OCLC Online Computer Library Center, Inc. Finding Fiction: Facilitating Access to Works of the Imagination Scattered by Form and Format

Abstract: This study explores ways to assist users who are primarily interested in finding a good story, regardless of format or literary form. The emphasis is on materials classed in the Arts (700s) and Literature (800s) in the Dewey Decimal Classification (DDC) system. Features from two prototypes, FictionFinder and DeweyBrowser, are being combined to attempt to assist users in finding terms to input while providing a holistic approach to finding works with imaginary content.

1. Introduction Wikipedia gives both broad and narrow definitions of “fiction”:

Fiction is storytelling of imagined events and stands in contrast to non-fiction, which makes factual claims about reality. . . . Fictional works—novels, stories, fairy tales, fables, films, comics, interactive fiction—may be partly based on factual occurrences but always contain some imaginary content. The term is also often used synonymously with literature and more specifically fictional prose. In this sense, fiction refers only to novels or short stories . . . . (Fiction, 2006).

In DDC 22 (Dewey, 2003), “fiction” has the narrow definition, limited to prose. The table of preference in the schedule at 800 and the discussion of literary form in the Manual at 800 make clear that stories presented in the form of drama or poetry are classed with drama or poetry, not with fiction. The DDC is structured so that purely textual literary forms are classed with literature in the 800s, but formats that combine literary text with other arts are classed with the arts in the 700s, e.g., comic books, opera, theater, and films (Beall, 2004). Format is given precedence over content for fictional works in the arts and literature. Factual works are classed with the topic, e.g., films that teach history in the 900s. Comic books and graphic novels are examples of a mixed format classed with the arts, specifically under drawing and drawings in 741.5 Comic books, graphic novels, fotonovelas, cartoons, caricatures, comic strips (Dewey Updates, 2006). Goldsmith defines “graphic novel” as “a part of a spectrum of sequential art formats that includes a range of related media such as one-panel gag cartoons and serially published comic books.” She focuses on “the graphic novel element of the spectrum, …on creative works that include narrative with a beginning, a middle, and an end and are published in book format. . . .” (Goldsmith, 2005, 16). Discussions about how to improve provision for graphic novels in the DDC have revealed strong differences of opinion about whether graphic novels belong with the arts in the 700s or with literature in the 800s; some librarians argue that readers of graphic novels are primarily interested in the art; others that they are primarily interested in narrative fiction (Beall, 2004). Goldsmith lists four common ways that fictional graphic novels are handled in libraries; two involve putting them outside the general classification scheme; one is placement in 741.5; and the other is to place them “in the fiction, or science fiction, or short story collection—either in the literature classification or, as numerous libraries maintain their fiction collections, according to the author’s surname” (Goldsmith, 2005, 53-54). 296

2. Finding Fiction Most general libraries will have some users especially interested in the artistic aspect of mixed formats, and others in the literary aspect. Consequently, local solutions by which some libraries place mixed formats in the 700s and some place them in the 800s will never satisfy all users; and users who visit multiple libraries are likely to be confused. This study explores ways to assist users who are primarily interested in finding a good story, regardless of format or literary form. The emphasis is on the 700s and 800s. In the 700s, the focus is on comic books and graphic novels (741.5), films (791.43), and television programs (791.45). Other fictional formats in the 700s (e.g., operas in 782.1) are excluded from this project for now, though ultimately they will need to be included. Similarly, folk tales, classed with folklore in 398.2, are excluded from this project for now unless they are identified as fiction by an accepted value for literary form (see section 4). Our approach is to consider ways to combine key features of projects like DeweyBrowser and FictionFinder.

3. Fiction Records in WorldCat This project seeks to build on much previous work, first and foremost the cataloging records contributed to WorldCat by catalogers from around the world. These records include author and title; publisher and publication date; format (e.g., audiobook, large print); language; translator; summaries or other annotations; links to outside information, e.g., authors' web sites; cover art; library holdings, pseudonyms and uniform titles from authority records; DDC or Library of Congress Classification (LCC) class number or both; subject headings for themes, settings, characters; genre headings. The WorldCat database reflects the extra efforts that catalogers have been making to improve subject and genre access to fiction since publication of Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc. (GSAFD; American Library Association, first edition 1990; second edition 2000). For example:

Since 1997 the British Library has adopted a policy for providing enhanced subject access to individual works of fiction catalogued for the British National Bibliography. Genre headings are applied where appropriate in accordance with the Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc.” (British Library, 2006). In the OCLC WorldCat database, more than 500,000 genre terms based on GSAFD or Library of Congress Subject Headings have been assigned by catalogers. The amount of data beyond the basic minimum varies from record to record. For the FictionFinder prototype, the OCLC FRBR Work-Set algorithm has been applied to cluster the records at the work level. Records with the same author/title key are members of the same work cluster (Hickey, O’Neill & Toves, 2002). In the FictionFinder interface2 the term “edition” is used instead of the FRBR “manifestation” because “edition” will be more readily understood by library users. We follow that usage in this paper. Because of FRBRization, many edition records with minimal data are linked to work records rich with data gathered from all the edition records; as a result, the edition records can be found via the rich data in the work records. Some efforts have been made to add data algorithmically, e.g., to supply DDC numbers for work records lacking them. Because of the large number of records involved (2.8 million edition records and 1.4 million works), this project cannot contemplate adding data manually. This project utilizes all the subject data available in WorldCat except the LCC class numbers. The LCC numbers could be utilized in a later version, for example, to provide more information about the nationality of authors, e.g., distinguishing Colombian, Mexican, Spanish authors. 297

Clare Beghtol, Annelise Mark Pejtersen, and others have discussed the value of providing much more information about fictional works, in the form of a classification scheme or other controlled vocabulary. For example, Beghtol proposed giving the following information for all major characters: name; occupation; religion; socioeconomic group; racial, ethnic, national group; physical or mental health; sex; age; living or not; type (e.g., human beings) (Beghtol, 1994, 253-266). For another example, Pejtersen proposed describing an author's intention to convey information or to create a particular kind of emotional experience; this information would be presented in standard phrases available for searching (Pejtersen, 1994, 258-261). One sample book had in the “information” category “A description of allegiance and patriotism” and in the “experience” category “Exciting” (Rasmussen, 1994, 300). Unfortunately, the detailed representation of fictional works envisioned by Beghtol, Pejtersen, and others is not present in the WorldCat database, at least not in any systematic, controlled fashion. The current project cannot draw any conclusions about the potential value of data not present in the WorldCat database. One advantage to using the WorldCat database, in addition to the large number of bibliographic records, is the large amount of library holdings information (over 1 billion library holdings symbols) and the ability to build on the OCLC “Find-in-a-Library” service. Users who identify an edition of a work that interests them can easily find out which local libraries have the edition and search for it in the local library catalog (or identify libraries holding the edition in the same state, province, region, or worldwide and search the relevant catalogs). The holding library need not use Dewey or any other classification scheme for this feature to work.

4. DeweyBrowser and FictionFinder 4.1 DeweyBrowser DeweyBrowser is a visual interface that displays search results in successive rows of ten categories based on the three main summaries of the DDC. A user enters a search and navigates up and down the Dewey hierarchy by clicking on a category. Categories are color-coded to indicate where matching records occur. Three collections are currently available for browsing using the DeweyBrowser, including wcat, a database of 2.2 million records for the most widely held works in WorldCat (Vizine-Goetz & Hickey, 2006). DeweyBrowser has the advantage of presenting results of searches in a graphic display that helps users learn where in the DDC the good stories are found. DeweyBrowser can do that now for users who know what terms to input. For example, a search on the title words “pride prejudice” yields 9 records under 823 English fiction for Jane Austen's novel Pride and Prejudice itself, plus other records in the 800s for criticism, sequels, and dramatic adaptations. Under 791 Public performances the search yields 3 records for film adaptations (3 different films, including one “modern makeover” set in Utah) and 7 records for 2 BBC television adaptations. The 9 records for the novel and some of the records for the film and television adaptations can also be retrieved by the following topical subject headings:

Courtship Sisters Young women

The 9 records for the novel and some of the records for the adaptations would also be retrievable by the following headings if genre and geographic subject headings were indexed in DeweyBrowser: 298

Domestic fiction Love stories England

Some of the records for the novel but not all can be retrieved by a search for the topical heading

Social classes

The DeweyBrowser does not aggregate records at the work level; if it did, 8 of the 9 editions of the novel itself could be retrieved by the same subject headings.

4.2 FictionFinder FictionFinder employs a different approach. The database contains records for materials identified as fiction, novels, short stories, dramas, or comic strips and records for sound recordings for literary texts. Coding in the bibliographic records, MARC 21 field 008, is used to identify the specific literary forms included in the database (Library of Congress, 2004). The system presents records at the work level and aggregates information about fiction, such as names of characters, settings, genres, and subjects to assist users in finding works of fiction. For example, FictionFinder brings together more than 700 editions of Austen's novel Pride and Prejudice, including illustrated and abridged editions and translations into more than 30 languages, under one work record with the key:

austen, jane\1775 1817/pride and prejudice

This work record and all the linked edition records can be found under all the headings mentioned above, plus other headings, for example:

Regency fiction Bennet, Elizabeth (Fictitious character) Darcy, Fitzwilliam (Fictitious character) Marriage — 19th century

Because of an inconsistency in the title (“&” instead of “and”), 1 of the 9 records for the novel retrieved by DeweyBrowser is attached to a different work record, which brings together fewer than 15 editions under the key:

austen, jane\1775 1817/pride & prejudice

Films and television programs are not included in the existing FictionFinder prototype. Adding dramatic films on the basis of DDC numbers 791.43, 791.4334, 791.4372, 791.4375 (more than 9,000 work records) and dramatic television programs on the basis of 791.45, 791.4572, 791.4575 (more than 3,000 work records) will significantly increase the range of material to which FictionFinder provides access. Comic books and graphic novels are included in FictionFinder only if they are specifically coded as fiction or comic strips. This coding is inconsistent for comic books and graphic novels, especially for older editions.3 A search in DeweyBrowser for “spider man” yields more than 60 records under 741 Drawing & drawings for works with DDC numbers 741.5+. Among those records are many for comic books republished as graphic novels that are not currently included in FictionFinder because they are not coded as fiction. 299

If there were no records for comic books and graphic novels already in FictionFinder, adding records on the basis of the DDC numbers 741.5, 741.58, 741.59+ would lead to an increase of more than 4,000 work records; however, since many of the work records are already in FictionFinder, the increase may be more significant in the number of editions and holdings to which FictionFinder provides access.

4.3 DeweyBrowser plus FictionFinder By combining features from FictionFinder and DeweyBrowser we will attempt to assist users in finding terms to input while providing a holistic approach to finding works with imaginary content. Expanded criteria for selecting works of fiction, outlined in section 2, will be used to define the collection. Records for all types of materials will be derived from the OCLC WorldCat database and deployed through a modified DeweyBrowser interface.4 An early prototype that uses selected Dewey categories to narrow search results has been developed. An example of the results retrieved for a search for the subject heading “Courtship” is shown in Figure 1.

Dewey Categories Works from all Categories American fiction in English The American / Henry James English drama Emma / Jane Austen English fiction High society / Sol C. Siegel; John Patrick; Charles Walters; Bing Crosby; Grace, Princess of Monaco; Frank Sinatra; Celeste Holm; John Lund; Louis Calhern; Sidney Blackmer; Louis Armstrong; Paul C. Vogel; Ralph E Winters; Cole Porter; Johnny Green; Saul Chaplin; Philip Barry; Metro-Goldwyn- Mayer.; Warner Home Video (Firm) Movies and films Jane Austen's Pride and prejudice /Colin Firth; Jennifer Ehle; Alison Steadman; Benjamin Whitrow; Simon Langton; Andrew Davies; Sue Birtwistle; Michael Wearing; Jane Austen; Arts and Entertainment Network.; BBC Lionheart Television Spanish fiction Jude the obscure / Thomas Hardy Television Mr. & Mrs. Smith / Alfred Hitchcock; Norman Krasna; Carole Lombard; Robert Montgomery; Gene Raymond; Jack Carson Much ado about nothing / William Shakespeare An old-fashioned girl / Louisa May Alcott Pride and prejudice / Jane Austen Sense and sensibility / Jane Austen La tía Julia y el escribidor / Mario Vargas Llosa

Figure 1. Modified DeweyBrowser interface

The results include novels, films, and television programs. When a Dewey Category is selected, for example, Movies and films, the results are limited to items in that category:

High society Mr. & Mrs. Smith

When a title is selected the work record for that title is displayed. The prototype will be publicly accessible and users will be invited to provide feedback on the interface and functionality. 300

5. Conclusion Experience with FictionFinder and DeweyBrowser has shown that no single approach is adequate for finding and representing works with imaginary content. Changes in cataloging rules, formats, and practices have resulted in inconsistencies in the ways fictional prose and mixed formats are described and coded. The FictionFinder prototype demonstrates that, although a narrow definition of fiction which relies on a limited set of codes in MARC records can produce useful results, works of the imagination can be more fully represented when supplemented by content classed in the 700s and 800s as in the DeweyBrowser.

Notes 1 DDC, Dewey, and Dewey Decimal Classification are registered trademarks of OCLC Online Computer Library Center, Inc. 2 A new interface under development uses the term “edition” instead of the term “version” which was used in the original FictionFinder interface. For sample screens showing the new interface, see http://www.oclc.org/research/presentations/vizine- goetz/webwise2006.ppt. 3 Prior to the definition of 11 new codes and the change of name from “Fiction” to “Literary form” in 1997, only the codes 0 (Not fiction) and 1 (Fiction) were used. Retrieved February 28, 2006, http://www.loc.gov/marc/marbi/1996/96-8rrp2.html 4 For example, see Hickey, Thom. Work in Progress. [Weblog entry.] Outgoing: Library metadata techniques and trends. December 09, 200. Retrieved February 28, 2006 from http://outgoing.typepad.com/outgoing/2005/12/work_in_progres.html

References American Library Association. Subcommittee on Subject Access to Individual Works of Fiction, Drama, Etc. 1990. Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc. Chicago: American Library Association. American Library Association. Subcommittee on the Revision of the Guidelines on Subject Access to Individual Works of Fiction. 2000. Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc. 2nd edition. Chicago: American Library Association. Beall, Julianne. 2004. 700 The arts Fine and decorative arts vs. 800 Literature (Belles- lettres) and rhetoric. Draft schedule 741.5 Cartoons, caricatures, comics, graphic novels, fotonovelas available for testing. Retrieved February 28, 2006, from http:// www.oclc.org/dewey/discussion/papers/GraphicTestNov2004.htm Beghtol, Clare. 1994. The Classification of Fiction: The Development of a System Based on Theoretical Principles. Metuchen, N.J.: Scarecrow Press. The British Library. 2006. Subject Access in British Library Bibliographic Records. Fiction Indexing. Retrieved February 28, 2006, from http://www.bl.uk/services/bibliographic/ subject.html Dewey, Melvil. 2003. Dewey Decimal Classification and Relative Index. Edition 22. Edited by Joan S. Mitchell, Julianne Beall, Giles Martin, Winton E. Matthews, Jr., and Gregory R. New. 4 vols. Dublin, OH: OCLC. Dewey Updates. 2006. New and Changed Entries, February 2006. Retrieved February 28, 2006, from http://www.oclc.org/dewey/updates/new/default.htm Fiction. 2006. Wikipedia. Retrieved February 28, 2006, from http:// en.wikipedia.org/w/index.php?title=Fiction&direction=prev&oldid=41847943 FictionFinder: A FRBR-based prototype for fiction in WorldCat. 2006. Retrieved February 28, 2006, from http://www.oclc.org/research/projects/frbr/fictionfinder.htm 301

Goldsmith, Francisca. 2005. Graphic Novels Now: Building, Managing, and Marketing a Dynamic Collection. Chicago: American Library Association. Hickey, Thomas B, O’Neill, Edward T. and Jenny Toves. 2002. Experiments with the IFLA Functional Requirements for Bibliographic Records (FRBR). D-Lib Magazine, 8 ( 9). Retrieved February 28, 2006, from http://www.dlib.org/dlib/september02/hickey/ 09hickey.html Library of Congress. 2004. MARC 21 Concise Format for Bibliographic Data. Retrieved February 28, 2006, from http://www.loc.gov/marc/bibliographic/ecbdhome.html Pejtersen, Annelise Mark. 1994. A Framework for Indexing and Representation of Information Based on Work Domain Analysis: A Fiction Classification Analysis. In: Knowledge Organization and Quality Management: Proceedings of the Third International ISKO Conference, Copenhagen, Denmark, 20-24 June 1994. Frankfurt am Main: Indeks Verlag. Pp. 251-263. Rasmussen, Jens, and Annelise Mark Pejtersen, L. P. Goodstein. 1994. Cognitive Systems Engineering. New York: Wiley. Vizine-Goetz, Diane and Thomas B. Hickey. 2006. Getting Visual with the DeweyBrowser. NextSpace, 1. Retrieved February 28, 2006, from http://www.oclc.org/nextspace/001/ research.htm

Joseph T. Tennis The University of British Columbia, Vancouver, Canada

Function, Purpose, Predication, and Context of Information Organization Frameworks1

Abstract: This paper outlines the purposes, predications, functions, and contexts of information organization frameworks; including: bibliographic control, information retrieval, resource discovery, resource description, open access scholarly indexing, personal information management protocols, and social tagging in order to compare and contrast those purposes, predications, functions, and contexts. Information organization frameworks, for the purpose of this paper, consist of information organization systems (classification schemes, taxonomies, ontologies, bibliographic descriptions, etc.), methods of conceiving of and creating the systems, and the work processes involved in maintaining these systems. The paper first outlines the theoretical literature of these information organization frameworks. In conclusion, this paper establishes the first part of an evaluation rubric for a function, predication, purpose, and context analysis.

1 Introduction A diversity of technologies and practices has resulted in a diversity of information organization frameworks. That is, in order to fulfill particular needs, information workers have constructed information organization frameworks for those needs and in many cases have used different technologies and components to fulfill different purposes, built on different predications, to perform different functions, while in a particular context. For example, the theories of bibliographic control state that the library catalogue allows users to find, collocate, identify, select, and obtain materials in a library (Svenonius, 2001).2 That is, the functions of bibliographic control in a library catalogue are fivefold – find, collocate, identify select, and obtain materials, the functions are built into the catalogue system, and the context is a library. For the purpose of this paper, an information organization framework consists of information organization systems (classification schemes, taxonomies, ontologies, bibliographic descriptions, etc.), methods of conceiving of and creating the systems, and the work processes involved in maintaining these systems. Information organization frameworks comprise bibliographic control, information retrieval, resource discovery, resource description, open access scholarly indexing, personal information management protocols, and social tagging. Each of these has grown out of a need to manage the interaction between information and users. However, each of these information organization frameworks addresses these need in different ways. They differ in purpose, predication, function, and context. Bibliographic control, as outlined by Patrick Wilson (1968), takes as its purpose the delivery of the best textual means to an end, which requires the development of tools (what Wilson calls bibliographical instruments) that offer control over a body of writings. Bibliographic control, made manifest in catalogues, exists in the context of libraries and bibliographic utilities, like the WorldCat database. In contrast, social tagging has grown out of a need to share items among a social group. As flickr.com, a system that uses social tagging, says of itself, “We want to help people make their photos available to the people who matter to them,” and “We want to enable new ways of organizing photos,” (flickr, 2005). 304

The differences between social tagging and bibliographic control are at least four-fold. First, the explicit purpose (as seen through the writings on these frameworks) is different. Bibliographic control has as its purpose of facilitating the best textual means to an end (Wilson, 1968). Social tagging is a framework whose purpose is to facilitate sharing pictures as well as creating a space for novelty in sharing and describing pictures. Second, the predications of authoritative descriptive control are present in bibliographic control, but absent in social tagging. Third, the functions of the components of the system are not the same – nor are they built to do the same things. For example, the terms in flickr are not controlled vocabulary terms, like many of those in bibliographic control, and there is no explicit functionality desired that is comparable to the catalogue’s find, collocate, identify, select, and obtain (Svenonius, 2001).3 Sharing is a function that could be dissected in this way (by function), and would point out differences between these two frameworks. Finally, the context in which these information organization frameworks operate is not the same. The context of sharing photos online, of replicating a sharable photo album in a web environment, is not the same as the context in which the catalogue is built and maintained. The perceived similarity between these information organization frameworks is that they are all built for retrieval. However, retrieval happens in many different contexts, and for many different reasons via a diverse set of systems and components of systems. For example, Svenonius claims that the inventory purpose of the catalogue is primary, and so disagrees with Wilson’s (1989) focus on the collocation function of the catalogue (Svenonius, 2001, 204 n24). This is a complicated argument, and an example of disagreement on purpose, predication, function, and context that requires further analysis in order to evaluate the best way to proceed with development and implementation of information organization frameworks in information systems. How much should the inventory function be present in evaluations of contemporary catalogues that do not point to only those items they own? An explication of purpose in line with predication, function, and context would help us answer this question. The importance of identifying a diversity of purposes, predications, functions, and contexts of information organization frameworks is to create better evaluation rubrics for the design specifications, work processes and resultant representations in information systems across the global learning society. We can refine the evaluation rubrics (checklists, comparative models, fieldwork analysis codebooks, etc.) if we have more refined understanding of the diversity in information organization frameworks, their purpose, predication, function, and context. The next sections outline the purpose and predication, function, and contexts as seen through information organization frameworks, and discuss components of an evaluation rubric built on this comparison.

2 Information Organization Frameworks Information organization frameworks consist of purposes, predications, functions, and context. In this section we introduce these concepts.

2.1 Purpose and Predication Purpose is defined in this paper as the reason for why something is created. The purposes of information organization frameworks are retrieval, attestation, and inference. We will first discuss retrieval. Retrieval can be achieved in various ways, and furthermore, it can be assumed that retrieval is not defined the same way for each information organization framework. That is, we cannot assume that the purpose of retrieval is operationalized in the same way, and that by extension each information organization framework functions in the 305 same way and in the same context. To understand these differences we must first outline the explicit predications (operationalizations) of information organization frameworks, and then their functions and contexts. Retrieval, as a purpose for information organization frameworks, is built on a combination of predications. Predications are the assertions of purpose on which functions are built. They are operationalizations of purpose. They are a bridge between purpose and function. Retrieval, if we understand it to be the ability to find something4 relies on a spectrum of predications – specifically, control measures, matching measures, and display measures. One wants control over a set of documents in order to retrieve a set of them. Wilson argues there are two types of control descriptive control and exploitative control. Following Wilson we can say that descriptive control is the power line up writings in some arbitrary order (Wilson, 1968, 25), and exploitative control is ability to procure the best text for the intended use of said text (Wilson, 1968, 25). Control guides the implementation of catalogues. It also guides decisions to employ standards for the various functions of the catalogue (find, collocate, etc.). The second predication on which retrieval is built is on a spectrum of matching measures. In current systems matching can be seen as necessary for control while control is not necessary for matching. For example, we do not need to achieve control over documents in order to match, especially in full-text corpora. In many cases, matching is required to illustrate that one has control over a set of documents. In most online catalogues we are dealing with a mix of both purposes. However, that is not the case for most web search engines. Thus, control is predicated on matching, but matching is not predicated on control. The third predication on which retrieval is built is display. This predication also borrows from another purpose (attestation), but it is important to retrieval because it aids control and matching. We must have conceptual and actual mechanisms in place that display the results of matching and control. Work on display has been a concern of many who work with information systems, and is an ongoing field of research in information organization frameworks (Carlyle, 2002; Yee, Swearingen, Li, and Hearst, 2003). Display is often assumed, and not accounted for as a separate operationalization of the purpose of a system. Carlyle’s work opens up this discussion in bibliographic control, and in so doing asks us to reflect in a purposive way on what we are doing with displays in information organization frameworks. Another purpose of information organization frameworks is attestation. Information organization frameworks make attestations about resources (descriptions of them e.g., subject matter, title, relevance ranking etc.), which are reinforcements of matching and control and enable subordinate functions. The predications of these attestations can be explicit and static in the form of representation of title or subject matter, or they can be dynamic and derived from relationships between documents as decided by query expansion algorithms or ranking algorithms. Attestations require a link to authority – either based terminology employed (that of a authorized scheme or not) or in identity of tagger (indexer) as seen in flickr or Amazon.com’s use of tagging (Amazon, 2006). Inference is the third type of purpose employed by information organization frameworks. Inference can be simple or complex. For example inference allows users of catalogues to identify particular documents – a function of catalogues (Svenonius, 2001). Inference is also what the structure of ontologies allows machines and users to do. Inference, like control, requires some representation and attestation of authority in identity and terminology. These purposes and predications can be seen at work in information organization frameworks, and they vary by matters of degree between these frameworks. For example, we might see little to no inference done by machine in bibliographic control, but we can see how 306 structures employed in bibliographic control could be modified for inference – this would then add a layer of purpose to those structures that we want to account for in evaluation. Making purposes explicit lays the groundwork for evaluation because we can see the relationships between purpose and function.

2.2 Function Functions in this paper are the actions intended by an information organization framework. The functions facilitated by the library catalogue, as outlined by Svenonius, are to allow users find, collocate, identify, select, and obtain.5 Functions of social tagging, as seen in the flickr example, might be called sharing and annotation, where the system facilitates these functions above all other accidental functions related to collocation, for example. The question then surfaces, can sharing be seen as finding, collocation, or identifying? Here we have the intersection of purposes of retrieval and attestation, and its consequent functions. It is also a case where identity in attestation affects the function of flickr as an information organization framework. Because much of what is done in bibliographic control is delegated (Fairthorne, 1961, 124-134), and sharing photos is not, we can see bibliographic control as an anonymous aid in finding, collocation, and identification based on some third authority (published controlled vocabularies and standards). Whereas the sharing function of flickr is built on social groups and identity – my social group decides what terms to use, and builds these uncontrolled vocabularies for itself. So in the case of flickr, sharing can be seen as a social function linked to a type of identity (my tags versus your tags) distinct from identity, as it is understood in bibliographic control (identity of a an authoritative list of subjects and standards and training in applying these subject headings). It is of course to see accidental functions of information organization frameworks. So we might search flickr thinking it should function like a catalogue, and in some cases we may be pleasantly surprised, but we cannot evaluate flickr based on this accidental function, and read into it a different purpose than for which it was built.

2.3 Context Functions, predications, and purposes are conceived and realized in a context. Context in this instance comprises the information system, the user, and the larger social system in which the information system and the user operate. Context for bibliographic control then, is the catalogue, catalogue users (including professionals), and the environment in which the users and the catalogue operate. The context for flickr is different. Here we are not dealing with a catalogue. We are dealing with personal collections of photos that can be shared with a small group or with anyone. There is not attempt at controlling these photos. And in many cases, the tags used to identify these photos are shared via email or face-to-face interaction. Context here then is not in anonymous mediation to controlled representations. The context here is a social group deciding how to share photos in a novel tool. Contexts offer secondary functions and purposes as well. The definition of, and the unit of analysis for, context are not clearly defined in LIS. There are a number of discussions of context at various levels and with different foci (Cool, 2001; Davenport and Hall, 2002; Hjørland and Albrechtsen, 1995; Hjørland and Kyllesbech Nielsen, 2001; Rasmussen, Mark Peijtersen, and Goodstein, 1994; Solomon, 2002; Tennis, 2003; Wilson, 1968). We cannot develop this idea here. This is an area that requires further work. 307

3 Evaluation Rubric for Information Organization Frameworks This section presents a brief rubric. The evaluation rubric presented here is not comprehensive. It is a start, but more frameworks can be analyzed in order to improve this rubric. The purposes of the rubric are to attest to (1) the purposes of the information organization framework, (2) the predications of the information organization framework, (3) the functions that enable that purpose, and (4) how well it achieves its purpose. This rubric makes explicit these four categories in order to (a) speciate the information organization framework – making explicit the tenets on which the framework was built and distinguishing intended use from accidental use, and (b) laying bare the relationship between intension and action in information organization frameworks. The fourth point above, the degree to which an information organization framework achieves its purpose, is a complicated matter to interpret. It is important to consider the evaluation in a number of ways, fulfilling purpose is just one of those ways. And even with this partial look at evaluation we are left with only the rubric. We do not have the values that might be associated with the categories in the rubric. That is substance for future research. The rubric that follows uses the elements of information organization frameworks as the grid through which we can identify purposes, predications, functions, and the degrees of success. It is important to note that evaluation here does not account for interface interactions or other kinds of usability concerns. The evaluation rubric presented here only addresses the structures for information organization. The first table presents purposes and predications.

PURPOSES and PREDICATIONS

Purposes Predications Retrieval Control Descriptive (arbitrary criteria) Exploitative (best texts) Matching Without Query Expansion With Query Expansion Display Descriptive (arbitrary criteria) Exploitative (best criteria) Attestation Terminology Opaque Language Transparent Language Representation Static (e.g., alphabetical) Dynamic (ranking) Explicit (from record or document) Implicit through Relationships between other Documents Identity Anonymous (no identity) Link to some Authority (e.g., LCSH) Link to Assertions (link to other indexing work, e.g., other tags in flickr) Profile Available (as in Amazon.com)

ĺ 308

PURPOSES and PREDICATIONS

Inference Relatedness Explicit (in vocabularies, etc.) Implicit (interpreted by user) Joint Assertions Through combining structures (merging) Through if then statements (logical inference)

Table 1. Purposes and Predications

This table schematizes the discussion in section 2 above. The intended use of this rubric is to lay bare the intended (and accidental) purposes and predications of information organization frameworks. This makes explicit the components and intension of design. The functions make explicit the actions of an information organization framework. They are perhaps too numerous to list in their entirety here, but a short list can be provided in Table 2.

FUNCTIONS (an incomplete list) Find (locate) Collocate Identify Select Obtain Share Recall Pinpoint [precision] Store Input Inventory

Table 2. An Incomplete List of Functions

4 Future Work Future work in information organization frameworks will apply the rubric presented here to different frameworks. It will also identify the boundaries of the construct information organization framework. It will also outline a vocabulary for discussing how well a framework achieves its purpose.

5 Conclusion This paper is a first step in a identifying an analytical tool for evaluating information organization frameworks. It is also a first step in comparing these frameworks in an attribute- by-attribute manner. Researchers have illustrated concern of reinvention of information organization frameworks by fields unfamiliar with the literature of LIS (Soergel, 1999; Vickery, 1997; Veltman, 2004). However, these accounts have not dissected the purposes, predications, functions, and contexts of these frameworks. As the work unfolds, it is hoped that this rubric will aid researchers is making claims about indention and design in information organization frameworks, and that this will provide a richer vocabulary for evaluation and comparison of these important tools for the global learning society.

Notes 1 The author wishes to acknowledge Kari Hølland for reading and commenting on a draft of this paper. 309

2 Svenonius (2001) has a sixth function – navigate – but is seems different in kind, and deserves longer discussion than this paper allows. 3 See note 1 above. 4 Or Wilson’s best textual means to an end (Wilson, 1968). 5 Svenonius (2001) also discusses and inventory function, not listed as an explicit part of the full-featured bibliographic system. However, in a wider context of information organization frameworks this is important to consider. It is also a clue to the more implicit or hidden purposes, predications, and functions that have yet to be discussed.

References Amazon.com (2006). A record for The World is Flat. Retrieved February 28, 2006 from: http://tinyurl.com/gl2af Carlyle, A. (2002). Transforming Catalog Displays: Record Clustering for Works of Fiction. In Cataloging & Classification Quarterly, 33, 13-25. Cool. C. (2001). The concept of situation in information science. In Annual Review of Information Science and Technology. (pp. 5-42). Medford, NJ: Information Today. Volume 35. Davenport, E. and Hall, H. (2002). Organizational knowledge and communities of practice. In Annual Review of Information Science and Technology. (pp. 171-227). Medford, NJ: Information Today. Volume 36. Fairhtorne, R. A. (1961). Delegation of Classification. In Towards Information Retrieval (pp. 124-134). London: Butterworths flickr. (2005). About Flickr. Retrievede November 2, 2005 from: http://www.flickr.com/ about.gne Hjørland, B. and Albrechtsen, H. (1995). Toward A New Horizon in Information Science: Domain Analysis. In Journal of the American Society for Information Science, 46, 400-425. Hjørland, B. and Kyllesbech Nielsen, L. (2001). Subject access points in electronic retrieval. Annual Review of Information Science and Technology. (pp. 249-298). Medford, NJ: Information Today. Volume 35. Rasmussen, J., Mark Pejtersen, A. and Goodstein, L. P. (1994). Cognitive systems engineering. New York: Wiley. Soergel, D. (1999). Rise of ontologies and the reinvention of classification. In Journal of the American Society of Information Science and Technology, 50, 1119-1120. Solomon, P. (2002). Discovering information in context. Annual Review of Information Science and Technology. (pp. 229-264). Medford, NJ: Information Today. Volume 36. Svenonius, E. (2001). The intellectual foundation of information organization. Cambridge: MIT Press. Tennis, J. T. (2003). Two axes of domain analysis. In Knowledge Organization, 30, 191-194. Veltman, K. H. (2004). Towards a semantic web for culture. In Journal of Digital Information, 4, Available: http://jodi.tamu.edu/Articles/v04/i04/Veltman/ Vickery, B. C. (1997). Ontologies. In Journal of Information Science, 23, 277-286. Wilson, P. (1968). Two kinds of power: An essay on bibliographical control. Berkeley: University of California Press. Wilson, P. (1989). Interpreting the second objective of the catalog. In Library Quarterly, 59, 339-353. Yee, K-P., Swearingen, K., Li, K., Hearst, M. A. (2003) Faceted metadata for image search and browsing. In CHI 2003, 401-408.

Edmund JY Pajarillo, Ph.D Molloy College, Rockville Center, NY, USA

A qualitative research on the use of knowledge organization in nursing information behavior

Abstract: The use of knowledge organization is ubiquitous in our global society. This present research focuses on its use in nursing, specifically, how these knowledge organization processes are integral in nursing information behavior (NIB). Nurses use the nursing process as a professional practice tool to systematically plan and evaluate patient care. It entails various phases, beginning from assessment, identifying nursing diagnoses and needs of the patient, planning, implementation and evaluation. Knowledge organization steps and processes are evident in each of these steps of the nursing process, where compiling, sorting, filtering, organizing, sense making and prioritizing are used. The purpose of this study is to identify and describe these knowledge organization concepts as used by nurses in home care. These are examined vis-à-vis the nursing process, using a qualitative paradigm.

1. 0 Introduction Information behavior is a broad and distinct term used to refer collectively to human behavior in relation to sources and channels of information, including both active and passive processes of information seeking, information searching, and information use (Wilson, 2000). It is also known that information behavior is based on the contextual nature of the information and its users. Thus, nursing information behavior (NIB) encompasses all behavior that describes nurses while gathering, processing and managing information. This includes the information sources, resources, leads, and conduits they prefer and use, what problems and difficulties are encountered, and the information and knowledge processes involved while in any particular professional environmental space (PES) or clinical practice setting (Pajarillo, 2005). The nursing process is just one such professional tool used in NIB. It is instrumental to nurses in the day-to-day practice of their profession, to assist them in systematically planning and evaluating the care provided to patients. It serves as “an organizational framework for the practice of nursing, encompassing all the steps taken by the nurse in caring for the patient: assessment, identifying the nursing diagnoses and needs of the patient, planning, implementation and evaluation (Mosby, 1998)”. Knowledge organization steps and processes are evident in each of the steps of the nursing process, such as compiling, sorting, filtering, organizing, sense making and prioritizing. Like other professions, nursing is not unique when it comes to information processes and concepts. Practitioners of the discipline constantly handle and manage tremendous amounts of information. It is here where the importance of knowledge organization cannot be overemphasized. Rubin (1998, 171) once said that “information tends to have an entropic character: it does not organize itself, rather, it has a tendency towards randomness. Unless there are ways to organize it, it quickly becomes chaos.”

2.0 Nursing Process and Knowledge Organization The nursing process is the information processing, management and use components of Wilson’s definition (2000) of information behavior. It begins at the time an information need is identified. It continues with seeking and searching through information leads, and then managing, processing, and eventually using the information. 312

Using the nursing process involves the same process of “telling and being told” that Machlup and Mansfield (1983) described. There are two parties, the nurse as the recipient, receiver or observer, and the patient as the source, sender or originator. The patient presents an array of information that includes: signs and symptoms; laboratory findings; radiological, sonic and cardiac strips; physical assessment data; past and current medical history; family history; environmental descriptions; work and social habits; and others. The nurse gathers raw, unprocessed and meaningless data during the assessment phase of the nursing process. The media or channel that the nurse uses to collect these observations (message or data) may vary from visual, tactile, auditory, to other sensorial means. The next phase of this information processing and management in the nursing process occurs when the nurse analyzes these data. To accomplish this stage successfully, the nurse utilizes some or all of the following knowledge organization concepts and processes, namely: compiling, sorting, filtering, organizing, sense making and prioritizing. Compiling is the process of collecting into a list, or putting together, gathering and amassing large amounts of information (The Oxford Dictionary and Thesaurus, 1996). It can also be described as putting all accumulated data next to each other for gross examination. This initial step is essential, particularly when faced with copious amounts of information gathered during the assessment phase of the nursing process. The next process is sorting, described as a way to display data in some preferred manner of display, either by relevance, time sequence, author or source. Sorting as used in the nursing process can mean grouping similar data. This involves comparing patient-related data with the nurse’s stock knowledge, either from formal education, training or experiential learning. The nurse determines whether these are normal or abnormal. This is helpful when formulating patient’s list of needs and problems, also known as nursing diagnoses. Filtering is another concept used in the nursing process. Shneiderman (1998, 538) defines it as a process of discarding non-meaningful or uninteresting data, with the aim of assisting the user to focus more on relevant items. As the nurse proceeds further along the nursing process, information that is obviously insignificant and meaningless is once again sorted prior to discarding, leaving only those of value. This step is then followed by the organizing phase. Soergel (1985) identifies two approaches to organizing, namely, putting like entities into groupings, and developing a list of descriptive characteristics for each entity. It involves grouping data that are alike, similar, or related. The final step in the analysis phase is sense-making, when the nurse attempts to find some meaning out of the organized data. Other concepts in information processing and management can be applied in subsequent steps of the nursing process such as planning and intervention. If a patient has more than one problem or nursing diagnosis, prioritizing ensues (Alfaro-LeFevre, 1999; Rubenfeld & Scheffer, 1999; Leddy & Pepper, 1998). The list of nursing diagnoses is ranked in the order of relevance. The nurse examines and judges each diagnosis and determines priority. Other tasks used during the planning phase include identifying goals or objectives of care, expected outcomes of nursing actions, short and long term goals, and time frames for goal attainment. The same information processing steps are applied while in the intervention phase. These steps include listing all the possible nursing actions geared to address the identified nursing diagnoses and to achieve the targeted outcomes of care, and prioritizing these in the order of implementation. In the evaluation phase of the nursing process, the nurse looks back and examines the patient’s current state in relation to the nursing diagnoses. This is achieved by reassessing and gathering data from the patient and comparing these to the initial appraisal. The same 313 information processing steps are followed, with the end result of identifying goal attainment or re-strategizing to refocus an ineffectual plan of care. In all the steps of the nursing process, critical thinking is a helpful tool for the nurse when threading through the problem-solving process. Critical thinking is the analytical tool used to process data from meaningless, disjointed and vague states towards relevant and useful information. In the same manner, previously discussed knowledge organization processes and concepts (compiling, sorting, filtering, organizing, sense-making and prioritization) serve as the analytical and technical tools used in managing patient data for planning care, and its relevance in the steps of the nursing process can not be overemphasized.

3.0 The Flow of Information Behavior Aside from Wilson’s (2000) research on information behavior, other authors describe the various steps and processes that characterize it. Ellis (1989; 1993) explains the following features of information behavior to include starting, chaining, browsing, differentiating, monitoring, extracting, verifying and ending. The highlight of Ellis’ model is its non-sequential nature, or the non-linearity of these features. It is the detailed interrelation and interaction of each of the characteristics of the information activities that a user is performing at any one particular time during any given search episode. Certain steps appear to be sequential, such as starting, ending, chaining, differentiating, extracting and verifying. Other features, such as browsing and monitoring, can be independent of the other actions, with the user being at any point in the process at any one time. Meho and Tibbo (2003, 583) confirm Ellis’ model in their research involving the information behavior of social scientists, particularly in the area of information retrieval and enhancing research activities. Four additional steps, however, were cited as equally important in the information-seeking activities of this group of users. These include accessing, networking, verifying and information managing. With these other steps, a revised Ellis’ model was devised by Meho and Tibbo (2003) dividing information-seeking into four interrelated stages: searching, accessing, processing and ending. The original premise by Ellis that the steps are non-sequential still holds true. With this redesigned model, users still do myriad activities in non-linear fashion such as starting, chaining, browsing, differentiating, monitoring, and extracting, particularly in the first three stages of searching, accessing and processing. Meho and Tibbo’s research simply reiterates the characteristic multiple task-based activities entailed in information seeking. When one observes the behavior of home care nurses while pursuing work-related information needs, the same activities are noted. Whether using electronic or non-electronic sources of information, the nurse proceeds through the general steps of searching, accessing, organizing volumes of information, processing and eventually ending the process. The similarity between nurses’ use of the nursing process when managing information and Meho and Tibbo’s model is further illustrated by this researcher in Table 1: 314

Nursing Process Knowledge Organization Concepts Assessment Step 1: SEARCHING PHASE Starting, searching, accessing (browsing, monitoring), chaining, differentiating, compiling Nursing diagnosis Step 2: ORGANIZING PHASE Sorting, filtering, extracting, organizing, verifying, sense-making, information identification (may need to go back and do any or all of Step 1) Planning Step 3: PROCESSING PHASE Processing identified information, listing available relevant information, prioritizing, identifying other information needs, using available information (may need to go back and do any or all of Steps 1 and 2 as necessary) Implementation Evaluation Step 4: ENDING PHASE Verifying, validating, ending (may require repeat of any or all of the preceding steps)

Table 1: Comparison of Steps in the Nursing Process with Knowledge Organization Concepts

Meho and Tibbo’s (2003) first and second phases are combined into searching, which coincides with the assessment phase of the nursing process. It includes other knowledge organization steps such as starting, accessing (browsing, monitoring), chaining, differentiating, compiling. Accessing is integral to searching, and makes for this combination. This is followed by the organizing phase, corresponding to the step in the nursing process of identifying nursing diagnoses. Once data have been accumulated, an essential step is to put these into meaningful order. Doing so requires knowledge organization functions such as sorting, filtering, extracting, verifying, sense-making, and information identification. The third phase is processing, comparable to both planning and implementation steps in the nursing process. This processing phase covers other knowledge organization steps such as listing available relevant information, prioritizing, identifying other information needs, and using available information. The last is the ending phase, coinciding with evaluation in the nursing process. This ending phase includes other functions such as verifying and validating. Note that each step is not exclusive or static from the other phases; a nurse may need to go back and include other steps and functions from previous phases. Another research (Foster, 2004) validates information seeking as non-linear and redefines the features into three core processes, namely: opening, orientation and consolidation. The opening is comparable to the starting, initialization or beginning points previously described by Ellis (1991) and Meho and Tibbo (2003). Other tasks and activities are included in the opening, such as breadth exploration (the willingness to explore beyond limits) and eclecticism (the capacity to merge active, passive and serendipitous approaches to achieve the required information, networking (a tool used to explore interdisciplinary contacts by various means such as e-mail, Internet, discussion groups, or face-to-face (F2F) contacts in conferences, meetings or social gatherings. Still other activities fall under the opening stage—those related to the use of databases, online catalogs, Internet sites and search engines and online journals. These tasks include keyword searching, browsing, monitoring and chaining. Serendipity is also included, identified as an essential aspect for achieving breadth and depth in acquiring information from uncharted sources or means. Orientation pertains to defining the focus and boundaries of the information need, covering a range of activities such as identifying previous and existing research, current 315 models and thinking, prevailing key and discussion points. Other specific tasks in orientation are picture building (or concept mapping), reviewing (considering existing material), identifying keywords (to draw out the most relevant sources), and defining the shape of the existing research (referring to identifying key articles, researchers and proponents, and current opinion and concepts. The third core process is consolidation, although Foster (2004, 234) asserts that it plays a role in every step of the information seeking process. It is the key task of judging and determining whether the required information has been achieved and the information seeking episode should cease. Otherwise, the process continues until the need for the specific information is satisfied. Tasks included in consolidation are refining, knowing enough, sifting, incorporating, verifying, and finishing. Foster’s model reiterates that information behavior is non-linear and non-sequential. Particularly in our current networked and digitized environment, this model is seemingly more relevant and appropriate. Foster (2004, 235) aptly compares this model “to an information seeker holding a palette of information behavior opportunities, with the whole palette available at any given moment. The interactivity and shifts described in the model show information seeking to be nonlinear, dynamic, holistic and flowing.”

4.0 Research Question Drawing from the assertion that knowledge organization is pervasive in the global society, this present research examines those processes used in nursing, specifically among home care nurses. The study focuses on two research questions.

4.1 What knowledge organization concepts are used by home care nurses to describe nursing information behavior, particularly those actions or steps used when processing, managing and using information? This research component delves into the specific actions and measures describing how home care nurses identify information needs, name possible information leads, translate or encode the question into a manner that is congruent with the specifications of the information lead, arrive at an array of possible responses, and choose the most appropriate solution to the information quandary. Does the home care nurse follow a particular order, from starting, chaining, differentiating, extracting and verifying, to interposing these with browsing and monitoring, as described by Ellis (1989, 1993)? Does the revised Ellis’ model posited by Meho and Tibbo (2003) parallel the steps taken by home care nurses when accessing, searching and managing information? What other steps, such as sorting, compiling, prioritizing and filtering, are characteristic of home care nurses when seeking and processing information?

4.2 What knowledge organization concepts used by home care nurses are analogous to the steps used in the nursing process? Different models influenced the development and evolution of the nursing process. Potter and Perry (1995) affirm that steps employed in the nursing process parallel those involved in scientific method and problem solving or decision-making theory. There is also evidence that the nursing process is influenced in various ways and degrees by other models such as communication, interpersonal relation, system, perception, and critical thinking. Which knowledge organization concepts and processes align with the steps of the nursing process? A description and analysis of the effects of knowledge organization on the nursing process is pertinent in describing nursing information behavior, where both nursing and knowledge organization converge. 316

5.0 Research Methodology The research study is qualitative in nature, employing the use of the case study method. This paradigm is particularly helpful when providing a snapshot of the home care nurses’ daily practice, affording a more detailed description of nursing information behavior. The aim is to develop analytical generalizations from case study scenarios for use in theory development to address the specific research questions. The unit of analysis is the home care nurse. Five case study participants and three alternates were recruited from home care nurses of the Visiting Nurse Service of New York Home Care, a large metropolitan certified home care agency in New York City, U.S.A. All represent a cross-section of home care nurses in terms of varying length of professional experience, ranging from those with less than five years of home care experience, those with 5-10 years experience, and those with more than 10 years experience. Data-gathering involved a combination of methods to elicit comprehensive and in-depth information, including the use of daily information search logs, individual and group interviews. Nurses kept journals of information needs and encounters over a period of two weeks, detailing steps and processes taken to resolve these. The submitted transcripts were examined for completeness, clarity and appropriateness. Follow-up individual interviews were conducted to explore ambiguous or unclear entries. Finally, case study participants participated in focus group interviews. The data gathering tools made for a triangulated approach. Results from one were used to verify and validate outcomes from the other data gathering techniques. Daily write-ups submitted by the nurses were used as start-up discussion points in the focus groups. The initial group meeting was conducted, with the follow-up used to iterate and validate assumptions. Individual follow-up sessions were also done to confirm conclusions about recurring themes.

6.0 Case Study Findings and Discussion A total of 43 information search episodes were documented and submitted by the case study nurses. These varied from clinical and administrative issues, to plain and mundane search scenarios on computer problems and locating patients’ home locations.

6.1 Knowledge Organization Processes and Steps Used in NIB The information search logs, interview and focus group transcripts were examined for key terms describing the information searching and knowledge organization steps and processes used by home care nurses. The first phase was an analysis of each of the submitted information search logs, identifying tasks and activities in the information-seeking process. Some examples of these codified scenarios include:

ƒ Starting, sorting, clarifying, verifying ƒ Starting, identifying, sorting, differentiating, picking out information choices, ending ƒ Assessing, identifying, seeking sources, evaluating ƒ Identifying needs, problems, solutions, calling resources, ending ƒ Sorting, identifying, prioritizing, determining ƒ Recognizing, sorting, calling sources, using the library, accessing Medlineplus™, sorting, reading, differentiating, determining, closing ƒ Determining needs, sorting, processing, choosing, achieving answer ƒ Asking sources, sorting sources, searching the Internet, reading results, filtering, finding, ending 317

Putting these information searching and knowledge organization processes together in a common list is the next phase in the inquiry. Following the procedure instituted by Foster (2004, 231), these keywords were categorized for organization and analysis purposes into three linear groupings: initial, middle and final. Table 2 (below) is a representation of all information searching and knowledge organization steps noted in the case study logs and transcripts, grouped according to the linear categories of the information-seeking process. Certain tasks are distinct, defined and expected, such as starting, problem identification, searching and ending. As in any process, there is a beginning and an end, or a point at which an information need leads to inquiry and its ultimate information discovery. However, there are tasks that come at various points in the continuum, which are recurring or circuitous. Extracting from the data, activities such as sorting, clarifying, verifying, validating, determining, chaining, browsing or refining (those marked in Table 2) occur at all the different stages in the information-seeking process. The sorting step is an example of a recurring task. In one of the information search scenarios, the nurse began the process with the information need being unclear and unfocused because of many incidental questions and distractions. It was necessary to sort these questions and distractions to zero in on the central information need. Once the searching process yielded a string of possible relevant and useful leads, sorting became pertinent again. Subsequent steps, such as differentiating, sifting and filtering, were uneventful after results were sorted into similar groupings or categories. Another example is the browsing task, occurring in the early phase of information-seeking when specifics of an ambiguous and evolving information need have yet to be defined. Once the user begins to actively search for the identified information need, browsing steps become essential again and are interposed with searching activities. Gleaning from the knowledge organization steps described in the various scenarios, there appears to be a wide variation in the terminology and chronology of activities used in the nursing information behavior process. As Foster (2004), Meho and Tibbo (2003), and Ellis (1993) describe, there are basic, common and essential knowledge organization steps, intertwined with various specific activities that are consequential of or contingent on preceding tasks. There are tasks and activities that are central and vital to knowledge organization and creation, falling right in the nub of information-seeking. From the case study findings, three processes (uncovering, discovery and recovery) were quite evident which aptly describe knowledge organization steps and processes as being in the nub of nursing information behavior (Pajarillo, 2005). 318

Table 2: Categorization of Information Searching and Knowledge Organization Steps and Processes Used in Nursing Information Behavior

Uncoveringwas evident as the beginning point in most of the information search episodes. Some scenarios referred to this phase as starting, beginning, or information need identification. There were some circumstances that were more clear-cut and focused, but there were situations when the starting point was fuzzy, undetermined or vague. The nurse came to terms with the exact information question after systematically following a series of other knowledge organization steps such as sorting, browsing, clarifying or differentiating. The uncovering phase was also described in instances when the process of identifying the particular information need was hit-or-miss or serendipitous. Finding out the precise, prioritized and well-formulated information question is a process of uncovering, that of demystifying or removing the ambiguity, vagueness and fuzziness of an initially-identified and evolving information search query. After uncovering tasks and getting the information driver on target, the second of the nub processes that constitute NIB is discovery. From the case study scenarios, this was described as the continuation of uncovering, when the user follows and picks up where it left off. Subsequent information searching and knowledge organization steps and processes include more browsing, active searching, chaining, differentiating, discriminating, sorting and filtering. Thus, a combination of searching, browsing and calling (phoning) sources, or chaining and following up ensue. From these steps, the user might discover other information 319 leads. These sources require further fine-tuning to achieve the needed information. Should discovery necessitate a return to the initial process of uncovering, it is probably to reformulate information questions or search queries because results are not yielding targeted outputs. Finally, the next step for the nurse is recovery—the third of the core processes in NIB. This can be the ultimate step once the user succeeds and fulfills the information need. The stage of recovery is not always the final step, but is sometimes a revalidation period. It is possible for the information-seeking process to be in the final stage of recovery when the user confirms the relevance and usability of the found information. However, the user can sometimes be in a feedback mode, indicating that recovery is also a moment to revalidate one’s current standing in the information search process. This feedback mechanism presents as an opportunity to re-examine information already obtained. The feedback process results in either an affirmation of the usefulness of the information, or a redirection to a previous stage when uncovering or discovery is deemed imminent. The use of the term recovery particularly refers to stabilizing the disruption in the nurse’s working routine. Once the question is clarified and the user achieves the required information, the disequilibrium and instability also resolves. In this regard, recovery can also be referred to as the recoil process—a return to original, stable state. These core processes, or the nub of NIB is illustrated in Figure 1.

Figure 1: Schematic Representation of the Nub of Nursing Information Behavior (NIB)

6.2 Comparison of the Steps used in NIB and those of the Nursing Process There is strong and recurring evidence from the scenarios and interview transcripts that nurses follow steps outlined in the nursing process in most of their information seeking activities. The use of the process is described as unconscious, systematic and automatic. In almost all the scenarios analyzed, the search episodes depict the methodical steps of the nursing process used by nurses when faced with an information need. Thus, nurses completed a full range of assessments, interpreting the resulting data and identifying nursing needs and problems, planning courses of action, implementing defined interventions, and finally, evaluating the outcomes. 320

It is also evident from closer scrutiny of the search scenarios that home care nurses adapt knowledge organization steps and processes as part of their nursing information behavior. For every phase in the nursing process, corresponding and appropriate knowledge organization steps are applicable. It is relevant to extend the discussion by incorporating the conclusion that home care nurses employ both nursing process and knowledge organization, as shown in the preceding description of the nub of NIB. The relationship and similarities between the nursing process steps and the NIB core processes appear in Figure 2.

UNCOVERING ASSESSMENT

PROBLEM IDENTIFICATION

DISCOVERY PLANNING

INTERVENTIONS

RECOVERY EVALUATION

Figure 2: Comparing the Nub of Nursing Information Behavior with the Steps of the Nursing Process

In the uncovering phase, signifying the beginning or starting point in the information search process, the home care nurse is in assessment, or the first step of the nursing process. This is an initial move, an attempt to establish one’s baseline. Perhaps the presenting information driver is clear and focused. There is no need for sense-making, figuring out, or other recognizing steps. It is a straightforward determination of markers, benchmarks, or specific information needs. Once completed, the next step is problem identification. Here is where steps such as sorting, clarifying, verifying and determining are instituted to define specific information needs. These actions are inclusive of uncovering in the nub of NIB. Once problems are determined and identified along the nursing process continuum, the nurse progresses to planning how to address these issues. In the nub of NIB, the nurse is identifying leads and conduits, thinking of questions to ask sources, determining search terms or keywords, and drawing up courses of actions. Planning in the nursing process might still be considered a part of uncovering in NIB. When reaching the discovery level, it is possible that the nurse realizes that earlier defined problems or information needs remain indeterminate. The nurse then reverts to further sense-making, re-conceptualization, or sorting. This is equivalent to backsliding to the problem identification step in the nursing process, when problems identified earlier have to be rethought. Only then will the process progress to planning, still being part of the discovery phase. When the nurse is able to identify specific moves aimed to achieve the needed information, the intervention step proceeds. This equates to discovery in the nub of NIB, with such specific tasks and activities as actual browsing or searching the Internet, electronic databases, books, manuals and journals, calling up leads and conduits, sorting, reading through, filtering, determining, and actually picking out the most relevant and useful from a list of search outputs. The last stage, recovery, parallels activities related to the evaluation step of the nursing process. This entails an examination of the effectiveness of the process and the achieved 321 outcome. It also engages the nurse in a feedback mode when necessary, necessitating a return to assessment, or any other previous step, in order to refine and improve on results. In the nub of NIB, recovery might mean redoing either uncovering or discovery phases. It can also be that the nurse reached goal achievement or satisfaction, and the nursing process ceases. Similarly, this signifies the task of ending, another activity subsumed in recovery. This comparative segment of the analysis is particularly relevant in emphasizing the integral role that information searching and knowledge organization steps play in NIB. This also underscores the contextual nature of nursing information behavior, with the blending of inherent principles, concepts, and models used in the nursing profession—such as the nursing process—with the global and universal tools and processes used in knowledge organization.

7.0 Conclusions and Recommendations In summary, many knowledge organization processes used in nursing information behavior are similar to those described by Foster (2004), Meho and Tibbo (2003) and Ellis (1993). Certain steps are constant, such as starting, problem or needs identification, determining baseline, problem definition or exploration, conceptualizing, searching, deciding, finding, reviewing, feedback, ending, satisfaction, information use, and relief. But in between these are myriad processes that come into play as the needs dictate. Thus, other steps such as sorting, clarifying, browsing, validating, discriminating, verifying, determining, chaining, prioritizing and refining, are described to occur at various points in the information behavior of nurses as supported by data analyzed from the case study. Closer examination of the nurses’ search episodes reveals three salient components that define the nub of NIB, namely: uncovering, discovery and recovery. Every search scenario is characterized by a point of uncovering, when the nurse comes to a realization that an information need exists. Uncovering is almost always the beginning or the starting point in the nurse’s pursuit of information. Depending on the nature of the information driver and the accessibility and effectiveness of the available leads and conduits, the nurse either progresses to discovery or recovery. Discovery is the active unearthing of and searching for the required information. It reverts to the first step of uncovering, when refinement or refocusing of the information driver is necessary. Or, discovery can advance to the final phase of recovery, when the requisite information is eventually found and the need is fulfilled. Additionally, the research also revealed that knowledge organization steps correspond to every step in the nursing process and are interwoven as vital processes in nursing information behavior. Results of this qualitative study only reiterate the inherent and integrative features of knowledge organization in our global society. Nursing is no exception, where many of the knowledge organization steps and processes are reflected and evident in the daily professional practice and information behavior of nurses. This researcher recognizes that the study focused only on the information behavior home care nurses, and that replicating the investigation using case study participants from other nursing specializations is in order.

8. 0 References Abate, F. (Ed.). (1996) The Oxford dictionary and thesaurus. The ultimate language reference for American readers. New York: Oxford University Press. Alfaro-LeFevre, R. (1999). Critical thinking in nursing: A practical approach. Philadelphia: W. B. Saunders. Ellis, D. (1993). Modeling the information-seeking patterns of academic researchers: A grounded theory approach. Library quarterly 63: 469-86. 322

Ellis, D. (1989). A behavioral approach to information retrieval system design. The journal of documentation 45: 171-212. Foster, A. (2004). A nonlinear model of information-seeking behavior. Journal of the American society for information science and technology 55: 228-237. Leddy, S. and Pepper, J. M. (1998). Conceptual bases of professional nursing. Philadelphia: J. B. Lippincott. Machlup, F. and Mansfield, U. (Eds.) (1983). The study of information: interdisciplinary messages. New York: John Wiley & Sons. Meho, L. and Tibbo, H. (2003). Modeling the information-seeking behavior of social scientists: Ellis’ study revisited. Journal of the American society for information science and technology 54: 570-87. Mosby’s Medical, nursing and allied health dictionary, 5th ed. (1998). New York: Mosby-Year Book. Pajarillo, E. J. Y. (2005). Contextual perspectives of information for home care nurses: Towards a framework of nursing information behavior (NIB). Dissertation Abstracts International, (UMI No. AAT 3167384 ) Potter, P. and Perry, A. (1995). Basic nursing: Theory and practice. 3rd ed. St. Louis: Mosby Year Book. Rubenfeld, M. G. and Scheffer, B. (1999). Critical thinking in nursing: An interactive approach. Philadelphia: Lippincott Williams & Wilkins. Rubin, R. (1998). Foundations of library and information science. New York: Neal-Schuman Publishers. Shneiderman, B. (1998). Designing the user interface. Strategies for effective human-computer interaction. Reading, MA: Addison-Wesley. Soergel, D. (1985). Organizing information. Principles of data base and retrieval system. London: Academic Press. Wilson, T. D. (2000). Human information behaviour. Informing science 3: 49-55. Ia C. McIlwaine University College London

Joan S. Mitchell OCLC Online Computer Library Center, Inc., Dublin, Ohio, USA

The new ecumenism: Exploration of a DDC/UDC view of religion

Abstract: This paper explores the feasibility of using the Universal Decimal Classification’s revised religion scheme as the framework for an alternative view of 200 Religion in the Dewey Decimal Classification, and as a potential model for future revision. The study investigates the development of a top-level crosswalk between the two systems, and a detailed mapping using Buddhism as a case study.

1. Introduction For some years the editors of the Dewey Decimal Classification (DDC) and the Universal Decimal Classification (UDC) have been seeking ways of furthering collaboration and fostering interoperability between the two systems. An opportunity has now presented itself, in the need to provide a more universally acceptable approach to religion. DDC and UDC have a worldwide user base, and both classifications have to provide solutions to meet the needs of today’s multi-faith environment. The two systems are historically rooted in a firm Christian tradition and each has attempted to accommodate itself to the modern world in the recent past. UDC implemented a totally new scheme six years ago (UDC Consortium, 2000). The editors of the DDC have also improved the structure of 200 Religion over the last ten years, but Dewey’s strategy has been largely incremental, supplemented by the ongoing provision of local solutions in the form of optional arrangements. This paper explores the feasibility of using the UDC religion scheme as the framework for an alternative view of 200 Religion in the DDC, and as a potential model for future revision. The study investigates the development of a top-level crosswalk between the two systems, and a detailed mapping using Buddhism as a case study.

2. The present situation 2.1 Religion in the DDC In the past two editions, the Dewey editors have reduced the Christian bias in the 200 Religion schedule and provided deeper representations of the world’s religions. In DDC 21 (Dewey, 1996), the editors moved comprehensive works on Christianity from 200 to 230, relocated the standard subdivisions for Christianity from 201–209 to specific numbers in 230–270, and integrated the standard subdivisions of comparative religion with those for religion in general in 200.1–.9. They also revised and expanded the schedules for 296 Judaism and 297 Islam. DDC 22 (Dewey, 2003), the current print edition of the DDC, contains the rest of the relocations and expansions outlined in the two-edition plan. A key change at the top level in DDC 22 is the relocation of specific aspects of religion from 291 to the 201–209 span vacated in DDC 21. The numbers in the 201–209 span are used for general topics in religion, and as the source for notation to address specific aspects of religions in 292–299. Other improvements in DDC 22 include expansion of the sources of the Bahai Faith at 324

297.938, and revision and expansion of the developments in 299.6 for religions originating among Black Africans and people of Black African descent, and in 299.7–.8 for religions of American native origin. Even with these changes, 200 Religion continues to feature Christianity prominently at the three-digit level. At the present time, radical transformation of 200 Religion to give preferred treatment to another religion is only possible as a local solution using one of the five optional arrangements described under 290 Other religions:

Option A: Class the religion in 230–280, its sources in 220, comprehensive works on the religion in 230; in that case class the Bible and Christianity in 298

Option B: Class in 210, and add to base number 21 the numbers following the base number for the religion in 292–299, e.g., Hinduism 210, Mahabharata 219.23; in that case class philosophy and theory of religion in 200, its subdivisions 211–218 in 201–208, specific aspects of comparative religion in 200.1–200.9, standard subdivisions of religion in 200.01–200.09

Option C: Class in 291, and add to base number 291 the numbers following the base number for that religion in 292–299, e.g., Hinduism 291, Mahabharata 291.923

Option D: Class in 298, which is permanently unassigned

Option E: Place first by use of a letter or other symbol, e.g., Hinduism 2H0 (preceding 220), or 29H (preceding 291 or 292); add to the base number thus derived, e.g., to 2H or to 29H, the numbers following the base number for the religion in 292–299, e.g., Shivaism 2H5.13 or 29H.513

Option A vacates the numbers devoted to Christianity for use by another religion. Options B and C provide preferred treatment (and shorter or equivalent numbers) for a specific religion. Both explicitly derive notation directly from the schedules for the preferred arrangement. Option D provides preferred treatment and shorter numbers for a specific religion by relocating it to 298, a permanently unassigned number. Option E provides preferred treatment (and shorter or equivalent numbers) for a specific religion. Option E also uses notation derived from the schedules, but introduces the use of mixed notation. Each of these options presents some problems and none gives the opportunity to provide an even-handed approach to the great religions of the world. There is little information on how Dewey users are using the five options. The Dewey editors recently surveyed Dewey users about the use of options, and received a total of fifty- six responses from thirteen countries (Mitchell, 2005). Only nine respondents reported use of one of the five options, and no one reported use of options D or E. Instead of adding yet another optional arrangement, the Dewey editors are studying the wholesale replacement of the current set of options with one alternative arrangement that might also serve as the future framework for 200 Religion. The new UDC religion scheme is a promising model.

2.2 Religion in the UDC In 2000, UDC published a totally new classification for Class 2 Religion and Theology which has subsequently been incorporated into the latest version of the system (UDC, 2005). The new version of Class 2 aims to rise above the biased approach of the earlier 325 version and treat all religions equally. Broughton notes, “There is no concept of value or priority attached to the order of faiths; each is regarded as having equivalent status, even where this is not reflected notationally” (Broughton, 2000, 60). The classification is totally faceted and consists of a main table enumerating the major religions of the world in the order of their date of foundation:

2 Religion. Theology 21 Prehistoric and primitive religions 22 Religions originating in the Far East 23 Religions originating in Indian sub-continent. 24 Buddhism 25 Religions of antiquity. Minor cults and religions 26 Judaism 27 Christianity. Christian churches and denominations 28 Islam 29 Modern spiritual movements

The historically based listing of religions is amplified through an auxiliary table which lists the principal categories and phenomena of religion, to provide for the expression of the needed concepts. In outline, it is as follows:

2-1 Theory and philosophy of religion. Nature of religion. Phenomenon of religion 2-2 Evidences of religion 2-3 Persons in religion 2-4 Religious activities. Religious practice 2-5 Worship broadly. Cult. Rites and ceremonies 2-6 Processes in religion 2-7 Religious organization and administration 2-8 Religions characterised by various properties 2-9 History of the faith, religion, denomination or church

This auxiliary table contains great detail under each of the above heads, and is used to amplify and create the necessary classmarks. Every main number can have as many concepts added on to the base notation as necessary to provide the detail needed to express the elements of a specific religion. Notation for multiple facets may be added to the number for any religion; the recommended citation order is retroactive in nature. For example, teaching in the Torah on divorce is 26-454-242:

26 Judaism -454 Divorce -242 Torah. The Law. The Pentateuch

There will also be the necessity in certain cases to provide specific detail for concepts that are associated with one faith specifically. Wherever this need arises, differential facets expand the base auxiliary table, so as to provide for the specific needs of the subject, though in fact the number of places where the general auxiliary has been found insufficiently detailed is surprisingly small. 326

3. DDC-UDC view of religion 3.1 Top-level crosswalk The authors are engaged in a study to explore the use of UDC’s Class 2 as a model for development or replacement for 200 Religion in the DDC. As the first step in the project, we developed a general mapping between the two classification systems at the level of representation for each major religion, preserving the notational development under each religion. Table 1 contains an excerpt of the top-level crosswalk.

UDC DDC 23 Religions of the Indian Subcontinent 294 Indic religions 231 Vedism 294.509013 Vedic religion 232 Brahmanism 294.5 (in class-here note) 233 Hinduism narrowly 294.5 Hinduism 234 Jainism 294.4 Jainism 235 Sikhism 294.6 Sikhism 24 Buddhism 294.3 Buddhism ...... 26 Judaism 296 Judaism 27 Christianity 230 Christianity 28 Islam 297 Islam

Table 1. Crosswalk between religions in UDC and DDC

UDC and DDC both place general topics of religion at the beginning of the religion schedule; we did not attempt a mapping between those topics in the initial stage of the study. One problem that surfaced immediately in the top-level mapping of major faiths was the difference in the treatment of the Bible in the two systems. The UDC includes sources of religion within the development for each religion, with religion-specific enumeration provided where needed. For example, the books of the Jewish Bible and the Christian Bible are enumerated under Judaism and Christianity, respectively. Dewey provides a separate class for the Bible outside the development of Christianity and Judaism in recognition of its role as a source of both religions. For the purposes of the initial top-level mapping, we have retained the separate treatment of the Bible found in Dewey. We plan to explore other solutions at a later stage in the project. As part of the preliminary study, we plan to test the top-level mapping as a browsing view of the DDC, stripped of notation at the top layer. Such a browsing view will provide a chronological/regional structure for religion in Dewey while retaining the underlying DDC notation. The top-level mapping can also serve as the basis for a detailed development of an optional arrangement of the DDC based on the UDC structure. We are investigating two approaches to the detailed development of an optional arrangement using Buddhism as our case study: 1) use the UDC base number for the religion and apply the DDC schedule notation and number building instructions to that base number; 2) adopt the UDC structure in full.

3.2 Use of UDC 24 + DDC notation In the first approach to a detailed development based on the UDC, we simply moved 294.3 Buddhism to an earlier position in the Dewey hierarchy by replacing 294.3 with 24, the UDC base number for Buddhism, and applying the Dewey notation beyond 294.3 directly to 24 Buddhism (Buddhism itself is represented by 240). At the three-digit level, this results in the following: 327

240 Buddhism 243 Religious mythology, interreligious relations and attitudes, and social theology 244 Doctrines and practices 245 Religious ethics 246 Leaders and organization 247 Missions and religious education 248 Sources 249 Branches, sects, reform movements

Below the three-digit level, topics may be added through application of standard subdivisions, existing schedule notation, or notation synthesized from add instructions. As an example of the last, 243 can be extended by the instructions found under standard notation 294.33:

Add to base number 294.33 the numbers following 201 in 201.3-201.7, e.g., social theology 294.337

In our UDC base 24 + DDC notation, social theology of Buddhism is represented by 243.7. The corresponding full UDC notation for the same topic is 24-43. The use of the UDC base number plus Dewey notation moves Buddhism from a placement in “other religions” in 290 to a neutral chronological/regional position based on the sequence found in UDC Class 2. DDC number-building instructions and internal and auxiliary tables are maintained. The result is similar in approach to the current Option B in the DDC, except that it provides a redistribution of all religions instead of giving prominent treatment to a single religion. There are limited benefits to this approach. The resulting notation does not correspond to the UDC beyond the first two digits. Also, the development itself carries over the limited development of Buddhism found in the current development of 294.3 in the DDC.

3.3 Detailed mapping of DDC Class 294.3 to UDC Class 24 The second approach under study is to adopt the UDC structure in full as the basis for an optional arrangement of religion in the DDC. We decided as a first step to explore the issues arising from a detailed mapping of 294.3 to UDC Class 24. For the initial study, we limited our source mapping data to the 294.3 notation explicitly enumerated in the current editorial database for the full edition of the DDC. The Dewey editorial database currently contains eighty-six entries in the development for 294.3 Buddhism: twenty-nine schedule entries, plus fifty-seven synthesized index-only entries. Additional numbers may be built using standard subdivisions or add instructions in 294.3. The UDC database contains over 260 explicit entries for Class 24: the table outline of thirteen entries for Buddhism expanded through application of the general religion auxiliary table plus some entries specific to Buddhism. We extracted the developments for Buddhism from the two systems, and developed a table of correspondence from the Dewey notation to the UDC notation. An excerpt based on Dewey classes 294.33 and 294.39 is included in Table 2.

DDC UDC 294.333 Mythology 24-264 Myths and legends 294.337 Social theology 24-43 Social customs and practice. Social theology 294.391 Theravada Buddhism 241 Hinayana (Theravada) Buddhism 294.392 Mahayana Buddhism 242 Mahayana Buddhism 294.3923 Tibetan Buddhism (Lamaism) 243 Lamaism 328

294.3923 Tibetan Buddhism (Lamaism) 243.4 Tibetan Buddhism 294.3925 Tantric Buddhism 243.2 Tantrayana. Tantric Buddhism 294.3925 Tantric Buddhism 243.6 Vajrayana (= Tantric aspect of Mahayana) 294.3926 Pure Land sects 242.5-795.2 Pure Land 294.3927 Zen (Ch’an) 242.5-795.4 Ch’an 294.3927 Zen (Ch’an) 244.82 Zen Buddhism 294.3928 Nichiren Shoshu and Sǀka Gakkai 244 Japanese Buddhism

Table 2. DDC-UDC mapping

The excerpt in Table 2 highlights some of the differences in structure and class definitions between the two systems. UDC makes a basic distinction between Chinese Buddhism and Japanese Buddhism; Dewey does not make such a distinction. For example, UDC places Ch’an under Chinese Buddhism and Zen Buddhism under Japanese Buddhism; Dewey groups both in a single category. The DDC category 294.3926 Pure Land sects includes Pure Land sects of Chinese and Japanese origin. The UDC database only specifies explicit notation for the Chinese version (242.5-795.2), but the same notation can be added to Japanese Buddhism to represent the Japanese version of the sect. Nichiren Shoshu is a sect in Japanese Buddhism, and Sǀka Gakkai is its corresponding lay organization. UDC includes general provisions for sects and lay organizations in the main religion auxiliary table, but the Buddhism expansion in the UDC database does not include explicit notation for either concept. Lamaism and Tibetan Buddhism are equivalent concepts and treated as such in the DDC; in the UDC, Tibetan Buddhism is represented as a subdivision of Lamaism, and this should probably be revised. Tantric Buddhism is a subdivision of Mahayana in the DDC; the boundaries of UDC classes 243.2 and 243.6 are not immediately obvious. Table 3 contains the Dewey notation from 294.33 for which we did not find a match beyond Buddhism itself in the UDC database. In Dewey, there is a skeletal development for secular disciplines under 201.7, with direct addition from other schedules limited to specific social problems. The UDC does not specify combinations of religions and other disciplines explicitly in Class 2, but such classes are available through synthesis.

294.33 Religious mythology, interreligious relations and attitudes, and social theology 294.335 Buddhism and Islam 294.336 Religion and secular disciplines—Buddhism 294.3365 Science and religion—Buddhism 294.3367 Arts and religion—Buddhism 294.3372 Civil war—social theology—Buddhism 294.33723 Civil rights—social theology—Buddhism 294.33727 International relations—social theology—Buddhism 294.337273 Conscientious objection—social theology—Buddhism 294.3376 Social problems—social theology—Buddhism 294.337625 Poor people—social theology—Buddhism

Table 3. DDC numbers without UDC equivalents in Class 24

Other issues surfaced as we reviewed mappings from the rest of the Dewey entries in 294.3 to notation in UDC Class 24. In addition to level of development and enumeration/synthesis of topics, there are differences in the common auxiliary tables 329 outside of religion. In Dewey, there are six auxiliary tables, Tables 1–6. The two tables which occur in UDC but are not reflected in DDC are UDC’s Table 1k-05 Common auxiliaries of persons, which is far more detailed than the provisions for persons in DDC’s Table 1 —08, and UDC’s Table 1k-02 Common auxiliaries of common properties, which has no comparable table in DDC. UDC also has Table 1k-04 Common auxiliaries of relations, processes and operations, but this post-dates the creation of Class 2, and should not affect the exercise. It is DDC Tables 1, 2, 5 and 6 that would be principally used and a few expansions may be needed, especially for languages and common forms to accommodate the concepts spelled out in the UDC.

4. Next steps Based on our preliminary study, we have identified a number of areas for additional study at the content and representational levels. At the content level, we have decided to focus on a careful study of branches, denominations, and sects below the main faith level. We suspect that the differences highlighted in the study of Buddhism are replicated throughout the two systems, and may point to areas in both systems in which further editorial work is needed. We believe a closer alignment of branches, denominations, and sects in the two schemes will contribute toward interoperability. At the representational level, we considered and rejected a development using the UDC base number plus existing DDC notation. We are currently studying an approach that uses the UDC main number coupled with revised DDC notation. In order to provide a basis for a revised development and to promote future interoperability, we need to undertake more detailed mappings between the religion schemes in the two systems. Because of the fully faceted nature of the UDC schedule and the differences in the two systems’ main religion auxiliary table and common auxiliaries, we do not think it is possible to map precoordinated notation. We plan to focus our efforts on developing mappings at the facet level for topics within the religion schemes in both systems. We will then investigate using the mappings as a guide to developing 200 Religion based on the UDC scheme, but with standard Dewey notation. For example, UDC has more detailed provisions for rites and ceremonies in the main religion auxiliary table than those found in the corresponding DDC development in 201– 209 Specific aspects of religion. A topic such as “Buddhist sprinkling rites” is classed in a number corresponding to the general topic “Buddhist rites” in the DDC; in the UDC, the topic is fully represented:

DDC: 294.3438 294.3 Buddhism 294.343 Public worship and other practices 8 Rites (notation derived from 203.8 Rites and ceremonies)

UDC: 24-536.1 24 Buddhism 536 Physical rites and ceremonies 536.1 Washing. Ablution. Immersion in water. Sprinkling of water.

We can use this example to develop the general table in Dewey under 203.8. At the same time, we will also want to provide for full addition of the 201–209 development (with appropriate extensions specific to Buddhism) directly to the base numbers for Buddhism itself and for branches, sects, and movements. 330

5. Summary The general chronological/regional approach taken by UDC in its revision of Class 2 Religion and Theology offers a promising model for an alternative view of religion in the DDC, and a possible framework for a future revision of 200 Religion. Our preliminary study supports the development of a browsing view of religion based on the UDC development at the major faith level. We have used Buddhism as a case study to outline some of the problems in developing detailed mappings between the two religion schemes. In the course of studying mappings at a detailed level, we uncovered structural differences and inconsistencies in both systems that must be addressed. The process of mapping concepts will lead to improvements in both systems as well as providing links for interoperability. Our immediate efforts in this ongoing study will be focused on a review of branches, denominations, and sects below the major faith level in both systems, and on the development of mappings at the facet level.

Notes DDC, Dewey, and Dewey Decimal Classification are registered trademarks of OCLC Online Computer Library Center, Inc.

References Broughton, V. (2000). A new classification for the literature of religion. International cataloguing and bibliographic control, 29, no. 4: 59–61. Dewey, M. (1996). Dewey Decimal Classification and Relative Index. Ed. 21. Edited by J. S. Mitchell, J. Beall, W. E. Matthews, Jr., and G. R. New. 4 vols. Albany, NY: OCLC Forest Press. –––. (2003). Dewey Decimal Classification and Relative Index. Ed. 22. Edited by J. S. Mitchell, J. Beall, G. Martin, W. E. Matthews, Jr., and G. R. New. 4 vols. Dublin, OH: OCLC. McIlwaine, I. C. (2000). The Universal Decimal Classification: A guide to its use. The Hague: UDC Consortium, pp. 256–260. Mitchell, J. S. (2005). Options in religion: Survey results. Retrieved October 31, 2005, from http://www.oclc.org/dewey/discussion/papers/optionsinreligion.htm. UDC Consortium. (2000). Theology and religion: New schedule. Extensions and corrections to the UDC, no. 22: 81–142. UDC: Universal Decimal Classification. (2005). Standard edition. Vol. 1. London: BSI. María J. López-Huertas Universidad de Granada, Spain Thematic map of interdisciplinary domains based on their terminological representation. The Gender Studies

Abstract: The analysis of the terminological representation of interdisciplinary domains is a method that helps us identify terminological dynamics and the conceptual model to organize these spaces. This study applies this approach to Gender Studies, to arrive at the behaviour and the problems posed by the domain, and a possible model for its organization.

1. Objetives There are different kinds of interdisciplinarity to be identified by their modes of construction and the inherent structure shown by each interdiscipline (Caidi, 2002). The terminological representation of the domain plays a very important role in this process. The present contribution aims to describe the terminological behaviour of Gender Studies as an interdisciplinary domain. The study is based on the terminology retrieved from specialized documents which will complement previous research based on Internet search engine structures, as shown in their directories (López-Huertas & Barité, 2002; Marcella, 2002), and on structures found in thesauri (López-Huertas, Torres & Barité, 2004). The documents published on any discipline or specialized field reflect the state of knowledge of that discipline (Hjorland, 2002), as well as shedding light on its epistemology, especially when a complete or substantial body of documents is considered, instead of a sample used to represent the domain. This study considers the total amount of publications on Gender Studies published in Uruguay. An analysis of the terminology extracted from the indexed documents may afford a perspective of terminological dynamics at the domain level. Accordingly, it will express how the different discourses taken into consideration in the studied interdiscipline are represented, it will help in studying the possible problems caused by terminological drifting, and it will discover the capacity of the interdisciplinary domain Gender Studies in generating —or not— a univocal and exclusive language of its own. All in all, terminology is key in approaching the thematic map of a interdisciplinary domain, in this case Gender Studies.

2. Materials and Methods Six hundred primary documents specialized in Gender Studies in the Spanish language, printed in Uruguay from 1990 to 2005, were indexed1. The identification of the documents was done through the following sources: 1) Internet searches (Google) using the words feminism, women, woman, Gender and Uruguay. This search took us to web pages for Women’s ONGs or Women Associations that have a library and reference section holding monographs, articles, etc., and a electronic journal proving very helpful for obtaining terminology. 2) The catalogue of the National Library of Uruguay at Montevideo, which is the repository of the Legal Deposit. 3) Online catalogues of the Universidad de la República libraries: a subject search was conducted using the words feminism, women, woman, Gender. The indexed documents were of the following typology: monographs, periodical publications, conference proceedings and research and socio-political reports. This variety has the advantage of including different discourses which will enrich the final vocabulary. 332

Abundant grey literature was produced in the Gender domain; reports of different natures and content made up a considerable part of the total documents. Free key words were assigned after analyzing the content of the documents. Terms were selected form titles, abstracts and main headings in monographs, articles and reports. In the case of specialized divulgation journals, permanent sections or columns and headings were taken into account to extract terminology. The obtained terminology was located in a relational data base on the Access platform that was designed for the occasion. It has seven fields: name of term, identification number, source of term provenance, source code, onomasiological variants, semasiological variants and context. This data base can give results on request that are very helpful for this type of study, such as lists of terms in excel, frequencies of terms, sources of the terms, etc. Terminology in the data base was later examined according to a quantitative methodology based on the frequency of the terms´ appearance in the documents. The assessment of this indicator was important because the interdiscipline in question is in the process of consolidation, meaning that its terminology is unstable to a considerable degree. The higher or the lower impact of the selected terms within the domain will provide important information. A qualitative analysis would be an interesting procedure to be followed in order to complete the information that the quantitative method yielded. Nevertheless, it has not be considered in this study. This procedure helped in the identification of different kinds of terms, many of them closely related to their original disciplinary provenance, and which showed how these “outsider” terms were incorporated into the interdisciplinary domain, according to the Gender epistemology.

3. Results After studying the terminology, the following results were observed. 1. The terminology pulled from the Uruguayan documents analyzed summed up 537 terms: 460 were descriptors, 48 identifiers and 29 no-descriptors. The vocabulary showed similar dynamics to those taken from Spanish documents. A small proportion of it is well represented in documents, but most of the terms are not well represented in documents, considering that the frequency is 1. Figure 1 shows the inverse relation between the number of descriptors and the frequency of appearance in documents. We see that 378 terms out of 489 have only 1 citation. This fact also evidences a terminological dispersion —and therefore a thematic dispersion— confirming the results of similar studies of gender terminology with other terminological resources, such as Internet and thesauri (López-Huertas & Barité, 2002; López-Huertas, Torres & Barité, 2004).

Figure 1. Descriptor citation frequency. 333

2. A small proportion of the terminology is found to be generated from the interdisciplinary activity itself. That is, it does not originate from any of the origin disciplines that interact within the domain. This terminology, rather, merges to denominate objects and phenomena created by the interdisciplinary domain (feminisms, gender, etc.). Such terms represent 32% of the total 460 descriptors. The terms belonging to this group can be considered nuclear ones, having a quite univocal behaviour; their pertinence to the interdisciplinary domain is beyond question. Nevertheless, they also participate in other general terminological dynamics found in the domain, and can show unstable behaviour in that many terms are in the process of being consolidated, or terminological drift may occur in the case that the terms were originally generated originally in other disciplines. (López- Huertas, M., Barité, M. & Torres, I., 2003). Figure 2 shows the percentage of terms coming from the interdiscipline itself and terms that belonged to other fields.

Figure 2. Terms in Gender Studies and terms in outside fields

3. Many of the selected terms are originally generated in or belong to other disciplines (other than Gender Studies). This group is formed by terminology which is a result of the aforementioned interaction among disciplines and specialties with the Gender perspective, and it represents 68% of the total descriptors. Terms in this set show a twofold behaviour: A) Terms that are adopted by the interdiscipline from other fields interacting with the Gender domain, with the same form and apparently the same sense as they have in their original realm. They are incorporated into the domain as a consequence of the interaction of disciplines that has taken place within the domain itself. Documents dealing with any of the interacted disciplines in Gender Studies —labour, health, education, politics, economy, etc., analyzed from the gender perspective— show a lack of terms of their own to name the subjects to be incorporated in the interdisciplinary domain. These dynamics cause a problematic phenomenon that may be referred to as terminological duplicity. The ambiguity resulting from such duplicity has negative effects for the information systems. Specialists approach this problem at the scientific communication level, adding to the titles of documents words such as women, Gender or an unambiguous expression together with words taken from the outsider discipline (Women and Health, women salaries, women in politics, etc.). In this way, the authors inform that the contents are Gender oriented. This fact is noticeable in the titles, where phrases such as “women labour market”, “women and politics”, “women and social security”, etc., are common. Authors use such expressions to mark the limits of gender discourse; and while it might be effective at this level, the words they use to mark the gender domain cannot be used to index a document 334 because gender or women have no meaning when isolated from the topics politics, economy, etc. in this context. The result is that the indexing expressions do not establish any gender distinction, and apparently they maintain the form and meaning they had in the disciplines of origin. This complicates the indexing process for those documents: the gender marks should be rescued in order to use them later in indexing systems. The problem is accentuated when we consider that these systems are moreover expected also to work without reference to the gender perspective, as would be the case of laws or norms that are needed for these systems. This terminological group, while not belonging to the domain itself, has a significant presence in the interdisciplinary domain. They generate terminology that will pose ambiguity problems for the rest of the vocabulary. Terms in this group represent 74% of the disciplines other than Gender (See Figure 3).

Figure 3. Terminological behaviour in group B).

B) A group of terms created by the interaction of the Gender perspective with outsider disciplines that represent either new concepts (wage-earning job, under-representation in politics, vertical discrimination, glass ceiling, housewife salary, etc.) or terms representing concepts that already exist yet that gain importance because of the Gender perspective impact (domestic violence, sexual harassment, violence against women, etc.). This set represents the 26% of the total of terms in the outside disciplines. Graph 3 illustrates this relation. 4. The subject composition of the terms in outsider disciplines is reflected in Figure 4. Disciplines such as Rights/Law, Labour, Politics, Customs, Family/Society and Health. Economy and Sexuality are clearly significant in forming the Interdiscipline. Other subjects —Psychology, Culture, Administration, Body/Image, and Others (Demography, Religion and Groups)— are of little importance to the Gender field in the sample. 335

Figure 4. Thematic composition of Gender Studies

If we group the subject areas under more general categories, it is easier to see how the Gender perspective interacts with other subject areas. Figure 5 shows this organization. The thematic composition of the Gender Studies for the sample taken in this research is reflected by Figure 5. The main areas of interaction of the Gender Studies are obviously those dealing with the Social Sciences, Health/Hygiene, and Economy/Business. The internal composition of each main area interacting with Gender Studies can be seen in Figure 6.

Figure 5 Main subject areas in Gender Studies 336

Figure 6. Subject composition of main disciplines interacting with the Gender Studies

5. The vocabulary created by the Interdiscipline will be that coming from group number 2 (gender terms) plus that coming from the creation of new terms and concepts in outside disciplines, as explained in 3 B). Altogether, they sum up 258 terms out of 468 descriptors, representing 55.1% of the total.

4. Conclusions The results obtained in this research lead us to the following conclusions:

1. There is a need for further research on the dynamics of the interdisciplinary domains from the Information Science perspective. 2. It is important to identify and analyse the terminological representation of interdisciplinary domains, because they have very different dynamics as compared with those of the disciplines per se. This fact has an important effect on the design and construction of indexing systems. 3. The terminology processed in the way described here focuses on the thematic constitution and interaction in interdisciplinary domains. This knowledge is interesting and necessary because it is a key issue in the representation and organization of interdisciplinary domains. 4. Terminological dynamics shed some light on the epistemological situation of the domain, revealing weaknesses and strengths it may have. 5. The study of terminological behaviour can tell us whether indeed we are talking about inter- or transdisciplinarity. It might be said that Gender Studies show transdisciplinary behaviour, although at present the bulk of knowledge is not yet involved. Nevertheless, the Gender perspective has penetrated many disciplines, altering their knowledge structures by creating new concepts and terms, modifying the 337

traditional scientific structure. This modification, plus the creation of a new language to describe reality, are likewise transdisciplinary characteristics. 6. It is very enlightening to identify the impact of Gender Studies on discourse, especially for resolving the problem of duplicity, as a complement to the terminological methodology followed in this research. Discourse analysis can prove very helpful in delimiting this interdisciplinary domain, once any unclear terminological and conceptual behaviour has been detected.

Notes 1 We were interested in the identification of secondary sources as well, such as specialized dictionaries, vocabularies, etc.; but since we were looking for the production of knowledge in Gender Studies in Uruguay, these publications could not be used because they were not printed in Uruguay. This research has been carried out with the support of a Sabbatical fellowship of the University of Granada, Spain.

References Caidi, N. (2001). Interdisciplinarity: what is it and what are its implications for information seeking? Humanities Collections, 1 (4): 35-46. Hjørland, B. (2002). Domain analysis in information science. Eleven approaches - traditional as well as innovative. Journal of Documentation, 58(4), 422-462. López-Huertas, Maria J. and Barité, M. (2002). Knowledge representation and organization of Gender Studies on the Internet: towards integration. Proceedings of the 7th International ISKO Conference: 10-13 July 2002, Granada / Edited by M.J. López-Huertas. Würzburg: Ergon Verlag. p. 386-392. López-Huertas, María J., Barité, Mario & Torres, Isabel de (2003). Terminología de Género para la recuperación de la información en documentos relacionados con los Estudios de la Mujeres. In: Jornadas de Investigación Interdisciplinaria. Democracia, Feminismo y Universidad en el siglo XXI. Madrid, Universidad Autónoma, 2003. López-Huertas. María J., Barité, Mario and Torres, Isabel de (2004). Terminological Representation of Specialized Areas in Conceptual Structures: the case of Gender Studies. In: Proceedings of the 8th International ISKO Conference : 13-16 July 2004, London / Edited by Ia C. McIlwaine. Würzburg: Ergon Verlag. p. 263-268. Marcella, R. (2002). Women on the web: a critical appraisal of a sample reflecting the range and content of women’s sites on the Internet, with particular reference to the support of women’s interaction and participation. Journal of Documentation, 58(1): 79-103.

Edmund JY Pajarillo, Ph.D. Molloy College, Rockville Center, NY, USA

A classification scheme to determine medical necessity: A knowledge organization global learning application

Abstract: The use and application of knowledge organization concepts and designs is pervasive in this current global learning society. Individuals, groups and organizations benefit from these knowledge organization tools to improve and enhance learning and system workflows. Practitioners in many disciplines deal with conflicts and problems encountered in managing and enhancing information with solutions and strategies that are inherent and unique in their respective domains, yet germane to the Information Science field. Classification theory is one such Information Science concept that plays a crucial role when organizing the body of knowledge in all disciplines, including Nursing. Determining the medical necessity of clients who require home care nursing is often onerous to prove. Nurses can use a classification scheme to establish nursing care that is reasonable and necessary for potential home care clients. This classification mechanism can be useful when determining the criteria for skill need required for reimbursement by Medicare and other health insurance payors. This research is an exploratory attempt at developing this practical tool. Nurses with home care experience were surveyed and interviewed using home care scenarios. The focus was on how nurses qualify home care clients’ appropriateness for services and their medical necessity. The result was a tri-level classificatory scheme to outline this process and its ontological representation.

1. 0 Introduction Classification plays an essential component in managing and organizing the body of knowledge in any given discipline. It provides order and structure to concepts, theories, principles, issues, concerns and phenomena that make up its realm and scope of practice. The profession of Nursing is no exception. Insurance reimbursement for home care services is based on medical necessity. A certified home care provider needs to document and show evidence of reasonable, medically necessary care in order to be reimbursed by Medicare and most other health insurance. For instance, in October 2000, Medicare began requiring home care nurses to document their assessment of home care patients’ needs using the Outcome ASsessment and Information Set (OASiS). The OASiS is a comprehensive, specific and detailed assessment tool used to define the scope and level of care needed by home care patients. Nurses perform various activities and interventions simultaneously or at different points in time during the care episode of a home care patient. Many times, medical necessity is easy to prove, but in other cases, it needs to be demonstrated, proven and well documented in the patient’s record. A quick and easy classification tool will be helpful to nurses when qualifying potential services as essential and reasonable even before initiating patient care services.

2.0 Problem Identification & Statement of Purpose Classification entails cognitive processes of perception, critical thinking, conceptualization and problem-solving as data are processed and placed in meaningful and rational groupings or order. Marco and Navarro (1993, 126) consider classification “central to human response in all the aspects of its relationship with its environment.” It involves a process of gathering data from internal and external resources, organizing and sorting these based on set criteria or likeness, analyzing for rationality or meaning, either singly or by groups, and being able to identify concepts into categories as relationships between and among them are established. 340

Visiting a home care client for the first time to assess the potential of providing for home care services is a daunting experience. It involves gathering innumerable physical, psychological, social and environmental cues by the visiting nurse from the potential client, family or significant others. The nurse utilizes either Federal, State, professional or personal tools, forms and guidelines to collect data, and makes either written or mental notes helpful in analyzing the specific nursing needs of the client. All the physical signs and symptoms, pertinent history, diagnostic and laboratory results are placed in relevant categories, resulting in the patient’s problem list being generated. Additionally, all these information assist the nurse in determining the client’s medical necessity for home care services. The Library of Congress Subject Headings (2003) only lists nursing assessments, observations, diagnoses and outcomes. It does not include a category for “nursing interventions”. Furthermore, the subject heading “home care services” covers only certain home remedies or alternatives, but does not include one for “home care nursing interventions”. The same findings are true with the Dewey Decimal Classification System (2003). There is no subdivision for “home nursing interventions,” under the division “home care services,” and under the class for “nursing.” There are presently classification systems used in nursing, such as the North American Nursing Diagnosis Association (NANDA) that provides standardized nursing diagnoses, the Nursing Interventions (NIC) and Nursing Outcomes (NOC) classifications, and even the Home Health Classification System (Dochterman and Jones, 2003). These scholarly and elaborate classifications have been in place and continue to evolve, providing various theoretical, clinical, administrative and system contributions and advantages. However, what this research aims to achieve is to develop a quick and easy mental mapping work tool that nurses can use as a cognitive framework when making that initial determination of what home care services are required by any particular client and whether these might conform to medical necessity parameters. Medicare pays for home care services for clients who are homebound and who require skilled nursing. Homebound is defined as the condition whereby a client is limited to the confines of the home because of severe weakness or unsteady gait, or when ambulation is taxing and severely impaired that the client requires assistance either by another person or an assistive device. Skilled nursing is demonstrated in any one of the following four categories, namely: (1) observation and assessment activities, (2) skilled procedures, (3) teaching and training patients and significant others, and (4) management and evaluation of the plan of care (Medicare Home Health Agency Manual, 1998). Observing a client’s response to chemotherapy, assessing for side effects to treatment, and following the progression of the patient’s adjustment to performing activities of daily living (ADLs) are examples of the first category (assessment and observation activities). Skilled procedures (category 2) are more technical, precise and specific. Examples are wound care, injections and physical therapy. Category 3 teaching and training activities are exemplified when the nurse teaches the client or significant others on diet, medications and exercise. It is also illustrated by the nurse who trains the client to perform ADLs after a stroke episode. Examples of management and evaluation of the plan of care (Category 4) are teaching and supervising the client and the family with diabetic management, as well as evaluating the effectiveness of the treatment and self-monitoring of blood glucose and self-administration of insulin. Classifying nursing needs into categories can make it more concrete and easily discernible to qualify medical necessity. This classification work tool will also help make the nurse focus the home visits and the resulting documentation based on this categorization of medical necessity. 341

This exploratory research is a look into a knowledge organization global learning application of the use of classification in the discipline of Nursing, particularly in home care, with an attempt to answer the following research question:

Can home care nurses use a classification scheme to determine medical necessity of home care services for their clients?

3.0 Conceptual Framework Determining whether or not a client meets any of the criteria for medical necessity involves a series of steps in classification. When a home care nurse first encounters a client, a string of problem solving processes are carried out. Each step of the nursing process (assessment, intervention, and evaluation) involves information processing and critical thinking to be able to advance to the next step. Problem solving begins during the initial home care visit with the first step in the nursing process – the assessment stage. The main goal involves assessing the client and the social network for any home care nursing needs or diagnoses. A pertinent component of this assessment includes qualifying the client for medical necessity, particularly when the client’s services are to be covered by health insurance. From this assessment, the nurse establishes the needs of the client, plans out appropriate nursing interventions and sets out indicators for evaluation. The nurse employs various data-gathering tools to complete the assessment phase, such as physical assessment skills, interviewing techniques, environmental survey or scanning, laboratory and diagnostic examinations, history taking, and observation. The nurse processes data obtained during the assessment phase utilizing critical thinking skills. A salient aspect of critical thinking is being able to discriminate and categorize items from a whole set of alternatives. The nurse perceives the client, with all the presenting medical, physical, psychological and social needs, and analyzes these data based on her professional knowledge and judgment, and determines whether or not a problem or need exists. Once problems are identified, a plan of care is consequently formulated. The same critical thinking comes into play when devising this plan of care. A range of possible nursing interventions is sorted out and classified based on identified needs, and the most appropriate and relevant actions, interventions or tasks are considered for implementation. The next step is formulating parameters for determining effectiveness, and critically selecting and classifying the most suitable evaluation measures from a variety of choices. A by-product of this nursing process is a concise summary of the client’s total home care needs and required nursing care which is used by the nurse to obtain a general perspective of the client’s overall home care requirements. There are nursing interventions that obviously spell out medical necessity, as in the case of physical therapy for a post-stroke client, or teaching a new diabetic how to monitor blood glucose and self-administer insulin. However, there are cases when interventions, when taken in isolation, may not be enough justification for medical necessity. Taking several interventions and grouping them together might be necessary. For instance, take the case of a demented client who does not know when to take his medications, wanders at night, forgets to eat and has no one to help in the home. This patient requires several, different simple interventions, such as, assigning a home health aide to assist with personal care, having a nutritionist set up a diet plan, pre-filling injectable medications, meeting with the family to set up a teaching schedule for medication pre-filling and administration, and other related nursing interventions. Taken collectively, these actions fall under the fourth classification of reimbursement for skilled nursing need--management and evaluation of the plan of care. 342

Providing a classificatory application to assist nurses ascertain medical necessity of the client’s needs for home care services is important at this time of burgeoning health care costs and managed care. Clients will receive appropriate and medically reasonable care that is efficient, focused and less costly. This knowledge organization work tool will not take the place of current classification schemes. It will be a practical and systematic instrument that nurses can use to mentally map their clients’ needs and determine medical necessity.

4.0 Methodology This is an exploratory and qualitative research, designed to pilot a classification scheme of nursing interventions based on Medicare’s categorization of skilled needs. Two data-gathering tools were employed: scenario analysis and interview. Respondents were asked to examine and analyze home care scenarios lifted from real-life situations, and to list all the potential assessment data and patient care needs. From this information, the case study participants were asked to identify and prioritize all possible nursing diagnoses identifiable and the appropriate nursing interventions for each. A follow-up interview of the participants was set up to gather qualitative responses to help identify the cognitive processes used in devising their listing and prioritization. Specifically, the goals of the interview were to describe the assessment and critical thinking processes used by the nurses and how medical necessity was eventually determined from the list of overall nursing needs of the clients. Volunteer respondents were sought among nursing students of Long Island University in New York, USA, particularly those with past or current home care work experience. A total of 5 students participated in the case study.

5.0 Results and Analysis The results of the case study show that nurses use classification in their home care practice, from their initial meeting with the clients, proceeding towards formulating their plan of care, until a determination for medical necessity and a home care episode is established.

5.1 Concept Map of Tri-Level Classification A striking observation from the case scenario analysis and interviews of the nurse-participants is their clear use of mental classificatory mapping. Hert (1997) describes concept mapping as a graphical representation of creative problem solving, or a picture display of the logical relationships between concepts. This is demonstrated by the nurse who embarks on a problem-solving task during the initial home visit of a client. This representation of the mental problem-solving is illustrated in the tri-level classification process shown in Figure 1. Level I Classification is the initial step of determining the client’s appropriateness for home care. This takes into consideration two components: adequacy of home care support and the client’s safety in the home. This process is carried out by physically scanning and obtaining a general assessment picture of the client’s home environment, including the number of help and support available to assist. If the client is deemed inappropriate for home care, all other phases cease as the patient is not accepted into the home care program. Adequacy of home care support refers to the availability of sufficient family, friends, neighbor or private help who are willing and able to assist in caring for the client. Obvious inadequacies in the support system present more risks to the already impaired health status of the client. However, a limited support network might be acceptable once adjustments are made, such as rearranging the schedule of the support person, contracting for paid assistance, or complementing help from a home health aide. 343

Figure 1: Tri-Level Classification Process for Determining Medical Necessity (using Medicare categories for skill home care needs) 344

In terms of client safety, two aspects are examined. The home environment is appraised for any safety concerns such as adequacy of space, absence of accident hazards, and presence of heat and water supply. The client is also assessed in terms of physical condition – the level of severity of the illness that can be sufficiently and safely cared for in a home setting. This Level I stage addresses the initial categorization of whether or not any client is considered safe for home care. Both factors of safety and care support are reasonably adequate and addressed. Any deficiencies in either one should be adjusted, i.e., providing services of an aide to augment the care support, or asking family or significant others to correct home environment issues prior to initiation of services. Otherwise, if there are serious doubts about the client’s appropriateness in a home setting, no home care should be considered at all. Progression to other levels in the classificatory work tool stops and is no longer feasible. The next phase ensues once a determination is made of the patient’s safety. Level II Classification commences with the comprehensive assessment of the patient in terms of relevant physical signs and symptoms, diagnostic and laboratory results, and pertinent medical and family history. The nurse-respondents describe this process as going through all data, sorting these based on relevance, filtering those that are unnecessary, organizing them into similar and logical groupings (those that render evidence to support a nursing diagnosis), and listing all applicable nursing diagnoses. Hand in hand with this stage is the ranking of priority problems. What follows is an itemization of all appropriate nursing interventions that address the pertinent nursing diagnoses. Level II is analogous to the three-stage problem solving process mentioned by Hert (1997, 54), namely: fact finding, idea finding and solution finding. When nurses sort numerous assessment data, they are essentially fact finding, or identifying relevant signs and symptoms that spell out abnormalities or deviations from norms. Anything that does not appear normal is a reiteration of the fact that the client is indeed in a state of medical care need (assessment of nursing needs through cues, signs and symptoms). These data are then grouped, and nurses venture into idea finding (nursing diagnosis identification) from analyzing similarities, relationships and logical explanations within the groupings (domains of nursing diagnoses). The third and final step of solution finding is the counterpart of determining nursing interventions for the respective nursing diagnosis. Once the second phase of the process is accomplished, nurses proceed to Level III Classification – ascertaining medical necessity. From listing all possible nursing tasks and actions required by the patient, nurses then group these into similar or related criteria. Finally, nurses determine which of the categories for skill needs fulfill Medicare guidelines that will permit reimbursement (observation and assessment, teaching and training, skill procedure, and management of the plan of care). Interventions may fall under one or more of these skill categories. The entire tri-level classificatory scheme of determining medical necessity is both taxonomic and comparative in nature. It is taxonomic in the sense that nurses gather data from various sources, which are then sorted into similar categories to distinguish likeness and delineate differences between and among them. It is comparative because the classificatory mechanism enables nurses to examine variables as they exist and affect each other in varying dimensions and perspectives.

5.2 Medical Necessity Determination Ontology Based on this classificatory work tool, an Ontology for Determining Medical Necessity has been devised which illustrates the computational representation of this tool (Figure 2). This helps codify the procedure for programming purposes when used in the education, practice 345

Figure 2: Ontology of Medical Necessity Determination for Home Care Services 346 and research fields of nursing. This is also particularly instrumental when training new home care nurses gain the knowledge and expertise in qualifying clients for home care, defining their plan of care, and ascertaining the medical necessity of their home care requirements. This is also of value to experienced nurses in terms of navigating the essential components required in this three-tier process to improve their efficiency in accomplishing their nursing responsibilities. This dynamic ontology describes transitions and processes. It can be one state or process at one time, or two or more concurrent activities (symbolized by two parallel lines, “||”), or one event results from the other (sequencing, symbolized by “;”), conditional (if-then relationships), iteration (whiledo), and other programming language and symbols (Jurisica, Mylopoulos & Yu, 1999, 486). Going through the steps in the ontological process as illustrated in Figure 2, the nurse gathers specific data relating to the patient’s general safety. This includes the adequacy of care support and the safety of the client. These two activities are concurrent and are represented by the “||”. These two conditions should be satisfied – there is adequate support and the client is safe. If one or the other is not, the process halts. Once this level is accomplished, the nurse then moves to the next phase – gathering, sorting, organizing, and analyzing data. Data that are relevant and useful are grouped and processed into pertinent nursing diagnoses. Those that are not are discarded or set aside. All nursing diagnoses are then prioritized, and relevant nursing interventions identified, listed and ranked. Once the identification of the problems and solutions are completed, the nurse proceeds to the final step – grouping and establishing medical necessity for skill needs. All nursing actions and tasks that are listed to address specific problems are analyzed and classified into one of the four Medicare criteria for medical necessity. Looking through this dynamic ontology, the sequential relationship of each step in the entire three-level classificatory process becomes clearer. A better understanding of the overall process of determining medical necessity is achieved with the use of this ontology.

6.0 Conclusions Results of this exploratory study reiterate that classification is indeed the focal point of human informational activity. It proves that a classificatory work tool to ascertain medical necessity is an example of how knowledge organization concepts are global learning applications, as in the discipline of Nursing. Nurses may appear as men and women who provide physical and psychological ministration to the ill. But aside from the nursing process used as the problem-solving framework, nurses also apply classification concepts in their work. The nurse-respondents in this case study demonstrate the use of this tri-level classification system which can be helpful as a quick and easy guide when qualifying would-be home care clients for medically reasonable and necessary services. Further research is recommended, such as replicating this study using more case study participants and with other home care nurses. Other variables can be taken into consideration, including the nurses’ education and years of experience and the type of insurance coverage that clients present with. Differences in education and experience among nurses might present obvious variations in results. Insurance payers differ in their service requirements which might also show divergence in research outcomes. These and other factors might have effects on this classificatory process and should be explored. 347

7. 0 References Dewey decimal classification. (2003) 22nd ed., In OCLC Connexion [online]. Dublin, OH: Forrest Press, OCLC, 2003-2004 [cited 22 January 2004]. Available from World Wide Web: http://connexion.oclc.org. Docherterman, J. M. and Jones, D. (2003). Unifying nursing languages. The harmonization of NANDA, NIC and NOC. Washington D.C.: NursesBooks.org. Hert, C. (1997). Understanding information retrieval interactions: Theoretical and practical implications. Greenwich, CT: Ablex Publishing. Jurisica, I., Mylopoulos, J. & Yu, E. (1999). Using ontologies for knowledge management: An information systems perspective. In L. Woods (Ed.), Knowledge: Creation, organization and use. Proceedings of the 62nd Annual Meeting of the American Society for Information Science Volume 36. Medford, NJ: Information Today. Library of congress subject headings(2003). 26th ed. Subject Cataloging Division, Processing Department. Washington: Library of Congress. Marco, F. and Navarro, M. (1993). On some contributions of the cognitive sciences and epistemology to a theory of classification. Knowledge Organization, 20, 126-132. Medicare Home Health Agency Manual(2003), U.S. Department of Health and Human Services, Centers for Medicare and Medicaid Services. Baltimore, MD.

Steven J. Miller, Melodie J. Fox, Hur-Li Lee, and Hope A. Olson School of Information Studies, University of Wisconsin-Milwaukee, USA

Great Expectations: Professionals’ Perceptions and Knowledge Organization Curricula

Abstract:A disparity seems to exist between the expectations that librarians have of education for the knowledge organization (KO) and what is taught in accredited master’s programs across the United States and Canada. Analysis of official competencies, AUTOCAT discussion list postings, previous studies, and LIS curricula suggests that although many practitioners express this mismatch, the data reveal that KO competencies are hardly marginalized in curricula, and there is a large degree of consensus on what educators should and do offer. The analysis also indicates that there may be a “mismatch” in perception about the state of KO education and curricular offerings today within both the practitioner and educator communities.

1. Introduction

It is a scandal that a person can obtain an MLS from some library schools, having learned little or no cataloging. It cannot be said often enough that cataloging--a broad understanding of the patterns and structures of bibliographic control--is essential for every librarian. – Michael Gorman (2002)

This statement from the editor of the second edition of the Anglo-American Cataloging Rules and 2005/06 president of the American Library Association (ALA) is echoed by others like it scattered throughout library literature. They indicate a seeming disparity between the expectations that librarians and other information professionals have of education for the organization of information (OI) and what is taught in accredited master’s programs (the standard professional qualification) in library schools across the United States and Canada. Two questions arise: Do educators offer what professionals say is needed? If not, what are the discrepancies between them? Focusing on education for knowledge organization (KO), this paper examines four sets of data: official competencies, discussion list messages, a meta-analysis of previous studies, and current curricula.

2. Literature Review The perceived mismatch between the expectations of practitioners and what library and information science (LIS) educators are providing has been a recurring topic in LIS literature on education for more than a century (Sellberg, 1988). The discussions and research can be grouped into three overarching, perennial themes: theory versus practice in LIS education, the number and requirement level of course offerings in the field, and the content of those courses. Since 1999, the literature has ranged from personal opinion pieces to empirical surveys. A few writers have explicitly addressed KO topics in LIS education (e.g., Williamson, 1997; Oyler, 2002). Concern over LIS education in cataloging, classification, OI and KO remains high, but it is difficult to make a simple split between practitioners and educators. Both groups express dismay at the perceived de-emphasis on, and devaluing of, cataloging and KO in LIS schools, perceived reduction in course offerings, challenges of integrating computer technologies and digital and other nonbook resources into the cataloging curriculum, and requirement of general OI courses instead of traditional cataloging and classification (e.g., Taylor, 1996; Hill & Intner, 1999; Hill, 2002; Saye, 2002; Hill, 2003). 350

In contrast, some practitioners expect that new graduates will require extensive on-the-job training (Riemer, 1993; Hill, 1997), and there seems to be a growing recognition among both practitioners and educators that KO education today must encompass not only substantial coursework, but also on-the-job training, continuing professional education, and life-long learning (Intner, 2002). Several authors find a growing degree of convergence among practitioners and educators on the importance of teaching theory over practice and providing students with a broader array of KO principles than traditional cataloging and classification alone (Letarte, et al. 2002). Some note an increase in the number of introductory level courses in LIS schools when considering both cataloging and classification and OI courses together, evidencing a strong commitment to the teaching of OI concepts, albeit with a shift toward a broader, more theoretical approach (Vellucci, 1997). This contrasts with those educators and practitioners who believe that instruction in traditional cataloging provides the best, most concrete way to give students those knowledge and skills, rather than a general OI course.

3. Methodology The methodology for this study is a discourse analysis enacted by close, critical reading of texts to reveal practitioners’ and educators’ understandings and beliefs about education for OI.

1) Official competencies developed by professional bodies that represent traditional librarians, information scientists, and specialists in OI. Generally, the competencies are developed through soul-searching discussion and represent the profession’s ideals of what a member of the profession should know. 2) Messages concerned with OI and KO education from March 1999 posted on AUTOCAT, “an electronic forum for the discussion of all questions relating to cataloging and authority control in libraries” together form a composite of individual, rather than organizational, voices and so bring a different perspective than formalized competencies. 3) Empirical studies of practitioners’ and educators’ views and of the curricula of library schools are considered in a meta-analysis to identify documented constants, trends, and changes in ALA accredited programs. 4) Library school web sites as they reflect curricula are examined to reveal what courses are required; what electives are offered devoted to fundamental topics in OI; and what other courses support, include aspects of, or are otherwise related to OI. Included are accredited master’s programs in the United States and Canada.

With analysis of these texts, themes are allowed to emerge, as described in the findings, to illustrate the views of practitioners and the state of curricula.

4. Findings 4.1 Competencies Suggested by Professional Associations A selective sample of Web sites of American and Canadian professional associations provides policies on professional competencies in the form either of a list of competencies, knowledge, or skills; or a list of core curricular areas. These professional associations form the general consensus that organizing and providing access to information resources is a core competence in LIS. Having a focus on library technical services, ALA’s Association for Library Collections & Technical Services articulates more specifics about the competence in subject analysis in its educational policy statement. Those specifics can be summarized as: 351

1. Knowledge of the theory for subject analysis; 2. Knowledge of methods and standards (national and international) for subject analysis, including thesaurus creation, subject indexing and classification; 3. Knowledge of activities, including those in subject analysis, that must be performed to provide the products and services users need and of tools that are used in those activities; and 4. Ability to develop and apply tools for subject analysis, including syndetic structure and controlled vocabulary.

An additional relevant statement taken from the Core Competencies of the Congress on Professional Education indicates:

5. Understanding of how to apply the principles and standards in practical, cost-effective operations.

4.2. AUTOCAT The theme of general cataloging education reappears yearly on AUTOCAT. Many postings reflect a sentiment that the field will experience a shortage of quality catalogers, which leads to debate over the responsibility of educating new catalogers, particularly in advanced skills. Most discussions speak broadly of cataloging rather than specifically of KO; however, subject analysis and classification are commonly mentioned as weak skills in rookie catalogers. The prevailing feeling is that library schools do not sufficiently train graduates to be immediate practitioners, and advanced KO skills need to be learned elsewhere. While most posters agree that a practicum trains potential catalogers best, practical experience is not required in most schools or always available. Therefore, the onus falls on library school curriculum or entry-level cataloging jobs. Though many libraries expect to train entry-level catalogers, others do not have the resources to do so, which returns the obligation to library schools to provide training that small libraries and organizations cannot. Faculty leads the defense of cataloging education justifying the quality of what is taught, but lamenting the amount of time they have to teach. They argue that library schools exist to teach about all aspects of librarianship and instructors have little time to teach the growing amount of specialized skills. Arlene Taylor (1998) points out, “[t]he amount of material [in cataloging courses] to be covered today is 15-20 times what it was when I started teaching in 1971.” As expectations rise, time limits become more restrictive. Courses beyond OI also have ambitious objectives, yet the overstuffing of courses results in an imbalanced understanding of the concepts. Some instructors try to balance theory and practice, while others focus on application to combat criticism that graduates cannot “hit the ground running.” While the reporting of what is taught versus what is remembered by students can be unreliable, the conflict clearly lies in how instructors are forced to prioritize concepts based on a 15 week semester schedule. As a result of the time constraints, most instructors spend around five to six weeks on subject analysis and classification:

[T]he core problem lies in the presumption that descriptive cataloging, subject analysis, LC call numbers, Dewey, LCSH, and authority control can be taught in one semester…Each of these subjects could easily be given an entire semester … and still come up short on giving new catalogers what they need to go into a cataloging job with a resemblance of competency for the job. (Edrington, 2004)

In a little under half of library school catalogs, subject analysis does exist as an elective course, but demand and faculty availability determine how often it is offered, if ever. 352

As the debate continues, few have proposed solutions beyond establishing competencies or requiring a practicum. One instructor, however, suggested that LIS’s could help by automating descriptive cataloging more so that instructors “could spend more time on subject analysis and authority control and less time on prescribed punctuation” (Frances, 2004). Additionally, many posters betray an underlying assumption that library school graduates “know” during school that they want to become catalogers, yet the reality is that students sample many areas of librarianship because budgets have forced librarians to be cross-disciplinary.

4.3. Meta-analysis Several studies have examined the place of KO in US and Canadian library education from two angles: what it should be and what it is. This meta-analysis compares the existing empirical studies to the extent that variables are roughly comparable. Dates in the tables are the dates of data collection, dates in the text are of publication.

Likert scale 1=most important (rank among all OI topics) Practitioners Educators MacLeod & Turvey & MacLeod & Callahan Letarte (2001) Callahan Letarte et al (2001) - (1994) - - heads of (1994) - cataloging, reference, & heads of reference and cataloging other educators - cataloging cataloging educators n=23+29+70 n=84 n=55+65 n=42 Cat. Ref. All 2.08 Dewey Decimal Classification 2.75 (21) 2.8 (39) 1.75 (10) 1.97 1.86 (33)

Library of Congress 1.84 2.37 (8) 1.76 (10) 1.76 (11) 1.72 1.5 Classification (21) Other [classification] systems 3.72 (31) n/a 3.65 (30) n/a n/a n/a Classification: Knowledge of 1.67 n/a 1.86 (15) n/a 1.52 1.61 theory (9)

Understanding relationship 1.87 between classification schemes n/a 1.79 (11) n/a 1.46 1.78 (27) and shelf order

Classification: Knowledge of 1.83 n/a 2.15 (27) n/a 1.72 1.61 methods for (19)

Subject analysis: Knowledge of 1.76 2.05 (4) 1.92 (19) 1.38 (1) 1.66 1.65 theory (15)

Subject analysis: Knowledge of 1.77 2.13 (22) 1.66 1.57 methods for (17)

Library of Congress Subject 1.76 n/a 1.74 (7) n/a 1.59 1.36 Headings (16) Authority files: Subjects 2.53 (11) n/a 1.54 (7) n/a n/a n/a

LC Subject Cataloging Manual 2.53 (14) n/a 2.05 (13) n/a n/a n/a 353

Likert scale 1=most important (rank among all OI topics) Practitioners Educators MacLeod & Turvey & MacLeod & Callahan Letarte (2001) Callahan Letarte et al (2001) - (1994) - - heads of (1994) - cataloging, reference, & heads of reference and cataloging other educators - cataloging cataloging educators n=23+29+70 n=84 n=55+65 n=42 Cat. Ref. All Ability to develop and apply 1.84 syndetic structure and n/a 2.1 (21) n/a 1.86 1.82 (20) controlled vocabulary

Thesaurus creation: 1.9 n/a 2.21 (29) n/a 1.9 1.7 Knowledge of theory (29)

Thesaurus creation: 2.09 n/a 2.52 (34) n/a 2.21 1.87 Knowledge of methods for (35)

Table 1: Comparison of practitioners’ and educators’ views

The two major studies, MacLeod and Callahan (1995) and Turvey and Letarte (2002) and Letarte et al (2002), used questionnaires to compare what practitioners and educators believe should be included in KO education. Of particular note are two factors. First, educators give either very similar or greater importance to KO competencies than do practitioners. Second, in Turvey, Letarte, et al theoretical knowledge was consistently ranked higher than knowledge of methods by both practitioners and educators. The importance of KO topics compared to other OI topics is seen by comparing where participants in the two studies ranked those topics in the full lists of competencies (Table 1, numbers in parentheses). These results are far more mixed than the average likert scores. The second type of study looks at what is actually included in the curriculum. Two factors arise: what is required and what KO electives are offered. Unfortunately, data collection methods vary and variables are not consistently defined. Response rates vary between studies, also affecting sample size. This analysis uses percentages only because of sampling variations. Percentages are rounded because more specific comparisons are unlikely to be meaningful. Joudrey compares data from Spillane (1999), Vellucci (1997), and CCQ (1987) with his own; Irwin (2002) compares his data with Marco’s (1994); and Hsieh-Yee (2004) compares her data to that of Vellucci, Spillane, and Joudrey. This analysis borrows from these previous efforts at meta-analysis. Table 2 records the percentage of schools that require some kind of course in OI and the breakdown between schools requiring courses that cover OI in general versus a traditional cataloging and classification course. This table shows no definitive diminution in the number of schools requiring at least one course of some kind in OI with courses covering OI broadly and conceptually gradually replacing traditional cataloging and classification courses. 354

Total OI C&C CCQ (1986) n=55 18.2% 78.20% Marco (1994) 81.60% Vellucci (1997) n=52 100.00% 38.00% 63.00% Spillane (1998) n=56 32.10% 55.40% Markey (2000) n=47 81% Irwin (2001) 85.40% Markey (2002) n=54 85% Joudrey (2002) n=48 91.70% 47.90% 43.80%

Table 2: Required courses

Whether required OI courses are of the conceptual or traditional type, KO competencies are likely to be included in all of them. However, no school is recorded as requiring a course completely devoted to a KO topic. Table 3 draws on the empirical data of five earlier studies to show elective courses that include some aspects of classification and may include thesauri and subject headings. Joudrey also points out that thesaurus construction is often a major portion of indexing courses, which he found offered at 81% of the schools in his sample (2002, 85).

subject thesaurus subject analysis Classification cataloging construction and search CCQ (1986) 15% 27% Vellucci (1997) n=52 25% 25% Spillane (1998) n=56 18% 18% McAllister-Harper (1998/99) n=16 6% 25% Joudrey (2002) n=48 15% 21% 17%

Table 3: KO electives

Due to limitations of the data and its comparability, this analysis is not able to indicate any trends in electives devoted to KO competencies. The discussion in these studies suggests difficulty in defining what subject cataloging or subject analysis courses are, which may suggest less consistency among what is included in these courses from one school to another.

4.4. Curricula Course catalog descriptions as listed on the websites of 54 English-speaking ALA-accredited LIS schools illustrate offerings in key KO competencies. Courses related to KO include: core OI courses, which include courses dedicated specifically to KO topics and general OI; supplemental KO courses, whose content requires application of KO concepts; and courses with fractional coverage of KO, which include KO as one unit or part a topic. The core category covers courses devoted solely or significantly to KO. Introductory OI courses are the most common, and they typically consist of some combination of bibliographic control, authority control, standards, classification, subject analysis, controlled vocabularies and introduction to bibliographic tools. Advanced core courses, usually electives, cover cataloging and classification, indexing, abstracting, metadata, information 355 architecture, enumerative bibliography and other advanced OI topics. Core courses dedicated exclusively to KO cover thesaurus construction, controlled vocabularies, classification, natural language processing, ontologies, taxonomies or the semantic web. In some cases these are offered as special topics rather than established courses. As shown in Table 4, most schools require OI as the required course.

Required Courses Percentage OI only 43.5 80.6 Cataloging & Classification only 4 7.4 Technical Services 2.5 4.6 None 2 3.7 Other, but includes IO/KO concepts 2 3.7 Total 54 100

Table 4: Required Courses as of Fall 2005

In supplemental KO courses, KO principles are at use but do not make up the central matter of the course. In information technology courses such as library automation, MARC formatting and XML, students apply KO principles to the design and creation of databases, OPACs and other systems, but the courses’ explicit content concentrates on technical aspects. Knowledge management, an emerging field, focuses on corporate environments, but includes the traditional KO concepts of ontologies, taxonomies, thesauri and indexing. Nineteen schools offer at least one KM elective and two offer KM concentrations. FractionalKO courses dedicate a unit or section of the course to KO. Information science courses are included because they contain concepts relevant to the construction, application and evaluation of subject access (precision and recall, specificity and exhaustivity) and because of the innovations in controlled vocabularies, thesauri in particular, originating in the documentation movement. Subject or material specific courses, such as government documents or special types of librarianship like law, art, or music librarianship, fractionally cover subject analysis and classification in the context of the particular discipline. Also, digital libraries courses require application of KO principles, explicitly addressing them in the course. How well does this array of courses cover the competencies laid out by the professional organizations? Required OI courses cover Competencies 1 and 2, knowledge of standards and theory for subject analysis. The rest of the competencies are covered in elective classes. A slight majority (29) of schools cover Competency 3, the practice of subject analysis, in cataloging and classification or its equivalent, but not in great depth. Twenty-four schools, however, offer a practical class beyond cataloging and classification and indexing/ abstracting. Competency 4’s skills of developing and applying tools are also covered in cataloging and classification, and in more depth in indexing and abstracting, controlled vocabulary, natural language processing and subject analysis courses; all schools but 7 offer at least one of these classes, and more might be offered as special topics. It is difficult to tell how much classes focus on cost-effectiveness described in Competency 5. Thirty-one schools either offered a technical services class or mentioned copy-cataloging or cataloging department administration in other course descriptions. KO concepts and competencies are more prevalent in library curricula than it appears. They not only turn up in courses specifically dedicated to them, but also implicitly either by application or by coverage in particular formats. While many KO-specific courses exist in curricula listing, how often these classes are actually offered is unknown. While traditional cataloging and classification and indexing and abstracting courses remain steady, KO seems 356 to be growing in popularity for areas outside of cataloging, such as digital libraries, special librarianship, and knowledge management.

5. Conclusions Is there a mismatch between library practitioners’ expectations of what should be taught and what are actually offered in the area of subject analysis? The above discourse analysis suggests a partial “Yes” and a partial “No.” The discourse analysis indicates that although many practitioners express this mismatch, professional competency documents from the practitioner community and some survey data reveal a large degree of consensus on what educators should offer. The analysis also suggests that there may be a greater “mismatch” among individuals in both the practitioner and educator communities who have different perceptions about the state of KO education and curricular offerings today. What many view as basic and essential knowledge related to OI, and more specifically KO is covered by at least a required course in the majority of ALA accredited LIS programs (52 out of 54). In addition, LIS programs offer a wide range of electives, including some courses with a focus on KO competencies, both theoretical and practical, and many others that require application of KO principles to the design and creation of databases and services. Hardly anyone could conclude based on this data that KO competencies are marginalized in LIS curricula. One possible explanation for the complaint that such competencies “are no longer central to, or even required by, today’s LIS curricula” (Gorman, 2003) is that a typical required course covering these topics rarely has “bibliographic control,” “cataloging” or “classification” in its title, making its relevance less obvious to some. On the other hand, the depth of curricular coverage, particularly in practical cataloging skills, appears to be a real disagreement between many within both the practitioner and educator camps. The main issue centers on whether a master’s student should learn enough in school to become a qualified cataloger at graduation. Some practitioners and educators agree that on-the-job training over a period of years, not weeks or months, is an indispensable part of well-rounded cataloging education.

6. Future research A significant gap in the research to date is the lack of surveys of expectations of practitioners in small and medium-sized libraries, especially school, public, and special libraries. These are settings that may be less likely to have opportunities for on-the-job training or mentoring and more likely to need entry-level catalogers to “hit the ground running.” Virtually all of the survey data are from large academic libraries. Professional competency documents, practitioner-authored literature, and even AUTOCAT may represent primarily large academic libraries since they are most likely to encourage or require staff participation in associations and research. Related are issues that cannot be resolved by formal education alone. Research into on-the-job training and continuing education would contribute to a fuller picture of education for KO.

References American Library Association. Congress on Professional Education. Task Force on Core Competencies draft statement. Retrieved February 25, 2006, from http://www.ala.org/ala/ hrdrbucket/1stcongressonpro/1stcongresstf.htm Association for Library Collections & Technical Services. ALCTS educational policy statement: Approved by the ALCTS Board of Directors, June 27, 1995. Retrieved February 25, 2006, from http://www.ala.org/ala/alcts/alctsmanual/conted/cepolicy.htm Gorman, M. (2002). The corruption of cataloging. Library Journal, 120(15), 32-34. 357

Hill, D. W. (1997). Requisite skills of the entry-level cataloger: a supervisor’s perspective. Classification Quarterly, 23(3/4), 75-83. Hill, J. S. (Ed.). (2002). Education for cataloging and the organization of information: Pitfalls and the pendulum. Binghamton, NY: Haworth Information Press. Hill, J. S. (2004). Education and training of catalogers: Obsolete? Disappeared? Transformed? - Part I. Technicalities, 24(1), 1, 10-15. Hill, J. S. & Intner, S. S. (1999). Preparing for a cataloging career: from cataloging to knowledge management. http://www.ala.org/ala/hrdrbucket/1stcongressonpro/ 1stcongresspreparing.htm History of AUTOCAT. (2002, February 14). Retrieved January 21, 2006 from http://ublib.buffalo.edu/libraries/units/cts/autocat/autocath.html Hsieh-Yee, I. (2004). Cataloging and Metadata Education in North American LIS Programs. Library Resources & Technical Services, 48(1), 59-68. Intner, S. S. (2002). Persistent issues in cataloging education: Considering the past and looking toward the future. Cataloging & Classification Quarterly, 34(1/2), 15-29. Intner, S. S., & Hill, J. S. (Eds.). (1989). Recruiting, educating, and training cataloging librarians: Solving the problems. New York: Greenwood Press. Irwin, R. (2002). Characterizing the core: what catalog descriptions of mandatory courses reveal about LIS schools and librarianship. Journal of Education for Library and Information Science, 43(2), 175-84. Joudrey, D. N. (2002). A new look at US graduate courses in bibliographic control. Cataloging & Classification Quarterly, 34(1/2), 59-101. Letarte, K. M., Turvey, M. R., Borneman, D., & Adams, D. L. (2002). Practitioner perspectives on cataloging education for entry-level academic librarians. Library Resources & Technical Services, 46(1), 11-22. MacLeod, J., & Callahan, D. (1995). Educators and practitioners reply: An assessment of cataloging education. Library Resources & Technical Services, 39(2), 153-165. Markey, K. (2004). Current educational trends in the information and library science curriculum. Journal of Education for Library and Information Science, 45(4), 317-39 McAllister-Harper, D. V. (1993). An analysis of courses in cataloging and classification and related areas offered in sixteen graduate library schools and their relationship to present and future trends in cataloging and classification and to cognitive needs of professional academic catalogers. Cataloging & Classification Quarterly, 16(3), 99-123. Oyler, P. G. (2002). Teaching classification in the 21st century. International Cataloguing and Bibliographic Control, 31(1), 16-17. Riemer, J. J. (1993). A practitioner’s view of the education of catalogers. Cataloging & Classification Quarterly, 16(3), 39-48. Saye, J. D. (2002). Where are we and how did we get here? or, The changing place of cataloging in the library and information science curriculum: causes and consequences. Cataloging & Classification Quarterly, 34(1/2), 119-141. Sellberg, R. (1988). The teaching of cataloging in U.S. Library schools. Library Resources and Technical Services, 32(1), 30-42. Spillane, J. L. (1999). Comparison of required introductory cataloging courses, 1986 to 1998. Library Resources & Technical Services, 43(4), 223-30. Taylor, A. (1996). A quarter century of cataloging education. In: Technical services management, 1965-1990: a quarter century of change and a look to the future: Festschrift for Kathryn Luther Henderson, Linda C. Smith and Ruth C. Carter, eds. New York: The Haworth Press. 358

Taylor, A. (1998, February 22). Official: Autocat archives. AUTOCAT [Online]. Available E-mail: [email protected]/Getpost autocat 048231 [2006, January 20]. Turvey, M. R., & Letarte, K. M. (2002). Cataloging or knowledge management: Perspectives of library educators on cataloging education for entry-level academic librarians. Cataloging & Classification Quarterly, 34(1/2), 165-187. Vellucci, S. L. (1997). Cataloging across the curriculum: A syndetic structure for teaching cataloging. Cataloging & Classification Quarterly, 24(1-2), 35-59. Williamson, N. J. (1997). The importance of subject analysis in library and information science education. Technical Services Quarterly, 15(1-2), 67-87. Kathrin La Barre

A multi-faceted view: Use of facet analysis in the practice of website organization and access.

Abstract: In 2001, information architects and knowledge management specialists charged with designing websites and access to corporate knowledge bases seemingly re-discovered a legacy form of information organization and access: faceted analytico-synthetic theory (FAST). Instrumental in creating new and different ways for people to engage with the digital content of the Web, the members of this group have clearly recognized that faceted approaches have the potential to improve access to information on the web. Some of these practitioners explicitly use the forms and language of FAST, while others seem to mimic the forms implicitly (Adkisson, 2003). The focus of this ongoing research study is two-fold. First, access and organizational structures in a stratified random sample of 200 DMOZ websites were examined for evidence of the use of FAST. Second, in the context of unstructured interviews, the understanding and use of FAST among a group of eighteen practitioners is uncovered. This is a preliminary report of the website component capture and interview phases of this research study. Future work will involve formalizing a set of feature guidelines drawn from the initial phases of this research study. Preliminary observations will be drawn from the first phase of this study.

In any sphere of life, practice precedes theory. Lifeforce [sic] stimulates man to improvise, to design and to develop various aids both at the physical and at the mental levels…After a long experience is gained with an improvised aid, a theory is developed in order to understand the aid deeply and to systematize, improve, refine, and develop it. So has it been with classification too. … (Ranganathan, 1971).

I. Overview The global world of information is characterized by fluctuating borders that present challenges which impel continual, creative and proactive adaptations in strategies and solutions. Working with the postulate that information organization and access is best supported by mutable systems, the current research study examines reflections of the fluid associative and formal structured interrelationships among information packages that are best accessed through dynamic, responsive systems first described by Bliss, 1929, 1933; Bush, 1945; Otlet, 1934 ; Ranganathan, 1957, 1967. While grounded in the sociological and cultural milieu of the present information realities and current international practices of web design, this study is also anchored in the intellectual and theoretical foundations of knowledge organization. In 2001, a group of information architects and knowledge management specialists charged with designing websites and access to corporate knowledge bases seemingly re- discovered a legacy form of information organization and access: faceted analytico- synthetic theory (FAST). This group of users has been instrumental in creating new and different ways for people to engage with the digital content of the Web and they have clearly recognized that faceted approaches have the potential to improve access to information on the web. Some of these practitioners explicitly use the forms and language of FAST, while others seem to mimic the forms implicitly (Adkisson, 2003). It must not be forgotten that the historical growth and development of FAST would not have been possible without the work of a number of seminal thinkers throughout the world, among them Paul Otlet (Belgium, Universal Decimal Classification), Henry Evelyn Bliss (United States, Bliss Classification) as well as S. R. Ranganathan (India, Colon Classification). The fact that FAST retained its currency and continued to develop 360 throughout the years is primarily due to the efforts of three groups who invested heavily in this new paradigm of information organization and access after WWII. These three international groups were: the Library Research Circle (LRC) in India, the Classification Research Group (CRG), in England and the Classification Research Study Group (CRSG) in North America

II. Research study Current applications utilizing faceted approaches are varied and include website search and browsing systems. By utilizing the methods of content analysis of websites and semi- structured interviews with Information Architects and Knowledge Management specialists, this study will attempt to uncover whether or not evidence of faceted approaches in website information access and organization structures exists. In so doing, it is hoped that connections between the intellectual and theoretical foundations of FAST in information organization and current practices in information architecture will be made explicit. The focus of this research study was two-fold, capturing access and organizational structures used in websites, and examining whether or not there is evidence of the use of FAST in website construction and the design of website search tools. Using the Alexander and Tate (A & T) (1999) typology of websites, a stratified random sample of 200 websites were selected from four DMOZ categories of interest, Shopping, Business and Reference and Society. These categories were chosen for their marked differences in content and rationale for creation and existence in order to provide the possibility for marked differences in practice. In addition, eighteen practitioners who explicitly invoke both the language and the forms of FAST were interviewed using an open, but guided (semi- structured) schedule of questions and prompts in order to elicit current knowledge as well as to determine the ways in which their practices conform to or depart from traditional applications of facet analysis.

II. a. The sample websites Alexander and Tate (A & T) (1999) created a five part typology of websites in order to assist in assessment of website content and quality. This typology was used to guide the choice of sample websites. This typology was chosen because it presents a conceptual understanding of the web by breaking a large universe of websites into 5 rough types. This study will compare Alexander and Tate Informational type websites with Business/ Marketing type websites in an attempt to repeat and assess Adkisson’s (2003) preliminary study of the use of faceted classification on e-commerce websites. These two types of sites have been selected as being conceptually related, one providing access to information, the other to goods; but are driven by different motives, one being primarily service-oriented and the other profit-oriented. Sites were then selected from those in the Open Directory Project: DMOZ: http://dmoz.org which gives access to a cross-section of the web chosen by volunteer editors under strict editorial guidelines. It is hoped that due to these policies, that the randomly drawn sample of 200 DMOZ websites has resulted in a collection of websites that may well represent the best, most reliable, most frequently consulted and carefully constructed sites on the Web. The DMOZ categories further provide an operationalized way of considering the Alexander and Tate types, as many of the DMOZ categories map to them, thus providing a way to draw the Alexander and Tate types down to the level of the individual page. 361

Site sample (n=200) by A&T type [IN or BM] and DMOZ category

BM 60 BM

50 IN

40

30 IN 20 IN BM 10 BM IN 0 REF(28) SOC(50) SHOP(57) BUS(65)

Figure 1: Outlines the composition of the 200 sample websites by DMOZ category: Reference (REF), Society (SOC), Shopping (SHOP), and Business (BUS) and by Alexander and Tate (A&T) type – Informational (IN) or Business/Marketing (BM).

Rosenfeld and Morville (2002, p. 4) provided guidance in the selection of the components of interest for this study in their description of the four major components of a website: the system of organization, the search system, the navigation system and the labeling system. This study examines two of these major components, the website search and navigation systems. Search systems on websites are easy to identify and are well known to web users who typically look for “the box” when beginning a search. The current research uses the following definition of navigation elements: “any elements that help a user locate information on a website and allow the user to easily move from page to page within the site. Navigation aids may be text, graphics or a combination of these” (Alexander and Tate, 1999, p. 139). Though Rosenfeld and Morville (2002, p. 170) distinguish between search and navigation systems on websites, they urge designers to integrate these components in ways that allow users to jump easily between the two. In the language of the practitioners who explicitly invoke FAST, faceted navigation and faceted searching are central to the applications they create. “Instead of forcing one way to view the items, Faceted Navigation allows users to view the items in any way they want. At the same time they learn how the items are structured so that they may consider other search strategies in the future” (van Welie, 2004). Thus browsing or ‘integrated’ search and browse systems as described by Rosenfeld and Morville (p. 131) were also examined. Integrated systems allow users to search and browse in one central location and often integrate both actions into one interface seamlessly for users. 362

(REF) 28 (SOC) 50 (SHOP) 57 (BUS) 65 Reference Social Shopping Business Search Features: basic search 1 1 1 1 advanced search 3 3 3 5 Browse Features: browse 5 6 6 7 integrated 4 2 2 4 search/browse Navigation Features navigation only 2 7 7 6 embedded 7 5 5 3 navigation sitemap 6 4 4 2

Table 1: A ranked list of website components across the 200 sample websites.

It is clear from this table that the most common feature across all types of sites is the basic search mechanism. It is somewhat surprising that the sitemap, a navigational feature is far less common than might be expected. Another highly ranked feature is the advanced search mechanism, with the exception of Business, which seems to prefer the use of a navigational sitemap. Also surprising is the heavy secondary reliance among reference sites on the use of navigational features rather than an advanced search feature to assist visitors in information seeking. One feature, integrated search and browse, is remarkable, as much for the complexity of execution as for the fact that it seems to serve as an advance indicator for a site in which there is a high likelihood for the use of FAST. Sites with this feature tend to have a set of co-occurring features, such as embedded navigation, typically a drop down menu attached to a global or site level navigation bar, which provides access to sub site, or local areas. It would seem that this integration of search and navigational features provides a marker of technological sophistication. Both features require the use of advanced information architecture in a way that sitemaps, or the simple use of buttons to help a site visitor navigate through a site do not. Perhaps this accounts for the fact that a site which uses these kinds of advanced features is also one which is more likely to use FAST. Might an IA who designs such a site be more skilled and aware of the discussions about FAST within the community?

II. b. The practitioners Participants were recruited by posting an invitation to participate on several listservs with IA, KM or information organization topics among them, SIGIA and the Faceted Classification discussion list. Others who explicitly invoke both the language and the forms of FAST whether on blogs, lists, or at conferences in the past few years were also invited to participate. Respondents were interviewed using an open, but guided (semi-structured) schedule of questions and prompts. It was hoped that this method would provide access to a broad reach of practitioners. The table below describes the sample members. 363

Title Number in sample IA/ web design – consultant 2 IA/ web design – project management 2 UX (user experience) 3 KM consultant 3 KM specialist (Content Management Systems) 1 Product manager 2 Software designer/ engineer 5 (2 academic) N = 18

Table 2: Interviewee occupation.

Seven of the interviewees are members of the Information Architect / User Experience community (IA/UX). UX is a much broader conception of IA. Four are members of the Knowledge Management (KM) community. Seven are involved in software design, engineering or product management. Two are also members of the academic community, and the rest are either self-employed consultants or work at for-profit corporations. Within the sample is a good mix of consultants and managers. The two academic members are also software engineers who spoke mainly about their creations and the practice of working within a programming and applications framework, and not primarily about their research orientations. One thrust of the interview process was to discover whether there was a common knowledge base, both in terms of people and resources that were commonly consulted and this is covered in the following tables.

Person consulted Affiliation / creation Times named Associate Professor, SIMS, U. C. Berkeley/ Hearst, Marti 8 Flamenco search interface, various publications no one 6 Professor University of Madras Ranganathan, S.R. and Hindu University/ 4 Prolegomena (1967) Schwartz, Candy Professor, Simmons College 2 Professor Emerita, Department of Information Studies UCLA/ Svenonius, Elaine 3 Intellectual Foundation of Information Organization. (2000) Semantic Studios/ Information Architecture for the Morville, Peter 3 World Wide Web. 2nd ed. (2002) Pollitt, Steve View-Based Systems, Ltd. 2 Lou Rosenfeld, LLC/ Information Architecture for the Rosenfeld, Lou 2 World Wide Web. 2nd ed. (2002)

Table 3: Most likely to be consulted for information about facet analysis or faceted classification. A total of 24 names were listed, 9 were academics. 2 are no longer living.

The list of people cited was relatively co-extensive with the list of sources. Other resources of note were the Faceted Classification discussion list, Peterme, a blog maintained by Peter Merholz, and Boxes and Arrows, an online publication with topics of interest to interaction designers, information architects and graphic designers. 364

III. Practice and understanding I’m fond of saying half the people in the world think they invented faceted classification. I know I did for awhile.

Some form of this statement was uttered by a third of the interviewees. Though there are members of the group who have studied canonical FAST literature, most found out about it when someone else identified their practice, software or design as an instantiation of faceted classification. Many of the members of the group were directed to, or found out about Marti Hearst’s work with the Flamenco browser as their first exposure to faceted classification. Others read Information Architecture for the World Wide Web, which contained a discussion of faceted classification in relation to information architecture. The often cited examples of faceted classification heavily favor applications, with a total of 15 websites cited. All websites and all but one of the six applications cited were created by for-profit organizations. Marti Hearst’s Flamenco application is the notable exception, and the most highly cited exemplar.

8 Flamenco demo search application [Academic] http://bailando.sims.berkeley.edu/flamenco.html 4 Endeca (application) http://endeca.com 3 Epicurious (website) http://www.epicurious.com/ 2 Barnes and Noble (Endeca site) http://www.barnesandNoble.com 2 Siderean (application) http://www.siderean.com 2 Facetmap (application) http://facetmap.com/

Table 4: Most often cited examples of faceted classification in action.

But what do the interviewees mean when they talk about faceted classification? Overall, the interviewees demonstrated a sophisticated understanding about fundamental concepts such as facet. Most indicated that a facet is a dimension, attribute, characteristic, category or property. Many indicated an appreciation for the associative or hierarchical constructions that FAST makes possible and explicit. Within this community, the most common understanding of a faceted classification is a system with the following components: facets are displayed as part of a search or navigation system and are visible at all times, the results are displayed using facets as an organizational device, and a null or empty set is never returned as a result display. Many indicated that it was difficult to think of faceted classification outside of a navigational or search system. As one interviewee noted, “One of the things we are still trying to understand here is how to move rapidly to take faceted interfaces and begin to use them not just for retrieval purposes, but for managing an entire class of information.” The observation of one interviewee speaks for many, “I think we use faceted classification, and everybody understands it more or less, but nobody has rally formulated it for us in a way we can understand. The practice we have needs to be theorized a bit and formalized.” As for usability, some expressed an understanding of the importance of domain knowledge when approaching facet analysis. “Do a lot of user research to determine what kinds of content people are looking for and the ways in which they search.” This is important because, “Facets and their display are completely dependent on the target population, their needs and interests.” While there is an understanding that available technology should make faceted interfaces easy to implement, “to many people it is still just an interface”, and not yet “a complete data model” though this is something clearly 365 desirable for many. The interface design issues seem to be among the most exciting, and difficult challenge for many in the sample and is often addressed as “a wicked interface issue”. Questions range from how complex the a faceted interface can be before it overwhelms, how to relate attributes to a subset of entities, and whether or not to allow multiple picks within a facet. Almost all of the interviewees expressed a strong interest in usability testing of faceted search and navigation interfaces to provide support for their own anecdotal evidence that these interfaces are powerful, useful and intuitive. As to the relation of practice to FAST theory, there is no agreement as to the necessity of one of the fundamental canons, that of mutual exclusivity and what that means in practice. While some interviewees feel that orthogonal facets are unnecessary in today’s digital environments, others adhere to it rigorously. It is interesting to note that Travis Wilson will be presenting his Facetmap software, at the March 2006 Information Architecture Summit. Designed according to a “strict faceted classification model”, Facetmap forbids a combination of attributes at any given facet level (in his example: ice cream with a flavor of raspberry-chocolate is not permitted according to the strict faceted classification model). He goes on to state, “This is counterintuitive, controversial, and if you subscribe to S.R. Ranganathan's original facet theory, heretical. Yet a faceted classification is most effective when built upon that restriction.” Another disputed boundary area is the desirability or efficacy of universality in facet creation. “I can’t imagine the classical facets being useful in identifying your own facets and then applying them to your own system. That’s just not the way things are.” Another interviewee said, “I think it is just silly to have a universal way of organizing the world.” Yet other interviewees noted a common observation, that of the frequent synchronous occurrence of fundamental categories in other applications and implementations which closely mirror Ranganathan’s five fundamental categories – whether these facets were automatically extracted from data in some way, or created by human indexers. Other perceived limitations of faceted structures relate to the difficulty of choosing or selecting facets when the data that is being organized is unstructured or lacks metadata or occurs outside a well-defined domain. Clearly, this is a terrain in which much experimentation with and discussion of FAST (whether fully understood or not) is occurring. It is interesting to note that many of these same discussions occurred in earlier work completed by the CRG, CRSG and LRC. The solutions this group will eventually reach may be a fruitful source of theory-building.

III. Future research direction Faceted classification is one way of giving the IA community the tools they need to effectively shape, craft, and channel information repositories in ways that enable designers to get the benefit of human insight and which can be immediately useful to people in the broader community.

A preliminary assessment of the sample website components indicates that approximately ten percent show evidence of some degree of facet analysis. More rigorous analysis of these sites is being conducted. Facets that are evident in these website designs are always used in interfaces which attempt to combine or integrate browsing and searching in order to enable site visitors to find what they are seeking. This remains true regardless of site intent or category. The continuing interest among KM and IA practitioners in FAST endures along with a sense that though the ideas of faceted classification may well be embedded in practice – more guidance and practical theoretical help is needed in order to create the kinds of compact scalable systems notable for their conceptual clarity that these communities know are possible with FAST. It appears that robust application of FAST is 366 still elusive. Guidelines are currently being drawn from content analysis of the sample websites and the observations elicited in the interviews of designers. The principles and postulates of FAST will be combined with the observations drawn from practice in the final phase of this study in order to create a checklist that may enable web designers interested in applying FAST as part of their own website design and construction practices to do so in a consistent and robust manner. One potential contribution of the proposed study is the possibility to apply what is uncovered in the course of this study especially as relates to novel web design practices in order to enrich the current state of FAST theory, and in turn for FAST theory to enhance the design and construction of websites.

References Adkisson, H. (2003). Use of faceted classification. Web design practices. Retrieved 2 February 2004, from http://www.Webdesignpractices.com/navigation/facets.html Alexander, J. & Tate, M. (1999). Web wisdom: How to evaluate and create information quality on the Web. New Jersey: Lawrence Erlbaum Associates. Also available: http://www2.widener.edu/Wolfgram-Memorial-Library/webevaluation/webeval.htm Bliss, H. E. (1929). The organization of knowledge and the system of the sciences. New York: Henry Hold. Bliss, H.E. (1931). Scientific is not Philosophic Classification. Library Association Record. 9(5), 174-175. Bliss, H. E. (1933). The organization of knowledge in libraries. New York: H. W. Wilson. Boxes and Arrows. (2006). Retrieved February 22, 2006, from http:// www.boxesandarrows.com/ Bush, V. (1945). As we may think. Atlantic Monthly, 176 (1), 101-108 Otlet, P.(1934). Traite de documentation. Brussels: Editiones Mundaneum. Peterme: Links thoughts and essays from Peter Merholz. (n.d.) Retrieved February 22, 2006, from http://www.peterme.com/ Ranganathan, S. R. (1957/ 1967). Prolegomena to library science. New York: Asia. Publishing. Ranganathan, S. R. (1971). Colon classification, ed. 7: A preview. Bangalore: Sarada Ranganathan Endowment for Library Science. p. 48 Xia Lin, Serge Aluker, Weizhong Zhu College of Information Science and Technology Drexel University Philadelphia, PA 19104

Foster Zhang Digital Library Systems and Services Stanford University Standford, CA 9435

Dynamic Concept Representation through a Visual Concept Explorer

Abstract: In the digital environment, knowledge structures need to be constructed automatically or through self-organization. The structures need to be emerged or discovered form the underlying information. The displays need to be interactive to allow users to determine meanings of the structures. In this article, we investigate these essential features of dynamic concept representation through a research prototype we developed. The prototype generates an instant concept map upon user’s request. The concept map visualizes both concept relationships and hidden structures in the underlying information. It serves as a good example of knowledge organization as an interface between users and literature.

1. Introduction A main advantage of knowledge organizing tools, such as classification, controlled vocabularies, ontology, and topic maps, is to create a knowledge structure suitable for a body of literature in a selected domain. These tools provide hierarchical structures, associative structures, and various cross-references or semantic links among concepts, terms, documents and other entities. These structures enable users to navigate or browse the body of literature and enhance their understanding. However, this very advantage is also a significant disadvantage in the digital environment as many have observed. For example, some typical problems related to knowledge structures include, (1) users often choose to do free text searches, rather than to browse through the knowledge structures, because it is easy and requires least effort to reach information they need. Going through knowledge structures put additional cognitive burden on users. (2) Structures created in these knowledge organizing tools are “man-made” (created by domain experts) and they do not necessarily reflect semantic structures of the document space the tools apply to; in fact, there is a significant gap between knowledge structures in the knowledge organizing tools and in the body of literature; (3) there are lack of easy-to-use interfaces that integrate knowledge organizing tools and the body of literature in the browsing and searching environment. In this paper, we explore these issues through a prototype system called Visual Concept Explorer (VCE). VCE is a representation and mapping tool designed to communicate with users both about knowledge structures established in the knowledge tools and knowledge structures hidden in the body of literature. We seek to apply information visualization techniques to link and visualize the two knowledge structures for users to explore. The system currently is implemented with the National Library of Medicine’s controlled vocabulary – the Medical Subject Headings (MeSH), the Unified Medical Language System (UMLS) and the PUBMED search engine for the largest medical literature database, 368

MEDLINE. It has an open architecture that can be easily adapted to other controlled vocabularies and literature databases.

2. Knowledge Organization as an Interface between users and literature The goal of knowledge organization (KO), in essence, is to create an interface between literature and users -- an interface that can serve as a pathway or navigation guide for users to recognize coverage of the body of literature, to understand knowledge structures of a subject domain, and to access specific information they need. A subject thesaurus, for example, is such an interface. The central piece of a thesaurus is the highly developed knowledge structures for terms and term relationships. Typically, a subject thesaurus is created by experts in a domain. The domain experts select terms to be included, build term relationships mostly through three types of relationships: equivalence, hierarchical, and associative (Clarke, 2001). Terms included in the thesaurus are later used by professional indexers to index documents to provide a consistent “concept space” of the domain. Figure 2 is an example of the hierarchical structure for the concept “Influenza A virus” in MeSH, the Medicinal Subject Headings developed and maintained by the United States National Library of Medicine. In this example, the hierarchy shows that “Influenza A virus” is a type of VIRUSES (or more specially, a type of RNA VIRUSES, a genus in the family of ORTHOMYXOVIRIDAE, etc.) Its three sub-concepts indicate that the virus can cause INFLUENZA and other diseases in humans and animals). It does not, however, show other relevant terms such as “Disease Outbreaks” and “Influenza Vaccine,” which are currently perhaps most relevant terms to “Influenza Vaccine.” Later in Figure 4, we will show that VCE dynamic display does show these terms.

Figure 1. The hierarchical tree for the concept “Influenza A Virus” in MeSH (2005 version).

The above example shows one of the problems of the hierarchical structures in thesauri. While the structures are carefully designed and often are understandably consistent, they do not necessary match user’s needs. Saying it more forcefully, Thellefsen (2004) states that “the thesaurus structure is a non-realistic structure that is forced upon the domain, often by librarians or information specialists.” He emphasized that the test for a good representation is whether the representation truly represents the structure of a knowledge domain or whether the representation truly represents distinctive features of the knowledge domain. 369

Since the formal is very difficult to achieve or test, he proposed a “knowledge profiling” method to construct “distinctive features” of the domain based on the epistemological basis of a knowledge domain as well as the consequences of this epistemological basis. Indeed, many research projects on knowledge organization are focused on different ways of constructing knowledge structures. For example, Lockenhoff (1994) proposed a systems modeling approach to classification. He views classification as a representation model that is dynamic and dialogic. The model, he emphasizes, should be constructed as a representational interface and evolving system that will reveal context and contextual structures, the hidden orders, and dynamic changes of the objects to be represented. Schneider and Borlund (2003) discussed using bibliometric methods to enhance thesaurus construction. They applied various semi-automatic bibliometric methods to augment traditional manual thesaurus construction in the identification of candidate thesaurus terms and concept relationships. San Segundo (2004) investigates the difference between knowledge representation that is applied to electronic information and knowledge representation that is applied to the structure of natural language and human memory. While “previous classification systems were imbued with logical positivism where the subjectivity of the user did not intervene, these new representations are inundated by the subject which prescribes them, the unit of the structure of the representation is correlative to the sequence of interpretants, which will determine different meanings for the same structure.” Some other constructive approaches to knowledge organization include co-citation analysis (Small, 1999), co-word analysis (Ding, et al., 2001), Text mining (Haravu & Neelameghan, 2003), and ontology (Shamsfard & Barforoush, 2004). What we observed is an increasing trend of dynamic representation for knowledge organization. In the digital environment, knowledge structures need to be constructed automatically or through self- organization. The structures need to be emerged or discovered form the underlying information. While traditional organization tools present logically-sound structures for concept and concept relationships, they need to be combined with self-organization or learning process to dynamically reveal structures in the literature as well as among concepts. Mostly importantly, knowledge organization as the interface between users and literature needs to provide the means for users themselves to explore and “determine meanings for the same structure.”

3. Design for Interaction The VCE prototype we are developing provides multiple channels for users to interact with terms, documents, and knowledge structures through a visual interface. Figure 2 shows the interaction paths between the user and the literature through VCE. When a user initializes a search, he can starts with any search term (free-text search). If the term matches an indexing term in the controlled vocabulary -- MeSH terms in the current setting, -- VCE will generate a semantic map of MeSH terms most closely related to the term entered by the user. The map shows how the terms are related to each other based on term co-occurrence analysis of the literature database (MEDLINE). It shows how the terms are used in the literature thus visualizing content structures in the literature. Through various interactive means provided by the system, the user can browse and select terms on the map for constructing a precise search query or generating new maps based on different terms selected on the map (Channel D). If the user’s query term does not match any indexing term in MeSH, which happens most often as users do not necessary come up terms that matches exactly the terms in the controlled vocabulary, the system will attempt to do various mappings. It first expands the search from the single controlled vocabulary source (MeSH) to a multiple vocabulary 370 system (Unified Medical Language System or UMLS) and uses the unified concept ID (UID) of UMLS to find the equivalent term in MeSH (Channel B on the figure). When no equivalent MeSH term is found, the system will send the user’s initial query to the search engine and retrieve the top 200 documents that match the query. All the indexing terms of the 200 documents will be extracted and listed according to their occurrence frequencies. Users then can select a term from the list for mapping or exploration (Channel C).

Visual User Display of Knowledge Structures

User’s term D User’s selection of Indexing term

A Yes Match? Precision Search Indexing User’s selection No Vocabulary of Indexing term (MeSH) Yes C B Literature

Yes Equivalent Match? Indexing terms? List of Multiple indexing Vocabulary terms resources No No (UMLS) Exploration Search

Figure 2. The Multiple channel Approach for vocabulary mapping

4. An Example of Dynamic Representations VCE is a mapping tool that creates dynamic knowledge structures upon user’s query. It is implemented as a practical visual interface on real world vocabularies (MeSH and UMLS) and a vary large, publicly available literature database (MEDLINE). The maps on the interface are generated by two mapping algorithms, the Pathfinder Networks (Schvaneveldt, 1990] and Kohonen self-organizing feature map (Kohonen 1997), based on term co- occurrences in the Medline database. In this paper, we will not discuss technical details but focus on examples of dynamic representations of concepts and how users interact with the representations. Say a user starts with a search for “bird flu vaccine” as it is a timely topic he would like to explore. The system first determines that term is not a MeSH term, and it does not match to any UMLS term, neither. The system then searches the PUBMED for documents that match the query “bird flu vaccine.” Through processing the first 200 retrieved documents, the system yields a list of terms most relevant to the query (Figure 3). The user now can select any of the terms on the list to explore. 371

Figure 3. The user’s initial query and a list of relevant terms returned by VCE

VCE allows the user to see and recognize concept terms that best match to user’s information needs. Figure 3 shows the map when the user selects the first term on the list, “Influenza A Virus, Avian.” The map provides very rich information about a set of terms most closely associated with the selected term. Following the clusters or links on the map, the user can either select a different term to explore, or select terms to construct a precise query for further searches. Comparing Figure 4 to Figure 2, we are seeing two very different pictures of the same concept. On the MeSH tree structure in Figure 2, the emphasis is on the logic, or the scientific category that the concept “Influenza A Virus, Avian” belongs to. On the automatically-generated concept map in Figure 4, the display shows terms used most often with the concept “Influenza A Virus, Avian” and how they are related to each other. It should be emphasized that the term relationships shown on figure 2 are not simple binary term co-occurrence relationships, but group co-occurrence relationships. It is the optimization process (in this case, the application of the pathfinder algorithm) that highlights the most significant term relationships. We call this kind of representation “dynamic concept representation.” Recall that the user’s original query is “bird flu vaccine,” which implies two concepts: the virus and the vaccine. In the MeSH hierarchy, the two concepts are in two different 372 hierarchical branches. In the dynamically generated display (Figure 4), many related concepts are displayed on the same screen, with their relationships shown. This, on the one hand, helps users understand the context in which these concepts are used. On the other hand, it provides users more choices for further interaction and exploration. Coupling with the instant availability of the search engine on the same interface, users can easily right- click on selected concepts to add them to the search boxes. Terms in the search boxes are automatically converted to a Boolean query for PUBMED searching. In this example, the user adds “disease outbreaks”, “influenza vaccine,” and “influenza A virus, human” to the search boxes. The searcher always knows how many documents will be retrieved as he adds or removes a term during the interaction process. Thus, searching becomes a byproduct of user’s interaction with knowledge representation interfaces.

Figure 4. An automatically generated concept map for the term “Influenza A Virus, Avian.”

5. Conclusions When online map services such as mapquest.com or maps.google.com become widely available, people’s map using behaviors start to change. They do not necessarily need to go through the hierarchy, from the country map, state map, to city map in order to find a street location; they do not need to see all the streets nearby their target locations. They can jump right to their target locations, and dynamically map only those streets and highways they need to travel. Similarly, in the digital environment, more and more people can use search engines to jump to a neighborhood of information they are searching for. What they need is a dynamic representation of information most relevant to their information needs. Thus, dynamic representation and interaction are likely to be two important features of knowledge representation in the digital environment. 373

In this article, through a prototype design, we explore some potentials of dynamic concept representation. We consider knowledge organization tools essentially an interface between the user and the literature. Knowledge organization should take into account not just semantic relationships of concepts. Both user’s information needs and information in the literature are essential sources of knowledge for building knowledge structures. Thus, knowledge structures need to be constructed dynamically and users need to interact with the dynamic representation to match their information needs with the underlying information. While VCE prototype represents only one approach to dynamic concept representation, we hope that this article will bring attention to some general research questions to the knowledge organization community:

1) Can knowledge structures built instantly upon user’s request? This is to test whether the structures can be truly linked to user’s information needs. This is also to address the integration of knowledge representations for documents and knowledge representations for user’s information needs. 2) Can knowledge structures represent dynamic changes of their underlying documents? In other words, when the underlying documents change, the knowledge structures should be changed automatically. When users select different document collections or different time periods of a collection, the knowledge structures should change accordingly and users should be able to observe or follow the changes. 3) Can knowledge organization tools seamlessly integrate knowledge structures and their underlying documents they represent for? While we continue to create knowledge structures manually, semi-automatically and automatically, we need to make sure that the tools should communicate with users about both knowledge structures established in the knowledge tools and knowledge structures hidden in the body of literature.

6. References Clarke, S. G. (2001). Thesaural Relationships. In: C. A. Bean and R. Green (eds.) Relationships in the organization of knowledge (pp. 37 – 52.) Kluwer Publishers: Dordrecht, The Netherlands. Ding, Y.; Chowdhury, G. G.; & Foo, S. (2001), Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing & Management, 37(6), 817-842. Haravu, L. J.; Neelameghan, A. (2003). Text Mining and Data Mining in Knowledge Organization and Discovery: The Making of Knowledge-Based Products. Cataloging & Classification Quarterly. 37(1/2), 97-113. Kohonen, T. (1997). Self-Organizing Maps. Springer-Verlag, Berlin. Lockenhoff, H. (1994). Systems modeling for classification: The quest for self- organization. Knowledge Organization, 21(1), 12-23. Schneider, J. W.; Borlund, P. (2004). Introduction to bibliometrics for construction and maintenance of thesauri. Journal of Documentation. 60(5), pp 524 – 549. Schvaneveldt, R. W. (Ed.). (1990). Pathfinder associative networks: Studies in knowledge organization. Norwood, NJ: Ablex. San Segundo, R. (2004). A new conception of representation of knowledge. Knowledge Organization, 31(2), 106-111. Thellefsen, T. (2004). Knowledge profiling: The basic for knowledge organization. Library Trends, 52(3), pp 507 - 514. 374

Small, H. (1999). A passage through science: Crossing disciplinary boundaries. Library Trends, 48 (1), 72-108. Shamsfard, M.; & Barforoush, A. A. (2004) Learning ontologies from natural language texts. International Journal Of Human-Computer Studies, 60 (1), 17-63. Victoria Frâncu Central University Library of Bucharest - Romania Subjects in FRBR and Poly-Hierarchical Thesauri as Possible Knowledge Organizing Tools

Abstract: The paper presents the possibilities offered by poly-hierarchical conceptual structures as knowledge organizers, starting from the FRBR entity-relation model. Of the ten entities defined in the FRBR model, the first six, the bibliographic entities plus those representing the intellectual responsibilities, are clearly described by their attributes. Unlike those the other four representing subjects in their own right: concepts, objects, events and places only have the term for the entity as attribute. Subjects have to be more extensively treated in a revised version of the FRBR model, with particular attention for the semantic and syntactic relations between concepts representing subjects themselves and between these concepts and terms used in indexing. The conceptual model of poly-hierarchical thesauri is regarded as an entity-relation model, one capable to accommodate both conceptually and relationally subjects in the bibliographic universe. Poly- hierarchical thesauri are considered as frameworks or templates meant to enhance knowledge representation and to support information searching.

1. Introduction The conceptual model proposed by IFLA's Functional Requirements for Bibliographic Records (FRBR) aims at defining the functions performed by the bibliographic records in their widest meaning, including descriptive elements and access points, taking into account a whole range of media and formats existing in bibliographic databases and attempting to meet the expectations of a large variety of user information needs in the most efficient way. The Functional Requirements for Authority Records (FRAR) extend the model to cover the entities and relationships reflected in authority records based on the same conceptual structure. In the FRBR entity-relationship model there is a provision for subjects as the third group of entities. Whereas the first two groups of entities, the bibliographic ones (works, expressions, manifestations and items) and those representing the intellectual responsibility (persons and corporate bodies) are clearly defined by many attributes, the subject entities - concepts, objects, events and places - have only one attribute: the term for the entity. Along with these four, each of the six types of entities in the first two groups of the model can be treated as subjects. One of the general characteristics of a work is that it has a subject. Consequently, that subject has to be logically connected with the work (Riesthuis and Žumer, 2004). This means that one of the functions of the cataloguing process in the view of the FRBR bibliographic model is to identify the subject of a work and collocate all the works relevant to that particular subject. The logical connection between the works embodied by means of expressions in manifestations and the subjects dealt with in those works is done through the relationships the FRBR model is based on. Bibliographic entities are characterized by a set of attributes that should be linked together each time an information need is formulated as a query, according to the criteria the user is interested in. The FRBR and FRAR conceptual models do not treat subjects extensively enough. Particular attention still needs to be given to attributes and semantic and syntactic relationships existing in thesauri, subject heading lists and classification systems and the way they are reflected in the models. Delsey (2005, 56) insists upon a revision of the FRBR and FRAR models with respect to a more extensive analysis of subject access. 376

2. Attributes in FRBR Both conceptual models considered are not restricted to bibliographic information but apply in an expanded outlook to archives, through IFLA’s FRBR Working Group's interaction with the International Council on Archives Committee on Descriptive Standards (Patton, 2005) and to museum objects (Delsey, 2005) through the harmonization with the CIDOC-CRM. Delsey points out the necessity of a significant re-examination of and inter- relation between the two models and the study of the “conventions used to support subject access in bibliographic records and the principles underlying the construction of thesauri, subject headings lists and classification schemes”. Two major objectives of that exercise are: (1) to ensure that the attributes with a role in building and use of access points and subject authority records are adequately covered and (2) to ensure that the models provide a clear representation of the relationships reflected through subject access points in the bibliographic records and of those reflected in the syndetic structure of thesauri, subject heading lists and classification schemes and the syntactic structure of the indexing strings.

3. Poly-hierarchies as collections of entities Experience has proved that poly-hierarchies, as flexible types of knowledge organizers, are more appropriate to the information retrieval requirements particularly in the case of digital resources. If nowadays library catalogues give the user the opportunity to discover useful information by serendipity enabled by the 'browse' function of online catalogues, semantic networks are likely to guide the user "by purpose" to the desired information without him be necessarily an expert in the field of his interest. These networks have the functions of template structures able to embed entities and relationships if adequately configured. In a comprehensive article about generating poly-hierarchical classifications by means of a system of criteria that allows the distinction of classified objects by their meaningful properties the authors (Babikov & al.) argue that "the generating poly-hierarchy is a self- consistent, compact, portable, and re-usable information structure serving as a template classification. This can be further associated with one or more particular sets of objects, included in more general classifications, or used as a prototype for more comprehensive classifications." Knowledge will be represented in the poly-hierarchical structure by the semantic network underlying it. Semantic networks according to Iyer (1995, 164) consist of "objects called nodes and arcs or links which connect nodes. Nodes can represent entities, such as physical objects, abstract concepts, acts or events, or descriptors (adjectives or other entities which describe an entity)". The links between nodes represent relationships, such as: "is-a", "has-a", "resembles" or "causes". A poly-hierarchy is a collection of individual entities of different types grouped in classes according to explicitly formulated criteria and linked among them by multiple types of relationships, each of the entities developing or producing its own hierarchy. Fugmann (1993, 20) defines poly-hierarchies as structures characterized by the existence of more than one superordinate concept in the hierarchy for a specific concept. He also argues that poly-hierarchies are often the consequence of poly-categoriality. Poly- dimensionality of hierarchies is the progressive subdivision of general concepts using a series of characteristics of subdivision. They are applied simultaneously and in parallel. Kwasnik (1999) explores the link between classification and knowledge while making an analysis of the structural requirements of hierarchies. She points out that not every knowledge domain can be represented by a hierarchy and mentions the multiple hierarchies as one of the problematic issues in this respect. Cross-links in a poly-hierarchical structure are regarded as a possible solution. 377

Soergel (1999) identifies several functions of dictionaries, thesauri and ontologies or classifications. They are reference tools relating concepts to terms and they provide definitions, conceptual frameworks and semantic road maps to individual fields and the relationships among fields. They support information retrieval by providing knowledge- based support of end-user searching through: menu trees, browsing a hierarchy or concept map to identify search concepts, mapping from user's query terms to descriptors. They also support hierarchically expanded searching and well-structured displays of search results and provide a tool for vocabulary control in indexing. Soergel recommends the reconsideration of the “huge intellectual capital embodied in many classification schemes and thesauri”.

4. Classificatory and thesaurus structures Classification schemes and thesauri are bibliographic tools and major sources of controlled vocabulary terms. These along with the subject heading lists are used in authority control that provides authoritative consistency of indexing terms. Classification schemes use numerical notations or other codes for subject representation and they have a hierarchical structure. Few classifications are fully enumerative or fully facetted, most are partly faceted, partly enumerative as is the case of the Universal Decimal Classification. The classification tables include the notational codes, their meanings and indications of use in natural language and often provide cross-references to sections of the scheme where related concepts can be found. Thesauri are organised alphabetically and use terms from natural language in a normalised form to represent subjects. They have "by nature" a hierarchical structure reflected through the reciprocal relations BT/NT between their terms. Associative relations are also found in the structure of thesauri and they are reflected through references between related terms or from unauthorised to authorised terms and reverse. As far as the form of terms is concerned, thesaurus terms are usually single words or short phrases that are used postcoordinately in formulating the search statement. Classification codes are covering the subjects they represent by precoordinated structures (e.g. 398.332.416(498.4)(086.86) meaning in UDC "Christmas customs in Transylvania – Romania – video recording"). An important difference between classificatory structures and thesaurus structures is that the former are discipline-oriented and allow browsing by scientific discipline or subject area whereas the latter are subject-oriented and allow specific retrieval. By way of consequence subjects that occur in more disciplines are difficult, if at all possible, to be represented in classification codes. A term like "acoustics" for instance is found 82 times in the UDC tables: in theoretical physics, physiology, linguistics, architecture, musical instrument manufacture, sound recording and reproduction, acoustic detection, to mention only some. Interdisciplinary subjects are still more difficult to be represented in classification notations. Iyer (1995) discusses this phenomenon and gives as examples clustered disciplines occurring in topics like women's studies, area studies, environmental studies, etc. However, in the last years some classifications schemes, e.g. DDC and UDC, have added tables for interdisciplinary new scientific fields. Examples of such tables are in Class 50 and 60 – Biotechnology in the UDC. Whereas thesaurus terms need qualifiers or other context providing devices in case of homonymy with classification notations there is no such problem given the class mark that works like disambiguating device. The concept of “Phonation” is an example of homonymy and multiple inheritance from Linguistics – 81`342.2, Physiology – 612.78 and Physics – 534.78. The class marks 81, 612 and 534 of the UDC will specify the context in each instance of use of the mentioned classification codes. This is not the case of a 378 thesaurus or subject heading list that need context to disambiguate the meaning or domain of application in each instance. In principal, classificatory structures lack unique characteristics of subdivision as the "sonar" example illustrates (Figure 1). For most subjects there are more hierarchies in which they fit because such subjects may be the object of study of more than one scientific discipline. There are different hierarchies under which the concept 'sonar' can be found throughout the UDC scheme. Being in the first place an application of acoustics as a physical phenomenon, it is found in Technical acoustics under ‘Acoustic detection’ as a method of underwater object location; then, under ‘Material and equipment of naval forces’; a third time under ‘Vehicle engineering’ as means of operation of guidance systems and lastly under ‘Fishing’ as a fish detection method.

UDC UDC description number 681.8 Technical acoustics 681.88 Acoustic detection […] Sound locators. Hydrophones. Sonar 681.883 Sonar methods and equipment. Underwater object location […] 681.883.05 Means of sweeping and scanning 681.883.06 Sonar systems and equipment according to location 681.883.062 Ship-borne systems […] 681.883.068 Airborne systems 681.883.07 Sonar systems according to purpose 681.883.072 Sonar systems for research 681.883.074 Sonar systems for navigation 681.883.078 Special purpose sonar systems 623.9 Material and equipment of naval forces […] Submarine defences 623.98 Various applications of science. Special instruments, equipment (e.g. submarine detectors) 623.983 Detectors of underwater objects. Asdic. Sonar 629 Transport vehicle engineering 629.05 Guidance, control-initiation and navigation systems and instruments vehicle-borne 629.052 Means of operation of guidance systems 629.052.2 Acoustic systems. Sonar. Echo systems 639.2 Fishing 639.2.081 Fishing methods and equipment 639.2.081.7 Fish detection apparatus. Methods and equipment for locating fish (e.g. echolocation, asdic, sonar)

Figure 1. Poly-hierarchies for 'sonar' in the UDC classificatory structure

The characteristics of subdivision are explicitly stated in some parts of this section of the classification scheme (see Class 681.883) namely “sonar systems and equipment according to location” 681.883.06 and “sonar systems according to purpose” – 681.883.07. The notational codes used in representing these characteristics of subdivision are treated as facets, the last part of the mentioned codes starting with .0 being called special auxiliaries. They can be regarded as ‘free-floating subdivisions’ and may be attached to more than one 379

UDC main number. This way 639.2.081.7 means “Fish detecting with the help of sonar” and 639.248.081.7 means “Hunting of marine turtles with the help of sonar”.

5. Poly-hierarchical structures as knowledge organizers A poly-hierarchical thesaurus may be regarded as an entity-relation model. In poly- hierarchical structures the classification criteria of the entities included are explicitly formulated. Likewise, the various types of relationships among those entities are clearly stated. The structure of the records in the poly-hierarchical thesaurus will necessarily include generic-specific relationships for each of the classification criteria applicable to a particular class of entities. In Figure 2 a poly-hierarchical thesaurus of pharmacology1 is described. It is meant only for demonstration purposes and does not include all entities and relationships in the given domain. Each entity may be further subdivided and instantiated. The entity types or classes included in the thesaurus are:

E1 - Medicaments E5 - Production technologies E2 - Medicament compounds E6 - Diseases of the human body E3 - Active principles E7 - Parts of the human body E4 - Raw materials

Among these entities, the following relation types are defined:

R1 – Is Beneficial/Detrimental To R6 – Is Composed Of R2 – Interacts With R7 – Is Used In R3 – Includes R8 – Is Included In R4 – Is Affected By R9 – Makes Use Of R5 – Is Influenced By

Entity type Relationship type E1 – Medicaments R1 – IsBeneficialTo/DetrimentalTo E6 – Diseases of the human body R2 – InteractsWith E7 – Parts of the human body R3 – Includes E3 – Active Principles R6 – IsComposedOf E2 – Medicament compounds E2 – Medicament compounds R3 – Includes E3 – Active Principles R6 – IsComposedOf E4 – Raw materials R8 – IsIncludedIn E1 – Medicaments E3 – Active principles R2 – InteractsWith E6 – Diseases of the human body R8 – IsIncludedIn E2 – Medicament compounds E4 – Raw materials R7 – IsUsedIn E5 – Production technologies R8 – IsIncludedIn E2 – Medicament compounds E5 – Production technologies R9 – MakesUseOf E4 – Raw materials E6 – Diseases of the human body R2 – InteractsWith E7 – Parts of the human body R5 – IsInfluencedBy E1 – Medicaments E7 – Parts of the human body R2 – InteractsWith E1 – Medicaments R4 – IsAffectedBy E6 – Diseases of the human body R5 – IsInfluencedBy E3 – Active principles

Figure 2. Entities and relations in a thesaurus of pharmacology 380

The first class of entities may be divided further in subclasses that describe various types of medicaments according to criteria such as: physical properties, physico-chemical properties, chemical properties, methods of storage and preservation, methods of packaging, methods of administration, dosage, effects (or main action) and side effects. These subclasses of entities inherit the attributes of their superclass. They are all in a generic relationship (aka “is-a” relationship) with ‘Medicaments’. In their turn, all the types of medicaments according to their main action inherit not only the attributes of their immediate superclass but also those of the one above it. Figure 3 illustrates a hierarchy from a subclass of entities belonging to the superclass Medicaments according to their main action, i.e. Agents predominantly affecting the nervous system. Both the terms and the structure in the hierarchy are taken from Class 615 – Pharmacology of the UDC. The example is meant to demonstrate the great potential of traditional classificatory structures to serve as knowledge organizers.

Medicaments according to their main action Agents predominantly affecting the nervous system General anaesthetics Analgesics and antipyretics Antipyretic analgesics Antipyretics Antifebrilants / Febrifuges Addictive analgesics / Narcotics Antiepileptics / Anticonvulsants Psychotropic agents Neuroleptics / Psycholeptics Neuroplegics Tranquillizers / Ataractics Psychorelaxants Sedatives / Hypnotics / Saporifics Psychoanaleptics

Figure 3. Example of hierarchy from the Universal Decimal Classification Class 615 – Pharmacology

The thesaurus will contain terms no matter the entities they instantiate, so that once the search is started the search terms will be recognized by the string of characters in the search expression. The result will consist of the display of terms that satisfy the criteria formulated by the user and of the records attached to those, in which the term definition and all its relationships are explicitly shown. Further the search may be expanded through broader terms, narrower terms or related terms in the thesaurus structure.

7. Conclusion Debates are still going on for clarification of the meaning of the FRBR entities. The topics discussed insistently on the FRBR Review Group's listserv fully demonstrate this. The conceptual models of FRBR and FRAR may be merged in a conceptual model for both bibliographic and authority records in order to treat entities, attributes and relationships belonging to both models in a unified way. 381

There is much concern related to issues of aboutness in the activity of the Working Group on Functional Requirements for Subject Authority Records (FRSAR) (see http://www.ifla.org/VII/s29/wgfrsar.htm). The group’s main objective is to build a conceptual model of Group 3 entities within the FRBR framework as they relate to the aboutness of works. Other goals are to provide a clearly defined, structured frame of reference for relating the data recorded in subject authority records to the needs of the users of those records, and assist in an assessment of the potential for international sharing and use of subject authority data both within the library sector and beyond. Poly-hierarchical thesauri may be treated as structures based on an entity-relation model. Consequently they can be associated with the way the FRBR model treats the bibliographic universe. Despite the formal differences between the traditional knowledge representational structures like classification systems, thesauri and subject heading lists, the poly- hierarchical thesauri and ontologies should use the intellectual capital of the former in order to considerably enhance knowledge representation and support information retrieval with lower intellectual effort.

Notes 1 The thesaurus in the example is part of a project of applying poly-hierarchical structures in semi-automated indexing at the Central University Library of Bucharest. I am grateful to Olimpiu Naicu for his help with this application.

References: 1. Babikov, P., Gontcharov, O., and Babikova, M. (n.d.). Polyhierarchical Classifications Induced by Criteria Polyhierarchies, and Taxonomy Algebra. Retrieved February 20, 2006 from: http://arxiv.org/ftp/cs/papers/0312/031259.pdf 2. Delsey, T. (2005). Modelling Subject Access: Extending the FRBR and FRANAR models. In Le Boeuf, P. (Ed.) Functional Requirements for Bibliographic Records (FRBR): Hype or Cure-all? (pp.49-61). Haworth Information Press and Cataloging and Classification Quarterly, 39 (3/4) 3. Fugmann, R. (1993). Subject Analysis and Indexing: Theoretical Foundations and Practical Advice. Frankfurt/Main: Indeks 4. Iyer, H. (1995). Classificatory Structures: Concepts, Relations and Representation. Frankfurt/Main: Indeks Verlag. 5. Kwasnik, B. (1999). The Role of Classification in Knowledge Representation and Discovery. In Library Trends, 48(1), pp. 22-47 6. Patton, G. (2005). Extending FRBR to Authorities. In Le Boeuf, P. (Ed.) Functional Requirements for Bibliographic Records (FRBR): Hype or Cure-all? (pp.39-48). Haworth Information Press and Cataloging and Classification Quarterly, 39 (3/4) 7. Riesthuis, G. J. A. and Žumer, M. (2004). FRBR and FRANAR: subject access. In Knowledge Organization and the Global Information Society: Proceedings of the 8th International ISKO Conference, 13-16 July 2004, London, UK. McIlwaine I. C. (Ed.) Würzburg: Ergon Verlag, pp. 153-158 8. Soergel, D. (1999). The Rise of Ontologies or the Reinvention of Classification. In Journal of the American Society for Information Science 50(12), pp. 1119-1120

Richard P. Smiraglia Palmer School of Library and Information Science, Long Island University, NY USA

Empiricism as the Basis for Metadata Categorisation: Expanding the Case for Instantiation with Archival Documents

Abstract: Metadata schemas tend to be rationally ordered instruments for the categorization of data about information objects. Instantiation has been demonstrated to be a universal phenomenon. Empirical analysis, both positivist and qualitative, has contributed to typologies of the properties of instantiation. This yields a naïve knowledge organization schema of instantiation. Bibliographic, museum, and archival analyses are compared to demonstrate the value of empirical derivation of categories. In this instance categories, once derived, are demonstrated to represent properties yielding typologies. The empirical generation of categories for knowledge organization is demonstrated.

1. Introduction In the sphere of metadata, and in particular in the area of knowledge representation, the emerging concept of instantiation holds promise for the construction of increasingly sophisticated retrieval mechanisms. Instantiation” is the phenomenon addressed by research into bibliographic ‘works,’ and more recently ‘content genealogy’ of artifactual representations. Specifically, an instantiation of a work exists whenever the work is manifest in physical form (in a book, for example). A problem arises when multiple instantiations of a work (several editions, translations, etc.) exist and must be collocated (i.e., classified alphabetically) in a retrieval system with sufficient information to assist in the selection of the instantiation of interest to a searcher. Similarly, unique artifacts can be represented by metadata or images (called representations), which can exist in multiple instantiations (a photographic negative, a print, its digital descendent, etc.). The same is true of the representations of archival documents, which might exist in paper photocopies, digital images, and so forth. Metadata are categorical descriptors of information resources, which often are used as alphabetico-classed segments of thesauro-faceted strings for information retrieval. Multiple instantiations of a work (editions, translations, etc.) that appear in an information retrieval system must be collocated (i.e., caused to appear to be adjacent). In such a system, there must be sufficient information to assist in the selection of the instantiation of interest to a searcher. So far, in bibliographic retrieval systems, collocation of this type has been achieved by the use of alphabetical classifiers called uniform titles. Metadata schemas are generated rationally and sometimes pragmatically, but generally without evidentiary source. In this paper, a sequence of research projects is summarized to demonstrate the value of empirical observation in the creation of metadata for ordering collocated instantiations. Studies of bibliographic works, museum artifacts, and personal papers are used to generate typologies of instantiation.

2. The Role of Empiricism The use of empirical research methods in KO is rare, and has been confined largely to the positivist testing of components of retrieval systems. There has been little testing of principles or assumptions that form the basis of bibliographic control practice; such testing 384 could generate sufficient evidence to turn principles into theories. This absence of empirical research (or other appropriate methodological approaches) in Knowledge Organization (KO) was criticized by Hjørland (2003), who suggested developments in practice were likely more technology-driven than theoretically justifiable. Through historical and epistemological analysis of KO he argued for the formal use of four fundamental methods, which parallel four fundamental epistemic stances (2003, 107): empiricism (observations and inductions), rationalism (principles of pure reason, deductions), historicism (study of context and development; explicating pre-understanding), and pragmatism (analysis of goals, values and consequences in both subject and object). That said, a major contribution to theory has come from research into the content and formation of components of library catalogs. In particular, several projects generated substantial evidence about the structure of catalogs and the origins of their source data. One result has been increased confidence in the generalization due to increased confidence in external validity. These projects--Taylor’s investigation of name headings (Taylor 1982), Potter’s application of Lotka’s law (Potter 1980), Tillett’s analysis of bibliographic relationships (Tillett 1987), and the subsequent suite of research by this author into the ‘works’ phenomenon—are presented in meta-analytical form in Smiraglia (2002). Hjørland’s appeal to activity theory (2003, 98) is probably most directly relevant to the present project. The act of naming objects (documents, artifacts, records, and their content, to be specific) is the action of facilitating their use. Terminology thus used cannot be neutral, because of the influence it brings to the activity of facilitating (or obfuscating) use. Metadata schemas tend to categorize based on rationally deduced categorizations of objects and their component parts. Thus rationalized, the schema predetermine the potential use of intellectual content by facilitating or limiting its retrieval. What I mean to suggest is that the base point for construction of metadata schema, particularly those designed to embrace intellectual content, should be empirical observation of the content itself. More colloquially put, letting the documents, artifacts, and records speak for themselves allows the creators (all of them, from authors to printers) to play a role in the use of the intellectual content. In an analysis of metadata schema and their evolution, Greenberg (2005, 30) emphasized the importance of empirical research, specifically into the problem of instantiation with the purpose of informing data modeling techniques for object representation. Data modeling, she suggested, is a way of typing objects—that is, naming them to facilitate their use. When multiple instantiations exist side by side in a retrieval context, explicit, content-driven typologies are required to sort instantiations. Empirical observation of the instantiations in a variety of networks can yield inherent typology, such as those that are demonstrated below.

3. Empirical Research and Instantiation Studies to date have demonstrated some consistent theoretical parameters for the concept of instantiation, even across bibliographic and artifactual borders. In the bibliographic analysis of works, samples drawn from online catalogs formed the basis of study. For each work, all instantiations extant in the bibliographic networks OCLC and RLIN were identified and sorted according to a taxonomy. The initial taxonomy, created for analysis of instantiations of works from an academic library catalog, included seven categories: simultaneous editions; successive editions; translations; amplifications; extractions; adaptations; and, performances (Smiraglia 2001, 42). Subsequent research using a sample of works drawn from the OCLC WorldCat (a union catalog), yielded two new types. One was ‘predecessor,’ which was used for notes or sketches of the work under study, as well as for instances such as novels or screenplays developed from short stories. The other was ‘accompanying material,’ which was used when the work was included in more than one 385 medium, such as a textbook accompanied by its text on a computer disk. Vellucci studied musical works and generated two more types of instantiation that were music-specific— ‘musical presentation,’ and ‘notational transcription.’ More recently Smiraglia (forthcoming) studied best-selling books of the twentieth century, and reported the presence of a new category—‘persistent works’—used to denote works whose instantiation networks develop after the initial publishing frenzy that usually accompanies best sellers. Interestingly, continued tabulation of instantiation networks revealed that the terms represented not so much categories as properties. The categories were not mutually exclusive, but could appear together in the same instantiation. That is, just as a male may be tall, so a translation may have a successive edition or appear in a commentary or both. This suggests that what we are dealing with is not so much taxonomy as typology. To extend the concept of instantiation, a set of Etruscan artifacts from the University of Pennsylvania Museum of Archeology and Anthropology was examined. The artifacts were, of course, unique. But for each, many representations (ranging from photographs to models), as well as metadata descriptions, existed both in the museum and in publications. It turned out that the representations were often multiply instantiated (Smiraglia 2005a). The instantiations of artifacts, like those of bibliographic works, yielded a typology as well. The typology has two sets. For metadata: finding aids; field notes; letters; conservation treatment notes; register descriptions or object cards; image order invoices; museum database records; and catalog card records. For artifactual representations: field photos; drawings; working images; 3d models; exhibition color images, and digitized images; conservation photos; photo archives, including negatives, prints, and transparencies; and object reproductions. Every representation, metadata or artifactual, was accompanied by an object description, whose components constitute an object entity: object type, material, culture, source, collector, and date acquired. A natural question, then, is: can the concept of instantiation be extended to unprocessed, raw data, as in the case of archival evidence? The extension to museum artifacts was fruitful, and knowledge of the digitization of archival finding aids, as well as the impending digitization of archival resources meant that it was likely instantiation could be observed in archival contexts. The U.S. Merchant Marine Academy at Kings Point, Long Island, New York, houses an extraordinary library with rich archives of former midshipmen. In particular, the Class of 1942 is a rich source of documentation. The archives consists of fourteen “folders” created by members of the class, and created for the purpose of leaving an historical record of the class. These were young men who chose the seafaring life as a future career opportunity between the ages of 16 and 18. Once admitted to the academy they had to deal with all of the problems of college life (getting along, passing courses, and paying tuition). And yet these young men were destined for greatness. They entered the academy in 1938. Shortly after the Peal Harbor event of December 7, 1941 they found themselves commissioned as officers in the merchant marine. And they were soon in the thick of war. These archives document their lives. For this study one midshipman’s documents were examined. The collection was rich with letters, envelopes, binders, photographs, ship’s deck-logs, time-sheets, scholarship applications, and so forth. Of particular interest was a canvas-bound ring-binder containing ‘orders’—papers ordering military personnel from place to place. The binder was heavy. There is a note inside indicating the binder had been issued specifically because it would sink in ocean water. In the event a ship was boarded, these binders were to be tossed overboard so as to be lost to enemy intelligence. 386

But the enterprising graduates of this class have managed to compile large instantiation networks of the documents pertaining to their service. In the file we located photocopies, carbon copies, digitized scans of postcards containing photographs, scans of photos, photos alongside digitized scans of them, and documents together with their carbon copies and digitized scans of the originals. Instantiation is present even in this case. These documents demonstrate the occurrence of instantiation among archival entities. The analysis of instantiation demonstrates not only the universality of multiplicity among informing objects, it also demonstrates the usefulness of empirical strategies for generating categories and their descriptions.

4. Conclusions The typologies from the three studies can be placed side by side in tabular form for visual impact (see Table 1). The table lends support to Hjørlands idea that activity theory can proscribe the categorizing activity of knowledge organization. We denote categories so as to assign information objects spatial loci within the acknowledged schema. We usually generate these schema rationally, but without reference to the content that is to be so schematized. In the present study we have seen how empirical evaluation (both positivist and qualitative) has yielded useful typologies of instantiation.

Bibliographic Artifacts--Metadata Artifacts--Representations Personal Papers Works simultaneous -finding aids -field photos Photocopies editions successive editions -field notes -working images Carbon copies predecessors -letters -exhibition color images Photos amplifications -conservation treatment -digitized exhibition images -postcard with photo notes extractions -register descriptions; -conservation photos -digitized scan of object cards postcard with photo accompanying -image order invoices -archived photographic -reprint of photo materials negatives musical -museum database -archived photographic -digitized scan of photo presentation records prints notational -catalog card records -archived photographic transcription transparencies persistent works -finding aids translations -object reproductions adaptations -drawings performances -3D models

Table 1. Comparative Instantiation Typologies

The terms in these typologies, empirically derived, represent the properties of instantiation in three contexts, and yet they demonstrate the epistemological properties of mutation and derivation (see Smiraglia 2002b). Derivation denotes types or properties of instantiation in which intellectual content is unaltered; mutation denotes types or properties of instantiation in which intellectual content has been altered semantically or ideationally. In table 1, terms listed below the solid line represent mutations, which occur in both bibliographic and artifactual typologies. According to research to date, archival records and artifactual metadata typologies identify derivations. 387

In fact, these typologies represent a form of naïve classification system. As Beghtol suggests (2003, 66) the terms in the typologies discover and fill gaps in knowledge about instantiation, reconstruct empirically derived evidence, facilitate integration of findings, and suggest the complexity of the concept known formerly in knowledge organization and bibliographic control, simply, as ‘the work.’ Other consistent elements across these studies of instantiation include: 1) the concept of ‘canonicity’ as a predictor of instantiation; 2) the influence of time as a predictor of the degree of instantiation; 3) transitive relations demonstrated at points of change in semantic or ideational content as predictors of the type of instantiation; and 4) a continued association of the incidence of instantiation with Lotka’s law (Smiraglia 2005b). These results demonstrate the importance of the phenomenon of instantiation for the design and implementation of information systems for a global learning society. Pan- and inter-institutional digital libraries incorporate representations of documentary, artifactual, and archival information resources. In all three cases, instantiation enriches the resource base, but threatens chaos in retrieval. Empirical derivation of instantiation typologies, as demonstrated here, suggests a realistic approach to metadata solutions.

6. References Beghtol, Clare. 2003. Classification for information retrieval and classification for knowledge discovery: Relationships between “professional” and “naïve” classifications. Knowledge organization 30: 64-73. Greenberg, Jane. 2005. Understanding metadata and metadata schemes. Cataloging & classification quarterly 40n3/4: 17-36. Hjørland, Birger. 2003. Fundamentals of knowledge organization. Knowledge organization 30: 87-111. International Federation of Libray Associations. 1998. Functional Requirements for Bibliographic Records. Munich: K.G. Saur, 1998. Available http://www.ifla.org/ VII/s13/frbr/frbr.htm or http://www.ifla.org/VII/s13/frbr/frbr.pdf Potter, William Gray. 1980. When names collide: Conflict in the catalog and AACR. Library resources & technical services 24: 3-16. Smiraglia, Richard P. 2001. The nature of a work: implications for the organization of knowledge. Lanham, MD: Scarecrow. Smiraglia, Richard P. 2002a. Further progress in theory in knowledge organization. Canadian journal of information and library science 26 n2/3: 30-49. Smiraglia, Richard P. 2002b. Works as signs, symbols, and canons: The epistemology of the work. Knowledge organization 28: 192-202. Smiraglia, Richard P. 2004. Knowledge sharing and content genealogy: extending the “works” model as a metaphor for non-documentary artifacts with case studies of Etruscan artifacts. In McIlwaine, Ia C., ed. Knowledge Organization and the Global Information Society; Proceedings of the Eighth International ISKO Conference 13-16 July London UK. Advances in knowledge organization v. 9. Würzburg: Ergon Verlag, pp. 309-14. Smiraglia, Richard P. 2005a. Content metadata—An analysis of Etruscan artifacts in a museum of archeology. Cataloging & classification quarterly 40 n3/4: 135-51. Smiraglia, Richard P. 2005b. Instantiation: Toward a theory. In Vaughan, Liwen, ed. Data, information, and knowledge in a networked world; Annual conference of the Canadian Association for Information Science … London, Ontario, June 2-4 2005. Available http://www.cais-acsi.ca/2005proceedings.htm. 388

Smiraglia, Richard P. Forthcoming. The ‘works’ phenomenon and best selling books. Cataloging & classification quarterly. Taylor-Dowell, Arlene. 1982. AACR2 headings: A five-year projection of their impact on catalogs. Littleton, Colo.: Libraries Unlimited. Tillett, Barbara Ann Barnett. 1987. Bibliographic relationships: Toward a conceptual structure of bibliographic information used in cataloging. Ph.D. dissertation, University of California, Los Angeles. Vellucci, Sherry L. 1997. Bibliographic relationships in music catalogs. Lanham, Md.: Scarecrow. Carol A. Bean National Center for Research Resources, Bethesda, Maryland, USA

Hierarchical Relationships Used in Mapping between Knowledge Structures

Abstract: User-designated Broader-Narrower Term pairs were analyzed to better characterize the nature and structure of the relationships between the pair members, previously determined by experts to be hierarchical in nature. Semantic analysis revealed that almost three-quarters (72%) of the term pairs were characterized as is-a (-kind-of) relationships and the rest (28%) as part-whole relationships. Four basic patterns of syntactic specification were observed. Implications of the findings for mapping strategies are discussed.

1. Introduction With increasing awareness of the importance and power of computable representations of knowledge today, there is also concomitant interest in aligning and integrating such resources, especially in an automated way. Current automated approaches to mapping among knowledge structures such as ontologies most commonly rely upon lexical matching between individual source and target terms, often augmented by synonymy. Applying other types of semantic relationships (e.g., hierarchical or associative) to the mapping problem is much less common, and remains largely a labor-intensive and cognitively demanding manual task. However, using equivalence relationships is not always possible or appropriate for some types of knowledge mapping. For example, translational science for biomedicine requires knowledge of animal models to be applied to human disease models. The proper conceptual representation or modeling of animal models of disease is essential to realize their full value for understanding human pathological conditions, and is a tremendous challenge in itself. Another, even greater, challenge lies in understanding how conditions in nonhuman species relate to human conditions. In very few cases are the conditions produced in animal models equivalent to the human condition; rather, it is far more common for animal models to present one or more features that have relevance to some particular aspect of the human disease. In some cases, this relevance to a human condition is relatively straightforward; in others, it is quite complex. We also need to be able to search across multiple animal models in order to identify and examine commonalities and differences among them, especially when the fullest picture can only be obtained by combining elements from more than one. As more and more animal models are generated, it is crucial to develop better mechanisms by which animal models are mapped onto human conditions. Such methods must be both dynamic and intelligent, using automated ways of linking various types of representations to identify equivalent, comparable, or related concepts, using both lexical and ontological approaches. Representations of knowledge for both human and animal models of disease contain information about normal and pathological anatomy, states, and processes; the transitions between and among them; and temporal and causal information on disease processes. Mapping various aspects of animal models onto the human condition using straightforward equivalence relationships alone would be problematic. First, as stated above, the animal model rarely maps directly or wholly onto a human condition. Second, the relevance of a particular animal model may change over time. Third, simple cross indexing of animal models to human conditions renders opaque the underlying rationale for the 390 mapping, both to humans and to machines. Fourth, differences in anatomical structure, as well as physiology and behavior, often preclude direct correspondence between animal and human models of disease. Fifth, because they likely have been created for different purposes, the conceptual models themselves of animal and human disease can vary in many ways (e.g., perspective, granularity) that make it difficult to map among them by any means. Finally, such approaches are not readily extensible to the addition of new models and species. In short, mapping individual features using equivalence relationships is, by definition, inappropriate for this particular task, because we know a priori that there is little likelihood, or even desirability, for exact or complete matches, either lexically or conceptually. Thus, other types of relationships beyond lexical matching and synonymy must be used for at least some types of mapping tasks. Using the basic organizational structure inherent to most ontologies and other knowledge structures, that is, hierarchy, seems a natural alternative. The role of hierarchical relationships in conceptual mapping among knowledge structures warrants investigation.

2. Approach A natural test set of human terminology mappings exists in the 1996 Large Scale Vocabulary Test (LSVT) performed in the USA (Humphreys, McCray, & Cheh, 1997). As previously described, over 40,000 terms were mapped from a variety of health-related vocabularies to their nearest matches in the Unified Medical Language System. Exact matches in the target set were found for 58% of the source terms. Another 28% of source terms were mapped to broader terms and 3% to narrower terms as the best match in the target term set. (Another 10% were mapped to “related terms,” with no match found for 1% of the source terms.) In other words, human testers found that for a third of all source terms, the closest matches in the target terminology were hierarchically related, that is, either broader or narrower terms. Two studies have examined the two LSVT data subsets in which the user-designated best target matches were respectively broader or narrower in meaning than the source terms (McCray & Browne, 1998; Bean 2000). In both studies, most (over 60%) of the source and target term pairs were found to differ by the addition of pre- or post-modifiers for the head concept of the term. In contrast to McCray and Browne’s results, Bean’s (2000) examination of 805 BT-to-NT mappings also identified a substantial subset of term pairs (21%) in which there was no such simple addition of words or phrases to a core term. These mappings seemed to require hierarchical reasoning on the part of the users. (See Bean 2000 for more details of the test set of BT-NT pairs.) The current research analyzed these BT-NT term pairs to characterize both the semantic and syntactic relationships between the members of 193 such term pairs members, previously determined by experts to be hierarchical in nature. The results of the analysis are considered in two main ways, first, according to the type of hierarchical relationship characterizing the difference between the members of the BT-NT term pair, and then, according to the structural location of the difference.

3. Semantic Analysis of Relationship Types Is-a (-kind-of)relationships dominated, accounting for the hierarchical specification in almost three-fourths (72%) of the BT-NT term pairs. The rest (28%) were primarily part-whole relationships. 391

Examples of is-a-kind-ofrelationships

• brachiocephalic vessels ĺ brachiocephalic veins Veinsare a type of blood vessel, which is any channel for carrying fluid. Brachiocephalic vessels are those vessels located or serving the arm and head, while Brachiocephalic veins are two specific veins that drain blood from the head, neck, and upper extremities.

• elevated transaminases ĺ elevated SGOT Serum glutamic-oxaloacetic transaminase (SGOT) is one type of transaminase.

• ophthalmopathy fungal ĺ fungal conjunctivitis Fungal ophthalmopathyis any eye disease caused by or occurring as a complication of a fungal infection. Conjunctivitis is more specifically an inflammation of the conjunctiva.

• bladder symptoms ĺ bladder pain Painis a symptom. Bladder symptoms can include pain.

• loss of subcutaneous tissue ĺ loss of subcutaneous fat Subcutaneous fatis a kind of subcutaneous tissue.

• signs of recent bleeding ĺ evidence of recent epistaxis A sign indicates the evidence of something, so these terms are roughly synonymous. Epistaxis specifically refers to nosebleed.

• substance-induced cardiac disease ĺ cardiomyopathy, alcoholic Alcoholic cardiomyopathyis a form of cardiac disease induced by the substance type alcohol.

• schwannoma ĺ neurilemoma A schwannoma is a neoplasm originating from the myelin sheath of neurons. The two types are neurilemoma and neurofibroma.

• V tach ĺ torsades de depointes V tachis shorthand for ventricular tachycardia and torsades de depointes is a particular type of atypical, rapid ventricular tachycardia.

• brain cancer ĺ astrocytoma Brain cancer refers to neoplastic disease originating in the brain that is fatal if left to the natural course. Astrocytoma is the most common type of primary brain tumor, although it also occurs in other parts of the central nervous system besides the brain and varies in malignancy and prognosis.

Examples of Part-wholerelationships

• uterine cancer ĺ endometrial cancer The endometrium is the inner mucous membrane of the uterus. Cancer of the uterine corpus usually, but does not exclusively involve the endometrium. 392

• mouth discoloration ĺ gingival discoloration The gingiva is the gum, a part of the oral mucosa, located in and part of the mouth.

• functional class (NYHA) ĺ NYHA class III The New York Heart Association (NYHA) Functional Classification has four categories classifying extent of heart failure based on limitations of physical activity; class III is one of the four categories.

• body modification ĺ ear piercing Ear piercingis a form of body modification, localized to the ear, which is a part of the body.

• mortuary science ĺ embalming Embalming is one component of mortuary science, which also includes such topics as business, psychology, anatomy, and restorative art.

4. Analysis of Syntactic Specification In phrases, one part, called the head, is the governing element. The rest of the phrase, governed by the head, is called the modifier. For example, in noun phrases, the noun is the head; in verb phrases, the verb; in prepositional phrases the preposition; and so forth. In syntactic, or structural, term specification, the semantic meanings of the terms are modified by virtue of changes in the syntax or structure of the phrase. In the larger LSVT subsets described above (Bean 2000; McCray & Browne 1998), the BT-NT pairs were made more specific by the addition of a modifier (pre- or post) or by the conjoining of an additional noun phrase. Also in the current set, the specification had a structural, or syntactic, aspect. Four basic patterns were observed as shown.

a. Specification occurring only in the head of the noun phrases constituting the term pairs:

• brachiocephalic vessels ĺ brachiocephalic veins

• elevated transaminases ĺ elevated SGOT

• ophthalmopathy fungal ĺ fungal conjunctivitis

• bladder symptoms ĺ bladder pain

b. Specification occurring only in the modifier of the noun phrases constituting the term pairs:

• loss of subcutaneous tissue ĺ loss of subcutaneous fat

• signs of recent bleeding ĺ evidence of recent epistaxis

• uterine cancer ĺ endometrial cancer

• mouth discoloration ĺ gingival discoloration

• functional class (NYHA) ĺ NYHA class III 393

c. Specification occurring in both the head and the modifier portions constituting the term pairs:

• substance-induced cardiac disease ĺ cardiomyopathy, alcoholic

• body modification ĺ ear piercing

d. Specification with no clear syntactic relationship within the term pairs:

• schwannoma ĺ neurilemoma

• V tach ĺ torsades de depointes

• brain cancer ĺ astrocytoma

• mortuary science ĺ embalming

5. Discussion Results from this and related studies of the LSVT clearly indicate that humans use a variety of different mechanisms to determine close conceptual relationships among terms. Similarly, an automated system for mapping terminologies and aligning ontologies should also accommodate a variety of conceptual relationships. In situations where the presumed gold standard, an exact match, is neither available nor appropriate, the next best choice would be a term that is hierarchically related in some way. It is also clear that syntactic analysis is necessary in an automated system, but it is not sufficient alone for these tasks: Semantic and lexical elements are also critical. Several additional observations suggest more subtle patterns that simply could not be detected in this relatively small sample set. For example, it appeared that the specification more commonly occurred in the modifier portion of the term than in the head. Where specification occurred in the portion constituting the head of the term, the broader term was frequently at the level of a highly generalized class, such as impairment, problem, discomfort, change, etc. Both of these observations might reflect a fundamental “privilege” obtaining to concepts in preference to attributes or relationships. Another observation that warrants further investigation is the apparent importance of semantic type in mapping decisions. In all three studies of the hierarchical mappings, decisions were most often made along axes corresponding to anatomical and functional (especially pathological) semantic types. This is not at all surprising considering the goals of the biomedical enterprise and the centrality of anatomical knowledge to it. Such a preference might suggest weighting these semantic types over other candidate semantic types when other factors do not indicate clear directions for mapping.

6. References Bean CA. (2000) Mapping down: Semantic and structural relationships in user-designated Broader-Narrower Term pairs IN Beghtol C, Howarth LC, Williamson NJ (eds.) Dynamism and Stability in Knowledge Organization. Proceedings of the 6th International ISKO Conference, 10-13 July 2000, Toronto, Canada. Wurzberg: Erlon Verlag, 301-305. Humphreys BL, McCray AT, Cheh ML (1997) Evaluating the coverage of controlled health data terminologies: Report on the results of the NLM/AHCPR Large Scale Vocabulary Test. Journal of the American Medical Informatics Association 4(6):484-500. 394

McCray AT, Browne AC. (1998) Discovering the modifiers in a terminology data set. Journal of the American Medical Informatics Association. Symposium Supplement, 1998: 780-4. Francisco Javier García Marco Universidad de Zaragoza (Spain) Understanding the categories and dynamics of multimedia information: a model for analysing multimedia information

Abstract: A model for analysing multimedia information is proposed from the point of view of the theory of communication. After a brief presentation of the complex map of the sciences that deal with multimedia communication in its different aspects, the current multimedia revolution is historically contextualized as a tendency towards messages that are able to build near-reality experiences (virtual reality). After setting the theoretical point of view, an analysis of multimedia messages is substantiated and a model is presented. The first part of the model deals with the different communications channels and tools: still images, movies, sounds, texts, text with illustrations, audiovisuals and interactive multimedia, with an emphasis in non- textual documents. The second part addresses the global properties of the multimedia message, which are of a textual and metatextual nature. The overlapping of media, channels, genres and messages—and the conscious and technical use of such interactions—is precisely one of the main and outstanding characteristics of the multimedia discourse, and requires specific moves in indexing languages development. The multimedia environment has also a great potential to promote a wider theory of knowledge organization, bringing closer distant fields like scientific and fictional indexing or verbal and image indexing. It is stated that such a unified theory requires a closer attention to the pragmatic aspects of indexing and the inclusion of new semantic layers. A simple indexing model is proposed to illustrate who to address these challenges.

1. Introduction (1) The aim of this paper is to propose a model for analysing multimedia information building on the point of view of the theory of communication (2). Such a quest must begin with an acknowledgment of the complexity of the field, as the landscape of the disciplines of communication is impressive. As an illustration o this complexity, Figure 1 provides a sketch of only the twenty-two related to the study of signs and their use, that is, to Semiotics.

Fig. 1. Sciences of the sign: an approach to the complexity of communication science 396

Another important premise for such a model is providing a working definition of multimedia communication—which can be described as communication that is achieved through more than one channel or medium—and overall a precision about its implications. In fact, a multimedia nature is inherent to natural communication, where the agents, being both present, convey information throughout different visual and auditory clues. As presence is a very limiting characteristic of natural communication, technologies have been invented to allow for communication in absence of the sender—this is, for example, by using letters—. But, because of economical and technical reasons and with very rare exceptions, these technologies use a single channel and genre. This made mediated communication abstract and even difficult when both agents do not shared a very detailed knowledge of the frame of reference of the messages. So, from an evolutive or historical perspective, computer mediated multimedia communication and the current multimedia revolution can be considered as a tendency towards creating asynchronous messages capable of inciting in the receiver almost-natural experiences, and, eventually, of arousing in him near-reality experiences (virtual reality). Such a natural experience requires multimedia, but also a truly and synchronous communicative experience between the sender and the receiver, that only current interactive information technology can provide (Fig. 2).

Fig. 2. An evolutionist perspective of multimedia communication: the path towards an induced “natural” experience 397

In conclusion, messages—and therefore documents—are considered an effort to create a sensorial and intellectual world in the mind of the receiver. The current stage of this evolution is considered to be the integration of the different information channels and resources to build an overwhelming sensorial and intellectual experience. The advances in this direction can be best appreciated in some environments like some scientific applications or even state-of-the-art computer games. Information and document centres have followed the path of the available document technologies creating specialized centres for each kind of media that has been appearing—video, maps, drawings, photos, video libraries and archives—, and now follow the path toward integration and convergence. So, an integral approach toward documents is necessary, in which multimedia is not considered something apart, but the very nature of communication. From this perspective, written, visual and sound documents are granted the status of partial and uncompleted approaches toward the aim of a global communication experience that can be reconstructed by imagination and that, currently, can be induced through the senses with better sensorial experiences that are dependent on the technology.

2. Analysing multimedia messages: toward a model But multimedia, though a very powerful tool for communication, implies big challenges for indexing and, therefore, for its previous phase, content analysis. Such challenges come from two different sources: on the one hand, multimedia is constituted by the aggregation of different items which are expressed in different languages that must be understood on their own terms; on the other hand, the relation between these channels for meaning transfer is not univocal, but emergent, dialectic and systemic. As a result, decoding and making explicit the information conveyed in a multimedia message is a very complex process that requires enhanced analysis tools and models. The outline of such a model is presented below. The model is divided in two main parts, according to the above-mentioned challenges. The first one is devoted to the component channels that constitute a multimedia message; and the second, to the global systemic qualities of the message that govern and integrate the whole meaning and intention of the communication act. The model is structured in five levels of complexity:

1. Component elements a. Image i. Static elements 1. Light environment 2. Composition 3. Spaces a. Natural spaces b. Urban spaces c. Relational spaces 4. Objects 5. Human images a. Physiognomy b. Clothing and garments ii. Dynamic elements 1. Natural processes 2. Human activities a. Expressive activities b. Instrumental activities 3. Objects behaviour 398

b. Sound i. Environmental sound 1. Natural sound 2. Environmental music ii. Human voice 1. Basic properties a. Intensity b. Pitch c. Timbre d. Duration 2. Processes [Paralinguistic] a. Emotionality b. Relationship status c. Interaction modalities d. Emphasis e. Non verbal vocal designation iii. Music c. Verbal i. Languages ii. Idiolects iii. Acts of speech iv. Syntax v. Lexical level [including lexical morphological aspects] 1. Abstractions 2. Generic concepts 3. Specific concepts 4. Exemplars vi. Phonetics d. Graphics [artificial languages, not images] [see verbal] e. Interactions i. Navigation ii. Retrieval iii. Transformations 1. Layout 2. Calculus 2. Systemic aspects a. Global qualities i. Metatextual level 1. Intentionality 2. Relation with reality a. Factuality b. Credibility 3. Sociocultural frames 4. Actants [Roles] 5. Spatiotemporal frame 6. Channel/medium 7. Script ii. Textual level 1. Genre a. Subgenre b. Document type 2. Subject a. Referential 399

b. Relational 3. Rhema [new information] 4. Argument 5. Rhetorical and communicational paradigms and methodologies b. Interaction processes among the different channels i. Basic interaction 1. Concentration 2. Complementarity 3. Addition 4. Contradiction ii. Temporality 1. Synchrony/Asynchrony 2. Isolated presentation/Continuity/Recursivity

It must be stressed that the model can work as a universal one—that is, that can be applied to a specific media like photographs, texts, drawings, etc.—, as long as only the component elements of the respective media are considered. The global characteristics of a message apply to any kind of document, and also, if the document has distinct parts, to each of these subsystems—as paragraphs, chapters, illustrations, videos, interactive commands, etc.—. In fact, the analysis can be done in a recursive manner depending on the number of levels or layers in which the “text” (3) is organized. In this later case, it seems useful to remark that the different levels and subsystems are always organized towards the same aim, even when it does not seem so, because of different reasons, be it that the author wants to leave meanings open, show confusion or ambivalence, or is unskilled to transmit the message. It can be also useful to share the experience that, for the end-purpose of indexing a multimedia document, it is seldom necessarily—but for very complex materials like, for example, textual or visual poetry—to repeat the analysis recursively at every level of semiosis. On the contrary, it is usually enough to analyse only key elements at the different stages of the pyramid of meaning. And as a general rule, lower level characteristics will be considered only if they separate from the generic prototype they belong to; otherwise they must be considered efficiently summarized at the upper level. In the two following sections, the general landscape of both the component elements and the global properties of messages will be presented, and some selected key characteristics will be discussed.

3. Analysing the channels and their components The basic channels of communication are well-known: still images, movies, sounds— verbal and nonverbal—, texts, text with illustrations, audiovisuals, graphic artificial languages and interactive multimedia. Regarding images, the elements to be analysed can be classified into static and dynamic ones. The static elements considered are the environment, composition, spaces—natural, urban and relational ones—, objects and the structural elements of the human image— physiognomy and clothing—. The dynamic elements are the processes going on—natural, human and artificial ones—and the systemic properties of their interaction. Still images have also, of course, dynamic elements, though they are presented as an instant or a potential, not as fully developed movement. Anyway, this kind of summarization is inherent to visual documents, since telling histories, picturing, drawing or filming is always about selecting, abstracting and suggesting, as usually a large period of time must be told in a very short one. 400

On its part, sound is analysed in three groups—environmental sound, human voice, music—and their interactions. Environmental sound can be divided into unintended sounds—“natural” sounds—and background music, which is nearer to the third category. Human voice can be analyzed according to its basic qualities—tone, intensity, timbre, etc.—and some basic para- or pre-linguistic process—that denote emotionality, relational tags, interaction modalities, emphasis and vocal designation—. Verbal behaviour—be it expressed by sounds or written—can be analysed in different levels: the languages selected, the peculiarities of language use—idolects and acts of speech—, syntax, lexical level—abstractions, generic concepts, specific concepts and exemplars—and even phonetics. Verbal behaviour in general lacks the immediate emotional appeal of images and sounds, but has a very strong relational power and allows deliberate ‘mind computing’ in the short-term memory. Verbal behaviour has evolved into artificial and programming languages that are the background of interaction, one of the main qualities of multimedia. Besides verbal behaviour, a group of languages exists that are also conventional but use graphic symbols instead of words. This world of iconic messages—some of them very elaborated, as occurs in Cartography, Industrial Design, Geometry or Statistics—keeps growing in size and importance. Another kind of graphic languages of great importance is graphic publishing languages—layout techniques and typefaces—with are an important way of conveying the meaning that is usually transmitted with paralinguistic and other non verbal resources in the verbal world.

4. The systemic and emergent aspects of multimedia messages The different analytical components of the multimedia message form together an emerging unit of meaning and communications. This unit—the message—can be characterized by certain distinct global properties that can be classified into two levels: metatextual and textual. On the one hand, metatextual qualities are of a social nature, and express the sociocultural context of the message: the actants, the intentionality, the social and spatiotemporal frames and scripts, and the communication channel that are acting in the whole situation. On the other hand, textual properties are of a semiotic nature, and can be organized in its classic subdisciplines: pragmatic—genre—, semiotic—subject-rhema—and syntactic—argument, rhetorics, etc.—. Finally, the kind of connection among the different channels of the multimedia message must also be considered, since, using a mathematical metaphor, the different components can be added and substracted in different ways to modulate the final meaning. The systemic and emerging meaning of multimedia messages has been partially the centre of traditional content and document analysis (Table I).

Document analysis Relative Pronouns Explanation Ranganathan facets vocabulary Who Subjects Personality Agents Domain, Objects, What Matter Subjects Processes How Modalities Energy Activities Where, Space Space Circumstances When Time Time

Table I. Semantic elements of a message in content and document analysis 401

But a theory of communication that goes beyond scientific and technical documents— which are usually of a controlled pragmatic and semantic nature—gets necessarily much more intricate. When analyzed, many multimedia messages emerge as communication “icebergs”, where very few is actually told, very much is simply intended or tacitly given as known, sense is conveyed into different kind of meanings—defined, denoted, connoted and designed—and even organized in a hierarchy of levels of signification and abstraction, and where the very pragmatics of the process remain obscure even for their agents. The complexity affects also the agents—since persons can act in such different roles as references, channels, actual and potential targets, etc.—and their intentions. After all, communication is not only a way of transferring true knowledge, but also a technique to influence the receiver. In fact, intentionality is a multilevel reality: There is a predicated aim, possibly a shared one, always a subjective intention, and, finally, a meta-intentionality that is given by the social and cultural system and is not always perceived by the agents. In any case, an integrated theory of knowledge organization cannot be achieved but by recognising and using the systemic properties of any message, which allow the indexer and, in general, the content analyser to treat it as a unit. In this sense, messages must be considered systems, this is, a set of elements in dynamic interaction organized for a goal (Rosnay, 1979).

5. The multi-channel nature of multimedia information and its implications for Knowledge Organization It has been discussed that the overlapping of media, channels, genres and messages—and the conscious and technical use of such interactions—is precisely one of the main and outstanding characteristics of the multimedia discourse. This multi-channel nature of multimedia documents has an important theoretical implication for knowledge organization: As a multimedia indexing theory must take in account all the different channels, it can and must, therefore, explain the indexing of each kind of material and thus provide the frame for a global theory of knowledge organization. This would allow overcoming some of the big barriers that prevail nowadays in indexing and classification theory, mainly the ones that keep apart scientific and technical documents from fiction, and graphic materials from textual ones. A second implication of the multimedia revolution is the need to put order in the world of the indexing terms for media, audiences and genres. This has become an imperative task in the new multimedia context, because in such an environment the same information circulates through different channels and in different forms, intending different audiences, and this situation should be controlled. The very concept of channel or medium has also got very complex, as now diverse media is involved in the different stages of creation, distribution, storage and reproduction of documents. Something similar occurs with genres. Trying a translation of this approach to a theoretical point of view, it could be stated that indexing must incorporate more clues about the pragmatics of communication, since, in the new information society, information is no more guided to target audiences by quasi- unidirectional channels—like teaching material trough classes or entertainment through theatres and cinemas—, but through a net of interconnected media. In this context, media, audience and genre are no more implicit characteristics of a communication process, but facets that must be predicated in a specific way with the help of proper knowledge organization tools. Because of their universality and specificity, it seems a good goal to develop specific languages for media, audiences and genres, as it has been done for toponyms, entities or personal names. But this approach should therefore be completed with relations to other 402 more general subject thesauri and classifications. In any case, the constitution of completely separated tools should be avoided, because such a situation would provoke the emergence of unintended barriers for the semantic navigation among related concepts, as, ultimately, all those different vocabularies refer to integrated domains of knowledge that must be interrelated.

6. The limits of current semantic indexing: some strategies for action Another question that must be addressed is that traditional indexing rarely takes into account the whole set of characteristics that have been presented above as systemic aspects of the message. On the contrary, other pragmatic aspects than the context of study— disciplines—are not considered, and, on the other hand, the subjacent semantic model that supports content analysis is usually very positivistic and is unable to deal with the loose of fuzzy meanings that multimedia messages frequently convey. Regarding the pragmatics deficit, normal indexing theory seems fixed on the semantics of messages—that is, on subjects such as process, agents, patients, tools, places, etc.—, but, though the global properties of a message are undoubtedly of a semantic nature, they are mainly of a pragmatic one. As Van Dijk (1977) states: “Discourse coherence is not primarily a matter of meaning, but of reference”. In someway, pure semantic indexing was possible when such pragmatic aspects were unambiguously implicit in the knowledge organization tools because only very specific literature was subjected to indexing, mainly scientific or technical. But the media revolution and its cultural and industrial implications need a new approach toward indexing. In fact, a more complex society requires more tips about the relations of messages with their users. For example, even in the scientific realm, interdisciplinary research has put in question the implicit relation between subjects and disciplines. Descending to a more practical point of view, an alternative approach (see table II below) would require indexing not only the subject of the message, but aspects like

• who creates the message (4), • who can use it, • the spatial, temporal, social and scientific-technical contexts of its production and reception, • the instrumental aspects of the communication—media channels, genres, textual structures—, • and their intentions and interpretation.

This approach has also a key practical implication for current technological and social developments: it can help to connect in different ways users and information objects in automatic ways, promoting the implementation of the gateways that are implied in the semantic web approach (5). On the other hand, semantic indexing is also receiving big challenges from inside. With the multimedia revolution, a growing percentage of content is leaving the relatively unambiguous path of scientific and journalistic genres and approaching the semantic jungles of creative literature. This kind of material requires the complex kind of content analysis that is characteristic of artistic disciplines, that can be summarized in the simplified model that was proposed by the Art historian Panofsky (1955)—also applicable analogically to the analysis of fiction—: description of common subjects; identification of the specific names and exemplars of these common subjects; and interpretation of the abstracts subjects that are being represented through the former. 403

All these questions are taken into account in table II, which shows an outline of an extended indexing model that considers the semantic and pragmatic complexities of multimedia indexing, as opposed to the one shown in table I. Here the scope of the traditional pronouns changes to denote a pragmatic approach instead of the usual semantic one, which is denoted by the interrogative “what”. The subject domain (“what”) is divided also into three realms: common names, exemplars and abstract concepts, which is always necessary for dealing with non-verbal materials.

Relative Pronouns Explanation Proposed term What (1): common Domain, Objects; Processes, Concrete subjects Tools, Actants, Space, Times What (2): exemplars Exemplars’ subjects What (3): abstract Abstract subjects Who Subjects Producers To whom Potential and actual users Audiences From where Disciplines Perspective To where Paradigms, -ism How Modalities Media, Genres, Textual structures Where, When Circumstances Scenery: Spaces, Times For what Aims Intentionalites Why Causes Interpretations

Table II. An extended model of indexing for representing pragmatic and multi-level semantic characteristics

Notes (1) This paper builds on the previous work done by the author for a collective book written with his colleagues Maria Pinto Molina and Carmen Agustin Lacruz (2002). Though he is the only responsible for his ideas, he is in doubt with them for the discussions and invaluable suggestions and feedback. (2) We are referring here to the theory of “natural” communication as opposed to “artificial” one, which, apart from the splendid communication model (Shannon & Weaver, 1949), deals mainly with the transfer of signals. “Natural” communication is concerned with effective and efficient signal transfer, but also with the pragmatic, semantic and syntactic aspects of the phenomenon. (3) The use of the term “text” for referring to multimedia documents must be understood in a very wide sense: Text is defined here as the documented form of a discourse; and discourse, as an architecture of symbols—organized in different structures—forming a message to support the communication between persons—or their artificial surrogates—. (4) In fact, this is possible in cataloguing through the use of authority records. (5) It is also very important to call the attention upon the modular nature of the multimedia message. This kind of information carrier can be disaggregated and its parts reused and re-oriented. This is more and more a strategy in industry and also in pop multimedia culture. Though this reality has mainly direct implications for multilevel item control and cataloguing, it is also important for indexing and classification, as it requires that certain materials that traditionally have not been indexed—because they were expected to be retrieved in the upper context or were considered raw material, like book illustrations—will have to be processed. 404

References Pinto Molina, M., García Marco, F. J., & Agustín Lacruz, M. C. (2002). Indización y resumen de documentos digitales y multimedia: técnicas y procedimientos. Gijón: Trea. Panofsky, E. (1955). Meaning in the visual arts: papers in and on Art Histoy. Garden City, N.Y.: Doubleday. Rosnay, J. de (1979). The macroscope: a new world scientific system. Translated from the French by Robert Edwards. New York: Harper & Row, Publishers; San Francisco, London: Hagerstown. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana [etc.] : University of Illinois Press. Van Dijk, T. (1977). Semantic macro-structures and knowledge frames in discourse comprehension. In M. A. Just, & P. A. Carpenter, eds. Cognitive Processes in Comprehension (3-32). Hillsdale: Lawrence Erlbaum. Rob Hidderley* and Pauline Rafferty** *Department of Computing, University of Central England, Birmingham, **Department of Information Science, City University, London

Flickr and Democratic Indexing: Disciplining Desire Lines

Abstract: In this paper, we consider three models of subject indexing, and compare and contrast two indexing approaches, the theoretically based democratic indexing project, and Flickr, a working system for describing photographs. We argue that, despite Shirky’s (2005) claim of philosophical paradigm shifting for social tagging, there is a residing doubt amongst information professionals that self-organising systems can work without there being some element of control and some form of ‘representative authority’ (Wright, 2005).

1 Models of Indexing 1.1 Expert Led Indexing Traditional expert led approaches to subject indexing rely on the existence of controlled vocabulary systems, classification schemes, taxonomies, or ontologies prior to specific instantiations of indexing activity. What traditional expert led subject indexing systems have in common, whether they derive from a belief in consensus of adept intellectuals, or whether they are derived from a post-modern pragmatic functionalism, is their tendency towards monologic utterance. Traditional expert led subject indexing relies on the management of information through the intervention of intermediaries (librarians, indexers, publishers, volunteers). Monologic, expert led indexing is expensive and time consuming. It can be a facilitator of access to information, by providing routes into large groups of documents. It can also be an inhibitor of access to information, because any constructed, controlled vocabulary will privilege specific worldviews, and may ignore or marginalise other worldviews, with the result that certain concepts and terms are neglected (Berman, 1971, Olson, 2002). Expert led knowledge organisation is a very single minded way to construct maps of knowledge. This matters, because representations of knowledge in knowledge organisation tools are always ideologically determined, and politically consequential. Knowledge organisation tools are artificial constructions, and so historically and culturally contingent, thus, always already ideological.

1.2 Author Based Indexing Mathes (2004) argues that alternative model of indexing is author based indexing. Traditional automated text-based indexing develops indexes by extracting terms from the text. This approach assumes that the author will use terms that are commonly understood and generally accepted. This literary warrant approach can oblematic if indexing and searching happen at a later historical moment than the moment of textual production. Terms that are ideologically acceptable at the moment of production might not be considered acceptable at the moment of reception. Another author based indexing approach to solving the problem of the expense of expert based indexing is seen most commonly in the development of the Internet. The Dublin Core Metadata Initiative (DCMI) has been developed with a view to facilitating authorial indexing. A problem that this approach faces is that the author is not necessarily an information manager. Frameworks might be less developed, definitions of the content of fields and sub-fields simpler than those developed for traditional knowledge organisation purposes, to enable authors to create their own metadata. Information professionals are 406 sometimes critical of DCMI in comparison to the Anglo-American Cataloguing Rules (AACR). DCMI is not as detailed as AACR but too detailed for the amateur indexer. DCMI still refers authors/indexers back to existing traditional subject indexing tools, or allows them to add their own descriptive text for subject access. Moreover, Mathes (2004) argues that Internet search engines have shown that author indexing is sometimes wrong: sometimes it is inaccurate, sometimes author based indexing is purposefully false and fraudulent. Even if the author is straightforward, truthful and honest, this approach to knowledge organisation still remains a monologic approach, as is the expert led approach, but a monologic approach often implemented by knowledge organisation amateurs. Underpinning the author based indexing approach is the view that the author’s interpretation of his/her own work is an authoritative view. When a communicative object is created and then ‘liberated’ from its producer, and disseminated within a public space, who is to say that the author continues to have complete and utter control over it? The user is an important element in the production of meaning once the document is no longer in the total control of the author, or of institutional facilitators of dissemination.

1.3 User Based Indexing The move towards social software has generated interest in shared metadata. The challenge is to involve users in metadata production. Professionally produced metadata is of high quality but expensive to produce: Dublin Core offers a way for the author to create metadata; however in both cases, users are disconnected from the process. User generated subject orientated metadata has started to develop as an alternative approach. Clusters of user generated subject tags are sometimes referred to as ‘folksonomies’, a term which is generally agreed to have been coined by Thomas Vander Wal (2005). Folksonomies do not have hierarchies, but there are automatically generated ‘related’ tags in folksonomies. They are the set of terms “that a group of users tagged content with, they are not a predetermined set of classification terms or labels” (Mathes, 2004). Mathes cites as the limitations of these systems their ambiguity, the use of multiple words, and the lack of synonym control, whilst their strengths are that they facilitate serendipity and browsing. Merholz (2004) argues that folksonomies can reveal the digital equivalent of ‘desire lines’. Desire lines are foot worn paths, which can appear in the landscape over time. He suggests that favourite tags across the community may emerge over time and then a controlled vocabulary based on the favourites could be created. A related metaphor is that of ‘information landscapes’, a term to be found in the literature of ethnoclassification. It may be that over time, a set of digital ‘desire lines’ will develop, but it is often the case that when groups of humans get too large, they split and form sub-groups and cliques. This might lead to the organic production, not of a dominant controlled vocabulary, but of many splintered controlled vocabularies. Over time this might have cultural and political consequences.

2 Knowledge Organisation and Flickr Flickr is a photo sharing web site that aims to provide new ways of organising photos: Part of the solution is to make the process of organizing photos collaborative. In Flickr, you can give your friends, family, and other contacts permission to organize your photos - not just to add comments, but also notes and tags. … and as all this info accretes around the photos as metadata, you can find them so much easier later on, since all this info is also searchable.’ (Flickr, 2006a). 407

Tags are like keywords or labels that you can add to a photo to make it easier to find later. You can tag a photo with tags like "catherine yosemite hiking mountain trail" and then later on if you are looking for pictures of Catherine you can just click on that tag and get all photos that have been tagged that way. (Flickr, 2006b)

Flickr has been described as a folksonomy (Wright, 2005), but in practice Flickr works not as a user-indexed, but as an author-indexed database, where the term 'author' refers to the person who uploads the image on to the site and creates tags for the image. The construction and use of a tag is left entirely to the ‘author’. There are discussions of how tags should be used (Ideant, 2005) within the internet community. In practice, tags often correspond to well-understood, usually English. However, tags are also used that are private codes and sometimes as a code used by some sub-group of users to facilitate semi-private communication. Other tags are developed that are actually phrases without spaces. Tags are uncontrolled (except by the ‘author’ of an image) and unmediated, there is nothing to stop inappropriate use nor the generation of tags that are (nearly) identical in meaning or (mis- )spelling to other tags. There does not appear to be a single list of all the current tags in use, Flickr adopts summaries of ‘hot tags’ and ‘all time most popular tags’ neither of which provide any kind of comprehensive listing. Flickr adopts an author based indexing approach. Does this lead to a satisfactory retrieval mechanism? ’precision’ and ‘recall’ are worth considering. The use of a tag for searching may deliver any number of results. There is no way of knowing if all of the relevant images have been retrieved but one can certainly find examples of retrieved images that are irrelevant. Clearly this is due to the fact that an author may not decide to tag an image with a word that is relevant or alternatively tag an image inappropriately. Therefore ‘precision’ and ‘recall’ are likely to indicate a poorly performing system, but one must ask if such measures are appropriate for such a system? Table 1 illustrates some of the difficulties that may be experienced through the use of unmediated tags as a mechanism for retrieval. These difficulties include:

General/specific vocabulary A key difficulty with any retrieval system is to use a set of terms that are not too general as to apply to all items and not too specific that they only apply to a very small number so that they are able to distinguish items. The uncontrolled use of tags leads to terms that are too broad, retrieving a set that is too big to browse, or so specific that few items are associated with the term. However, given that there is no upper limit to the number of images that a system like Flickr may store, this may be an insurmountable problem through any statistical approach to the control of tag use.

False use Can there ever be ‘false use’ in an author-based indexing system? From the perspective of someone searching for an image, then there will be images that are not relevant to the tag’s use. The difficulty in Flickr is that there is no mechanism through which this may be changed or influenced and there is therefore a danger that as, inevitably, tag use becomes more unusual so retrieval based on tags becomes less and less reliable – entropy and chaos. This would inevitable lead to tags being used solely for private indexes and public searching would demand other approaches (or not use it at all). This may be of no concern to Flickr but it is an important lesson for systems that are intended to manage large volumes of similar material where public searching is a priority. 408

Problem Tag Issue Too broad Wedding 795280 Images (14/2/06) Too specific Iijsselmeer 1 Image (21/2/06) http://www.flickr.com/photos/bennixview2/91925115/ False use Wedding Image: http://www.flickr.com/photos/robwallace/99658661/ picture of a squirrel, presumably at a wedding! Naked Man dressed in kilt! http://www.flickr.com/photos/36219427@N00/103171353/ Code A119 305 photos – all from the same user, (23/2/06) (private X1 http://www.flickr.com/photos/tags/x1/ language?) 161 photos – 27 users (23/2/06) Multiple Technologyshowcaseday 53 images, one user (21/2/06) words http://www.flickr.com/photos/tags/technologyshowcaseday/ Ambiguity Goes 1639 images indexed (23/2/06), difficult to discern use or relevance! It 9460 images indexed (23/2/06), wide variety of images! Synonyms Photos 83539 images (23/2/06) Photo 110139 images (23/2/06)

Table 1: Examples of difficulties with tag use from Flickr

Codes Codes are tags without any corresponding English (or other natural language) meaning. One might celebrate codes and conceive of their meaning as being entirely derived from the collection of images associated with the code. However, codes are, by definition private (Flickr provides no method of stating the intended meaning of a tag) and this cannot be seen as a transparent or useful use for public retrieval.

Multiple words In some ways these are similar to codes. However they are an attempt by indexers to make their tags more exact and are really ‘phrases’ without spaces! This is partly likely to be a response to Flickr’s implementation of tags and the limited search language associated with tags (simply ‘all’ or ‘any’, not even a Boolean language).

Ambiguity Text-based document retrieval systems have ‘stop-lists’ of terms that are of little value for indexing, such lists would include words like ‘it’, ‘a’, ‘be’ and so on. However there is no such control within Flickr and although, like other tags, such words have to be associated with an image explicitly by the author it is difficult to discern the value or meaning of such tags. Tags like this may be worse than codes due to the fact that people know they do have a linguistic meaning and then become confused when images challenge their view whereas a code is entirely constructed without any linguistic reference.

Synonyms The uncontrolled use of tags leads to unfortunate indexing practice. Homonyms, synonyms and mis-spellings are all ignored in that they are all treated as unique tags. This leads to poor retrieval performance, not helped by not having a comprehensive list of tags used across Flickr. Flickr implements some alternative retrieval mechanisms, which might be more appropriate for an informal collection of user-created photographs that constitute the Flickr database. These mechanisms focus on browsing, and connecting (‘clusters’ of tags) and 409

‘interesting’. These functions on the Flickr site do facilitate the exploration of the database and the notion of serendipity (discovery by chance, exploration). A user’s tags may be viewed as an alphabetical list where the size of the tag reflects the number of times it has been applied by that user. This feature is also duplicated in the ‘hot tags’ section for all users. This may be considered to be a crude attempt at implementing a ‘desire line’ or providing a ‘map’ of the ‘information landscape’. There are other functional improvements that could be made to Flickr to mitigate the practical use of tags, like the introduction of a Boolean search language and the provision of a ‘global tag list’. However it is unlikely that such functions will resolve the larger problem of improving the retrieval performance of the system (such an aim may not be of interest to Flickr at all!). Is it fair to criticise Flickr for providing such a variable retrieval performance and could the system be improved so that the freedom provided by author-based indexing is retained but the retrieval performance is also improved? The authors believe that the democratic approach (Rafferty & Hidderley, 2005, pp177-187) provides a method of marshalling a ‘free’ user-indexed archive to provide useful retrieval functions.

3 Democratic Indexing: An Alternative Approach to Concept Based Retrieval The democratic indexing project grew out of an interest in the challenges of designing image retrieval systems. It is a response to the issues of connotation, specifically to the issue of whether a ‘spectrum of connotation’, based on the range of possible meanings available in society at a particular moment, might exist. The design of the database allows changes in meaning over time to be captured. Thus, the database addresses synchronic issues about the structure of meaning at any one time, and diachronic issues about the changes in the system over time. By focusing on user interpretation, democratic indexing differs from traditional expert led models of indexing. In particular, the democratic approach considers that readers of the images play active roles in determining meaning by constructing their own interpretations of images, and that a collection of terms describing meanings constructed by readers should be used to create a subject based index. The democratic approach determines authority from the agreement of its users: its warrant comes from the constructive interpretation of its users. The democratic approach does not cover all image contents, for example information such as photographer, date of creation and title that are not subject to variation. However the approach is applied to all forms of interpreted information that might be summarised as ‘what does this mean?’ or ‘what is important here?’ The principle of democratic indexing is that individuals will have their own, potentially different interpretation(s) of an image: the differences may be manifested as different focus on parts of the image and different terms to describe the image. Democratic indexing has incorporated a number of novel features:

x The information recorded for each information item includes descriptive cataloguing and subject indexing based on user perceptions of the item. x The collection of user-generated indexes will be used to compile a ‘public’ index through a process called ‘reconciliation’. x The ability of individual users to record their private indexes offers a ‘democratic’ approach to indexing.

3.1 Level of Meaning tables Central to the project has been the construction of ‘levels of meaning’ indexing templates, initially to capture a range of information relating to images, but subsequently 410 developed to capture information relating to film and fiction (Hidderley and Rafferty, 1997). The image based levels of meaning template is shown in Table 2:

Level & Category Description Some examples 1.1 Biographical Information about the image as a Photographer/artist, date & time of document creation, colour/B&W, Size, Title 1.2 Structural contents Significant objects & their Object types, position of object, physical relationship within the relative size (or importance) picture within the picture, e.g. car top right. 2.1 Overall content Overall classification of the image Type of image, 'landscape', 'portrait', ... 2.2 Object content Classification of each object Precise name & details of each defined in 1.2 object (if known), e.g. Margaret Thatcher, Ford Orion... 3.1 Interpretation of whole image Overall mood Words or phrases to summarise the image, e.g. 'happy', 'shocking. 3.2 Interpretation of objects Mood of individual objects (when e.g. Margaret Thatcher relevant) triumphant, defeated.

Table 2: Levels of Meaning table

The approach adopted in the levels of meaning indexing template is based on some assumptions that still require verification. Firstly, it is assumed that, at least for the higher levels of meaning (3.1, 3.2), there is no single interpretation of an image. Secondly, that there will be common terms used by viewers to index images. Thirdly, that the natural way to describe images is through words and phrases.

4. Organic Organisation or Teleological Discipline? What is at stake in the user based indexing paradigm is the issue of conventional knowledge organisation tools. Flickr and the Democratic Indexing project are examples of systems offering alternatives to imposed semantic structures. Clay Shirky (2005) distinguishes between domains in which ontologies operate successfully and those which are more difficult to discipline ontologically. He argues that ontologies do not work well when domains are large corpus, with no formal categories, unstable and unrestricted entities, with no clear edges, and participants are uncoordinated, amateur users, naive catalogers, and there is no authority. The list of factors makes the Web an almost perfect fit for an information space in which ontologies do not work. Sharky’s view is that the process of social tagging heralds a philosophical shift in indexing, which takes us away from a binary process of categorisation to a probabilistic approach. Shirky argues that Flickr and del.icio.us provide us with a way of developing organic categorisation where alternative organisational systems will be built by letting users tag URLs, and then aggregating those tags. The use of the word ‘aggregating’ is suggestive of the limitations of user-driven systems. The term echoes Merholz’s suggestion that over time folksonomies will develop informational equivalents of ‘desire lines’, which will provide de-facto controlled vocabularies, and Hidderley and Rafferty’s suggestion that democratic indexing projects should operate using a public/private indexing split. The discourse of user-based indexing is one of democracy, organic growth, and of user emancipation, but there are hints throughout the literature of the need for post-hoc disciplining of some sort. This suggests that, despite Shirky’s claim of philosophical paradigm shifting for social tagging, there is a residing 411 doubt amongst information professionals that self-organising systems can work without there being some element of control and some form of “representative authority” (Wright, 2005). Perhaps all that social tagging heralds is a shift towards user warrant.

References Berman, S. (1971). Prejudices and Antipathies: A Tract on the LC Subject Heads Concerning People, Metuchen, NJ, Scarecrow press Flickr™, (2006a). About Flickr, [Retrieved 21 February 2006] http://www.flickr.com/about.gne, Flickr™, (2006b). Tags, Frequently asked Questions, [Retrieved 21 February 2006] http://www.flickr.com/help/tags/#37, Hidderley, R. and Rafferty, P. (1997). Democratic Indexing: An Approach to the Retrieval of Fiction, Information Services and Use 17 (2-3), 101-111. Ideant (2005). Tag Literacy, [Retrieved 23 February 2006] http://ideant.typepad.com/ ideant/2005/04/tag_literacy.html Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication Through Shared Metadata, [Retrieved 25 October 2005], http://www.adammathes.com/ academic/computer-mediated-communication/folksonomies.html Merholz, P. (2005). Metadata for the Masses, [Retrieved 25 October 2005] http:// www.adaptivepath.com/publications/essays/archives/000361.phpOlson, H. (2002). The Power to Name: Locating the Limits of Subject Representation in Libraries, Dordrecht, Netherlands, Kluwer Academic. Rafferty, P. and Hidderley, R. (2005). Indexing Multimedia and Creative Works: The Problems of Meaning and Interpretation, Aldershot, United Kingdom, Ashgate. Shirky, C. (2005). Ontology is overrated: categories, links, tags, http://shirky.com/writings/ontology_overrated.html [Accessed 25 October 2005] Vander Wal, T., (2005). Explaining and Showing Broad and Narrow Folksonomies, [Retrieved 27 February 2006] http://www.personalinfocloud.com/2005/02/ explaining_and_.html Wright, A. (2005). Folksonomy, [Retrieved 25 October 2005] http://www.agwright.com/ blog/archives/000900.htm

Blanca Rodríguez Bravo Universidad de León, Facultad de Filosofía y Letras, Campus de Vegazana, 24071 León, Spain The Visibility of Women in Indexing Languages

Abstract: This article analyses how gender matters are handled in indexing languages. The examples chosen were the Library of Congress Subject Headings (LCSH), the UNESCO Thesaurus (UT) and the European Women’s Thesaurus (EWT). The study is based on an analysis of the entries Man/Men and Woman/Women, their subdivisions and established relationship appearing under these entries. Other headings or descriptors are also listed when they allude to men or women but the gender sense occupies only second or third place in the entry, in the shape of an adjective or a second noun. A lack of symmetry in the treatment of gender is noted, with recommendations being made for equal status for men and women, which should, however, avoid unnecessary enumerations.

1. Introduction The growing interest in documentation referring to the situation of women and in Women’s Studies in general, may be easily seen in the increase in the number of publications on these topics and in the proliferation of specialized information centres, libraries and data bases. These need to have a working tool-kit that will permit homogeneous indexing of documents from all areas of knowledge containing specific women’s terminology, so that every content relating to Women’s Studies, feminist theory and the situation of women in general can be located and retrieved in the most effective and exhaustive way possible. However, information professionals question the usefulness of classic encyclopaedic indexing languages for processing this documentation. This is because as an outcome of the lack of specific terminology for representing information about women it is difficult to retrieve documentation that is truly feminine. In traditional terminological tools, feminine presence is sparse because the language has set up the masculine as universal and generic. It is also inappropriate, because their discourse for the representation of knowledge maintains feminine stereotypes, offering an image of women that is anachronistic and shows traces of sexism. Associative languages do not reflect the emerging contents of a feminist nature, taking the masculine as the generic, so undervaluing the feminine or hiding everything referring to women. For this reason, it is common to find very varied documentation lumped together under a single entry, “Women”, so general that it makes it difficult to retrieve information about any specific point. Similarly, discriminatory differences can be noted with respect to the heading or descriptor “Men”. This analysis is set within the constructivist trends that criticize and question classifications and in general classical terminological tools as systems for organizing knowledge. They state that indexing languages are social creations intended for ordering and controlling concepts, since underlying their supposed universality and neutrality they are constructed in accordance with the cultural discourse which is dominant in society at different times and are hence a product of their period. On this point, the work by Budd and Rabe, Frohmann, Kwasnik, Olson, Radford and Radford, among others, should be noted. Yet further, as soon as such a classification or indexing language has organized concepts with a given order of dependence and established relationships among them, it indirectly 414 will describe and provide one particular view of the world. In this way, it will contribute to supporting this perception in the minds of people who consult it. In the light of the exclusion of the feminine from the most often used encyclopaedic languages, it is understandable that libraries and information centres that specialize in women’s matters have attempted to construct indexing languages of their own, such as: European Women's Thesaurus (EWT); On Equal Terms: A Thesaurus for Non-Sexist Indexing and Cataloging (Compiled by Joan K. Marshall); Tesauro “Mujer” (Instituto de la Mujer); Thesaurus d'història social de la dona and the Listado de descriptores en el tema de la mujer (ISIS Internacional). These initiatives are aimed at designing egalitarian languages and/or involve a move from a man-centred representation of knowledge to a feminine representation of reality. Use of these specialized terminological tools contributes to raising the visibility of women and of documents about them. However, their use is restricted to very specific contexts. The great majority of information centres are not able to use these thesauri other than as a complement to universal terminoly tools. Hence, it would seem essential to achieve the integration of women into encyclopaedic languages.

2. Methodology The representation of gender in three indexing languages will be considered, these being the Library of Congress Subject Headings (LCSH), the UNESCO Thesaurus (UT) and the European Women’s Thesaurus (EWT). Study of associative languages focus principally on the analysis of the entries Man/Men and Woman/Women, for which all the relationships established by the language were collected. For all the terminology tools used, other headings or descriptors alluding to men or women in which the gender significance holds second or third place in the entry, being an adjective or a second noun, such as Female/Male, Femininity/Masculinity, were also enumerated. In relation to lists of subject headings the LCSH was chosen, since it constitutes the subject headings list most used and most influential worldwide. A list is given of the headings where the first term is among those noted. In the case of the entries Females, Males, Man, Men, Woman and Women their scope notes and their relashionship with other terms were recorded, so as to provide their context for use, and also their respective subdivisions. With respect to the latter, the subdivisions of the entry Man were left out, as it is a generic entry and there was another one, differentiated, for the masculine. Likewise, references to compound headings derived from Female, Femininity, Male, Men and Women were omitted. In respect of thesauri the UT was selected, as it is the only encyclopaedic and multilingual thesaurus in existence. From this, the descriptors containing the words Men and Women with their preferential, generic and associative relationships were collected. Similarly, to avoid prolixity only the descriptors derived from Women were enumerated. Next, consideration was given to the handling of generics in the EWT, a tool whose principal aim is to represent documents about women. Its gender descriptors are enumerated, with the full listings for “Men” and “Women” being presented. The treatment of gender in other tools specializing in feminine topics, such as the Spanish Tesauro “Mujer”, is not covered, since it is not easy to compare terminologies used by different languages. The English language differs greatly from Spanish. In Spanish, the general problems of sexism are aggravated by two fundamental factors. The first is its frequent use of the masculine as a generic where English can distinguish the generic from both the masculine and the feminine. The other is that the distinction of grammatical gender Spanish makes in both nouns and adjectives has led, on the grounds of 415 economy, to the utilization of the masculine to represent all humanity. Otherwise, this distinction permit avoid long lists including the term women (mujeres). It is necessary to refer to the thesaurus compiled by Marshall, On Equal Terms: A Thesaurus for Non-Sexist Indexing and Cataloging, which develops gender contents very extensively, taking particular care to utilize non-sexist language. The starting point for this terminological tool is the vocabulary present in LCSH. Detailed analysis of this is not included, owing to its exhaustive treatment of gender and the limitations of space for this work.

3. Results 3.1. LIBRARY OF CONGRESS SUBJECT HEADINGS

HEADINGS SUBDIVISIONS FEMALE…: Female livestock, Female offenders, Female orgasm. FEMALES - Evolution Here are entered works on female organisms in general. Works on - Physiology the human female are entered under Women. BT Sex NT Female livestock Women FEMININE BEAUTY FEMININITY…: Femininity (Philosophy), Femininity (Psychology), Femininity of God. FEMMES…: Femmes fatales, femmes fatales in art. MALE…: Male contraception, Male contraceptives, Male livestock, Male nude, male orgasm, Male striptease. MALES Here are entered works on male organism in general. Works on the human male are entered under Men BT Sex NT Male livestock Men MAN UF Human beings Humans Mankind BT Primates RT Anthropology Human-animal relationships NT Anthropometry Economic man Ethnology Human biology Men Men in literature Persons Philosophical anthropology Women 416

HEADINGS SUBDIVISIONS MASCULINITY (PSYCHOLOGY) MEN (May Subd Geog) - Diseases Here are entered works on the human male. Works on male - Employment organisms in general are entered under males. - Health and Hygiene UF Human males - Medical examinations Males, Human - Mental health BT Males - Mortality Man - Physiology RT Patriarchy - Prayer-books and devotions NT Abusive men - Psychology Aged men - Sexual behavior Brotherhoods - Socialization Brothers - Societies and clubs Church work with men - Study and teaching English literature-Men authors - United States Gay men Grooming for men Househusband Jewish men Male nude Middle aged men Middle class men Photography of men Sex instruction for men Short men Single men Strong men Uncles White men Young men

MEN…: Men (Christian theology), Men actors, Men as collectors, Men authors, Men authors, English, Men consumers, Men dancers, Men in art, Men in church work, Men in literature, Men in mass media, Men in motion pictures, Men in popular culture, Men nurses, Men weavers. WOMAN…: Woman (Buddhism); Woman (Christian theology); Woman (Mormon theology): Woman (Philosophy); Woman (Theology); Woman-to-woman marriage. WOMEN (May Subd Geog) - Anthropometry Here are entered works on the human female. Works on female - Biography organisms in general are entered under Females. - Books and reading UF Females, Human - Charities Human females - Colonization Woman - Communication BT Anthropology - Conduct of life Females - Congresses Man - Crimes against Sociology 417

HEADINGS SUBDIVISIONS RT International Women’s Year, 1975 - Cross-cultural studies Misogyny - Diseases SA subdivision Women under names of wars and names of Indian - Education groups; also subdivision Relations with women under names of - Education (Higher) persons; also headings beginning with the word Women. - Education, Medieval NT Abused women - Employment Aged women - Employment re-entry Architecture and women - Evolution Aunts - Folklore Beauty contestants - Health and hygiene Church work with women - History Fascism and women - Institutional care Feminism - Language Femmes fatales - Legal status, laws, etc. Gifted women - Life skills guides Handicapped women - Literary collections Heroines - Medical examinations Homeless women - Mental health Housewives - Mythology International Women’s Decade, 1976-1985 - Non-formal education Lesbians - Nutrition Libraries and women - Pensions Married women - Periodicals Mass media and women - Physiology Matriarchy - Portraits Middle aged women - Prayer-books and devotions Middle class women - Psychic ability Minority women - Psychology Mothers - Public opinion Motion pictures for women - Quotations Nieces - Recreation Overweight women - Religious life Photography of women - Services for Physically handicapped women - Sexual behavior Poor women - Social and moral questions Pregnant women - Social conditions Queens - Social networks Rural women - Socialization Self-defense for women - Societies and clubs Sex instruction for women - Sociological aspects Single women - Study and teaching Sisters - Suffrage Television and women - Suicidal behavior Television programs for women - Taxation United States. Navy-Women - Time management Urban women - Tobacco use White women - Vocational education Widows - Great Britain 418

HEADINGS SUBDIVISIONS Wives - Puerto Rico Women’s mass media - United States Working class women Young women WOMEN,…: Women Ashanti; Women, Australian (Aboriginal) (…) (94 headings) WOMEN…: Women (in numismatics); Women (International law); Women accountants, (…) (410 headings: many professions) WOMEN’S…: Women’s Cave (Romania); Women’s clothing industry; Women’s colleges, (…) (36 headings) …WOMEN: Abused women; Afro-American women, Chinese American women (…); Libraries and women; Minority women; Professional education of women; Public speaking for women, Reformatories for women; Rural women; Self-defense for women; Sex discrimination against women; Sports for women; Sterilization of women; Urban women, White women, etc.

It is significant what a disproportion there is between the number of entries for men and for women, this being the outcome of an attempt to give greater visibility for the latter, in language where the use of the masculine as a generic would conceal their presence. It can be noted how the masculine is used as a generic or even hegemonic, as may be observed in the employment of the term “Man” instead of other options that would be less discriminatory, such as “Human”, or “Person”, “Humans” or “Humanity”. The aim of these near-interminable listings is no more than to give visibility to women, hidden by non-specific entries. It is also striking that within the limited space given over to women much indicates stereotypes: beauty, marital status, gestation, sexuality, and so forth, as also that there is decontextualization and obsolescence of entries and references, such as the suggestion that the subdivision Women should be used under Indian Tribes. The work compiled by Marshall, On Equal Terms, considers the subdivision “Conduct of life” outdated and suggests that it should be substituted for “Personal conduct, lifestyles, etc.”. It also proposes that the subdivision “Societies and clubs” should be replaced by “Organizations”. In relation to sexism in language, Marshall recommends the replacement of “Femininity” and “Masculinity” by “Womanhood” and “Manhood” respectively, seeing these as neutral terms that lack traditional generic connotations. The difference between “Overweight women” and “Strong men” is also to be noted. 419

3.2 UNESCO THESAURUS

DESCRIPTORS MARRIED MEN MT Family UF Husbands BT1 Marital status BT2 Marriage MARRIED WOMEN MT Family UF Wives BT1 Marital status BT2 Marriage RT Homemakers RT Mothers MEN MT Population UF Males BT1 Sex BT2 Sex distribution NT1 Boys RT Fathers RT Sex stereotypes RT Women RURAL WOMEN WOMEN MT Population SN Use more specific descriptor UF Females BT1 Sex BT2 Sex distribution NT1 Girls NT1 Homemakers RT Men RT Mothers RT Sex stereotypes RT Sexual division of labour Women…: Women and development; Women artists; Women authors; Women in politics; Women journalists; Women scientists; Women students; Women teachers; Women workers; Women’s education; Women’s employment; Women’s liberation movement; Women’s organizations; Women’s participation; Women’s rights; Women’s status; Women’s studies, Women’s suffrage; Women’s unemployment.

It is noteworthy how few gender entries there are in the UT, although there is a significantly larger number of entries destined to show women with respect to men. Mentions of women’s circumstances show up the asymmetrical relationship between males and females, the latter being hidden in the rest of the vocabulary. It is traditional feminine roles that are highlighted “Homemakers” and “Mothers”, or aspects considered to be emerging: education, employment and the like. It is significant 420 that there is linguistic anachronism in the denomination “Women in politics”, as opposed to “Women’s politics”, which would be preferable.

3.4. EUROPEAN WOMEN’S THESAURUS

DESCRIPTORS ECRITURE FEMININE FEMALE BODY FEMALE CONSCRIPTION FEMALE LANGUAGE FEMALE LEARNING FEMININITY MASCULINITY MEN SN Use only in combination with other terms and in general sense; for publications on the position of men in a particular country or in a particular period use with geographical or time descriptors and with other terms, taking care not to duplicate existing compound terms; for a specifically masculine aspect of a subject USE – W. RT -W Sex Women MEN…: Men in female-dominated occupations; men’s movement; Men’s studies; men’s work WOMEN SN Use only in a general sense; may be combined with geographical and time descriptors; use in combination with other terms taking care not to duplicate existing compound terms. RT Men Sex WOMEN’S…: Women’s antiquarian bookshop, Women’s archives, Women’s bookshops, Women’s broadcasting companies, Women’s centres, Women’s ceremonies, Women’s companies, Women’s concentration camps, Women’s convention, Women’s culture, Women’s discos, Women’s documentation centres, Women’s festivals, Women’s films, Women’s health and welfare services, Women’s history, Women’s history + Interwar period, Women’s holiday camps’ Women’s holiday services, Women’s hotels, Women’s information services, Women’s libraries, Women’s literature, Women’s magazines, Women’s movement, Women’s movement + Developing countries, Women’s network, Women’s organizations, Women’s parties, Women’s peace movement, Women’s printing establishment, Women’s prisons, Women’s publishing houses, Women’s pubs, Women’s question, Women’s resistance, Women’s restaurants, Women’s songs, Women’s struggle, Women’s studies, Women’s suffrage, Women’s synod, Women’s theatre, Women’s trade unions, Women’s training centres, Women’s work, Women’s year WOMEN…: Women + Developing countries; Women and work centres; Women in male dominated occupations; Women on social security; Women church. …WOMEN: Black women, Caravan women, Decade of women, Images of women, Learned women, Market women, Marriage + Traffic in women, Married women, Medicine women, Ordination of women, Romany women, Rural women, Unmarried women, Traffic in women, White women, Working-class women.

Women are omnipresent in this vocabulary, with men becoming conspicuous by their absence. However, there do not appear to be signs of discrimination and the language seems in keeping with the times. 421

This thesaurus was designed for the treatment of documentation about women, hence the limited presence of descriptors referring to the opposite sex. In fact, when needed they are to be represented by adding the clarifier (-W) (Not women) after the appropriate descriptor.

4. Discussion In the first two languages considered, LCSH and ET, it may be seen that there is a masculine hegemony that makes it necessary to reserve a special space for women. In traditional encyclopaedic languages women’s matters are hidden away beneath the use of masculine forms as generics and their presence thus is sparse. With this kept in mind, it is possible to state that the attention paid to feminine topics is much greater in the LCSH than in the UT. Similarly, a certain asymmetry may be observed in the representation of gender. In both these indexing languages it is possible to note how women are given a status of inferiority and dependence with respect to men and are limited to traditional roles. Stress is laid on every aspect of their marital status, sexuality and processes related to motherhood. The EWT, on the other hand, concentrates exclusively on the representation of women’s circumstance. Hence, its use is circumscribed to the representation and retrieval of documents about women in centres specializing in this field. It may be concluded that there is a need to integrate women and the feminine into the principal headings lists and thesauri without implying the concealment of masculine matters. This objective means that efforts must be made along the following lines: the elimination of sexist stereotypes from indexing languages; a search for balance between the presence of men and of women and the elimination of masculine forms as false generics. It is urgently necessary to revise encyclopaedic languages in pursuit of an egalitarian representation of the two sexes. The proposal would be for a rigorous revision to be undertaken, leading to the elimination of discriminatory connotations. In this task, the EWT and the vocabulary drawn up by Marshall could serve as examples. It would seem appropriate that indexing languages, as vehicles of thought, should collaborate in attempts to achieve active reforms of language, since the use of sex and gender with marginalizing connotations makes language into a bringer of discrepancies. On these grounds, the compilation of neutral vocabularies is essential. Similarly, as entropy is a principle to be followed in indexing languages, it would not seem that interminable lists of headings and descriptors derived from “Women” would be the best way of ensuring the inclusion of women. One solution that might be debated could be the inclusion of notes authorizing the use of the feminine whenever it is appropriate. Likewise, the use of syntagmatic headings or descriptors to distinguish the masculine from the generic when no other option exists. Another alternative might be to use formulae similar to what is proposed in the EWT, which, when requiring a reference to the male, adds the qualifier (-W) (not women). It would be feasible to establish two gender clarifiers for use with headings and descriptors when it is necessary to specify that a document on topic X does not refer to all humanity but only to women or only to men, (W) and (M), for example.

4. Bibliographic references Budd, M. & Raber, D. (1996). Discourse Analysis: Method and Application in the Study of Information. Information Processing and Management. 32(2), pp.217-226. Burgos Fresno, J. L., Fernández Pérez, M., Maseda García, R. & Villanua Bernues, L (2002). Tesauro “Mujer”. (6ª ed.). Madrid: Instituto de la Mujer, Centro de 422

Documentación. Available from the website: Capek, M. E. S.(1988). A Women’s Thesaurus: An Index of Language Used to Describe and Locate Information by and about Women. New York: Harper & Row. Dickstein, R., Mills, V. & Waite, E. (1988). Women in LC’s Terms: A Thesaurus of Library of Congress Subject Headings Relating to Women. Phoenix: Oryx Press. Drenthe, G. & Van der Sommen, M. (1990). Towards a Uniform Language Retrieval System for Information on the Condition of Women and Women’s Studies in the Netherlands: Report on a Preliminary Investigation. Amsterdam. EEUU. National Council for Research on Women & Capek, M. H. (ed.) (1989). A Women’s Thesaurus: An Index of Language Used to Describe and Locate Information by and about Women. New York: Perennial Library. Frohmann, B.(1994). Discourse Analysis as a Research Method in Library and Information Science. Library and Information Science Research. 16, pp.119-138. International Information Center and Archives for the Women’s Movement (1988). European Women’s Thesaurus: A List of Controlled Terms for Indexing Information on the Position of Women and Women’s Studies. Amsterdam: IIAV. Available from the website: ISIS International (1994). Listado de descriptores en el tema de la mujer. Santiago de Chile: ISIS Internacional. López-Huertas, M. J. & Barité, M. (2002). Knowledge Representation and Organization of Gender Studies on the Internet: Towards Integration. In: López-Huertas, Mª J. Challenges in Knowledge Representation and Organization for the 21st Century. Integration of Knowledge across Boundaries. Proceedings of the Seventh International ISKO Conference, Granada, 10-13 July, pp. 393-403. López-Huertas, M. J. (2004). Terminological Representation of Specialized Areas in Conceptual Structures: The Case of Gender Studies. In: McIlwaine, I. C. Knowledge Organization and the Global Information Society. Proceedings of the Eighth International Conference, London, 13-16 July, pp. 35-39. Marshall, J. K. (1977). On Equal Terms: A Thesaurus for Non-Sexist Indexing and Cataloging. New York: Neal-Schuman. Morán Suárez, Mª A. & Rodríguez Bravo, B. (2001). La imagen de la mujer en la Clasificación Decimal Universal (CDU). In: 5º Congreso ISKO-España: La representación y organización del conocimiento: metodologías, modelos y aplicaciones. Alcalá de Henares: Facultad de Documentación. Olson, H. A. (1997). The Feminist and the Emperor’s New Clothes: Feminist Deconstruction as a Critical Methodology for Library and Information Studies. Library and Information Science Research. 19, pp.81-199. Olson, H. A. (1998). Mapping beyond Dewey’s Boundaries: Constructing Classificatory Space for Marginalized Knowledge Domains. Library Trends, 47 (2), pp. 233-254. Olson, H. A. (2003). Transgressive Deconstructions: Feminist/Postcolonial Methodology for Research in Knowledge Organization. In: Frías, J.A.; Travieso, C. Trends in Knowledge Organization Research. Salamanca: Universidad de Salamanca, pp. 731-740. Radford, M. L. & Radford, G. P. (1997). Power, Knowledge and Fear: Feminism, Foucault, and the Stereotype of the Female Librarian. Library Quarterly. 67(3), pp. 250-266. Sebastià i Salat, M. (1988). Thesaurus d’història social de la dona. Barcelona: Generalitat de Catalunya, Comissió Interdepartamental de Promoció de la Dona. Florian Kohlbacher Vienna University of Economics and Business Administration Hitotsubashi University, Tokyo

Knowledge Organization(s) in Japan – Empirical Evidence from Japanese and Western Corporations

Abstract: With the recognition of knowledge as an essential resource of organizations as well as a company’s only enduring source of competitive advantage in an increasingly dynamic world, knowledge management (KM) seems to have become a ubiquitous phenomenon both in the academic as well as in the corporate world. This paper presents insights from a current research project on knowledge transfer, creation and sharing in a cross-cultural context. Applying a case study research design, an empirical study using qualitative interviews with managers and other corporate staff was conducted in Japan in 2005 and 2006. I will look at critical recent issues in the context of knowledge organization(s) in Japan and try to identify and analyze differences in Japanese and Western approaches to knowledge management, transfer and creation. This paper also highlights relevant factors of influence on the knowledge transfer process within multinational corporations and points to the crucial issue of knowledge retention which has emerged recently in Japan.

1. Knowledge Management and Knowledge Creation With the recognition of knowledge as an essential resource of organizations as well as a company’s only enduring source of competitive advantage in an increasingly dynamic world, knowledge management (KM) seems to have become a ubiquitous phenomenon both in the academic as well as in the corporate world. In fact, it has turned into one of the most prominent and widely discussed management concepts of the post-modern era. Publications on knowledge management are legion, and business practitioners do not fail to stress its importance for the competitiveness of their corporations. Even though KM has also been analyzed and discussed as a management fad and within the framework of management fashion models (cf. e.g. Scarbrough et al., 2005; Scarbrough & Swan, 2001; Skyrme, 1998) to explain its diffusion and “strong rhetorical appeal” (Alvesson et al., 2002: 282), no management scholar or practitioner is likely to disagree with Newell and fellow researchers’ (2002: 2) pronouncement to the effect that “[m]anaging knowledge and knowledge workers is arguably the single most important challenge being faced by many kinds of organizations across both the private and public sectors in the years to come”. Emerging from Japan, Ikujiro Nonaka’s publications and his theory of knowledge creation (e.g. Nonaka, 1994; Nonaka & Takeuchi, 1995) have drawn the attention to Japanese firms as knowledge-creating companies, a feature that supposedly has helped them to create the dynamics of innovation and to become world leaders in the automotive and electronics industries, among others, in the 1980s and the beginning of the 1990s. The difference, it was argued, between Japanese and Western firms, lies in the focus on tacit knowledge of the former and explicit knowledge of the latter (Hedlund & Nonaka, 1993; Nonaka & Takeuchi, 1995; Takeuchi & Nonaka, 2000) and this Japanese firms’ particular ability to create knowledge has also been received and acknowledged by Western scholars (e.g. Davenport & Prusak, 2000; Holden, 2002; Leonard, 1998). In fact, this is also closely related to two different paradigms in organizational theory and management practice: the information-processing paradigm which led to a rather technical concept of knowledge management focusing on information technology (IT) and implicit knowledge and the 424 knowledge creation paradigm which emphasizes intellectual capability and human creativity and tacit knowledge (Ichijo, 2002, 2004). Wiig (2004, p. 338), for instance, defines knowledge management as „[t]he systematic, explicit, and deliberate building, renewal, and application of knowledge to maximize an enterprise’s knowledge-related effectiveness and returns from its knowledge and intellectual capital assets.”. In contrast to that, Nonaka and Takeuchi (1995, p. 3) mean “the capability of a company as a whole to create new knowledge, disseminate it throughout the organization, and embody it in products, services, and systems” by organizational knowledge creation and develop a dynamic model of this process.

2. International and Cross-cultural Dimensions Scholars and practitioners around the globe have identified the capability of multinational corporations (MNCs) to create and efficiently transfer and combine knowledge from different locations worldwide as an increasingly important determinant of competitive advantage, corporate success and survival (cf. e.g. Asakawa & Lehrer, 2003; Bartlett & Goshal, 2002; Doz et al., 2001; English & Baker, 2006; Gupta & Govindarajan, 2000; Schulz & Jobe, 2001). Indeed, the process of knowledge transfer between business units is an essential aspect of KM (Bresman et al., 1999) and knowledge transfer capability is one of the most important advantages of MNCs as it is “[t]hrough the transfer and adaptation of knowledge, subsidiaries of MNCs build and develop their competitiveness over local firms” (Tseng, 2006: 121). According to Schulz (2001), the management of knowledge flows is especially important for MNCs because they operate in geographically and culturally diverse environments. Since strategically important knowledge is geographically dispersed in the business environment of most global firms (Asakawa & Lehrer, 2003), MNCs can derive great competitive advantage by managing knowledge flows between their subunits with differences between local markets requiring adaptation of products and operations to local conditions (Haghirian & Kohlbacher, 2005; Schulz & Jobe, 2001). Doz et al. (2001: 219) point to the important fact that MNCs will have to shift from merely being global projectors of knowledge to so-called metanational companies, which means “exploiting the potential of learning from the world by unlocking and mobilizing knowledge that is imprisoned in local pockets scattered around the globe”. Nonaka (1990: 82) terms the cross-border synergistic process of joint knowledge creation as ‘global knowledge creation’ and sees it as the key process of globalization. Here again, “[t]acit knowledge, embodied in individual, group and organizational routines, is of critical strategic importance because, unlike explicit knowledge, it is both inimitable and appropriable” (Al-Laham & Amburgey, 2005: 251; Spender, 1996). According to Holden (2002: 81), “[o]ne of the problems in the knowledge management literature is that authors give the impression that knowledge management operates in a kind of unitary vacuum, in which diversity in terms of language, cultural and ethnic background, gender and professional affiliation are compressed into one giant independent variable, which is in any case pushed to the side”. In fact, it is obvious that cultural differences and the cross-cultural context play an important role for and influence global knowledge creation and management (cf. e.g. Holden, 2001, 2002; Holden & Von Kortzfleisch, 2004). Zhu (2004: 74) for instance questions the popular claim that KM is becoming a universal management concept and correctly notes that such a universal concept would not only be unrealistic but even counterproductive and thus undesirable as well. However, the problem how cross-cultural differences influence KM has received too limited research attention so far (Edwards & Kidd, 2003; Ford & Chan, 2003; Zhu, 2004) and “the literature is almost silent on knowledge management in its cross-cultural dimensions” (Glisby & Holden, 2003: 29). 425

3. Aim and Scope of the Paper This paper presents insights from a current research project on knowledge transfer, creation and sharing in a cross-cultural context. Applying a case study research design, an empirical study using qualitative interviews with managers and other corporate staff was conducted in Japan in 2005 and 2006. The cross-cultural context is given by the work settings in MNC with either headquarters or subsidiaries in Japan. As a matter of fact, KM experts have frequently pointed to Japanese companies as role models of knowledge organizations for their Western counterparts and Japanese scholars play a leading role in the development and advancement of empirical research and theory in the fields of knowledge creation and KM (see above). Therefore, the empirical study underlying this paper aimed at identifying and analyzing differences in Japanese and Western approaches to knowledge management and in the process of sharing and transferring knowledge in corporations. This paper highlights relevant factors of influence on knowledge transfer within corporations and points to the critical issue of knowledge retention which has emerged recently in Japan.

4. Research Methodology In order to analyze the process of knowledge creation and transfer in MNCs, this study adopted an exploratory research strategy. Indeed, qualitative research, rather than traditional quantitative empirical tools, is particularly useful for exploring implicit assumptions and examining new relationships, abstract concepts, and operational definitions (Bettis, 1991; Weick, 1996). One important objective of this study is to conduct an analysis of different patterns and ways of knowledge creation and transfer within MNCs that helps to develop new hypotheses and build theory on how companies can efficiently and successfully do so and thus contribute to the theory of knowledge creation in an international context and to develop constructs that facilitate future hypothesis testing. As case studies have an important function in generating hypotheses and building theory (cf. e.g. Eisenhardt, 1989; Hartley, 1994, 2004; Kohlbacher, 2005), I chose a case study research strategy. The research was conducted over a period of more than one year and involved triangulation among a variety of different sources of data including the conducting of both formal and informal on- and off-site interviews (Kvale, 1996; Rubin & Rubin, 1995) with manager as well as scholars and other experts in the field, analysis of archival materials such company internal documents as well as articles in the business media (Forster, 1994; Hodder, 2000), and an evaluation of existing case studies and other relevant literature (Yin, 2003). In total, qualitative interviews with more than 70 top executives, middle managers and selected employees in more than 20 different MNCs – Japanese, European and US American – have been conducted in 2005 and 2006 in Japan. Where necessary and appropriate, supplementary interviews were conducted at headquarters or subsidiaries in Kolín, Czech Republic, Vienna, Austria, and Munich, Germany. For the research on two companies, I used participant-observation (Waddington, 2004; Yin, 2003) in addition to interviews with key persons and worked on the researched project as a part-time employee for several months in Tokyo. Unless no permission was given or unless it seemed inappropriate for other reasons, all interviews were recorded and authentically transcribed. In the course of the qualitative interviews, semi-structured questions in accordance with the theory of organizational knowledge creation within firms were employed. The interview-partner could nevertheless answer openly and lead the interview mostly. After transcription, the interviews were coded and analyzed according to Mayring’s qualitative content analysis, which "an approach of 426 empirical, methodological [sic] controlled analysis of texts within their context of communication, following content analytical rules and step by step models, without rash quantification" (Mayring, 2000, [5]; cf. also Kohlbacher, 2005). Finally, as Kohlbacher (2005) has shown, there are synergy effects of using qualitative content analysis in case study research and he argues strongly in favor of this combination. As for sampling, I opted for purposive sampling and theoretical sampling. The former is essentially strategic and entails an attempt to establish a good correspondence between research questions and sampling, as the researcher samples on the basis of wanting to interview people who are relevant to the research questions (Bryman, 2004). Theoretical sampling entails sampling interviews until your categories achieve theoretical saturation and selecting further interviewees on the basis of one’s emerging theoretical focus (cf. Bryman, 2004; Glaser & Strauss, 1967; Strauss & Corbin, 1990). Hence, both the sample companies as well as the interview partners were chosen on the basis of their potential to contribute insights to the research questions – e.g. according to extant literature, studies, and the evaluation of experts such as the MAKE (Most Admired Knowledge Enterprise) award (cf. e.g. English & Baker, 2006) – and because they offered a variety of different approaches to knowledge creation and transfer (Eisenhardt, 1989).

5. Findings This section summarizes the most important intermediate findings from the explorative empirical study. The expert interviews and the in-depth case studies of the companies helped to generate essential hypotheses on KM and knowledge organization(s) in Japan. These hypotheses are presented in the form of research propositions on the following topics:

1) Knowledge organization in Japan 1.1. J apanese knowledge organization seems to be more people-centered with a focus on tacit knowledge and face-to-face communications, while its Western counterpart concentrates more on IT tools and explicit knowledge 1.2. as the majority of foreign firms’ employees in Japan are Japanese, theses differences are less obvious and significant and the Japanese way dominates

2) Transfer of knowledge in MNCs in Japan 2.1. transferring knowledge between headquarters and subunits is an important issue for both Japanese and Western MNCs 2.2. in addition to the transfer of existing knowledge, creating new knowledge locally and then disseminating the knowledge globally becomes more and more critical for companies 2.3. language and cultural differences seem to be the most decisive factors of influence on knowledge sharing in a cross-cultural context

3) Aging workforce and knowledge organization in Japan 3.1. possibly huge loss of knowledge due to mass retirement in firms in Japan in 2007 (nisennananen-mondai, year-2007 problem) 3.2. knowledge retention and the passing on of tacit knowledge from elderly experienced employees to their successors has become a critical issue in Japan

6. Discussion In this section, I will briefly discuss the research propositions in the context of the extant literature. 427

6.1. Knowledge organization in Japan The two research propositions here seem to be in line with the extant literature on knowledge organization in Japan. Takeuchi and Nonaka (2000) for instance, summarize the fundamental differences between the Western approach to knowledge (KM) and the Japanese approach to knowledge (knowledge creation) as follows:

– how knowledge is viewed: knowledge is not viewed simply as data or information that can be stored in the computer in Japan; it also involves emotions, values, hunches; – what companies do with knowledge: companies do not merely “manage” knowledge, but “create” it as well; – who the key players are: everyone in the organization is involved in creating organizational knowledge, with middle managers serving as key knowledge engineers. (p. 184)

Indeed, while “the focus in the West is not on knowledge per se, but on measuring and managing knowledge”, the Japanese emphasize on the cognitive dimension of knowledge and see organizations as living organisms rather than machines for processing information (Takeuchi, 2001: 317, 321). Furthermore, in contrast to the Western predominance of intensive reflection at the individual level, in the Japanese firm, the primacy is at the organizational and group level: quality circles, ring systems, long working hours followed by collegial after-hour talk and drinking are all mechanisms to encourage sharing of knowledge (Hedlund & Nonaka, 1993).

6.2. Transfer of knowledge in MNCs in Japan As Hansen and Nohria (2004: 22) correctly note, the ways for MNCs to compete successfully by exploiting scale and scope economies or by taking advantage of imperfections in the world’s goods, labor and capital markets are no longer profitable as they once were, and as a result, “the new economies of scope are based on the ability of business units, subsidiaries and functional departments within the company to collaborate successfully by sharing knowledge and jointly developing new products and services”. Indeed, the interviews confirmed that both Japanese and Western MNCs in Japan are well aware of the importance of transferring knowledge between headquarters and subunits and vice versa – as well as between subunits. Besides, more and more companies realize that they need a new approach for their global business development which includes learning about unique local needs and requirements, adapt to them while doing global coordination for the operational excellence, and shift responsibility to local staff. This need to unlock the potential of globally dispersed knowledge has been called ‘the meta-national imperative’ (Doz et al., 2001) and the term ‘front-line management’ has been used to describe a form of management, where “the workplace is recognized and valued as the center of knowledge creation and in which knowledge-creation resources […] and processes […] are concentrated at the front line of the company” (Yasumuro & Westney, 2001: 178). The fact that in basically any company, “critically important knowledge resides in the workplace – on the factory floor, within sales and service organizations that deal directly with customers, at the “bench” in the R&D lab”, in short at the “front lines” of the company (Yasumuro & Westney, 2001: 178), underscores the importance of tacit knowledge and the need for involving local staff in the process of creating and disseminating local knowledge globally. Moreover, a recent study on the transfer of knowledge from Japanese MNCs to their subsidiaries abroad revealed the following three factors as especially influential on the 428 knowledge flow: the experience of having lived in a foreign country by the knowledge receiver (negative), a high proficiency in Japanese (positive), and a low perceived cultural difference towards Japan (positive) (Haghirian & Kohlbacher, 2005). While the former two aspects can been seen as rather neglected factors in prior research (see e.g. Marschan-Piekkari et al., 1999 for language), the latter aspect has been treated extensively (e.g. Hennart & Larimo, 1998; Shenkar, 2001; Williams et al., 1998). However, results and conclusions on the actual impact of cultural distance vary greatly (e.g. Brouthers & Brouthers, 2001; Manev & Stevenson, 2001), with the mainstream arguing for a negative influence, even though some also make a strong claim for a positive impact (e.g. Morosini et al., 1998). Cultural distance has been defined as “the sum of factors creating, on the one hand, a need for knowledge, and on the other hand, barriers to knowledge flow and hence also for other flows between the home and the target countries” (Luostarinen, 1980: 131-132, cited in Barkema et al., 1997: 427-428). Johanson and Vahlne (1977: 24) use the term psychic distance and define it as “the sum of factors preventing the flow of information from and to the market”, with examples being “differences in language, education, business practices, culture, and industrial development”. Especially in Japan with its unique culture, the cultural and psychic distance as well as the fairly high language barrier play an important role for doing business and sharing knowledge. From the above, it has become clear that internal knowledge transfer is not an easy task for MNCs and that “[they] need to apply different organizational mechanisms in order to facilitate knowledge transfer and depending on the specific characteristics of the knowledge” (Foss & Pedersen, 2002: 65). This might also be one of the reasons why many MNCs face difficulties in implementing proper KM structures and in coordinating their knowledge flows successfully (cf. e.g. Haghirian & Kohlbacher, 2005; Kasper et al., 2005a, 2005b).

6.3. Aging workforce and knowledge organization in Japan In 2005, Japan’s aging population was shrinking for the first time, and so is its labour force (The Economist, 2005). By 2024, more than a third of the population will be over age 65—one of the developed world's largest proportions of elderly citizens, and as a result, during the next 20 years, the financial wealth of Japanese households will stop growing and begin to decline (Farrell & Greenberg, 2005; McKinsey Global Institute, 2004). In fact, Japan is experiencing the fastest demographic changes among the leading industrial nations, and these changes have obvious impacts on labour markets and employment practices (Dirks et al., 2000). As particular features of the Japanese employment system – such as ‘lifetime employment’ and ‘seniority based promotion’ e.g. – have also been relevant in the context of knowledge creation and sharing (cf. e.g. Dirks et al., 2000; McCormick, 2004; Pudelko, 2004), the significance of an aging workforce for knowledge management (KM) issues such as knowledge retention is obvious (DeLong, 2002, 2004; Parise et al., 2005; Tempest et al., 2002).The possibly huge loss of knowledge due to mass retirement in firms in Japan in 2007 has been termed ‘nisennananen-mondai’ (year-2007 problem) and as a result, knowledge retention and the passing on of tacit knowledge from elderly experienced employees to their successors has become a critical issue in Japan.

7. Limitations and Need for Further Research Although carefully researched, documented and analyzed, the findings from my study are subject to some limitations. First of all, the insights gained were derived and concluded from an exploratory study adopting a case study research design and are thus based on single – each probably rather unique – cases, even if this is exactly what case study research is all about (Stake, 2000). Indeed, the common limitations of generalizability of such field research 429 are well documented (cf e.g. Eisenhardt, 1989; Hartley, 2004; Yin, 2003), though analytic generalization – in contrast to statistical generalization – is possible (Hartley, 2004; Yin, 2003). Moreover, the findings presented in this paper are only first – albeit essential – results from an empirical study which is in its final stage of research and analysis but is still to bear more insights and more detailed analyses. Last but not least, due to the exploratory nature of the study, there is need for further evaluation and testing of the gained insights and generated hypotheses.

8. References Al-Laham, A., & Amburgey, T. L. (2005). Knowledge sourcing in foreign direct investments: An empirical examination of target profiles. Management International Review, 45(3), 247-275. Alvesson, M., Kärreman, D., & Swan, J. (2002). Departures from knowledge and/or management in knowledge management. Management Communication Quarterly, 16(2), 282-291. Asakawa, K., & Lehrer, M. (2003). Managing local knowledge assets globally: The role of regional innovation relays. Journal of World Business, 38(1), 31-42. Barkema, H. G., Shenkar, O., Vermeulen, F., & Bell, J. H. J. (1997). Working abroad, working with others: How firms learn to operate international joint ventures. Academy of Management Journal, 40(2), 426-442. Bartlett, C. A., & Goshal, S. (2002). Managing across borders: The transnational solution (2nd ed.). Boston: Harvard Business School Press. Bettis, R. A. (1991). Strategic management and the straightjacket: An editorial essay. Organization Science, 2(3), 315-319. Bresman, H., Birkinshaw, J., & Nobel, R. (1999). Knowledge transfer in international acquisitions. Journal of International Business Studies, 30(3), 439-462. Brouthers, K. D., & Brouthers, L. E. (2001). Explaining the national cultural distance paradox. Journal of International Business Studies, 32(1), 177-189. Bryman, A. (2004). Social research methods (2nd ed.). New York: Oxford University Press. Davenport, T. H., & Prusak, L. (2000). Working knowledge, how organizations manage what they know. Boston: Harvard Business School Press. DeLong, D. W. (2002). Better practices for retaining organizational knowledge: Lessons from the leading edge: Accenture Institute for Strategic Change, Research Report. DeLong, D. W. (2004). Lost knowledge: Confronting the threat of an aging workforce. New York: Oxford University Press. Dirks, D., Hemmert, M., Legewie, J., Meyer-Ohle, H., & Waldenberger, F. (2000). The japanese employment system in transition. International Business Review, 9(5), 525-553. Doz, Y., Santos, J., & Williamson, P. (2001). From global to metanational: How companies win in the knowledge economy. Boston: Harvard Business School Press. Edwards, J., & Kidd, J. (2003). Knowledge management sans frontières. Journal of the Operational Research Society, 54(2), 130-139. Eisenhardt, K. M. (1989). Building theories from case study research. Academy of Management Review, 14(4), 532-550. English, M. J., & Baker, W. H., JR. (2006). Winning the knowledge transfer race: Using your company's knowledge assets to get ahead of the competition. New York: McGraw-Hill. Farrell, D., & Greenberg, E. (2005). The economic impact of an aging japan. The McKinsey Quarterly - Web exclusive, May 2005. 430

Ford, D. P., & Chan, Y. E. (2003). Knowledge sharing in a multi-cultural setting: A case study. Knowledge Management Research & Practice, 1(1), 11-27. Forster, N. (1994). The analysis of company documentation. In C. Cassell & G. Symon (Eds.), Qualitative methods in organizational research, a practical guide (pp. 147-166). London, Thousand Oaks, New Delhi: Sage. Foss, N. J., & Pedersen, T. (2002). Transferring knowledge in mncs, the role of sources of subsidiary knowledge and organizational context. Journal of International Management, 8(1), 49-67. Glaser, B. G., & Strauss, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Chicago: Aldine. Glisby, M., & Holden, N. (2003). Contextual constraints in knowledge management theory: The cultural embeddedness of nonaka's knowledge-creating company. Knowledge and Process Management, 10(1), 29-36. Gupta, A. K., & Govindarajan, V. (2000). Knowledge flows within multinational corporations. Strategic Management Journal, 21(4), 473-496. Haghirian, P., & Kohlbacher, F. (2005). Interkultureller wissenstransfer in multinationalen japanischen unternehmen. In M. Pohl & I. Wieczorek (Eds.), Japan 2005. Politik und wirtschaft (pp. 213-233). Hamburg: Institut für Asienkunde (IFA). Hansen, M. T., & Nohria, N. (2004). How to build collaborative advantage. MIT Sloan Management Review, 46(1), 22-30. Hartley, J. (1994). Case studies in organizational research. In C. Cassell & G. Symon (Eds.), Qualitative methods in organizational research, a practical guide (pp. 208-229). London, Thousank Oaks, New Delhi: Sage Publications. Hartley, J. (2004). Case study research. In C. Cassell & G. Symon (Eds.), Essential guide to qualitative methods in organizational research (pp. 323-333). London, Thousand Oaks, New Delhi: Sage Publications. Hedlund, G., & Nonaka, I. (1993). Models of knowledge management in the west and japan. In P. Lorange, B. Chakravarthy, J. Roos & A. Van de Ven (Eds.), Implementing strategic processes: Change, learning and co-operation (pp. 117-144). Oxford: Basil Blackwell. Hennart, J.-F., & Larimo, J. (1998). The impact of culture on the strategy of maltinational enterprises: Does national origin affect ownership decisions? Journal of International Business Studies, 29(3), 515-538. Hodder, I. (2000). The interpretation of documents and material culture. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 703-715). Thousand Oaks: Sage. Holden, N. (2001). Knowledge management: Raising the spectre of the cross-cultural dimension. Knowledge and Process Management, 8(3), 155-163. Holden, N. (2002). Cross-cultural management: A knowledge management perspective. Harlow: Financial Times/Prentice Hall. Holden, N., & Von Kortzfleisch, H. F. O. (2004). Why cross-cultural knowledge transfer is a form of translation in more ways than you think. Knowledge and Process Management, 11(2), 127-136. Ichijo, K. (2002). Knowledge exploitation and knowledge exploration: Two strategies for knowledge creating companies. In C. W. Choo & N. Bontis (Eds.), The strategic management of intellectual capital and organizational knowledge (pp. 477-483). New York: Oxford University Press. Ichijo, K. (2004). From managing to enabling knowledge. In H. Takeuchi & I. Nonaka (Eds.), Hitotsubashi on knowledge management (pp. 125-152). Singapore: John Wiley & Sons (Asia) Pte Ltd. 431

Johanson, J., & Vahlne, J.-E. (1977). The internationalization process of the firm - a model of knowledge development and increasing foreign market commitments. Journal of International Business Studies, 8(1), 23-32. Kasper, H., Haltmeyer, B., & Kohlbacher, F. (2005a, 22-26 May 2005). Knowledge managment - fact or fiction? Empirical evidence of the current status and practices of knowledge management in multinational corporations. Paper presented at the 14th International Conference for the International Association of Management of Technology (IAMOT), Vienna, Austria. Kasper, H., Haltmeyer, B., & Kohlbacher, F. (2005b, 9-11 June 2005). Thriving on knowledge? Empirical evidence of the current status and practices of knowledge management in mulitnational corporations. Paper presented at the 6th International Conference on Organizational Learning and Knowledge (OLK6), Trento, Italy. Kohlbacher, F. (2005). The use of qualitative content analysis in case study research [89 paragraphs]. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research [On-line Journal], 7(1), Art. 21. Available at: http://www.qualitative-research.net/ fqs-texte/1-06/1-06-21-e.htm [Date of Access: January 6, 2006]. Kvale, S. (1996). Interviews, an introduction to qualitative research interviewing. Thousand Oaks: Sage. Leonard, D. (1998). Wellsprings of knowledge: Building and sustaining the sources of innovation. Boston: Harvard Business School Press. Luostarinen, R. (1980). Internationalization of the firm. Helsinki: Helsinki School of Economics. Manev, I. M., & Stevenson, W. B. (2001). Nationality, cultural distance, and expatriate status: Effects on the managerial network in a multinational enterprise. Journal of International Business Studies, 32(2), 285-303. Marschan-Piekkari, R., Welch, D., & Welch, L. (1999). In the shadow: The impact of language on structure, power and communication in the multinational. International Business Review, 8(4), 421-440. Mayring, P. (2000). Qualitative content analysis [28 paragraphs]. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research [On-line Journal], 1(2). Available at: http://qualitative-research.net/fqs-e/2-00inhalt-e.htm [Date of access: October 5, 2004]. McCormick, K. (2004). Whatever happened to the 'the japanese model'? Asian Business & Management, 3(4), 371-393. McKinsey Global Institute. (2004). The coming demographic deficit: How aging populations will reduce global savings. Morosini, P., Shane, S., & Singh, H. (1998). National cultural distance and cross-border acquisition performance. Journal of International Business Studies, 29(1), 137-158. Newell, S., Robertson, M., Scarbrough, H., & Swan, J. (2002). Managing knowledge work. Basingstoke: Palgrave Macmillan. Nonaka, I. (1990). Managing globalization as a self-renewing process: Experiences of japanese mnes. In C. A. Bartlett, Y. Doz & G. Hedlund (Eds.), Managing the global firm (pp. 69-94). London: Routledge. Nonaka, I. (1994). A dynamic theory of organizational knowledge creation. Organization Science, 5(1), 14-34. Nonaka, I., & Takeuchi, H. (1995). The knowledge-creating company, how japanese companies create the dynamics of innovation. New York, Oxford: Oxford University Press. 432

Parise, S., Cross, R., & Davenport, T. H. (2005). It's not what but who you know: How organizational network analysis can help address knowledge loss crises. Working Paper, The Network Roundtable at the University of Virginia. Pudelko, M. (2004). Hrm in japan and the west: What are the lessons to be learnt from each other? Asian Business & Management, 3(3), 337-361. Rubin, H. J., & Rubin, I. S. (1995). Qualitative interviewing, the art of hearing data. Thousand Oaks: Sage. Scarbrough, H., Robertson, M., & Swan, J. (2005). Professional media and management fashion: The case of knowledge management. Scandinavian Journal of Management, 21(2), 197-208. Scarbrough, H., & Swan, J. (2001). Explaining the diffusion of knowledge management: The role of fashion. British Journal of Management, 12(1), 3-12. Schulz, M., & Jobe, L. A. (2001). Codification and tacitness as knowledge management strategies: An empirical exploration. Journal of High Technology Management Research, 12(1), 139 - 165. Shenkar, O. (2001). Cultural distance revisited: Towards a more rigorous conceptualization and measurement of cultural differences. Journal of International Business Studies, 32(3), 519-535. Skyrme, D. J. (1998). Fact or fad? Ten shifts in knowledge management. Knowledge Management Review, 1(3), 6-7. Spender, J. C. (1996). Making knowledge the basis of a dynamic theory of the firm. Strategic Management Journal, 17(Winter Special Issue), 45-62. Stake, R. E. (2000). Case studies. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 435-453). Thousand Oaks: Sage. Strauss, A., & Corbin, J. M. (1990). Basics of qualitative research: Techniques and procedures for developing grounded theory. Newbury Park: Sage. Takeuchi, H. (2001). Towards a universal management concept of knowledge. In I. Nonaka & D. J. Teece (Eds.), Managing industrial knowledge: Creation, transfer and utilization (pp. 315-329). London: Sage. Takeuchi, H., & Nonaka, I. (2000). Reflection on knowledge management from japan. In D. Morey, M. Maybury & B. Thuraisingham (Eds.), Knowledge management: Classic and contemporary works (pp. 183-186). Cambridge, Massachusetts: The MIT Press. Tempest, S., Barnat, C., & Coupland, C. (2002). Grey advantage: New strategies for the old. Long Range Planning, 35(5), 475-492. The Economist. (2005). The sun also rises: A survey of japan, October 8th 2005. Tseng, Y. M. (2006). International strategies and knowledge transfer experiences of mnc's taiwanese subsidiaries. The Journal Of Amercian Academy of Business, 8(2), 120-125. Waddington, D. (2004). Participant observation. In C. Cassell & G. Symon (Eds.), Essential guide to qualitative methods in organizational research (pp. 154-164). London, Thousand Oaks, New Delhi: Sage Publications. Weick, K. E. (1996). Drop your tools: Allegory for organizational studies. Administrative Science Quarterly, 41(2), 301-313. Wiig, K. M. (2004). People-focused knowledge management: How effective decision making leads to corporate success. Burlington, MA: Elsevier Butterworth-Heinemann. Williams, J. D., Han, S.-L., & Qualls, W. J. (1998). A conceptual model and study of cross-cultural business relationships. Journal of Business Research, 42(2), 135-143. Yasumuro, K., & Westney, D. E. (2001). Knowledge creation and the internationalization of japanese companies: Front-line management across borders. In I. Nonaka & T. Nishiguchi 433

(Eds.), Knowledge emergence: Social, technical, and evoluntionary dimensions of knowledge creation (pp. 176-193). New York: Oxford University Press. Yin, R. K. (2003). Case study research, design and methods (3rd ed. Vol. 5). Thousand Oaks: Sage. Zhu, Z. (2004). Knowledge management: Towards a universal concept or cross-cultural contexts? Knowledge Management Research & Practice, 2(2), 67-79.

Ann Doyle University of British Columbia Canada

Naming and Reclaiming Indigenous Knowledges in Public Institutions: Intersections of Landscapes and Experience.

Abstract: This paper tells a story of a practitioner’s path to research in classification theory. It leads from a canoe-run through pacific forest to a university campus all of which are located on the traditional lands of the Musqueam people. It found that the university library has no authorized name in its vast catalogues and databases for the people of this place -- the Musqueam. As the first step of an enquiry into the nature of library classification of Indigenous knowledges, the paper explores the relationship of library classification to the dominant discourses, the effects of this relationship on access to Indigenous knowledges held in libraries, and the consequences for the education of Aboriginal and non-Aboriginal students. The next steps are to examine theoretical foundations that may serve to guide the design, development and evaluation of classification systems for organizing and naming Indigenous knowledges in public collections.

1. Introduction Ulqsԥn1 is the name of the point of land where the university is located on the traditional lands of the Musqueam people. Walking paths now trace the original Indigenous trails that led from the river to the fresh water site and fishing camps on the inlet, criss-crossing the present day endowment lands. The land remembers the good places for sturgeon, the lookouts, the crab apple gathering places, and places for medicines. As a librarian working at the First Nations library here, I often hear student assistants explaining to visitors, “The name of the Library is Xwi7xwa, pronounced whei’wha, in the Squamish language, it means echo.” This story is told anew as each visitor asks a question about the name and opens the doorway to a new understanding, a different landscape, a landscape with its own ways of knowing, and its own ways of telling. This is the ground of my question: what could an Indigenous library look like here? How would the Indigenous value of the balance of the physical, emotional, spiritual, and intellectual dimensions manifest in library classification and how could Indigenous ontologies and epistemologies inform knowledge organization -- naming and structures? The principles have already been put in place in the library where I work, guided by the elders and their aspirations for the next generations.2 The Xwi7xwa Library’s knowledge organization and naming systems aim to be congruent with Indigenous worldviews and reflect Indigenous intellectual landscapes in order to support an organizational mandate to make the University's vast resources more accessible to Aboriginal peoples. The commitment to Indigenous knowledge organization emerges from two interrelated considerations: 1) Standard library knowledge organization and naming systems carry the bias of the dominant culture and thereby marginalize or exclude Indigenous histories, cultures, knowledges, languages, and efforts toward self-determination -- jurisdictional and intellectual. 2) The development of meaningful knowledge organization and subject representation systems for Indigenous knowledges within libraries and their educational institutions is integral to the larger project of Indigenous scholarship, research, and pedagogy at local and global levels. It also contributes to capacity building within local communities, and extends foundations for cross-cultural understandings. From an international perspective, it is part of the larger project of repatriation of Indigenous cultural and intellectual property. Due to the convergence of technologies and spread of bibliographic 436 utilities, the ubiquitous classification systems in global contexts have unprecedented power to erase local and regional knowledge domains. Theoretical and applied research on Indigenous knowledges organization contributes to the larger project of knowledge organization for a global learning society.

2. Background The First Nations House of Learning is an Aboriginal student services unit at the University of British Columbia in Vancouver, Canada. It was formed imaging Raven, the northwest coast trickster and symbol of creativity, transforming the university, to reflect First Nations cultures and philosophies, linking the university to First Nations communities (First Nations House of Learning, 2005). A clear vision and many years of negotiations by Aboriginal people secured a separate Aboriginal collection on campus and the building of facilities for Aboriginal student services including a library building in 1993. The library mandate is to collect, organize and preserve textual and non-textual records relating to the Aboriginal peoples of British Columbia with a focus on Indigenous perspectives and scholarship. In the 1980’s librarian, Gene Joseph, Wet’suwet’en Nadleh’den Nations, selected the Brian Deer Indigenous classification scheme for the collections and began to develop Aboriginal subject headings to describe the contents. (Joseph, 1993) She understood that any possible futures for an Aboriginal library were written in the organization of the knowledge and the ways in which it is named. The Deer classification developed in the 1970’s is the only Indigenous general knowledge classification system in Canada. However, at present, it does not, and was not designed to, accommodate the large historic and contemporary interdisciplinary literature on Indigenous topics and a burgeoning Indigenous scholarship. The research challenge is to investigate principles to inform the development of classification tools and practices that give voice to Indigenous knowledges and Indigenous scholarship and are congruent with the demands of Indigenous research methodologies and ethics.

3. A Theoretical Lens: The Sociology of Education In seeking a theoretical lens with which to view the intersections of libraries, education and Aboriginal peoples, the scholarship of the New Sociology of Education represented by Michael Young, Basil Bernstein and Michael Apple offers some insight. Interested in knowledge and power relations, these theorists view curriculum as a form of knowledge organization. It is understood as a symbolic, material and human environment that is socially constructed and socially distributed. They question what knowledge is selected to be legitimized by educational and social systems, examine how it becomes available to certain groups (and not to others) and how some knowledge is incorporated into the processes and content of education, such as inclusion in curriculum. Bernstein states, “The distribution of power and principles of social control are reflected in the ways in which society selects, classifies, distributes, transmits and evaluates the educational knowledge it considers public” (p. 47). Textbooks transmit and distribute educational knowledge and controversies over what is considered to be legitimate knowledge often centre on what is included or excluded from textbooks: “They help set the canons of truthfulness ...” (Apple, 2000, 46). Some classification theorists hold that library subject headings and classifications can be viewed as text, as a discourse that carries traces of histories and political and social contexts. (Bowker and Star, 1999, 55) If viewed in this way, the new sociology of education could also provide tools to understand how dominant classification and subject representation systems entrench what is ‘taken for granted’ as legitimate knowledge, and how socially marginalized groups and their knowledge domains are excluded. 437

4. Aboriginal Education Since the Native Indian Brotherhood published its first national policy paper Indian Control of Indian Education in 1972, Aboriginal people have emphasized the primacy of culturally appropriate curriculum to the successful education of Aboriginal students. Bias in curriculum still continues to be viewed as a crucial factor contributing to the failure of the education system for Aboriginal children (Hampton 1995; Battiste 2000). Post-secondary institutions also teach about First Nations in their hidden curriculum, as well as, their stated curriculum. “They transmit attitudes, values, and beliefs about what is important, who is credible, the “right” way to do things, and place of Aboriginal peoples in Canada.” The design of these educational processes occurs at both conscious and unconscious levels. (Hampton, 2000, 215). The National Indian Brotherhood’s early analysis of the public school curriculum concluded that Aboriginal children will “continue to be strangers in Canadian classrooms until the curriculum recognizes Indian customs and values, Indian languages …” and their ongoing contributions to Canadian society (p. 26). In 1974, Manitoba Indian Brotherhood’s The Shocking Truth about Indians in Textbooks presented a thorough content analysis of the representation of Aboriginal peoples in textbooks. The study found texts to be derogatory, incomplete, and distorted as regards Aboriginal people and identified ten types of bias.3 These ten types of bias present in Canadian school textbooks thirty years ago could also be identified in the standard Anglo American classification and subject representation systems currently used by libraries. The effects on the education and self image of Aboriginal people, cross cultural understandings, and homogenization of society’s knowledge systems are similar. For example, in the university library catalogue there no subject entry for Musqueam, the Nation on whose unceded land the university is built. A search for works on Indigenous classification retrieved the pejorative term ‘primitive classification’, and a search for elders in this locale retrieves the heading, ‘Salish aged’ a term which skews the meaning and ignores the ubiquity of the term ‘elder’ in Indigenous contexts.

5. Library Classification and Homogenization: Erasures and Loss Library and Information Science (LIS) scholarship has documented cultural bias in subject access through classification and subject headings since the 1930’s (Berman, 1971, 1981; Yeh, 1971; Olson, 2002; Hermalata, 1995; Foskett, 1982). Indigenous knowledges have been marginalized through historicization, omission, lack of specificity, lack of relevance and lack of recognition of sovereign nations. This is documented in Canada (Lee, 2001; Lawson, 2004; Blake, 2003) and internationally, in the United States (Olson, 2002; Carter, 2002; Exner, 2005), Australia (Moorcroft, 1997) and New Zealand (Simpson, 2005; Smith, 1999). Classification systems reinforce the established intellectual and literary canon by placing subjects in traditional places, and reinforcing the expectations of users to find them there (Olson, 2002, 29). Notions about quality and authority underlie canon development and what is chosen as part of the canon (Searing, 1986). These same notions underlie the criteria that libraries use in selecting materials, such as, favourable reviews or indexing by standard sources. The problem is that the reviewers often lack a depth of knowledge of Indigenous topics and scholarship (Taylor and Patterson, 2004), and at a systemic level, the standard sources by definition choose more of the same. The information industry not only acts as a gatekeeper to knowledge, it also controls the interpretation of knowledge through the naming of concepts, and application of subject headings (Moorcroft 1993). These practices shape the current library collections that in turn shape research patterns and determine options that are available to future researchers. In this way they also construct memory (Traister, 1999, 213), and skew the telling and retelling of Aboriginal histories (Moorcroft, 1997, 108-112; Shilling 438 and Hausia, 1999, 18). Collections and subject representation “affect the way library patrons view themselves and their relation to their academic community, as well as, to the larger culture” (Manoff, 1992, 3-4). Moreover, understandings of identity are related to self-image and psychological well-being (Joseph and Lawson, 2003). Librarians are urged to acknowledge the importance of tribal governments through their acquisitions collections, reference publications and classification schemes (Carter, 2002, 14) and to recognize that First Nations and Aboriginal people “are not just racial groups, they are also self-governing, sovereign political entities empowered to exercise governmental functions” (p. 23). Hope Olson points out that the Dewey Decimal System (DDC), the most widely used classification system in the world, is in use in over 135 countries and translated into over 30 languages. Similarly, the Library of Congress Subject Headings (LCSH) is used in libraries and around the world: it is gradually becoming an “international subject language” (2002, 13). Convergence of networks and bibliographic utilities facilitate copying of catalogue records among libraries, sharing data over networks and through consortia Libraries in 82 countries use OCLC and copy millions of its records worldwide (Kyung-Sun, 2003; Olson, 2002). The standardization of knowledge organization and subject representation systems enables unprecedented sharing of knowledge and also unprecedented power to erase local and regional knowledge domains. At risk are the voices that represent diversity of human experience, such as, the uniqueness of Indigenous cultures, languages, stories and the ways of expressing them. The result could be the loss of representation and access to alternative ways of understanding, conduct and being in the world (Smith, 2005).

4. Challenges for Library Classification LIS classification theory recognizes that its traditional foundations of logical division and postpositivistic paradigms do not adequately express the perspectival (Ranganathan, 1967) and ‘border areas’ (Broadfield, 1946) and its challenge is to seriously imagine theoretical alternatives. Feminist theorists interested in the relationships between power and knowledge, and in multivocality produce an interdisciplinary literature that develops theoretical strategies for bridging limits. (Rose 1994) cited in Olson (2002). This literature envisions ‘boundary objects’ to link disparate knowledge domains, (Bowker and Star, 1999) and ‘eccentric techniques’ to create spaces for multiple voices (Olson, 2002). Digital library researchers also seek methods of traversing boundaries, both disciplinary and technical, for information retrieval of web resources and electronic collections (Dean, 2003; Manoff, 2000). Internationally, there are Indigenous thesaurus projects in Australia and New Zealand. The Maori Subject Headings grew out research on the information needs of Maori people and aims to provide access to the Maori body of knowledge held in public institutions for Maori people. In Australia, the Aboriginal Thesaurus aims to improve access to Aboriginal and Torres Strait Islander materials. The Rasmussen Library at the University of Alaska Fairbanks, in recognition of local Indigenous language revitalization efforts, has undertaken the reclassification of all Hyperborean languages, Alaskan and Other Arctic Native languages due to multiple inaccuracies and omissions within the Library of Congress classification. “We do not want to be perceived, as libraries often are, as a component of a white, European imperialist institution but rather as supportive partners in this process of cultural reassertion” (Lincoln 2003: 266). In Canada, the Royal Commission on Aboriginal Peoples (RCAP) calls for educational improvement through establishing an Aboriginal documentation centre and clearinghouse to provide access to Indigenous histories, knowledges and research (1996, Vol 3, 24). The repatriation of Indigenous knowledges is viewed as integral to the larger repatriation of cultural and intellectual property taken historically. Library and Archives Canada notes in a recent consultation report that Aboriginal resources and services are 439 affected by “issues of racism and ignorance raised by present cataloguing standards and terminology” (2003 p. 23). Although there are approximately ten specialized Indigenous classifications in use in North America (Hills 1997), LIS theoretical work deriving from Indigenous epistemologies and values that also comprehends the contemporary self-determination projects of First Nations has not been imagined. Libraries, archives, museums, cultural centres, and digital collections could benefit from conceptual and theoretical research on knowledge organization of Indigenous topics. From Indigenous perspectives, “research like schooling, once the tool of colonization and oppression, is very gradually coming to be seen as a potential means to reclaim languages histories and knowledge, to find solutions to the negative impacts of colonialism and to give voice to alternative ways of knowing and being (Smith, 2005, 91).

5. Next Steps This paper is the first part of an enquiry into the nature of library classification of Indigenous knowledges. It is contextual and explores the relationship of library classification to hegemonic discourses, the effects of this relationship on access to Indigenous knowledge held in libraries, the consequences for the education of Aboriginal and non-Aboriginal students, and for the self-determination efforts of First Nations and Aboriginal communities. It is necessary for designers of classification tools to gain an understanding our own intellectual genealogies and proclivities in order to be aware of what is useful and what is not useful for the purposes of the project. The next steps are to examine theoretical foundations to guide the design, development and evaluation of classification systems for organizing Indigenous knowledges in public collections. As some forms of Indigenous knowledge are considered to be the cultural and/or intellectual property of the Nations, the research focus is on public collections. A further phase of the research will seek grounds of compatibility between Indigenous classifications and existing classification systems. There is a gap in the North American literature on theoretical foundations for organizing and describing Indigenous knowledges, however, there is a growing Maori literature (Simpson 2005) describing Maori classification projects in New Zealand. The research will build on the existing theoretical literature guided by the scholarship on Indigenous knowledges, Indigenous research methodologies and ethics. Indigenous knowledges typically recognize the primacy of relationship and interconnectedness, (Hampton, 1995) are place based, (Kawagely, 1993) rooted in genealogy, informed by Indigenous language, and attuned to the wisdom of revelation (Cajete, 1994). Indigenous research methodology (Smith, 1999; Castellano, 2004) requires a commitment to produce work relevant to Aboriginal community needs. The ethics of the “The 4 R’s Protocol”: respect, relevance, reciprocity, and responsibility (Kirkness & Barnhardt, 1991) guide such work. This type of qualitative research project is a blend of pragmatic and interpretive methods and could follow a plan to: (1) collect existing Indigenous library classifications and subject headings (2) conduct interviews with the creators and users of those classifications and subject headings to determine design principles and usability (3) undertake a collaborative project with an Aboriginal community that intends to describe Aboriginal collections from an Aboriginal perspective (4) reflect on the principles that informed the collaborative research (5) present a case study of the use of the classifications and subject headings that is a proof of concept. The purpose of the research is to explore theoretical tools to aid in the development of classifications of Indigenous collections. It intends to improve access to information that is germane to Indigenous interests and to facilitate Indigenous research and knowledge production. Improved access may serve to foster the success and participation of Indigenous students within educational institutions. The research is congruent with The Royal 440

Commission on Aboriginal Peoples (RCAP) policy goal of the affirmation of Aboriginal knowledges (Castellano, 2000). From an international perspective, it is part of the larger project of repatriation of Indigenous cultural and intellectual property held in public institutions. Finally, it aims to make space for Indigenous research and scholarship within the academy to benefit Aboriginal students and thereby also contribute to a more relevant and vibrant academic community.

Notes 1 Larry Grant, Musqueam elder. Musqueam Language class. Musqueam Elders’ Centre, Musqueam Nation, British Columbia, Term 1. 2000. Ulqsԥn means nose or point in the hԥn’q’ԥmin’ԥm’ language, one of three dialects of Halkomelem which, like many Indigenous languages in Canada, is endangered. 2 For definitional purposes, this paper uses the terminology of the Royal Commission on Aboriginal Peoples (1996): Aboriginal people refer to the indigenous inhabitants of Canada when referring to Inuit, First Nations and Metis without regard to separate origins and identities. The term Native is used as a synonym when it is used in cited materials. The term First Nations replaces Indian except when the later is used in a source document. Aboriginal peoples refers to organic political and cultural entities arising historically from the original peoples of North America. Indigenous and Indigenous peoples refers to organic political and cultural entities arising as the original peoples of the world. Canada. Royal Commission on Aboriginal Peoples. Report of the Royal Commission on Aboriginal Peoples (Minister of Supply and Services, 1966): xv 3 Manitoba Indian Brotherhood The Shocking Truth About Indians in Textbooks (Winnipeg, Manitoba: Manitoba Indian Brotherhood, 1974). The ten types of bias include: bias by omission, defamation, disparagement, cumulative implication, lack of validity, inertia, obliteration, disembodiment, lack of concreteness, lack of comprehensiveness. 1a.

References Apple, M. W. (2000). Official knowledge: Democratic education in a conservative age (2nd ed.). New York: Routledge. Battiste, M. (2000). Reclaiming indigenous voice and vision. Vancouver: UBC Press. Berman, S. (1981). The joy of cataloging: Essays, letters, reviews and other explosions. Phoenix, AZ: Oryx Press. Bernstein, B. B. (1971). On the classification and framing of educational knowledge. In M. F. D. Young (Ed.), Knowledge and control: New directions from the sociology of education. Blake, D., Martin, L., Pelletier, D., & Library and Archives Canada. (2004). Library and Archives Canada report and recommendations of the consultation on Aboriginal resources and services. Ottawa: Aboriginal Resources and Services, Library and Archives Canada. Bowker, G. C., & Star, S. L. (1999). Sorting things out: Classification and its consequences. Cambridge, Mass.: MIT Press. Broadfield, A. (1946). The philosophy of classification. London: Grafton & Co. Buschman, J., & Carbone, M. J. (1991). A critical inquiry into librarianship: Applications of the "new sociology of education". The Library Quarterly, 61, 15-40. Cajete, G. (1994). Look to the mountain : An ecology of Indigenous education (1st ed.). Durango, Colo.: Kivakí Press. Canada. Royal Commission on Aboriginal Peoples, Erasmus, G., & Dussault, R. (1996). Report of the Royal Commission on Aboriginal Peoples. Ottawa: The Commission. 441

Carter, N. C. (2002). American Indians and law libraries: Acknowledging the third sovereign. Law Library Journal, 94(1), 7-26. Castellano, M. B. (2004). Ethics of Aboriginal research. Journal of Aboriginal Health, January, 98-114. Castellano, M. B., Davis, L., & Lahache, L. (2000). Aboriginal education : Fulfilling the promise. Vancouver: UBC Press. Dickason, O. P. (2002). Canada's First Nations : A history of founding peoples from earliest times (3rd ed.). Don Mills, Ont.: Oxford University Press. Exner, F. K. (2005 February). The impact of naming practices among North American Indians on name authority. (Doctoral dissertation, University of Pretoria). First Nations House of Learning. The University of British Columbia, First Nations House of Learning. [http://www.longhouse.ubc.ca]. Accessed 09Sep2005. Foskett, A. C. (1996). The subject approach to information (5th ed.). London: Library Association Publishing. Hampton, E. (2000). First Nations controlled university education in Canada. In M. B. Castellano, L. Davis & L. Lahache (Eds.), Aboriginal education : Fulfilling the promise. Vancouver: UBC Press. Hampton, E. (1995). Towards a redefinition of Indian education. In M. A. Battiste, & J. Barman (Eds.), First Nations education in Canada : The circle unfolds (pp. 355). Vancouver: UBC Press. Iyer, H. (1995). Classificatory structures: Concepts, relations and representation. Frankfurt/Main: Indeks Verlag. Joseph, G. (1993). Xxwa library information (Brochure ed.). Vancouver, British Columbia: First Nations House of Learning. Joseph, G., & Lawson, K. (2003). First Nations and British Columbia public libraries. Feliciter, 49(5), 245-247. Kawagley, A. O. (1993). A Yupiaq world view : Implications for cultural, educational, and technological adaptation in a contemporary world (1st ed.). Vancouver, B.C: Thesis submitted in partial fulfillment of the Requirements for the Degree of Doctor of Philosophy. Kim, K. (2003). Recent work in cataloging and classification, 2000-2002. Library Resources & Technical Services, 47(3), 96-108. Kirkness, V. J., & Barnhardt, R. (1991). First Nations and Higher Education: The Four R's--Respect, Relevance, Reciprocity, Responsibility. Journal of American Indian Education, 30(3), May 1991, 1-15. Lawson, K. L. (2004). Precious fragments : First Nations materials in libraries, archives and museums. (Master’s thesis, University of British Columbia). Lee, D. A. (2001). Aboriginal students in Canada: A case study of their academic information needs and library use. Journal of Library Administration, 33(3/4), 259-292. Lincoln, T. (2003). Cultural reassertion of alaska native languages and cultures: Libraries' responses. Cataloging & Classification Quarterly, 35(3/4), 265-290. Manitoba Indian Brotherhood, & Kirkness, V. J. (1974). The shocking truth about Indians in textbooks : Textbook evaluations. Winnipeg: Manitoba Indian Brotherhood. Manoff, M. (2000). Hybridity, mutability, multiplicity: Theorizing electronic library collections. Library Trends, 48(4), 857-876. Manoff, M. (1992). Academic libraries and the culture wars: The politics of collection development. Collection Management, 16(4), 1-17. Moorcroft, H. (1997). Libraries as sites for contested knowledges: Collection development in the area of aboriginal studies. Collection Building, 16(3), 108-112. 442

Moorcroft, H. (1996). Reflections on constructing an Aboriginal and Torres Strait Islander thesaurus {reprinted from Cataloguing Australia D '94}. Alternative library literature, 1994/1995. (pp. 257-260) McFarland & Co. Moorcroft, H. (1993). The construction of silence. Australian Library Journal, 42, 27-32. National Indian Brotherhood. (1972). Indian control of Indian education. Ottawa: National Indian Brotherhood. Olson, H. A. (2004). The ubiquitous hierarchy: An army to overcome the threat of a mob. Library Trends, 53(2), Winter 2004, 604-16. Olson, H. A. (2002). The power to name : Locating the limits of subject representation in libraries. Dordrecht The Netherlands: Kluwer Academic Publishers. Patterson, L. (2000). History and status of Native Americans in librarianship. Library Trends, 49(1), 182-193. Patterson, L. (1995). Information needs and services of Native Americans. Rural Libraries, 15(2), 37-44. Ranganathan, S. R., & Gopinath, M. A. (1967). Prolegomena to library classification (3d ed.). Bombay, New York: Asia Pub. House. Rose, G. (1993). Feminism and geography : The limits of geographical knowledge. Minneapolis: University of Minnesota Press. Searing, S. E. (1986). Feminist library services: The women's studies librarian-at-large, University of Wisconsin system. Women's collections. (pp. 149-162) Haworth Press. Shilling, Kath and Brenda Hausia. (2001). Cultural survival -- for the record. In R. Sullivan (Ed.), International Indigenous librarians' forum : Proceedings ; [the first international Indigenous librarians' forum, held at Waipapa, University of Auckland, 1 - 5 November 1999] (pp. 112)Te Roޜpuޜ Whakahau. Simpson, S. (Febrary 2005). Te ara tika: Nga ingoa kaupap maori: Purongo tuatoru = Guiding words: Maori subject headings project: Phase 3 research report. [http:// www.trw.org.nz/publications/Te_Ara_Tika_Guiding_Words.pdf]. Accessed 10Jan2006. Smith, L. T. (2005). On tricky ground: Researching the Native in the age of uncertainty. In N. K. Denzin, & Y. S. Lincoln (Eds.), The SAGE handbook of qualitative research (3rd ed.) (pp. 85-107). Thousand Oaks: Sage Publications. Smith, L. T. (1999). Decolonizing methodologies : Research and Indigenous peoples. London ; New York :Zed Books ; Dunedin, N.Z.; New York: University of Otago Press; Distributed in the USA exclusively by St. Martin's Press. Taylor, R. H., & Patterson, L. (2004). Native American resources: A model for collection development. Selecting materials for library collections. (pp. 41-54) Haworth Information Press. Traister, D. (1999). You must remember this ...; or, libraries as the locus of cultural memories. In D. Ben-Amos, & L. Weissberg (Eds.), Cultural memory and the construction of identity (pp. 333). Detroit: Wayne State University Press.