Knowl. Org. 41(2014)No.1

KO KNOWLEDGE ORGANIZATION

Official Bi-Monthly Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Contents

Articles Andreas Oskar Kempf, Dominique Ritze, Kai Eckert, and Benjamin Zapilko. Ole Olesen-Bagneux. New Ways of Mapping Knowledge Organization The Memory Library: How the Library Systems: Using a Semi-Automatic Matching Procedure in Hellenistic Alexandria Worked...... 3 for Building up Vocabulary Crosswalks...... 66

Xiaoyue Ma and Jean-Pierre Cahier. Tanja Svarre and Marianne Lykke. An Exploratory Study on Semantic Arrangement Experiences with Automated Categorization of VDL-Based Iconic Knowledge Tags ...... 14 in E-Government Information Retrieval...... 76

Heather Lea Moulaison, Felicity Dykas, Brief Communication and John M. Budd. Foucault, the Author, and Intellectual Debt: Ingetraut Dahlberg. Capturing the Author-Function Through Attributes, What is Knowledge Organization? ...... 85 Relationships, and Events in Knowledge Organization Systems ...... 30 Emilia Currás. The Nature of Information and Its Influence Papers from the ISKO-UK Biennial Conference, in Human Cultures...... 92 “Knowledge Organization: Pushing the Boundaries,” United Kingdom, 8-9 July, 2013, London Books recently published...... 96

Fausto Giunchiglia, Biswanath Dutta, Index to Volume 40 ...... 97 and Vincenzo Maltese. From Knowledge Organization to Knowledge Representation ...... 44

Elena Konkova, Ayşe Göker, Richard Butterworth, and Andrew MacFarlane.

Social Tagging: Exploring the Image, the Tags, and the Game...... 57

Knowl. Org. 41(2014)No.1

KNOWLEDGE ORGANIZATION KO

Official Bi-Monthly Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

KNOWLEDGE ORGANIZATION José Augusto Chaves GUIMARÃES, Departamento de Ciência da In- fromação, Universidade Estadual Paulista–UNESP, Av. Hygino Muzzi This journal is the organ of the INTERNATIONAL SOCIETY FOR KNOWLEDGE ORGANIZATION (General Secretariat: Vivien Filho 737, 17525-900 Marília SP Brazil. E-mail: [email protected] PETRAS, Humboldt-Universität zu Berlin, Institut für Bibliotheks- und Informationswissenschaft, Unter den Linden 6, 10099 Berlin, . Birger HJØRLAND, Royal School of Library and , E-mail: [email protected]. Copenhagen Denmark. E-mail: [email protected] Barbara H. KWASNIK, School of Information Studies, Syracuse Uni- Editors versity, Syracuse, NY 13244 USA. E-mail: [email protected] María J. LÓPEZ-HUERTAS. Universidad de Granada, Facultad de Bi- Richard P. SMIRAGLIA (Editor-in-Chief), School of Information Stud- blioteconomía y Documentación, Campus Universitario de Cartuja, Bi- ies, University of Wisconsin, Milwaukee, Northwest Quad Building B, blioteca del Colegio Máximo de Cartuja, 18071 Granada, . E-mail: 2025 E Newport St., Milwaukee, WI 53211 USA. [email protected] E-mail: [email protected] Kathryn LA BARRE, The Graduate School of Library and Information Hanne ALBRECHTSEN (Book Review Editor), Institute of Knowl- Science, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, edge Sharing, Frisersvej 1, 2, DK-2920 Charlottenlund, Denmark. MC-493, Champaign, IL 61820-6211 USA. E-mail: [email protected] E-mail: [email protected] Nancy WILLIAMSON (Classification News Editor), Faculty Marianne LYKKE, e-Learning Lab, Center for User-driven Innovation, of Information Studies, University of Toronto, 140 St. George Street, Learning and Design, Department of Communication, Aalborg Univer- Toronto, Ontario M5S 3G6 . sity, Kroghstraede 1, room 2.023 Denmark 9220 Aalborg OE. E-mail: E-mail: [email protected] [email protected] Melodie Joy FOX (Editorial Assistant), School of Information Studies, Ia MCILWAINE (Literature Editor), Research Fellow. School of Li- University of Wisconsin, Milwaukee, Northwest Quad Building B, 2025 brary, Archive & Information Studies, University College London, E Newport St., Milwaukee, WI 53211 USA. Gower Street, London WC1E 6BT U.K. Daniel Martínez Ávila (Editorial Assistant), Department of Library and E-mail: [email protected] Information Science, University Carlos III of Madrid, C/Madrid 126 Jens-Erik MAI, Royal School of Library and Information Science, Co- 28903 Getafe – Madrid, Spain. penhagen Denmark. E-mail: [email protected] E-mail: [email protected] Widad MUSTAFA el HADI, Université Charles de Gaulle Lille 3, URF Editors Emerita IDIST, Domaine du Pont de Bois, Villeneuve d’Ascq 59653, . E-mail: [email protected] Hope A. OLSON, School of Information Studies, University of Wis- H. Peter OHLY, Prinzenstr. 179, D-53175 Bonn, Germany. consin-Milwaukee, Milwaukee, Northwest Quad Building B, 2025 E E-mail: [email protected] Newport St., Milwaukee, WI 53211 USA. K. S. RAGHAVAN, KAnOE (Centre for Knowledge Analytics & Onto- E-mail: [email protected] logical Engineering), PES Institute of Technology, 100 Feet Ring Road, Clare BEGHTOL, Faculty of Information Studies, University of To- BSK 3rd Stage, Bangalore 560085, . ronto, 140 St. George Street, Toronto, Ontario M5S 3G6, Canada. E-mail: [email protected]. E-mail: [email protected] M. P. SATIJA, Guru Nanak Dev University, School of Library and In- Ingetraut DAHLBERG, Am Hirtenberg 13, 64732 Bad Konig,̈ Germa- formation Science, Amritsar-143 005, India. ny. E-mail: [email protected] E-mail: [email protected] Aida SLAVIC, UDC Consortium, PO Box 90407, 2509 LK The Hague, Editorial Board The Netherlands. E-mail: [email protected] Jonathan FURNER, Graduate School of & Information Dagobert SOERGEL, Department of Library and Information Studies, Studies, University of California, Los Angeles, 300 Young Dr. N, Mail- Graduate School of Education, University at Buffalo, 534 Baldy Hall, box 951520, Los Angeles, CA 90095-1520, USA. Buffalo, NY 14260-1020. E-mail: [email protected] E-mail: [email protected] Jesús GASCÓN GARCÍA, Facultat de Biblioteconomia i Docu- Renato R. SOUZA, Applied Mathematics School, Getulio Vargas mentació, Universitat de Barcelona, C. Melcior de Palau, 140, 08014 Foundation, Praia de Botafogo, 190, 3o andar, Rio de Janeiro, RJ, Barcelona, Spain. E-mail: [email protected] 22250-900, Brazil. E-mail: [email protected] Claudio GNOLI, University of Pavia, Mathematics Department Li- Joseph T. TENNIS, The Information School of the University of brary, via Ferrata 1, I-27100 Pavia, . E-mail: [email protected] Washington, Box 352840, Mary Gates Hall Ste 370, Seattle WA 98195- Rebecca GREEN, Assistant Editor, Dewey Decimal Classification, 2840 USA. Dewey Editorial Office, Library of Congress, Decimal Classification E-mail: [email protected] Division , 101 Independence Ave., S.E., Washington, DC 20540-4330, Maja ŽUMER, Faculty of Arts, University of Ljubljana, Askerceva 2, USA. E-mail: [email protected] Ljubljana 1000 Slovenia. E-mail: [email protected] Knowl. Org. 41(2014)No.1 3 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked

The Memory Library: How the Library in Hellenistic Alexandria Worked

Ole Olesen-Bagneux

University of Copenhagen, 6 Birketinget, DK-2300, Copenhagen, Denmark,

Ole Olesen-Bagneux holds an MLISc and began his Ph.D. in 2011 at the Royal School of Library and Informa- tion Science, now part of the University of Copenhagen. In the fall of 2012 he studied at Anthropologie et Histoire des Mondes Antiques in Paris, affiliated both at L’École des Hautes Études en Sciences Sociales at the Sorbonne. Here, under fruitful guidance by Professor Christian Jacob, he spent some very long days reading in the Bibliothèque Gernet-Glotz. He also followed Jacobs’s courses on his theory of places of knowledge (lieux de savoir ).

Olesen-Bagneux, Ole. The Memory Library: How the Library in Hellenistic Alexandria Worked. Knowl- edge Organization. 41(1), 3-13. 37 references.

Abstract: For millennia the famous library in Hellenistic Alexandria has been praised as an epicenter of enlightenment and wisdom. And yet, a question still seems unanswered: how was its literature classified and retrieved? It is a subject that has been given surprisingly little attention by the field of library-and-information science―indeed, by scholarship in general. Furthermore, a certain way of thinking has influenced the few answers that have so far been attempted. It is as if the scholars of our era have tried to identify the modern, physical library in the Hellenistic library in Alexandria. But such an approach is biased in a basic way: It simply does not consider the impact of the cultural and intellectual context of the library. This article differs fundamentally, because I reject the notion that the library was like those of today. Accordingly, an entirely new way of understanding how the library actually worked, in terms of classification and retrieval processes is presented. The key element is to understand the library both as a physical structure and as a struc- ture in the memory of the Alexandrian scholars. In this article, these structures are put together so as to propose a new interpretation of the library.

Received 21 August 2013; Revised 25 October 2013; Accepted 29 October 2013

Keywords: library, memory, literature, Pinakes, Aristophanes, mechanics, scholars, Alexandria

1.0 Introduction It seems quite obvious, that the ancient library of Hel- lenistic Alexandria was not like a modern library―not at Very little is known about the ancient library of Alexan- all. Nevertheless, the library has been misinterpreted, quite dria. Sources indicate that it could have contained between substantially, by modern scholarship as though it had been 40,000 and 700,000 scrolls (Staikos 2004). Nor can it be similar to modern libraries. This can be seen in the de- ascertained exactly when it was established but it must scriptions of how the library worked, of how classifica- have been shortly after 300 BCE. It is reasonable to ac- tion and retrieval was conducted within it. In his book Li- cept that it must have looked something like its later rival, braries in the Ancient World Lionel Casson (2001, 41) pro- the Attalid library in Pergamum, erected around 200 BCE. vides such a description. He writes about the Pinakes by We have rather firm knowledge about its architecture Callimachus, calling it: “A key to the vast collection: from (Hoepfner 2002). The library in Alexandria was part of a his Pinakes users could determine the existence of any religious institution, the Mouseion, and the scholars were particular work; from his shelf-list they could determine in fact extremely skilled slaves that were imprisoned its location. He had created a vital reference tool.” within the Mouseion. Attempted escape could be penal- Casson claims that the Pinakes was not the catalog of ized by death (Canfora 1990), and part of the poetry writ- the library, but merely a sort of bibliography, a point of ten by these locked-up scholars was performed during re- view that contradicts that generally accepted. Casson be- ligious ceremonies (Meillier 1979). lieves that a specific list was not integrated in the Pinakes, 4 Knowl. Org. 41(2014)No.1 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked and that this list was the catalog (Casson 2001, 153). This Alexandrian reality. And so unfortunately they all just ba- is pure assumption, no actual evidence of this can be sically conclude that the Alexandrian library worked like a found. It is as if Casson reproduces a modern distinction modern, physical library. between catalog and bibliography in antiquity, as if he Instead of searching for elements similar to modern seeks to confirm the link between libraries in modern ones, my analysis turns the perspective around. I will argue times, and libraries in antiquity. It makes little sense to ac- that the way the library in Hellenistic Alexandria worked cept Casson’s view since it builds on the assumption that was in fact the result of a close and functional connection the Alexandrian scholars would maintain complex, unnec- with the Greek past it also contained. I believe that the essary and time consuming workflows in retrieving the lit- key to understanding the library lies in the story about erature in the library, for no actual reason. Konstantinos Aristophanes of Byzantium (Jacob 2010). In this story, it Staikos (2004, 186) expresses the generally accepted view is claimed that Aristophanes knew the structure and con- of the Pinakes: “What Callimachus set out to do was to tent of the library by heart. Accordingly, I agree with compile a comprehensive ‘bibliographical’ list of authors Christian Jacob (2010, 11) on the nature of the memory and their works that would also serve as a library cata- of Aristophanes, i.e. as a mental construct that somehow logue. The result was the Pinakes.” Although I agree with matches the library. But I think it is a demonstration of Staikos, I believe that he describes only a part of how the how the library worked, not only for Aristophanes, but for library worked. And even though he differs with Casson, the scholars in general. Therefore, Jacob’s view is followed he thinks like him: He uses exclamation points for the in this article, but his considerations are widened and sup- word bibliographical, knowing that he pushes a point fur- ported with evidence. ther than what Callimachus himself would have under- The main body of the article has three parts. The first stood. part is called The Dead Library. In antiquity, physical text Scholarship on the Alexandrian library is heavily biased was considered to be related to death (Svenbro 1988). Ac- by the unfruitful desire to retrieve elements similar to cordingly, The Dead Library deals with the physical struc- those present in our own era. It is as if we want to know ture―the actual library―of organized texts. It examines what constituted the bibliography of the Alexandrian li- how the physical scrolls were classified and retrieved. But brary instead of trying to grasp what the Pinakes actually the reader must have in mind that this was not how the li- was in its own respect. Both Casson and Staikos thinks of brary worked―only an aspect of how it worked! The sec- the library of Hellenistic Alexandria as a modern, physical ond part is called The Living Library. Human beings were library, totally uninfluenced by the intellectual principles called living libraries in antiquity, if they could remember of the Greek past it was dedicated to protect. Staikos impressive amounts of literature (Too 2010). Therefore, (2000, 67) goes so far as to claim that: “Quite possibly the The Living Library analyzes the scholar in antiquity, how he ‘’ underlying the Pinakes was entirely Callima- or she was able to store, search, remember and quote chus’s own idea and owed nothing to the cataloguing enormous amounts of literature from memory. Finally, methods employed by the Peripatetics at the Lyceum in the third part is called The Memory Library. This third part Athens or the methods devised by the Babylonians for use melds the dead and the living library into one constella- in their great collections of archives.” tion, and claims that this was how the library actually I think that the specific assumption might be correct, worked. In this part, I argue that the library, be it in mem- that the Peripatetics and the Babylonians did not influence ory or the actual physical library could be sung. “The Callimachus. Nevertheless, with this assumption at hand memory library,” is a new term, and yet, the Greek word Staikos (2000) simply denies that the entire intellectual Μουσεῖον (Mouseion) could be translated as exactly this: heritage played any role whatsoever in the way the Alex- “Memory library.” andrians organized their library. And that, I think, is not correct. Like Staikos, Phillips (2010) believes that the Al- 2.0 The Dead Library exandrian library was more in contact with our present re- ality than the Greek era that had just ended. Phillips As mentioned above briefly, death and written text was (2010) even goes as far as to conclude that the library of considered to be closely connected in Greek antiquity. Ac- Alexandria simply was the first modern library in the tually, the written testimonies of a person, in modern world, since it had all the characteristics of a modern li- times we would call this the collected works of an author, brary! were viewed as the true tomb of the person leaving them I disagree fundamentally with the view represented by behind. These written testimonies simply outmatched the Staikos (2000), Casson (2001) and Phillips (2010). They sepulchral monument representing a person that died are blinded by the many centuries of human civilization (Platthy 1968, 96). More recent studies have shown that that divides the present from the 3rd and 2nd century BCE the link between death and written culture evolved in an- Knowl. Org. 41(2014)No.1 5 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked tiquity (Svenbro 1988, 13) and culminated in a refined lit- first step towards complete alphabetization, or whether it erary wave coined as the Alexandrian avant-garde (Bing contained a potential in its own respect, different from 2008, 144-145). The remarkable esthetics of that wave re- complete alphabetization. garded a library as an enormous graveyard, containing the true sepulchral monuments of the writers now dead. 2.2 Callimachus Thus, the physical library in Alexandria in Hellenistic times is called “the dead library.” In the following, I will Callimachus of Cyrene (305-240 BC) was probably not describe how the dead library was organized. the director of the library, but he had substantial influence on its organization. In this respect, his is famous for com- 2.1 Zenodotus posing the Πίνακες τῶν ἐν πάσῃ παιδείᾳ διαλαμψάντων (Pfeiffer 1949), that translates thus: Tables of those who dis- Zenodotus of Ephesus (330-260 BC) was most likely the tinguished themselves in all branches of learning and their writings. first director of the library in Alexandria. It is believed It is usually just referred to as Pinakes, its first word in that he refined the organization of the library extensively, Greek, meaning table or board. It should be mentioned since he was able to conduct a complete and critical ver- though, that Callimachus composed several Pinakes (Witty sion of Homer. And in order to do so, the different ver- 1973). The Pinakes consisted of 120 scrolls and contained sions of Homer had to be strictly organized. So, he information about writers and their works. It has been probably divided the holdings of the library into at least characterized as so many different genres―literary ency- two categories, or, at the very least he created a principle clopedia (Lerner 2001, 29), register of literary matter of division that was later to be followed. These two cate- (Cancik et al., 1996-), catalog (Staikos 2004, 186), biobibli- gories were critical edited texts and different versions of ographical catalogue raisonné (Witty 1958, 132) bibliogra- the same text that were yet to be compared in order to es- phy (Jacob 2007, 1127), biobibliography (Blum 1991, 1) tablish the critical edition. It has been argued that Ze- ―that it would probably be most suitable to define it as a nodotus divided texts into classes that followed a classifi- genre of its own, pinakography, as mentioned but refuted cation scheme (e.g. Casson 2001, 37-40) but this argument by Blum (1991, 9). Although Blum (1991) is right, when is not supported with evidence, besides the accepted as- he describes the Pinakes as a biobibliography, Callimachus sumption that Zenodotus must have created a list of in- would not have had a clue about the meaning of such a ventory that mentioned each scroll contained in the li- word, at least not as a literary genre. It blurs the analysis brary. of what might have been Callimachus’s intention with his Along with this division came a more frequent use of work when it is categorized as something that did not ex- the Sillybos―the little note that was attached to each scroll, ist in his era. The aforementioned attempts to categorize with information that in modern times would be called the Pinakes as genres that were not yet invented but basi- metadata. The Sillybos would hold the title of the first text cally just labels it with a retronym that does not answer or more likely the incipit (the first words of the text). It what it was in its own respect. Nevertheless, the problem would also hold the stichometric sum that was the total of genre clearly demonstrates that it is difficult to ascer- number of lines, stichos, in the Homeric verses. Originally, tain what the Pinakes was. It has not reached us; it is lost, the stichometric sum was used to control production of but we have testimonies of its existence and content text―it originated from classic Athens and was not in- (Witty 1958, 133-36) that can enable a discussion with sci- vented in Alexandria. People knew that a certain song was entific authority. a certain number of lines long, and thus the total sum of The Pinakes was divided into classes and three are lines indicated whether the scribe had conducted honest known with certainty: Law, rhetoric and miscellaneous. labor (Witty 1958, 134). The Alexandrian scholars invented Another seven seem likely (Witty 1958, 136; Pfeiffer 1949, a new way of using the stichometric numbers, as we shall 349) creating a total of ten classes. Most likely even more see below. The Sillybos would also hold the name of the than ten classes existed and different assumptions have critical editor, for example the “Zenodotus version.” This been made as to try to imagine the totality of the classes indicates frequent use: all texts were to be critically edited of the Pinakes (e.g. Parsons 1952, 204-19). Each class at some point. would be divided into subclasses, though they were di- Quite certainly, the library was arranged alphabetically vided in different ways: chronologically, topographically from the start, since Zenodotus left proof that he was and biographically (Pfeiffer 1968, 129). The number of familiar with alphabetization (Casson 2001, 37-40). But classes and their subdivisions is not that important to my this was only alphabetization by the first letter. This way point. The fact that the classes matched a certain area of of alphabetizing has been subject to speculation (e.g. the library is―which is a generally accepted assumption Blum 1991, 227) because it is uncertain whether it was the (e.g. Staikos 2004, 186). That a work was placed within a 6 Knowl. Org. 41(2014)No.1 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked certain class in Pinakes meant that it was located in the area thing can of course be done in English. “Container” has or room of this class in the library. Each entry in the Pi- been chosen here. nakes simply matched a physical location. But if authors in a given class were given a specific The list of inventory by Zenodotus was perhaps used place, wouldn’t it become impossible to keep that place as as a catalog in the library (Casson 2001, 37-40) Even the collection grew? This is where alphabetization by only though Blum (1991) has been criticized substantially for the first letter comes into play. More writers could simply his research (e.g. Barnes 2000, 77) he makes several very be added in the end of the list in Pinakes (under each let- qualified points, one of them being the different nature of ter, that is) and simultaneously be given their own con- the list of inventory by Zenodotus and the Pinakes by Cal- tainer in the room to which they belonged. This again limachus. A list of inventory only mentions a scroll in makes it probable that the containers or places of authors such a way that it is retrievable. That was not the case with were recognizable visually, by tables or sculptures because the Pinakes. Consider the title: Tables of those who distin- crowded rooms by nature leave little space for orientation. guished themselves in all branches of learning and their writings. This way of ensuring solid structure through flexibility BIum (1991, 226) points to the fact that scrolls containing was a sort of upside-down-Dewey that permitted writers more than one author or several works, or even both, to be located in the same spot almost to eternity (though, were not described with satisfying precision in the list of only in the logic of the slow text production, i.e. pre- inventory. It did not inform about the writers or works Gutenberg). contained in the library, only the scrolls. But the Pinakes on the other hand, mentioned all those who distinguished 2.3 Aristophanes themselves in all branches of learning and their writings. It was without doubt the Pinakes that became the library Aristophanes from Byzantium (ca. 260- 185 BC) is nor- “catalog,” since it was the tool that mentioned all writers mally (e.g. Staikos 2004, 181-182) not considered as a con- (or those who had been written down by others) and what tributor to the innovation of library mechanics in Helle- they had written (or what others had written down). Each nistic Alexandria. He updated the Pinakes into a new ver- entry in the Pinakes would start with a short biography of sion, which is not regarded as significant. But in fact, two the writer, and then mention his works. Each work was important things happened during his time as director of mentioned by its title or incipit, the stichometric sum, and the library. the number of books (scrolls) it consisted of. This infor- The first thing is very simple, and yet its implication is mation was also indicated on the Sillybos, as mentioned substantial. The stichometric note as mentioned above above. This permits the first description of the library was only described as indicating a total sum. Evidently, mechanics. From the class in the Pinakes one knew what keeping track of, say, 12.739 lines only in the mind was a area of the library to go to, to find a given author, and tough job while at the same time copying a text. There- from the information in that same author’s entry in Pi- fore, the scribes noted the stichometric numbers continu- nakes, one could even locate the exact scroll. ously, like small signs next to the column of text. The sys- Most likely, the library mechanics had a step between tem was like this: A= 100 lines; Β=200 lines; Γ=300 lines; the area and the work of the author. This step was the Δ= 400 lines and so on. In Athens, the stichometric sum place of the specific author. Very little can be said with was proof of honest labor, but in Aristophanes’ time as precision about this, but many sources indicate such a director in Alexandria, the stichometric numbers along the step. In Pergamum, for example, the library of the Attalid text began to be used as references, just as in modern kings had sculptures representing authors (Callmer 1944, times we use references to chapters and pages (Irigoin 150-151), that were perhaps located close to that author’s 2001, 24-26). The stichometric number helped indicate scrolls (Hoepfner 2002, 49). The word pinakes is itself an- which part of the text was requested. other indication, since it probably originally meant boards The second thing is not traceable in the mechanics of or tables hung on the shelves or walls of the library, to in- the dead library. It will become clear in the next part, “The dicate the same information as the Pinakes by Callimachus. Living Library,” that it played a central role in the mechan- It is also possible to grasp the place of the author due to ics in the living library, and for the memory library as a impressive research by Gaëlle Coqueugniot (2007). She whole. And since it dealt with the written language, and concludes that the word kibôtos most likely was the com- was carried out by Aristophanes it is mentioned here. Aris- mon description of the entities that contained scrolls (Co- tophanes reformed the Greek language. He introduced a queugniot 2007, 304) even if these entities were different more stringent and frequent use of diacritical signs (they in size and shape (box, bag, coffin or shelves). Accord- already appear in writings from classic times). These signs, ingly, Coqueugniot discusses the many possibilities of above and around the letters of the Greek language helped translation of the word kibôtos into French, and the same demonstrate how syllables are pronounced (Irigoin 2001, Knowl. Org. 41(2014)No.1 7 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked

42). It is very likely that Aristophanes reformed the lan- Nagy calls this melody the colometric melody. It was guage in such a way because Alexandria was a cultural probably a part of the library mechanics of the living li- melting pot, attracting scholars from as far as India. These brary, as we shall see below. foreigners needed help to adapt to the Greek language, which had not been the case in classical Hellas, where the 2.4 The mechanics of the dead library intelligentsia had Greek as their first language (Canfora 1992, 20-21). It is not a mistake for the reader to compare To sum up, the mechanics of the dead library evolved into Aristophanes’ reform with the difference between written a six step procedure around 200 BC. From the Pinakes, one UK and US English. But the diacritical signs were not em- was led to a specific room via the class of literature. In the ployed at each syllable where they should have been ac- room (or area) a sculpture or tablet made the containers cording to pronunciation. This has been quite a mystery to visually recognizable, this led the scholar to the author. In modern scholars. It is evident, that they symbolized a sys- the container, work and scroll could be identified by in- tem, and that they were much more frequent than in the formation on the sillybos that matched the information in classical era. But what was the principle of their employ- the entry in Pinakes. Furthermore, a specific part in the ment? Gregory Nagy (2000, 9) has resolved this problem, scroll could be located via the stichometric numbers. This by turning the modern philological editing of manuscripts is demonstrated in Figure 1. from the time of Aristophanes into a philological study it- self. What he saw was that modern editions of these 3.0 The living library manuscripts blurred an understanding of the diacritical signs in relation to the original meter, in this case the met- Opposed to the dead library was “the living library.” The ric cola, a meter that most likely was introduced by Aristo- term actually occurs in literature from antiquity, and as a phanes himself. Originally, the diacritical signs expressed phenomenon it was current. The living library was a the rhythm of the metric cola. Put simply, a line played out scholar, capable of remembering a large amount of litera- a melody: ture―a feature that can most likely be interpreted as a heritage from the rhapsodes of archaic Hellas. But it had Line The colometric mel- Modern layout (col. 12 a significant difference: not only did the scholar remem- ody (col. X (VIII)): (8)): ber the literature, he also remembered its location, both 85 αμφιτρυωνιάδασ• Ἀμφιτρυωνιάδας, in memory and in a physical library. The literature con- • ειπεντε τισαθανατων εἶπέν τε• ῾῾τις ἀθανάτων tained in the memory of the scholar mirrored the physi- Above, the diacritical Above, the diacritical cal library, as though the physical library were imagined signs have been used to signs have been used to each time a work was sought. Testimonies of living li- point out the rhythm explain the pronunciation of the entire colon. of each syllable. Each co- braries actually indicate that they began to occur just Each colon is ex- lon is expressed staccato, about the time when the mechanics of the dead library pressed as a unit, al- the readability is height- was in place. The aforementioned Aristophanes from ened, but the melody is most as if it was one Byzantium was a living library (Jacob 2010, 11), and he lost. word. will be analyzed as such in the following. But the mechan- Table 1: The colometric melody ics of the living library are approached in reverse chro-

Figure 1: The mechanics of the dead library 8 Knowl. Org. 41(2014)No.1 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked nology, simply because that explains it in the clearest way. ample cabbage. The cabbage, like everything else, opens a And so, we begin with Athenaeus. universe of literature, and so, comic poets, philosophers and experts in plants are cited in an elegant continuous 3.1 Athenaeus composition that describes … cabbage! Each scholar pe- rused the web of literature in a non-linear pattern, zapping Not until the late Roman period does the literature that between authors, browsing each author’s work, in the sense has reached us reveal the mechanics of the living library. the subject is described here, in this way, and here again, in that way Athenaeus of Naucratis (2nd century CE) was a Greek- and so on and so on. The sum of all those patterns consti- speaking scholar living in Rome. He composed the work tutes the conversation in the Deipnosophistae. Δειπνοσοφισταί (Deipnosophistae) (Weber-Nielsen 1990). The The scholars, the living libraries, were able to quote ex- title translates in two ways: The Dinner-table Philosophers and act phrases and the occurrence of words. When they went Experts of the Dinner-table. Its Greek title is kept here to il- into zetesis mode, and searched their web of literature both lustrate both meanings, because they are both important the sound of words and their visual representation were in in this context. The Deipnosophistae is in many ways the key play. While Jacob grasps the refined complexity of the to literature in antiquity, since large quantities of literature web of Athenaeus, the general assumption that both have reached us only through this work by Athenaeus. In- sound and visual representation of words or phrases stead of writing on his own, Athenaeus composed a story played a vital role for the mechanics of the living library is that enveloped enormous amounts of already existing lit- generally accepted (e.g. Carruthers 2008, 101). Included in erature. In order to do so, he needed a course of events, the sounds is of course the colometric melody, but as and he chose an almost never-ending banquet as the set- Jacob clearly demonstrates, by the time Athenaeus com- ting. As delicate servings were carried in and out the posed his Deipnosophistae the skills of the living library had scholars were stimulated in various ways. When they were evolved substantially. starved and impatient, accusations rose around the table. When new and surprising plates were served, the scholars 3.2 Aristophanes―once again joyfully exclaimed their happiness. The scholars described each event with long quotations from literature, and they At this point, we are able to go back in time, and once quoted that literature from their memories. again look at the merits of Aristophanes from Byzantium Before proceeding further with the analysis of the (ca. 260-185 BCE). As already mentioned, Aristophanes Deipnosophistae, it should be mentioned, that a certain tradi- was a living library (Jacob 2010, 11). This is documented tion of interpretation will not be followed, nor accepted, in Vitruvius’s treaty on architecture De Architectura (Jacob in this article. This tradition basically interprets the Deip- 2010). Vitruvius tells the story of a poetry contest held at nosophistae as a messy work, symbolizing cultural decay (e.g. the court of the Ptolemaic court in Alexandria, when Too 2010, 114) It is correct that the overall story lacks Aristophanes was a young man. The contest was a recitatio compositional unity (Weber-Nielsen 1990, 8-9) and that and thus, in the literary it is to be understood as a these rather unimportant, small details can indeed serve as public performance with its roots in the tradition of the the foundation of many a pedantic-analytical critique, pin- rhapsodes, and the private reading aloud of poetry pointing obvious mistakes as though that were the sole amongst friends that was to become common in Rome. purpose of the humanities. Instead, let’s look at this enig- Contrary to the rhapsodes the person performing in recita- matic treasury that Athenaeus was so kind to leave us, let’s tio read aloud from manuscript, and contrary to what was see what he was up to, had in mind. to become the habit in Rome, it was still done in public. Christian Jacob analyses the Deipnosophistae in The Web of Aristophanes―so Vitruvius tells ―was appointed leader Athenaeus (2013) in an original way. He regards the memo- of the library because he was able to expose the contest- rized literature as a sort of common web that the scholars ants in the competition as cheaters. They were not poets; energetically and constantly peruse during the eternal din- all but one had copied text from various authors in the li- ner. What motivates them is zetesis, the urge to explore brary, simply claiming that it was their own poetry. Aristo- something in depth. It is not entirely impossible to de- phanes recognized the poetry and was able to tell who had scribe how this urge unfolded, how the web of Athenaeus originally composed it. To prove his point, relying only on worked. The scholars seem to browse important writers on his memory, he had an endless amount of scrolls taken different subjects, lists of words, of places or quotations, out of the library. He knew where they were stored and and they correct each other when they cite them wrong, was able to find the exact lines that had been copied, and again demonstrating that this web was universal in some compare the texts of supposed poets with the originals. sort. If they cannot agree, the written text appears as the This story has many different layers. It discusses plagia- concluding authority. Jacob begins his book with the ex- rism, but the topic has to be perceived in the light of the Knowl. Org. 41(2014)No.1 9 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked slow death of oral transmission where borrowing words the Lexeis, the first reference tool on the basis of words. from the past was no crime. It describes the cultural ri- But one should really be careful about claiming the begin- valry between Alexandria and Pergamum, where Vitruvius nings of the understanding of words as phenomenon in favors the library architecture of Pergamum most likely antiquity (Small 1997). due to that city’s strong bond with Rome. It is also a sym- Browsing words (or the occurrence of words) or co- bol of the literary wave of the Alexandrian avant-garde, lometric melodies are marked with horizontal arrows, in since Aristophanes refutes the old-fashioned poetry by the Figure 2 below. In a rather primitive way this illustrates the fake poets that pleases the audience but lacks esthetic re- process of zetesis, that the living library is exploring a men- finement. On top of that, Vitruvius most likely enhanced tal constellation of literature. The mechanics of the living the capabilities of Aristophanes’ memory to add a little library can be illustrated by this figure. drama to the story. So, all in all, Vitruvius is a source that A final remark: How common was the living library? has to be dealt with respectfully, but not naively. Consider- There is no point in trying to give a precise answer; too lit- ing Aristophanes as a living library, one has to have all this tle evidence has reached us. Nevertheless, an evolution in mind. can be glimpsed. Aristophanes was appointed director of Indeed, Aristophanes was a living library. How does he the library due to his capabilities. In this context it does expose its mechanics? The story indicates that he was ca- not matter whether this actually happened or not: the pable of recognizing poetry, literature in general, in its ex- story itself testifies that Aristophanes as a living library act phrasing. This seems very similar to the fact that he must have been a rare sight around 230 BCE Alexandria, used the diacritical signs to make colometric melodies, as or at least that he mastered the role of the living library described above. These melodies must have been part of a like no other. On the other hand, in second century CE learning-by-heart memorization that he to some extent Rome, the living libraries gathered in literary discussion could recognize when they (together with other meter) around the dinner-table in Athenaeus Deipnosophistae. The were pronounced or sung by others. The story also tells story is fiction, but the setting seems like a common us, that he was capable of retrieving the scrolls in the li- event, only stretched in time to the extreme. At one point brary containing the melodies―from memory! (V-203e), a person even comments on the Alexandrian li- brary, saying that he does not bother to describe its archi- 3.3 The mechanics of the living library tecture and content since it is in the memory of everyone. Quite possibly, the living library was to begin with a rare Athenaeus and Aristophanes permit a general description and exclusive phenomenon that over the centuries became of the mechanics of the living library in the Hellenistic more and more common, as literacy increased. era. The essential element is the colometric melody. Its ex- istence can be ascertained as a part of the mechanics of 4.0 The Memory Library the living library via Vitruvius, as mentioned above. It is likely, though, that other structures such as entire phrases As the story of Aristophanes demonstrates, the mechanics or even longer quotations from texts were also included in of the living library somehow blended with the mechanics the mechanics of the living library. A basic cognitive as- of the dead library. Aristophanes could browse his mem- sumption is that the longer the quotation, the easier it was ory for quotations by authors and he could afterwards lo- for the living library to recognize the author. Also, words cate them in them library. Jacob (2010) claims that this might be considered. Certainly, the living libraries in was exactly the case. Such a skill is also testified by Pliny Athenaeus’s Rome were capable of perusing their mental the Elder in his Naturalis Histioria although the linking be- web for specific words. It might already have been the tween living and dead libraries is often not grasped (e.g. case in Hellenistic Alexandria, since Aristophanes wrote Yates 1965, 41).

Figure 2: The mechanics of the living library 10 Knowl. Org. 41(2014)No.1 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked

I have presented the dead and living library as I think huge amounts of knowledge in memory. In fact, such sys- they must have worked. I will now proceed to argue that tems were used by illiterate societies all around the world they were in fact combined, not by extraordinary coinci- (Skafte Jensen 2011). The Greek version of this system dence or skill, but as one logical system, that I will call the originated from the Homeric formulae and meter, as dis- “Memory Library,” since it relied on human memory and covered by Milman Parry in the beginning of the 20th cen- since the Μουσεῖον (Mouseion), the name of institution con- tury (Parry and Parry 1971). taining the library in Hellenistic Alexandria, can be trans- Now, when Nagy (2000) points to the fact that the co- lated as such. I believe that the memory library was a lometric melodies employed by Aristophanes made it clear structure that existed both in the memory of the scholars how to express verse, was it only to help foreigners com- and as a physical library. Instead of pushing the semi- ing to Alexandria? I think that the colometric melody has modern library’s reality back in time, claiming that it to to be considered as a logical entry to a universe of beats, some extent existed in the Alexandrian library, as do Phil- of easy retrievable literature by the very way it sounds. lips (2010) Casson (2001) and Staikos (2000), I will now Consider the fact that almost all of the literature, exclud- do the contrary. I believe that the memory library that I ing small parts of the late philosophy, in its actual phras- am about to describe below was a logical continuation of ing contained a system that permitted it to be retrieved by Greek scholarship in antiquity. I will present what I think its sound. Why on earth would the scholars of Alexandria, is the most important argument in my favor, namely the being the first in history to create a library to pass on argument of the human voice (two other essential argu- knowledge from one generation to another (Bing 2008, ments are mnemonics and literary theory). The argument 40) abandon the benefits of such a perfect system? Why of the human voice is simply this: The scholars could sing not profit from it instead? The mechanics of the theatre the entire library. Below, I qualify how. in Athens could without difficulty be integrated in the me- chanics of the library in Alexandria. One can even con- 4.1 Singing the literature in the library sider if the scholars were capable of avoiding it: The sys- tem could not be withdrawn from the literature it had cre- Nagy (2000) is not the only one concerned with literature ated; it was the literature. in antiquity as sound. In his books Preface to Plato (1963) and the Muse Learns to Write (1986) Eric A. Havelock ana- 4.2 Singing the structure of the library lyzes the transformation from orality to literacy in the Greek society in antiquity. Until Plato, a certain type of As the literature in the library could be sung, so could the language dominated the Greek society, a language that, al- structure of the library―theoretically. The catalog of ships though found in literature, was essentially oral (Havelock in the second book of the Iliad is far from being the only 1986, 92-93): catalog or list that singers had memorized and recited by voice. Indeed the Greek word Καταλέγω, the etymological Greek literature from its beginnings was composed root of catalog, means both recite and list. What is important in verse, not prose, and in Athens this continued to understand, is that these two meanings do not oppose roughly to the death of Euripides .... The content of each other, lists were cataloged as they were sung, they were the versified language―which, as versified, is storage stored only in the memories of the singers. Surely, this prac- language, regardless of the individual styles and pur- tice changed in Athens around 400 BCE when lists began poses of individual writers―is uniformly mythic, to be written down on scrolls, but the original potential did meaning traditional .… Surviving orality also ex- not disappear overnight. Memorizing lengthy lists was still plains why Greek literature to Euripides is composed both a praised rhetorical skill and a necessity for the illiter- as a performance, and in the language of perform- ate. The Greeks did not lose awareness of the fact that lists ance. The audience controls the artist insofar as he had been passed on to them orally from generation to gen- still has to compose in such a way that they can not eration over a period of at least 400 years (Havelock 1986, only memorize what they have heard but also echo it 84). Quite the contrary: in Preface to Plato Havelock argues in daily speech. The language of Greek classic thea- (1963, 43) that Plato excludes poetry from his Republic ex- tre not only entertained its society, it supported it. actly because all branches of thinking were still influenced, and in Plato’s point of view blurred, by the esthetics of Havelock’s main point is that all Greek literature until orally transmittable poetry. Plato was composed in verse so that it could be easily This raises a question: If orally-based learning skills, in- memorized, simply because orality was the means to pass cluding the ability to recite catalogs, had such a huge intel- on knowledge to the next generation. Generally speaking, lectual impact even in the fall of Plato’s life in Athens, adding rhyme, repetitions and beat helped illiterates store could it be that a young Aristophanes in Alexandria some Knowl. Org. 41(2014)No.1 11 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked

100 years later was still singing the catalog most useful to curred. Not until the tenth century did silent reading be- him? If Aristophanes did sing the Pinakes, this would be come the standard way of reading in the western world the last piece to the puzzle. It would explain not only why (Manguel 1996, 43). Until then, reading out loud or at he as a living library could identify authors and works by least mumbling the words was the norm. One has to small bits of literature read out loud, but that he could also imagine the Alexandrian scholars as reading out loud the find the scrolls containing the literature in the library. Why? literature in the library, every time they read. Therefore, it Because the Pinakes mirrored the physical library. Singing seems fair to say that both structure and content of the li- the Pinakes meant singing the library, as structure. brary were sung, and that this was the order of the day. No evidence of this is given, I must admit. Besides When the scholars recited the Pinakes or the literature out Vitruvius’ story of the memory of Aristophanes (Jacob loud, this expressed the structure of the library. Done 2010), the Pliny the Elder’s testimony of living libraries over and over again this must at some point have made (Yates 1965) and Athenaeus’s statement that all scholars the scholars reach a level where they most likely could sing had the content and structure of the library present in their the library without consulting the scrolls, but rely entirely memory (Jacob 2013) and finally all the arguments pre- on their memory. The process was made easier due to the sented above, we are left to speculation. We cannot with fact that the literature was for the most part inherited oral certainty know whether the scholars had the structure of literature, that was designed to be remembered, and that the library in their memory, even though all sources indi- the Pinakes had its roots in the same tradition. In this way, cate it. I would like to point out that this constitutes an ar- I believe, the scholars singingly memorized the library. gument in itself: No source at all indicates the opposite of my view. In fact, opposing this idea is merely a result of 4.4 The mechanics of the memory library thinking like Phillips (2010), Casson (2001) and Staikos (2000) that the library in Hellenistic Alexandria was organ- As I have just argued, I believe that the Hellenistic library ized like a modern, physical library per se. There is no evi- of Alexandria could be sung, both its literature and its dence that the Pinakes was used as a modern, analog refer- structure. Therefore, its physical structure, the dead library, ence tool. It is simply assumed. must have been integrated with its counterpart, the struc- ture in the memory of the scholars, the living library. Basi- 4.3 The modernity of silent reading cally, the scholars, being living libraries, made use of them- selves and the dead library as though they were one struc- One final argument in support of the idea that the library ture. They could sing both the structure of the library and was sung is the fact that silent reading was rare in the Al- the literature it contained from their memory, but they exandrian library. Actually, it might not even have oc- could rely on the physical library in the process of memo-

Figure 3: The mechanics of the memory library 12 Knowl. Org. 41(2014)No.1 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked rization and indeed in cases of uncertainty and oblivion. To defend the idea that these two libraries was in fact The dead and the living library put together formed what one integrated structure, I presented the argument of the I have chosen to call “the memory library.” In order to il- voice. Havelock (1963, 1986) observed that Greek litera- lustrate this, I have simply added together the mechanics ture until Plato was unchallenged as orally transmittable. of the dead and the living library: As a consequence, all of this literature could be stored and The mechanics of the memory library show both how retrieved in memory by song. Included in this process classification and indexing, and information-seeking, in were catalogs like the later Pinakes. I have argued, that if the Hellenistic library of Alexandria, worked. the Pinakes was actually sung by the scholars, the entire li- To explain the mechanics of the memory library in a brary could be sung, both its content and as a structure. simple way, the reader must imagine being a living library. This assumption is supported by the fact that reading in Imagine being Aristophanes. He knows the structure of antiquity meant reading out loud (Manguel 1996). There- the dead library by heart, since he knows the Pinakes by fore, the argument of the voice qualifies that the dead and heart: they are identical. And that’s it, really. The living li- the living library constituted one integrated structure. I braries were able to browse the classes of literature in the have framed this structure as the memory library. Pinakes from memory, could go to specific rooms, authors, The memory library made classification and retrieval works, scrolls, parts and even lines in the work (perhaps faster and more precise, than a library merely contained even words) without moving. They browsed this structure within a building or the human mind. As this article has in their minds. But if they wanted to, the living libraries demonstrated, the mechanics of the memory library could verify their content in the dead library, since their reached its level of refinement before 200 BCE, by the mechanics were compatible with each other’s. The living time Aristophanes became the director of the library. At library contained the dead library within it. And the dead this point, the memory library had evolved into a 7 library enabled the possibility of the living library. (maybe 8) step procedure: from the entire universe of The memory library had many advantages. It out- knowledge, to the literary class, author, work, scroll, part, matched by far the mechanics of the dead library, because line and perhaps even right down to the specific word. I it could be browsed a lot faster than the dead library. Just have argued that this structure could be perused in the imagine browsing 120 scrolls for an author, and then run- mind of the scholar, and could always be verified, because ning to the area were the author was located, finding the the structure in the mind was also the structure of the right scroll and then, finally the right part. That is easier to physical library. do in thought than in reality, right? On the other hand think of the unreliability of human memory. It is easy to References forget an author’s literary class, exact phrasing and so on. Well, in this case the memory library was more reliable Barnes, Robert. 2000. Cloistered bookworms in the chi- than the living library: It could not (in theory) lose its ken-coop of the muses: The ancient library of Alexan- memory. dria. In MacLeod, Roy M., ed., The Library of Alexan- dria: centre of learning in the Ancient World. London: I.B. 5.0 Conclusion Tauris Publishers, pp. 61-77. Bing, Peter. 2008. The well-read muse: present and past in Cal- In the introduction, a specific understanding of how the limachus and the Hellenistic poets. Ann Arbor, Michigan: Alexandrian library worked was rejected. Phillips (2010), Michigan Classical Press. Casson (2001) and Staikos (2000) seem to analyze the Al- Blum, Rudolf. 1991. Kallimachos; The Alexandrian library and exandrian library by retrieving in it elements from the pre- the origins of bibliography. Madison, Wisconsin: The Uni- sent reality of libraries. In this article, I have done the ex- versity of Wisconsin Press. act opposite. Instead of interpreting the library as similar Callmer, Christian. 1944. Antike bibliotheken. Rom: Svenska to modern ones, I have claimed that the library was in fact Institutet I Rom. a logic continuation of the Greek intellectual heritage. My Cancik, Hubert, Schneider, Helmuth and Landfester, point of departure was to follow Jacob (2010). He reflects Manfred. 1996-. Der neue Pauly. Stuttgart: Verlag J.B. on the memory of Aristophanes and how it must some- Metzler. how mirror the organization of the library. In the present Canfora, Luciano. 1992. La Bibliothèque d’Alexandrie et article I have followed his considerations, but widened the l’histoire des textes. Liège: Cedopal. scope to all the scholars attached to the library. Accord- Canfora, Luciano. 1990. The vanished library. Berkeley: Uni- ingly, the article frames and outlines the living scholars as versity of California Press perusing their memory―a memory that mirrors the li- Carruthers, Mary. 2008. The book of memory. Cambridge: brary’s organization. Cambridge University Press. Knowl. Org. 41(2014)No.1 13 O. Olesen-Bagneux. The Memory Library: How the Library in Hellenistic Alexandria Worked

Casson, Lionel. 2001. Libraries in the ancient world. New Ha- Pfeiffer, Rudolf. 1949. Callimachus―Volumen I fragmenta. ven: Yale Nota Bene. Oxford: Oxford University Press. I. Coqueugniot, Gaëlle. 2007. Cofre, casier et armoire: la Ki- Pfeiffer, Rudolf. 1968. History of classical scholarship: from the bôtos et le mobilier des archives et des bibliothèques beginnings to the end of the Hellenistic Age. Oxford: Oxford greques. Revue archéologique 2: 293-304 University Press. Havelock, Eric A. 1963. Preface to Plato. Cambridge: Phillips, Heather. 2010. The great Library of Alexandria? Belknap Press, Harvard University Press. Library philosophy and practice. Available http://unllib.unl. Havelock, Eric A. 1986. The muse learns to write. New Ha- edu/LPP/phillips.htm ven: Yale University Press Platthy, Jenö. 1968. Sources on the earliest Greek libraries, with Hoepfner, Wolfram. 2002. Die bibliothek eumenes’ II. in the testimonia. Amsterdam: A.M. Hakkert. pergamon. In Hoepfner, Wolfram, ed., Antike bibliothe- Skafte Jensen, Minna. 2011. Writing Homer: a study based on ken. : Verlag Philip von Zabern, pp. 41-52. results from modern fieldwork. Copenhagen: Det Kongelige Irigoin, Jean. 2001. Le livre grec des origines à la Renaissance. Videnskabernes Selskab. Paris: Bibliothèque nationale de France. Small, Jocelyn P. 1997. Wax tablets of the mind: cognitive stud- Jacob, Christian. 2007. Alexandrie, IIIe siècle avant J.-C. In ies of memory and literacy in classical antiquity. London: Jacob, Christian, ed. Lieux de savoir : espaces et communau- Routledge. tés. Paris: Albin Michel, pp. 1120-45. Staikos, Konstantinos. 2000: The great libraries: from Antiq- Jacob, Christian. 2010. Le bibliothécaire, le roi et les poè- uity to the Renaissance. New Castle, Delaware: Oak Knoll tes. Athens dialogues e-journal 2: 1-17. Press. Jacob, Christian. 2013. The web of Athenaeus. Cambridge, Staikos, Konstantinos. 2004. The history of the library in west- Massachusetts: Harvard University Press. ern civilization: from Minos to Cleopatra. New Castle, Dela- Lerner, Fred. 2001. The story of libraries: from the invention of ware: Oak Knoll Press. writing to the computer age. New York: Continuum. Svenbro, Jesper. 1988. Phrasikleia : anthropologie de la lecture Manguel, Alberto. 1996. A history of reading. New York: en grèce ancienne. Paris: Éditions la découverte. Penguin Group. Too, Yun L. 2010. The idea of the library in the Ancient World. Meillier, Claude. 1979. Qallimaque et son temps. Lille: Univer- Oxford: Oxford University Press. sité de Lille. Weber-Nielsen, Carsten. 1990. Mad & vin i oldtiden ― Nagy, Gregory. 2000. Reading Greek poetry aloud: Evi- Uuddrag af Athenaios’ de lærde middagsgæster. Copenhagen: dence from the Bacchylides Papyri. Quaerni urbinati di Museum Tusculanum. cultura classica 1: 7-28 Witty, Francis J. 1958. The pinakes of Callimachus. Library Parry, Milman and Parry, Adam. 1971. The making of Ho- quarterly 1: 132-8. meric verse: The collected papers of Milman Parry. Oxford: Witty, Francis J. 1973. The other pinakes and reference Oxford University Press. works of Callimachus. Library quarterly 3: 237-44 Parsons, Edward A. 1952. The Alexandrian Library: Glory of Yates, Francis A. 1965. The art of memory. London: the Hellenic World. London: Cleaver-Hume Press Ltd. Routledge

14 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags†

Xiaoyue Ma* and Jean-Pierre Cahier**

*School of Economics and Management, Xidian University, 266 Xinglong Section of Xifeng Road, Xi’an, Shaanxi, 710126, , **ICD/Tech-CICO Lab, Université de Technologie de Troyes, BP 2060, 10010 Troyes, France,

Xiaoyue Ma is a lecturer in the department of information management of Xidian University since September 2013. She got her Ph.D. degree at the University of Technology of Troyes (France) in network, knowledge and organization. During the study, she developed an icon system for knowledge tagging. Her research interests currently focus on visual knowledge management, knowledge organization and sharing, in the field of which she has published about fifteen academic papers in international journals and conferences.

Primarily an engineer at L’Ecole Centrale de Lyon, France, J.-P. Cahier is since 2005 a searcher in knowledge engineering and cooperative work, in the Tech-CICO Lab (Troyes, France). During his Ph.D. work, he partici- pated in early research on the social semantic web, by which the community can build a dynamic and collective meaning. He built the “Agorae” software tool, to build “hypertopic” knowledge maps. Today he focuses on visual and semiotic new approaches of collective knowledge management.

Ma, Xiaoyue and Cahier, Jean-Pierre. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags. Knowledge Organization. 41(1), 14-29. 49 references.

Abstract: VDL (Visual Distinctive Language)-based iconic knowledge tags are graphically structured icons for knowledge representation. VDL was developed and assessed to enhance the connection of iconic tags and the connection of tagged knowledge. The purpose of this paper is to present further investigation on an arrange- ment method for these special tags as well as the characteristics of better tag presentation in knowledge or- ganization systems (KOS). An online experiment was conducted to compare tagging results of four types of iconic tag presentations: two types of iconic tags (VDL-based iconic tags and iconic tags without explicit struc- ture) under two arrangement methods respectively (random arrangement and semantic arrangement). Tagging quality and tagging speed were measured to identify how users locate and locate again appropriate iconic tags for knowledge tagging. A supplementary test on tag structure identification was also carried out for each tag presentation. Semantic arrangement of VDL-based icons helped users to tag given articles with more appropriate tags in less time. Users identified better tag structure in this type of tag presentation. This in-depth work of VDL-based iconic tags is among the first to investigate how to visually structure knowledge tags, a problem neglected by previous studies on icon knowledge representation.

Received 26 May 2013; Revised 11 September 2013; Accepted 23 October 2013

Keywords: tags, tagging, group tagging, VDL-based icons, semantic arrangement

† This work is partly financed by the Fundamental Research Funds for the Central Universities in the project “Evaluation analysis of knowledge organization system,” no. K5051306024.

1.0 Introduction ized interpretation of knowledge structures. It is intended to encompass all types of schemes for organizing knowl- Knowledge Organization Systems (KOS’s) (Hodge 2000) is edge. KOS’s includes classification schemes that organize a general term referring to the tools that present the organ- materials at a general level (such as books on a shelf), sub- Knowl. Org. 41(2014)No.1 15 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags ject headings that provide more detailed access, and au- New tags proposed by experts and users could be in turn thority files that control variant versions of key informa- added into certain category for potential searching use. tion (such as geographic names and personal names). They A tag cloud selects and presents a limited number of also include less-traditional schemes, such as semantic net- tags in a KOS to make a simple presentation of knowl- works and ontologies. A structured KOS serves as a bridge edge. It is visual interaction between users and knowledge between the user's information need and the material in resources by tagging. Besides the visual features of tag the collection. With it, a user should be able to identify clouds such as size, color or font weight (Bielenberg and boundary objects of interest (Bowker and Star 1999) with- Zacher 2005; Shaw 2008; Bateman et al. 2008; Rivadeneira out prior knowledge of its existence. Whether through et al. 2007), a lot of previous studies tried also to find out browsing or direct searching, whether through themes on a which type of tag arrangement would improve the interac- web page or a site search engine, the structured KOS tion quality of textual tag clouds (Kerr 2006; Chen et al. guides the user through a discovery process. 2009; Knautz et al. 2010). Compared to several arrange- Knowledge tags (henceforth “tags”) are employed to ment approaches, the most acceptable view on this issue organize, share, and search information in KOS’s. These was to semantically structure tag clouds (Schrammel et al. short textual labels can be regarded as the keywords to im- 2009). A whole tag cloud could be regarded as the combi- ply the categorization of knowledge. For example, when an nation of several clusters with the tags in each cluster rep- item of knowledge is marked by the tag “bus,” it is consid- resenting topic-related terms. ered to be sorted into the category “bus,” while upper cate- In our research we are no longer interested in the tag gories such as “transport,” or sub-categories such as “mini- arrangement of textual tag clouds. However, we need to bus,” might also be available. Tags of KOS’s and their struc- make use of these empirical results for our new form of ture work as dynamic knowledge organization access (Kipp tags—VDL-based iconic tags (VDL stands for Visual Dis- and Campbell 2010). Users are able to annotate sharing tinctive Language) (Figure 1). In former work (Ma and knowledge in KOS by predefined and recommended tags. Cahier 2012), VDL-based iconic tags were created and

Figure. 1. Examples of VDL-based iconic knowledge tags in the field of sustainability (upper for topics and lower for attributes) 16 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags validated to improve the limits of textual tags, like incom- tags. When tags reach a semantic consensus, knowledge prehension of tag meaning and neglect of controlled vo- tagged by them may be intuitively considered associated by cabulary structure of tags (Kipp and Joo 2010). They also common topics or attributes. This connection of docu- contributed to improving visual thesauri (Shiri and Revie ments is useful especially when dispersed documents are 2005), which bears more on social knowledge tags. How- represented without clear categorization. Clear tag struc- ever, tags for the former test were all randomly arranged; ture enhances the implicit network of tagged knowledge in in other words, tags semantically related were not strictly KOS which provides easier organization and seeking. clustered. To complete the conceptual proposition of In spite of the vocabulary problem existing (Sen et al. VDL-based iconic tags, we continue to investigate how to 2006; Downey et al. 2008; Macgregor and McCulloch arrange them for knowledge tagging in KOS’s. Meanwhile 2006; Ames and Naaman 2007), there has been accumulat- observation on semantically structured textual tag clouds ing evidence suggesting that emergent structures do exist will be also verified whether applicable or not to VDL in social tagging systems (Golder and Huberman 2006; clouds. The results will be meaningful to the theoretical Cattuto et al. 2007). Most importantly, these emergent foundation of iconic tag clouds that can be implemented structures do seem to have the potential to help users to in KOS’s and other tag-concerned systems. It could as well explore information by providing meaningful organization be meaningful to large-scale icon systems where icons are and indexing of information resources. Despite the diverse the main knowledge entities instead of a functional part. backgrounds and information goals of multiple users, co- In the next section, we will review the state-of-art of occurring tags exhibited hierarchical structures that mir- the semantic tag relations and semantically structured tex- rored shared structures that were “anarchically negotiated” tual tag clouds. A presentation of previous work on VDL- by the users. based iconic tags in section two will specify the context of To explore the hierarchical relations between tags, an in- research and clear the motivation of this deeper study. tuitive way is to cluster the tags into hierarchical clusters. Then section three will explain what semantic arrange- Wu et al. (2006b) used a factorized model, namely Latent ment of VDL-based iconic tag clouds means and our hy- Semantic Analysis, to group tags into non-hierarchical top- pothesis. A tagging test will be presented to confirm our ics for better recommendation. Brooks and Montanez hypothesis and discuss the characteristics of a better tags (2006) argued that performing Hierarchical Agglomerative presentation in KOS. Clustering (HAC) on tags can improve the collaborative tagging system. Later, HAC was also used for improving 2.0 Background personalized recommendation (Shepitsen et al. 2008). Heymann and Garcia-Molina (2006) clustered tags into a Before discussing the tag arrangement of VDL-based tree by a similarity-based greedy tree-growing method. iconic tags, we need first to look back on the studies about They evaluated the obtained trees empirically, and reported tag arrangement of textual tags: what is defined as the that the method is simple yet powerful for organizing tags semantic relations among tags and why semantically struc- with hierarchies. Based on Heymann and Garcia-Molina’s tured tag clouds have more advantages. In the latter part work, Schwarzkopf et al. (2007) proposed an approach to of the background, more details will also be presented modelling for users with the hierarchy of tags. Begelman et about VDL-based iconic tags and the empirical demon- al. (2006) used top-down hierarchical clustering, instead of stration of former experiments. All of the information is bottom-up HAC, to organize tags, and argued that tag hi- expected to give complete motivation on the research of erarchies improve user experiences in their system. Most semantic tag arrangement of VDL-based iconic tags. of the hierarchical clustering algorithms rely on the sym- metric similarity among tags, while the discovered relations 2.1 Semantic relations within tags in KOS are hard to evaluate quantitatively, because one cannot dis- tinguish similar from not-similar with a clear boundary. The representation of tag structure (a group of tags in People have also worked on bridging social tagging sys- KOS) is as important as that of each single tag. On one tems and ontologies in the semantic way (Fu et al. 2010). hand, an explicit tag structure facilitates finding and finding Mika (2005) proposed an extended scheme of social tag- again later an appropriate tag in a large group of tags. ging that includes actors, concepts and objects, and used While searching tags for specific knowledge tagging pur- tag co-occurrences to construct ontology from social tags. poses, relations among them allow users to find several al- Wu et al. (2006a) used hierarchical clustering to build on- ternatives referring to the closed topics. This leads to tology from tags that also use similar-to relationships. deeper comparison and selection among them in order to Later, ontology schemes that fits social tagging system were make better tag choices. On the other hand, tag structure proposed, such as (Van Damme et al. 2007) and (Echarte et offers a possible link between documents tagged by these al. 2007), which mainly focused on the relation among tags, Knowl. Org. 41(2014)No.1 17 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags objects and users, rather than among tags themselves. Pas- demonstrations of items, such as illustrating photos, URLs sant (2007) mapped tags to domain ontologies manually to of websites or supporting document links. improve information retrieval in social media. To construct On one hand, Hypertopic proposes a knowledge cate- a tag ontology automatically, Angeletou et al. (2007) used gorization method, especially emphasizing the concept of ontologies built by domain experts to find relations be- viewpoint, which is significant in collaborative knowledge tween tags, but observed very low coverage. Specia and classification (Ma and Cahier 2011). As illustrated in Figure Motta (2007) proposed an integrated framework for orga- 2, one item may be associated with more than one topic nizing tags by existing ontologies, but no experiment was depending on subjects' viewpoints. Meanwhile the rela- performed. Kim et al. (2008) summarized the state-of-the- tions between two items can also be changed depending on art methods to model tags with semantic annotations. them. For example, museum 1 and 2 are two items refer- The idea of a social-semantic web (Bénel et al. 2009) ring to the same topic category “educational place” from has emerged over recent years adopting the notion of col- the viewpoint “function and value.” However they will be laborative knowledge management (Ma and Cahier 2011). categorized into two different topic categories when talk- Contrary to the semantic web (Berners-Lee, 2000), the so- ing about the viewpoint “style of appearance”—museum 1 cial-semantic web is not interested in formal semantics but in Baroque while 2 in Gothic. Possible sub-topics such as in semantics depending on human subject and semiotic “Baroque in 15th century,” or “Baroque in 16th century” are substrate. The knowledge model Hypertopic (www.hyper supposed to continue specifying the period in which the topic.org; see Zhou et al. 2006) was developed in the frame style emerged. This type of categorization emphasizing the of the social-semantic web. It proposes to describe an item concept of viewpoint provides more flexible organization through its topics, attributes and resources. For each item, of items (knowledge) in a KOS. Categories of items are pertinent topics are listed to mention which type of subject not solid but dynamic relying on users’ opinions. It also al- is involved. These topics are supposed to be associated lows collaborative participation of categorization from with certain viewpoints considered alongside potential us- various users to search and retrieve an item under the ers. In other words, the implied viewpoints represent the viewpoints they prefer, even create a totally new viewpoint information goals of various people. Attributes and their without changing current knowledge structures. corresponding values provide also complementary and ob- On the other hand, Hypertopic provides a meaningful jective information that cannot be modified according to structure to manage tags that stem from topics (view- different users’ viewpoints. They are organized in pairs points) and attributes. Both topics and attribute values rec- with the name and its values as a facet (Mas and Marleau ommend textual tags to specify knowledge categorization. 2009). Talking of resources, they characterize other vivid Topic can be regarded as the “special” attribute consider-

Figure. 2 Knowledge organization based on Hypertopic model: topics (viewpoints), attributes and resources (Ma and Cahier 2012) 18 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags ing “topic” as the attribute name. These tags allow Hyper- The following studies started to focus on semantic rela- topic-based knowledge tagging in which users are able to tions within tags and tried to represent it in textual tag tag knowledge through its topics and attributes. For each clouds. Hasan-Montero and Herrero-Solana (2006) topic (viewpoint) or attribute value, more than one possi- claimed that the alphabetical arrangements neither facili- ble textual expression may exist, in kinds of synonym or tate visual scanning nor infer semantic relations between languages. We consider them as a unit for a topic tag or an tags. They discovered that the users have difficulty com- attribute tag, by which the tag structure is clearer and easier paring tags with small size and derived semantic relations. to be managed. All the topic tags are cataloged under the There might be wrong relation-interpretation with items tree structure considering the common viewpoint as the placed near to each other. They proposed an algorithm us- “parent” node. Each topic tag may be followed by sub- ing tag similarity to group and arrange tag clouds. There- topic tags. The user is allowed to add new categories of fore, they developed a k-means algorithm to group se- topic tags by creating a parent node named “my view- mantic similar tags into different clusters and calculate tag point.” This convenience encourages collaborative knowl- similarity by means of relative co-occurrence between edge management (Ma and Cahier 2011) and collective tags. Similar work can be found in (Provost 2008). Like- tagging. However, if textual tags generated by Hypertopic wise, Fujimura et al. (2008) use the cosine similarity of tag are presented together without implying topic category or feature vectors (terms and their weight generated from a attribute name, the structure will be less explicit especially set of tagged documents) to measure tag similarity. Based when users are not familiar with tag meaning. This prob- on this similarity they calculate a tag layout, where distance lem is increasingly evident when tag numbers grow. In ad- between tags represents semantic relatedness. Another dition, sometimes one topic tag may be related with several very similar approach is proposed by (Berlocher 2008). topic categories. For example, renewable energy can be An empirical evaluation of semantically structured tag sorted in topic “energy” and topic “economy.” In this case clouds (Schrammel et al. 2009) has demonstrated that one textual form expression cannot reflect all possible topical layouts (semantically-structured tag clouds) can relevant categories. A more explicit representation is re- improve search performance for specific search tasks quired. compared to random arrangements, but they still perform worse than alphabetic layouts. The semantic arrangement 2.2 Semantically structured tag clouds must be good enough otherwise users will not be able to distinguish it from random layouts. Semantic layouts Although semantic relations do exist within tags, tag ar- therefore should only be used when the quality of the ar- rangement based on semantic clustering was not largely ac- rangement can be assured. Test participants also com- cepted at the beginning. Previous studies considered differ- mented that it was difficult to identify clusters and rela- ent types of arrangement to improve better interaction of tions beyond single lines. tag clouds. Halvey and Keane (2007) investigated the ef- fects of different tags clouds and listed arrangements 2.3 Modelling well-structured iconic tags using Visual Distinctive comparing the performance for searching specific items. Language The setup included random and alphabetically ordered lists and tag clouds. Semantic ordering was not part of tested VDL-based iconic tags are well-structured icons working setups. They found that respondents were able to more for better representation of tag structures and single tags. easily and quickly find tags in alphabetical orders (both in Because the semiotic representation of icons has been lists and clouds). Rivadeneira et al. (2007) compared the largely studied already, we are more interested in visualiza- recognition of single tags in alphabetical, sequential– tion of tag structures than in choosing symbols for each frequency (most important tag at the left-upper side), spa- iconic tag. However the conclusion on imaged informa- tially packed (arranged with Feinberg’s algorithm) and list- tion (Paivio 1971) is as well accepted. To visualize the tag frequency layouts (most important tag at the beginning of structure in KOS’s, it has to first confirm the way of or- a vertical list of tags). Results did not show any significant ganizing tags, and then iconize them as well as their struc- disparity in recognition of tags. However, respondents ture. Tags in KOS’s can be regarded as the keywords to could better recognize the overall categories presented specify possible knowledge categorization, which means when confronted with the vertical list of tags ordered by structuring tags is, in fact, recommending a method to or- frequency. Hearst and Rosner (2008) discuss the organiza- ganize information and knowledge. tion of tag clouds. One important disadvantage of tag The idea is to benefit from the categorization of tex- cloud layouts they mention is that items with similar mean- tual tags made by Hypertopic (from topics and attribute ing may lie far apart, and so meaningful associations may values) and iconize it for better visualization of separate be missed. tags and their structures (see Figure 3) (Ma and Cahier Knowl. Org. 41(2014)No.1 19 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

2012). Here we think of the simplest case, one-tag-one- All the pre-icons in this model can be explained as a icon tags, where one recommended tag corresponds to “graphical organizer” named Visual Distinctive Language each topic and attribute value. What’s more, one iconic (VDL) (see Figure 4), which aims to visually characterize symbol represents the current textual tag although various the categorization made by Hypertopic protocol. Here the symbols can explain the same tag meaning. However, this “language” is a wide notion (instead of a spoken word) approach will be extended equally for the many-to-many that allows communicating with each other in a relatively case where no constraint of textual tags and icons is ap- effortless way (Nakamura and Zeng-Treitler, 2012). We call plied. For example, the tag “nature” could also coexist it Visual Distinctive Language because it provides visual with “mode of life,” “environment,” and other synony- consensus (pre-icons) on information structure (distin- mous (or closed expressions) in different languages. This guishing one category from another). Users who accept it tag “nature” will be represented by iconic symbols of could communicate under this visual convention, like trees, flowers and other possible signs. No matter whether knowledge sharing, one of the communication means in one-to-one or many-to-many cases, the tag structure is al- KOS. ways consistent with the knowledge organization accord- Among six visual variables illustrated by Bertin’s ing to Hypertopic. As long as tags in given KOS obey this graphical semiotic theory (Bertin 1983), three are in less structure, we can iconize them in the same way. accordance with the purpose of tag structure representa- The symbolic characters of icons convey explicitly the tion: size, orientation and value. It is difficult to distin- represented objects, while graphical characters help visual- guish two iconic tags in different sizes, different orienta- ize relations within them. In particular, a special group of tions or different values depending on the conditions of a icons called “pre-icons” function to signify the categories computer screen. Considering aesthetic reasons, icons are of tags in a KOS: the same viewpoint, the same branch of preferably designed in unified size for software applica- topic or the same attribute name. Pre-icons act as the tions. Limited choices of orientation and value also make common base of iconic tags. Tags in each category will be it less possible to design large scale tag presentations. specified by combining symbols with this corresponding By contrast, three visual variables—shape, colour and iconic base. Nevertheless a pre-icon for attribute name is texture—are chosen to create the pre-icons of VDL. For useless in some cases. For example, when iconizing the at- topics tags, all tags under common viewpoint are first de- tribute values of “language,” it is clear enough to repre- signed by uniform shape (pre-icon), and then those sorted sent them independently with national flags. into different topic categories will be added with another

Figure 3. Iconized topics and attributes—two elements of Hypertopic—to form well- structured iconic tags (Ma and Cahier 2012) 20 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags visual variable colour to form updated pre-icons. Since – verify whether a conclusion on semantically structured topics tags are catalogued in tree structure, new visual vari- textual tag clouds can be applied to VDL-based iconic ables would still have been added to create pre-icons for tag presentations; and, lower branches. However, on one hand the number of vis- – develop a supplementary experiment to get more com- ual variables is limited; on the other hand excessive visual plete view on how to construct better VDL-based variables reduce the readability of iconic tag structures. To iconic tag presentations, which will be meaningful for provide clearer and simpler VDL, iconic topic tags from creating iconic tag clouds in KOS’s. the second level will always keep the same pre-icon without being distinguished by a new visual variable. 3.0 Semantic arrangement method The graphical rule is similarly applied to attribute tags. for VDL-based iconic tags Attribute name is directly iconized into coloured shapes (pre-icons) and then attribute value is detailed by joining a Several arrangement methods are available for textual tag symbol onto it (except special cases as mentioned in the clouds, such as alphabetic arrangement, random arrange- preceding paragraph, such as “language”). The chosen ment, folksonomy-based arrangement, or semantic (lin- colours in Figure 6 show an example of the idea of visual- guistic-based) arrangement. While for iconic tags, only izing tag categorization by graphical variables, yet without random arrangement and semantic arrangement are con- a strict colour choice test. However, colour and shape are sidered according to tag format. Since former studies dem- supposed to interact in VDL, neither is the dominant vari- onstrated that semantically-clustered textual tag clouds able. Considering colour-blind cases, the version in black yielded better tag presentation and interface, we are con- and white is created as well. Variable colour will be re- sidering similarly semantically clustering VDL-based iconic placed by texture (see Figure 6) while preserving all the tags. First we define what semantic arrangement refers to other rules from the coloured version. The final version for VDL-based iconic tags. of both iconic topic tags and iconic attribute tags bears no Tag presentation in KOS’s is dynamic wherein users visual difference unless specifically marked for their origi- choose recommended tags for thier own tagging and nality. However, pre-icons allow indicating those from the searching goals and in turn update useful tags for later use. same category of viewpoint, topic or attribute name. Thus the tag arrangement should be convenient both for To evaluate how VDL-based iconic tags improve tag locating an existing tag and for adding new tags. VDL- presentation in KOS’s, we have done the first “tagging on based iconic tags improve the limits of textual tags in paper” experiment in 2011 (Ma and Cahier 2012). Consid- knowledge tagging. The symbolic characters of icons con- ered that tagging effectiveness is a complex subject associ- vey explicitly the represented objects and the graphical ated with numerous user-related cognitive factors, this ex- characters enhance connection among tags and docu- periment focuses on whether VDL-based iconic tags help ments tagged. In particular, a special group of icons called finding more usable tags to annotate knowledge in KOS “pre-icons” function to signify the categories of tags: the by visual representation of tags and tag structures. Figure same viewpoint, the same branch of topic or the same at- 5 shows three tested tags: textual tags, iconic tags without tribute name (tag structure proposed by Hypertopic). Pre- explicit structure and VDL-based iconic tags. icons in VDL act as the common base of iconic tags. The Across several tests in the experiment, early results tags in each category will be specified by combining sym- demonstrated that VDL-based iconic tags have more ad- bols with this corresponding iconic base. Here we still vantages compared with iconic tags without structure and think of the simplest case, the one-to-one tag-icon case as textual tags. Participants announced they easily located and mentioned in section 2.3. However, this approach will also located again later a tag from a tag presentation essentially be extended for the knowledge tags to which no vocabu- through graphical tag structure and partly through iconic lary (symbol) constraint of textual tags (icons) is applied. symbols. The knowledge resources tagged by VDL-based In that many-to-many case, more than one textual tag iconic tags were also supposed to be strongly connected in (icon) will be proposed for knowledge in each category. a KOS. The former test allowed us to confirm the first hy- The semantic relations within VDL-based tags are in- pothesis on VDL-based iconic tags: visual codes of VDL tegrated from both graphical relations and semiotic rela- improve knowledge tagging in a KOS. However there was tions of icons taking advantage of Visual Distinctive Lan- no discussion of tag arrangement methods (all the tags in guage. Thus the semantic arrangement means iconic tags the experiment were arranged randomly). with the same pre-icons will be clustered. To arrange the Consequently, in this paper, we propose to produce a tags in one category (one viewpoint, one branch of topic more in-depth study of VDL-based iconic tags. More pre- or one attribute name) requires only to put the tags with cisely, we propose to: the same graphical characters together (same colour, same shape). Particular tags from different topic branches of Knowl. Org. 41(2014)No.1 21 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

based icons are more critical to improving tag presenta- tion). Particularly, comparison between group A and group C has been made in a former study (random arrangement of two types of iconic tag clouds). Each group of partici- pants was asked to tag 24 given documents (like a simu- lated KOS) by using the tags from tag presentations. We assume that users in group of VDL-based iconic tags and semantically arranged tags will find more appropriate tags (greater accuracy) in less time (speedier) compared to other patterns. In addition, we also traced participants’ behaviors: the time spent to tag an item and its changing tendency, the frequency of asking for the instruction and the proportion among tags considered to choose and those finally being chosen.

4.1 Participants

Forty-eight French speaking students, 26 male and 22 fe- Figure 6. Semantic arrangement of VDL-based iconic male with computer science as their master major in the tags (taking an iconic tag cloud for example) University of Technology of Troyes participated in this experiment. They were divided into four groups corre- the same viewpoint are displayed closer together (see Fig- sponding to four types of tested tag systems: group A for ure 6). It is hypothesized that this type of tag presentation iconic tags without explicit structure and randomly ar- will present clearer boundaries of tag clusters than those ranged (12 persons); group B for iconic tags without ex- randomly arranged. Users might find and add tags easily plicit structure and semantically arranged (12 persons); even they do not understand completely icon representa- group C for VDL-based iconic tags and randomly ar- tions. The semiotic interpretation of tag meaning will be ranged (12 persons); group D for VDL-based iconic tags concerned less since not only icon symbols but also pre- and semantically arranged (12 persons). icons confirm the categorization of tags. Because semantically structured textual tags have been 4.2 Material studied before, we investigate only semantically structured VDL-based iconic tags and semantically structured iconic The material for this online experiment included four types tags without explicit structure, taking comparison with of tag presentations (see Figure 7), and 24 knowledge arti- those randomly arranged. Here iconic tags without explicit cles (see Figure 8). Tags in each presentation are knowl- structure perform as a control group to see whether tag edge tags referring to seven topical categories (from two format or tag arrangement is more important for tag pres- viewpoints) and three attribute names on the topic of sus- entation. It is assumed that semantically structured VDL- tainable development. Tag presentation type one (type based iconic tags will facilitate locating and relocating tags three) differs from type two (type four) on the tag ar- for knowledge tagging. rangement while type one (type two) and type three (type four) differ on the tag format. We chose the same icon 4.0 Experiment symbols for all four presentations to avoid the impact on semiotic interpretation (icon choosing). What we wanted A computerized experiment was conducted to investigate to test was the influence produced by visual structure and the tag arrangement for VDL-based iconic tags. There arrangement of iconic tags. The twenty-four web articles were four types of iconic tag presentations in this experi- were the same as those used in the first experiment. They ment (four groups A, B, C, D shown in Figure 7). Com- were short texts with a large range of interest in the field parison took place in three sessions: group A and group B of sustainability and each is represented by title, image and (to see whether semantic arrangement improves tag pres- description. entation for no visual structure iconic tags compared to random tags); group C and group D (to see whether se- 4.3 Procedure mantic arrangement improves tag presentation for VDL- based iconic tags compared to random tags); group B and This experiment was composed of three parts: pre- group C (to see whether semantic arrangement or VDL- questionnaire, tagging test and post-questionnaire. There 22 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

Semantically Randomly (by categories)

Iconic tags without explicit structure

Group A, Type 1 Group B, Type 2

VDL-based iconic tags

Group C, Type 3 Group D, Type 4 Figure 7. Four groups and their corresponding tag presentations in the online tagging test

Figure 8. Tagging test platform and articles to tag (example of item 1 for group B) was no unified time constraint over the whole process but about personal understanding and awareness of sustainable it proceeded without permit to suspend. All of the development. participants logged into the system with their e-mail ad- Once participants finished pre-questionnaires, they dresses and assigned passwords. The system produced started tagging texts using given tags. A “Help” button was automatically for each of them a group code in order (A1, displayed in the upper right corner to give instruction if B1, C1, D1, A2, B2, C2, D2 .... ). The letter of this code necessary. A double left click on an icon allowed submit- corresponded to the type of tag presentation they used. In ting it into a tag-selection zone (choose an iconic tag) while order to understand the level of prior knowledge in the a double left click on the icon in the selection zone was to field of sustainable development, each participant first return it to the former location of tag presentations; as completed a pre-questionnaire of 10 questions: five con- well a simple right click on the icons made corresponding cerned academic knowledge in the field while others were text of the icon visible. Participants could confirm tagging Knowl. Org. 41(2014)No.1 23 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags choices for an article and continue on to the next one by nal analysis. That is to say, they were excluded from the av- clicking the button “next item.” Once a tagged text was erage level of the prior domain, which influenced the out- confirmed, it could not be modified. Similarly, an untagged come of the experiment. Individual difference also was item could not be shifted up to the next one. When par- implied by the frequency of clicks on the “Help” button. ticipants clicked the “finish the tagging” button on the final Participants who asked more frequently for “Help” could article, they arrived at the post-questionnaire to test tag show a worse understanding of the test. Levene’s homoce- structure identification using the four types of iconic tag dasticity test2 revealed no significant heterogeneity between presentations. They had the same operation of clicks as be- the variances on the score in the pre-questionnaire fore to submit and cancel an icon. However, they could no (P=0.572) and instruction reading (P=0.812). The mean longer get help from the textual meaning of iconic tags. scores on the pre-questionnaire for the four groups were The post-questionnaire was used in order to explore which 8.5 for group A, 8 for group B, 8.4 for group C and 9 for type of iconic tag presentation explicates better semantic group D. An ANOVA conducted on the subjects’ per- tag clusters. formances in the pre-questionnaire revealed no significant Additionally, several new variables were also tested in difference (F<1). As far as the instruction reading was this experiment. First, tagging duration was one of these concerned, the mean times were 2 for group A, 1.7 for variables. We were interested not only in average tagging group B, 1.7 for group C and 2.2 for group D. The per- duration for one item, but also in any changing tendency formances of the subjects revealed also no significant dif- from the first article to the last one. Second, the propor- ference (F<1). The two results suggested that there was no tion between chosen tags (tags selected) and final tags (tags significant individual difference on the prior knowledge confirmed for one item) was also meaningful. Here se- test which could influence the later tagging test. lected tags were placed in the tag selection zone while con- firmed tags were those tags finally appearing in the tag se- 4.4.2 Tagging process lection zone when clicking “next item.” This proportion could also be seen as the probability of confidence. The Here we must first explain the method of evaluating the higher the average proportion was, the more participants tagging process that was applied in the former experi- were confident with their choice of tags. This percentage ment. Two factors were considered in the evaluation: tag- also implied the understanding level and learning result of ging quality (more appropriate tags found) and tagging iconic tags and their structure. Finally, asking for instruc- speed (less time spent to tag). The method for analyzing tion revealed whether users had difficulty on operations in the quality of tagging remained the same as in the previ- the test. This statistical record was considered as part of ous experiment (Ma and Cahier 2102) using an expert ma- the prior knowledge. trix and Rx2 criterion, which will be explained below. Eighty-seven tags each had a unique tag number from 1 to 4.4 Results 87. Five experts on sustainability were invited to tag the texts with these 87 tags. For each text, they were required 4.4.1 Prior knowledge test to rank all of the chosen tags with a number from 0 to 5 to represent the degree of correlation. Five indicated that Each question in the pre-questionnaire had one correct an- the tag was certainly relevant to the item while 0 meant swer from three options (a, b or c). A participant who not relevant. The average of the five experts comprised a managed to find that answer won one point while a par- matrix, called the expert matrix showing the correlations ticipant who could not find it did not earn any points. Af- between tags and items (see Table 1). ter the test, there was a list of points earned (10 in total) by Similarly, the tagging result of all the participants filled each person. Participants whose point total was above or 48 participant matrixes. The unique difference from the within the range from 6 to 2 were not considered in the fi- expert matrix was that the participant matrixes were filled

Table 1. Expert matrix (on the left) and participant matrix (on the right) to evaluate tagging quality (Ma and Cahier 2012) 24 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags merely with either 1 or 0: 1 refers to the tags used while 0 ging performances, N=40, P<0.05. A more thorough to the tags not marked in boxes. analysis using a Mann-Whitney test indicated a significant To analyze the tagging result of participant x, the for- difference between group D (M=342.1) and group C mula below was applied. (M=238.2), Mann-Whitney U=32, P= 0.04. Similarly, the performances of group D were significantly better than 87 24 group B (M=215.2), Mann-Whitney U = 5, P <0.05. As Rx=  ij TPxTE ij (1) demonstrated before, group A (M= 154.4) was signifi- ij1 1 cantly poorer than group C, Mann-Whitney U = 15, TEij : number in row i column j of the expert matrix P<0.05. In contrast, the performances obtained for group TPxij : number in row i column j of the participant A and B did not differ significantly for the tagging process, matrix (participant x) Mann-Whitney U = 32, P=0.173.

Rx is a variable implying the tagging quality which refers 4.4.3 Time changing tendency to the degree of appropriate tags that have been chosen. It reveals high quality of tag cloud interactions such as lo- Apart from average tagging time, dynamic change ten- cating and locating again useful tags considered relevant dency is also useful for analysing user behaviour. It can be by experts. All the RXs in one group were considered as a seen from Figure 9 that users in four groups revealed one-dimensional table to perform an ANOVA analysis close changing tendencies. Tagging duration decreased among groups. from item 1 to item 24 in all the groups without signifi- Tagging speed was originally reflected by the duration cant difference emerging on the rate of change. of tagging, from the selection of the first tag for the first text to the ending of the final tag for the final text. The fi- 4.4.4 Post-questionnaire nal statistical results compared were Rx/tagging duration of each participant, representing tagging quality in per unit The critical prediction of structural identification of tags time. Levens’s homocedasticity test indicated significant was to compare the categories proposed by participants heterogeneity between the variances on the tagging proc- with predefined VDL categories (seven categories of top- ess: Rx/tagging time, P<0.05. Consequently, these per- ics and three categories of attribute names, the same as formances were analyzed using a nonparametric Kruskal- before). Participants who were in complete correspon- Wallis test. This latter test implied a significant effect of the dence with one of these categories earned 2 points. Those semantically structured VDL-based icons on subjects’ tag- whose category was partially correspondent were scored 1

Figure 9. Average tagging duration for one article (four groups) Knowl. Org. 41(2014)No.1 25 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags point. No points were awarded to participants who mixed comments implied some of evidence. The participants in more than one proposed category. From the name of the group D said that as soon as they saw tag presentations, suggested category we also knew whether they only identi- they found clear icon categories represented in several fied the visual structure of the tags by the graphical regu- graphical bases in common. In contrast, those in group C, larity of VDL or whether they understood the meaning of although identifying the visual structure of tags, took the tag and confirmed it by the graphical regularity. It was much more time than the semantically structured group to also assumed that the group working with VDL-based catch this implicit information. The significantly better iconic tags and presented by category could offer more performance on structure identification in the post- categories corresponding to the categorization of tags, but questionnaire also validated this. maybe there would be the risk that two provisions of The advantage of semantically-arranged, VDL-based VDL-based iconic tags demonstrated the same capacity. icons was demonstrated also in tagging topic-related arti- After confirming that the homocedasticity of the vari- cles. Users are likely to tag them with the same tags or at ances was not statistically significant (P<0.05), Kruskal- least with the tags in one category. For example, if they Wallis test revealed a significant difference among four tagged a text on environment with a green tag, this tag or groups, H=40, P<0.05. More precisely, group D (M=12.4) other green tags was supposed to be used again for an- performed significantly better than group B (M=1.6), other environmentally-concerned text. In the case of ran- Mann-Whitney U=8, P=0.001 and group C (M=3.2) domly arranged VDL-based iconic tag presentations, users Mann-Whitney U=12, P=0.004. As was observed in the knew that there were still other choices of green tags in former experiment, Rx of group C was significantly the display. However, these green tags again took time and higher than that of group A (M=0.6), Mann-Whitney risked omitting some that were not used before. Other- U=26.5, P=0.037. In contrast, group B did not obviously wise semantically arranged VDL-based iconic tags might improve compared to group A, Mann-Whitney U=44, avert this problem. Green tags means all of the green tags P=0.465. were always listed together. Once one tag in a category was found, all other tags in that category appeared one by 4.4.5 Selection proportion one. Using this not only saves time localizing a tag, but also increases the tagging quality because all the alterna- Levene’s test implied significant differences between vari- tives are listed together, with the same structure informa- ances in four groups (P=0.025). The latter Kruskal-Wallis tion implied by visual code, influencing users’ selection test revealed no significant difference on selection propor- accuracy and confidence. tion among the four groups (P=0.149). Similar to the explication in the previous experiment, users became accustomed to selecting the tags from each 4.5 Discussion visual category. Finding and choosing a tag from 88 op- tions turns out to be a choice from seven small groups. In The results are partially in accordance with out predic- semantically arranged VDL-based iconic tag presentations, tions. Semantically structured VDL-based iconic tag pres- this method was better applied. Most of the testers in entation showed better effectiveness in the tagging proc- group D stated that they started the tagging process by ess (considering tagging quality and tagging speed) than consulting all the tag categories in every visual base, and the other three types. then they preferred to locate at each visual category to se- lect the useful tags. In group C, although they said they 4.5.1 Group C vs. group D to see whether semantic arrangement tried as well to choose tags from each visual category, it improves interaction of tag clouds for VDL-based iconic tags was not easy to find all icons in one category since they compared to random arrangement. were scattered in the presentation. They always forgot which tag in this category had been browsed. When they As demonstrated with textual tag clouds, semantically decided to look back for a second ttime at a certain tag, structured tag clusters led to a quicker and more accurate they could not easily pick it out. localizing of specific tags. Similarly, semantically struc- tured VDL-based iconic tags also revealed better guidance 4.5.2 Group A vs Group B (to see whether semantic arrangement in tag selection. Compared between tag presentations type improves interaction of tag clouds for no visual structure iconic 3 and type 4, semantically structured tags showed more tags compared to random one). clearly the layouts of tag clusters using visual signals, such as different colours or different shapes. Instead of spend- However, semantically-structured, iconic tags without ex- ing time to identify VDL in group C, testers in group D plicit structure did not reveal significantly better perform- got rapid graphical information about tag structure. Users’ ance on the tagging process compared with randomly ar- 26 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags ranged groups, nor did they in the post-questionnaire. these observations on tag format and tag arrangement are Testers in group A and B earned almost the same score in meaningful to create visual tag clouds in a KOS. the identification of tag structure. The semantically struc- tured arrangement did not bring a supplementary effect. 4.5.4 Other events As declared in semantically structured tag clouds (Schrammel et al. 2009), the semantic arrangement must Average tagging time of group C and D was longer than be good enough otherwise users will not be able to distin- that of group A and B. It may be assumed that testers guish it from random layouts and semantic layouts. There- with VDL-based icons could find more appropriate iconic fore it should only be used when the quality of the ar- tags and they spent more time to select them considering rangement can be assured. Iconic tags without explicit pure tagging duration. In particular, testers using iconic structure did offer graphical interpretation of tags, yet tags without explicit structure merely selected limited they did not provide visual information on tag structure— icons because it was difficult to find more interesting tags semantic relations within them. Consequently, users used among a huge number. Even though group D took a little semantically arranged icons totally as they did randomly longer tagging time, it still showed a significantly better arranged icons, which was previously shown to be poorer tagging quality in per unit time, which signified that the than randomly arranged VDL-based icons (Ma and Cahier pure tagging quality of group D was much more higher 2012) in the tagging process. than other groups, including group C. Tagging duration decreased from article 1 to article 24 in all four groups, 4.5.3 Group B vs Group C (to see whether semantic arrangement which signified progressive user learning on tags and get- or VDL-based icons is more critical to improves interaction ting skilled on tagging activity. It is assumed that the par- of tag clouds). ticipant could learn gradually the sense of tags and their structure, and this could reduce tagging time. Meanwhile, Seen from the assessment results, semantically-arranged the calculation of change tendency enabled us to argue tags improved the tagging process with the condition that that no matter which type of iconic tag was used, users the semantic structure was solid and clear enough for all showed similar changing regularity. users, as was demonstrated in group C and D. If not, it In particular, there was no significant difference on the will act just like randomly arranged tags, like A and B. proportion between final tags and selected tags (as defined How to define a solid and clear semantic structure or said above). This proportion did not make any significant dif- semantic layout among a group of tags is a crucial topic to ference among the four groups, which could be partly illus- discuss. On the one hand, if tags are in text or in icons trated by the argument in the former experiment (Ma and without explicit structure, they have to be in such high ac- Cahier 2012) that both types of iconic tags had equivalent cordance with daily comprehension that users easily rec- capacity in tag interpreting and memorizing. From the pre- ognize the tag cluster, using less ambiguous words. On the sent experiment, we can enhance the argument by another other hand, if tags can be sorted into several layers, they explanation that two types of arrangement, randomly and have to add complementary information for specifying categorically, did not influence tag interpreting and memo- their structure, such as VDL and pre-icons. Meanwhile, rizing. Users are supposed to have a close degree of confi- this information saves the users’ time identifying semantic dence due to comprehension of tag representations. In layers because of a more precise and intuitional tag struc- other words, neither VDL nor semantically arranged struc- ture. What’s more, testers in group C did better than those ture will improve the comprehension and memorizing of in group B, which also leads to an interesting argument. It tags except for the symbols of iconic tags. is assumed that in tag presentations tag format (represen- tation of a single tag and its structure) is more essential 5.0 Conclusion than tag arrangement. Comparing group B with group C, one changes tag arrangement to semantically structured The research findings in this paper have validated semantic based on group A, while the other alters tag format by arrangement of VDL-based icon tags providing better tag adding pre-icons to original icons in group A. However, presentations for knowledge tagging. This advantage was the statistical results implied significant improvement be- mainly produced by visual representation of semantic tag tween A and C (Ma and Cahier 2012) but not between A structures by pre-icons. The observation is relatively con- and B. In the absence of visual structure tags, even though sistent with that of semantically clustered textual tags. It is tags are semantically arranged, they will not ameliorate the seen once again that the semantic arrangement must be tagging process. As a result, reforming tag presentation good enough otherwise users will not be able to distinguish requires first making better representations of tag and tag it from random layouts. What’s more, results demonstrated structure, and then implementing the arrangement. All of that a tag format such as VDL-based is more critical com- Knowl. Org. 41(2014)No.1 27 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags pared to tag arrangement for knowledge tag presentations. Bertin, Jarques. 1983. Semiology of graphics: diagrams, This provides a possible interface for a KOS, such as tag networks, maps. W.J. Berg trans. University of Wiscon- clouds, to make a visual bridge between tags and knowl- sin Press, Translation. edge. Well-structured tag clouds need to be built up by Bielenberg, Kai and Zacher, Marc. 2005. Groups in social VDL-based iconic tags and arranged by semantic clusters software: Utilizing tagging to integrate individual contexts for so- based on empirical observations. Meanwhile, the explicit cial navigation. Masters thesis, Bremen: University of structure of tags will also help users in better understand- Bremen, 2005. ing and identifying the organization of knowledge. Bowker, Geoffrey C. and Star, Susan L. 1999. Sorting things out: classification and its consequences. Cambridge, MA: MIT Notes Press. Brooks, Christopher H. and Montanez, Nancy. 2006. Im- 1. Levene’s homocedasticity test is an inferential statistic proved annotation of the blogosphere via autotagging used to assess the equality of variances for a variable and hierarchical clustering. In Proceedings of the 15th In- calculated for two or more groups. ternational Conference on World Wide Web (Edinburgh, Scot- 2. Variable predefined to analyze tagging effectiveness land, May 23 - 26, 2006). WWW '06. New York: ACM among four groups. Details can be seen in the previous Press, pp. 625-32. paper (Ma and Cahier 2012). Cattuto, Ciro, Loreto, Vittorio and Pietronero, Luciano. 2007. Semiotic dynamics and collaborative tagging. Pro- References ceedings of the National Academy of Sciences of the of America 104: 1461-4. Ames, Morgan and Naaman, Mor. 2007. Why we tag: The Chen, Ya-Xi, Santamaría, Rodrigo, Butz, Anderas and motivations for annotation in mobile and online media, Therón, Roberto. 2009. Tagclusters: Semantic aggrega- In Proceeding CHI '07 Proceedings of the SIGCHI Conference tion of collaborative tags beyond tagclouds. Smart on Human Factors in Computing Systems. New York: ACM graphics lecture notes in computer science 5531: 56-67. Press, pp. 971-80. Downey, Douglas, Dumais, Susan, Liebling, Dan. and Angeletou, Sofia, Sabou, Marta, Specia, Lucia and Motta, Horvitz, Eric. 2008. Understanding the relationship be- Enrico. 2007. Bridging the gap between folksonomies tween searchers’ queries and information goals. In and the semantic web: An experience report. In The 4th Shanahan, James G., Amer-Yahia, Sihem, Manolescu, European Semantic Web Conference 2007 (ESWC 2007), 3-7 Ioana, Zhang, Yi, Evans, David A., Kolcz, Aleksander, Jun 2007, Innsbruck, Austria. Available http://oro.open. Choi, Key-Sun and Chowdhury, Abdur, eds., Proceedings ac.uk/23608/1/semnet2007.pdf of the 17th ACM Conference on Information and Knowledge Bateman, Ivan, Lee, Kyung-il and Kim, Kono. 2008. Management - CIKM 2008. New York: ACM Press, pp. TopicRank: Bringing insight to users. In Proceeding 449-58. SIGIR '08 Proceedings of the 31st annual international ACM Echarte, Francisco, Astrain, José Javier, Córdoba, Alberto SIGIR conference on Research and development in information and Villadangos, Jesús. 2007. Ontology of folksonomy: retrieval. New York: ACM Press, pp 703-4. A new modeling method. In Handschuh, S., Collier, N., Begelman, Grigory, Keller, Philipp and Smadja, Frank. Groza, T., Dieng R., Sintek M., and de Waard A., eds., 2006. Automated tag clustering: Improving search and Proceedings of Semantic Authoring, Annotation and Knowledge exploration in the tag space. Paper presented at Collabo- Markup Workshop (SAAKM2007), 28-31 October, Whis- rative Web Tagging Workshop of 15th International World tler, British Columbia, Canada: CEUR. Available http:// Wide Web, 23-26 May, Edinburgh, Scotland. Available ceur-ws.org/Vol-289/p08.pdf. http://www.ra.ethz.ch/cdstore/www2006/www.rawsu Fu, Wai-Tat, Kannampallil, Thomas, Kang, Ruogu and He, gar.com/www2006/20.pdf Jibo. 2010. Semantic imitation in social tagging. Journal Bénel, Aurélien, Zhou, Chao and Cahier, Jean-Pierre. of ACM transaction on computer-human interaction 17: 1-37. 2009. Beyond web 2.0 ... and beyond the semantic web. Fujimura, Ko, Fujimura, Shigero, Matsubayashi, Tatsushi, In Design of Cooperative Systems, Chapter 1, Springer. Yamada, Takeshi and Okuda, Hidenori. 2008. Topigra- Berlocher, Ivan, Lee, Kyung-Il, and Kim, Kono. 2008. phy: Visualization for large scale tag clouds. In Proceeding TopicRank: bringing insight to users. In Proceedings of WWW '08 Proceedings of the 17th international conference on SIGIR 2008, ACM Press, pp. 703-04. World Wide Web. New York: ACM Press, pp. 1087-8. Berners-Lee, Tim. 2000. Semantic web on XML. In Pro- Golder, Scott A. and Huberman, Bernardo A. 2006. Us- ceedings of XML 2000, Washington, DC. age patterns of collaborative tagging systems. Journal of information science 32: 198-208. 28 Knowl. Org. 41(2014)No.1 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

Halvey, Martin J. and Keane, Mark T. 2007. An assessment Ma, Xiaoyue and Cahier, Jean-Pierre. 2011. Iconic catego- of tag representation techniques. In Williamson, Carrie rization with knowledge-based icon systems can im- and Zurko, Mary Ellen, eds., WWW 2007: proceedings of prove collaborative KM. In Proceedings of CTS2011 in the 16th International Conference on World Wide Web, Banff, Philadelphia, USA, 2011, IEEE Conference Publications, (Alberta, Canada), May 8-12, 2007. New York: ACM pp. 216-23. Press, pp. 1313-4. Ma, Xiaoyue and Cahier, Jean-Pierre. 2012. Visual distinc- Hassan-Montero, Yusef and Herrero-Solana, Víctor. 2006. tive language: Using a hypertopic-based iconic tagging Improving tagclouds as visual information retrieval in- system for knowledge sharing. In Drira, Khalil and terfaces. In I International Conference on Multidisciplinary In- Reddy, Sumitra Mitra, eds., Proceedings: 21st IEEE Inter- formation Sciences and Technologies, InSciT2006. Mérida, national WETICE Conference: WETICE 2012: 25-27 June Spain. October 25-28, 2006. Available http://nosolousa 2012, Toulouse, France. Los Alamitos, Calif.: IEEE Com- bilidad.com/hassan/improving_tagclouds.pdf puter Society, pp. 456-61. Hearst, Marti A. and Rosner, Daniela. 2008. Tag clouds: Macgregor, Geroge and McCulloch, Emma. 2006. Col- Data analysis tool or social signaller. In HICSS '08 Pro- laborative tagging as a knowledge organization and re- ceedings of the International Conference in Waikoloa, Big Is- source discovery tool. Library review 55: 291-300. land, Hawaii, USA, 2008. Washington DC: IEEE Com- Mas, Sabine and Marleau, Yves. 2009. Proposition of a fac- puter Society, pp. 160-9. eted classification model to support corporate informa- Heymann, Paul and Garcia-Molina, Hector. 2006. Collabora- tion organization and digital records management. In tive creation of communal hierarchical taxonomies in social tag- 42nd Hawaii International Conference on System Sciences ging systems. InfoLab Technical Report 10. Available http:// (HICSS). 5-9 January, 2009, Waikoloa, Big Island, Hawaii. ilpubs.stanford.edu:8090/775/1/2006-10.pdf. DOI= http://doi.ieeecomputersociety.org/10.1109/HI Hodge, Gail. 2000. Systems of knowledge organization for digital CSS.2009.874. libraries: Beyond traditional authority files. Washington DC: Mika, Peter. 2005. Ontologies are us: A unified model of The Council on Library and Information Resources. social networks and semantics. In Gil, Yolanda, ed., The Available http://www.clir.org/pubs/reports/pub91/pub Semantic Web- ISWC 2005: 4th International Semantic Web 91.pdf Conference, ISWC 2005, Galway, Ireland, November 6-10, Kerr, Bernard. 2006. TagOrbitals: A tag index visualization. 2005: proceedings. Berlin: Springer, pp. 522-36. In SIGGRAPH 2006 sketches. New York: ACM, p.158. Nakamura, Carlos and Zeng-Treitler, Qing. 2012. A tax- Kim, Hak Lae, Scerri, Simon, Breslin, John G, Decker, onomy of representation strategies in iconic communi- Stefan and Kim, Hong Gee. 2008. The state of the art cation. International journal of human-computer studies 70: in tag ontologies: A semantic model for tagging and 535-51. folksonomies. In Greenberg, Jane and Klas, Wolfgang, Passant, Alexandre. 2007. Using ontologies to strengthen eds., Metadata for semantic and social applications proceedings of folksonomies and enrich information retrieval in we- the international conference on Dublin Core and Metadata Ap- blogs. In Weblogs and Social Media proceedings of the interna- plications, Berlin, 22 - 26 September 2008, DC 2008: Berlin, tional conference in Boulder, Colorado, USA, 2007. New Germany. Gottingen:̈ Universitatsverlag Gottingen,̈ pp. York, ACM Press, pp. 128-37. 128-37. Paivio, Allan. 1971. Imagery and verbal processes, New York: Kipp, Margaret E.I. and Campbell, D. Grant. 2010. Holt, Rinehart, and Winston. Searching with tags: Do tags help users find things? Provost, James. 2008. Improved document summarization and Knowledge organization 37: 239-55. tag clouds via singular value decomposition. Master thesis. Kipp, Margaret E.I. and Joo, Soohyung. 2010. Application Kingston, Canada: Wueen’s University. of structural equation modelling in exploring tag pat- Rivadeneira, A.W., Gruen, Daniel M., Muller, Michae J. and terns: A pilot study. In Proceedings Annual Meeting of the Millen, David R. 2007. Getting our head in the clouds: American Society for Information Science and Technology, Pitts- Toward evaluation studies of tagclouds, In CHI 2007 burgh, Pennsylvania, USA. Available http://www.asis.org/ proceedings of the international conference in San Jose, California, asist2010/proceedings/proceedings/ASIST_AM10/ USA, 2007. New York: ACM Press, pp. 995-8. submissions/325_Final_Submission.pdf Schrammel, Johann, Leitner, Michael and Tscheligi, Man- Knautz, Kathrin, Soubusta, Simone and Stock, Wolfgang fred. 2009. Semantically structured tag clouds: An em- G. 2010. Tag clusters as information retrieval interfaces. pirical evaluation of clustered presentation approaches. In Sprague, Ralph H., ed., Proceedings of the 43rd Annual In CHI 2009 proceedings of the international conference in Bos- Hawaii International Conference on System Sciences: 5-8 Janu- ton, USA, 2009. New York: ACM Press, pp. 2037-40. ary, 2010, Koloa, Kauai, Hawaii. Los Alamitos, Calif.: Schwarzkopf, Eric, Heckmann, Dominik, Dengler, Diet- IEEE Computer Society Press, pp. 1-10. mar and Kröner, Alexander. 2007. Mining the structure Knowl. Org. 41(2014)No.1 29 X. Ma and J.-P. Cahier. An Exploratory Study on Semantic Arrangement of VDL-Based Iconic Knowledge Tags

of tag spaces for user modeling. In Data mining for user M., May, W., eds., ESWC '07 Proceedings of the 4th Euro- modeling ICUM'07 proceedings of the international conference pean conference on The Semantic Web: Research and Applica- in Corfu, Greece, 2007, pp. 30-1. tions, pp. 624-39. Sen, Shilad, Lam, Shyong K., Rashid, Al Mamunur, Cosley, Wu, Harris, Zubair, Mohammad and Maly, Kurt. 2006a. Dan, Frankowski, Dan, Osterhouse, Jeremy, Harper, Harvesting social knowledge from folksonomies. In Maxwell F. and Riedl, John. 2006. Tagging, communi- Hypertext and hypermedia proceedings of the international con- ties, vocabulary, evaluation. In Conference on Computer ference in Odense, Denmark, 2006. New York: ACM Press, Supported Cooperative Work: 20th anniversary: November 4-8, pp 111-4. 2006, the Fairmont Banff Springs Hotel, Banff, Alberta, Wu, Xian, Zhang, Lei and Yu, Yong. 2006b. Exploring so- Canada: conference proceedings. New York: ACM Press, pp. cial annotations for the semantic web. In 15th World Wide 181-90. Web proceedings of the international conference in Edinburgh, Shaw, Blake. 2008. Utilizing folksonomy: Similarity metadata Scotland, 2006. New York: ACM Press, pp. 417-26. from the del.icio.us system. Available http://www.metablake. Van Damme, Céline, Hepp, Martin, and Siorpaes, com/webfolk/web-project.pdf Katharina. 2007. Folksontology: An integrated ap- Shepitsen, Andriy, Gemmell, Jonathan, Mobasher, Bam- proach for turning folksonomies into ontologies. In shad and Burke, Robin. 2008. Personalized recommen- Proceedings of the ESCW Workshop Bridging the gap between dation in social tagging systems using hierarchical clus- semantic web and web 2.0, pp. 57-70. tering. In Recommender systems proceedings of the international Zhou, Chao, Lejeune, Christophe and Bénel, Aurélien. conference in Lausanne, Switzerland, 2008. New York: ACM 2006. Towards a standard protocol for community Press, pp. 259-66. driven organizations of knowledge. In Proceedings of the Shiri, Ali Asghar and Revie, Crawford. 2005. Usability and 2006 conference on Leading the Web in Concurrent Engineer- user perceptions of a thesaurus-enhanced search inter- ing: Next Generation Concurrent Engineering. Amsterdam: face. Journal of documentation 61: 640-56. IOS Press, pp. 338-49. Specia, Lucia and Motta, Enrico. 2007. Integrating folkso- nomies with the semantic web. In Franconi, E., Kifer,

30 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Foucault, the Author, and Intellectual Debt: Capturing the Author-Function Through Attributes, Relationships, and Events in Knowledge Organization Systems†

Heather Lea Moulaison*, Felicity Dykas**, and John M. Budd***

*School of Information Science and Learning Technologies, University of Missouri, 303 Townsend Hall, Columbia, MO 65211 USA **52 Ellis Library, University of Missouri, Columbia, MO 65201 USA ***School of Information Science and Learning Technologies, University of Missouri, 303 Townsend Hall, Columbia MO 65211 USA,

Heather Lea Moulaison is Assistant Professor at the iSchool at the University of Missouri. Her research fo- cuses primarily on the intersection of the organization of information and technology and includes the study of issues pertaining to metadata, standards, and digital preservation. An ardent Francophile, Dr. Moulaison is also interested in international aspects of access to information.

Felicity Dykas is Head of the Digital Services Department at the University of Missouri Library. Previous posi- tions have included Head of the Catalog Department and electronic resources librarian. She works with the university institutional repository and digital library and has additional interests in metadata standards, organ- izational systems for online resources, and the preservation of print and digital material.

John M. Budd is a Professor in the iSchool at the University of Missouri. He is the author of ten books and more than 100 journal articles. He can also be found with his cat, Bitsy.

Moulaison, Heather Lea, Dykas, Felicity, and Budd, John M. Foucault, the Author, and Intellectual Debt: Capturing the Author-Function Through Attributes, Relationships, and Events in Knowledge Organi- zation Systems. Knowledge Organization. 41(1), 30-43. 24 references.

Abstract: Based on Foucault’s exploration of the author-function, the current study investigates knowledge organization systems’ (KOS’s) treatment of persons who are also authors and the ability to record attributes, relationships and events related to those persons. FRBR and FRAD do well to extend the information in li- brary authority records beyond the personal name as a character string to include attributes of the person, yet aspects of the person as an author and author-function can be enhanced. This paper begins with a discussion of the author-function as identified by Foucault and the complexities of identity that arise. Next, it reviews the Library and Information Science (LIS) literature on authorship and name authorities, then briefly discusses the current library content standard (Resource Description and Access, (RDA)) and the current library encoding stan- dard, (MAchine Readable Cataloging, (MARC)). It then examines four projects making use of person data to enhance the author-function: Europeana, AustLit, The American Civil War: Letters and Diaries, and DBpedia. We conclude that additional attributes, relationships, and events are pivotal to moving toward more Foucault- friendly KOS’s in libraries. Concerns with this more robust model of recoding information include the ethics of recording attributes of persons and problems of end-user searching in current systems.

Received 9 August 2013; Revised 22 November 2013; Accepted 25 November 2013

Keywords: Foucault, author, author-function, relationships, attributes, relations, events, persons, authority-records

† This article is based on: Moulaison, Dykas and Budd (2013). Knowl. Org. 41(2014)No.1 31 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

1.0 Introduction prehensive understanding of the author-function. Fou- cault’s analysis of the complexities of the author-function The question of how author data should be compiled and and authorship are examined first. Next, we look to the made available in controlled vocabulary systems and in literature in KO and (LIS) to explore concepts related to knowledge organization systems (KOS’s) is the subject of authorship and authority records. We then discuss and current interest in the knowledge organization (KO) compare current systems as they stand, and end with rec- community, with significant interest around the IFLA ommendations for rendering library-based KOSs more Study Group on the Functional Requirements for Biblio- amenable to representing authors, and subsequently allow- graphic Records’s Functional Requirements for Bibliographic Re- ing for the establishment of the author-function through cords: Final Report (FRBR 1998). FRBR designates three the addition of information about events. groups of entities in the bibliographic universe, with Group 2 representing “those responsible for the intellec- 2.0 Foucault: What is an author? tual or artistic content, the physical production and dis- semination, or the custodianship of the entities in the first Foucault responded to Roland Barthes’s essay, “The group” (p. 14). Group 1 represents “the different aspects Death of the Author” in his 1969 essay, “What Is an Au- of user interests in the products of intellectual or artistic thor?” (published in translation in 1977). Barthes (1977) endeavour” (p. 13), and Group 3, “an additional set of en- preceded Foucault by saying that the author can no longer tities that serve as the subjects of works” (p. 17). FRBR be considered a meaningful construct “for the good rea- also demonstrates relationships between entities within son that writing is the destruction of every voice, of every and between groups. The sibling document, Functional origin. Writing is that neutral, that composite, that oblique Requirements for Authority Data (FRAD) (Patton 2009), space where our subject slips away, the negative where builds on FRBR and designates fourteen attributes that every identity is lost, starting with the identity of the very can be recorded in authority records for persons, a Group 2 body which writes” (p. 142). He further says that “the entity. These attributes are: 1) Dates associated with the modern writer (scriptor) is born simultaneously with his person; 2) Title of the person; 3) Gender; 4) Place of text; he is in no way supplied with a being which precedes birth; 5) Place of death; 6) Country; 7) Place of residence; or transcends his writing” (p. 140). Barthes’s goal in the 8) Affiliation; 9) Address; 10) Language of person; 11) essay was effectively to replace “the Author” (as the pri- Field of activity; 12) Profession/occupation; 13) Biogra- mary creative signifier) with writing (or the process of crea- phy/history; and 14) Other information. Persons identi- tion rather than what he saw as an arbitrary creator (see fied by the access points and described by the attributes Wilson 1999, p. 340). Barthes’s effort to replace the author are, according to FRBR, associated with Group 1 entities: with writing—and thus to privilege writing as both act and works, expressions, manifestations, and/or items. In the product—caught Foucault’s attention and led him to at- bibliographic universe, people create (i.e. have relation- tempt a correction of Barthes’s thinking. ships with) works, have attributes, and are represented by In his essay Foucault (1977b) asks: “What, in short, is a character string that includes their name, yet they are the strange unit designated by the term, work? What is never specifically identified as authors. necessary to its composition, if a work is not something It is seldom considered exactly what an author is or written by a person called an ‘author’?” (p. 118). In asking what constitutes an author as the subject who is responsi- these questions, Foucault transcends Barthes and intro- ble for a work. The question, “who is the author” may be duces a different “unit” of analysis that has its own crite- asked, but the corollary (yet distinct) question, “what is an ria and effects. Foucault (1977b) actually anticipated many author” is seldom a matter of inquiry. Michel Foucault’s of the challenges that would eventually arise in the field influential early-period work, “What Is an Author?” of KO as he diminished the “noun” that has been taken (1977b) explores the notion of authorship and has in- to signify an author and replace that inadequate speech act formed studies of KOS’s. The current paper extends the with “name” as classification (p. 123). In other words, the Foucauldian inquiry into authorship in KOS’s, continuing name attributed to a work, while imminently important Budd and Moulaison’s (2012) work and Moulaison, Dykas, both to reading and to categorization, has traditionally and Budd’s (2013) work on the topic. It also addresses is- been removed from the human being attached to works. sues first raised by Smiraglia, Lee, and Olson (2011) when What is much more important is a completely revised they asked, “What role does the name of an author repre- conception of “authority.” The authority no longer exists sent in the interplay between publishing, bibliography, and solely within the realm of a person who has been con- cataloging?” (p. 137). We will examine the relationship be- nected to a work. Greater attention must be paid to the tween the information recorded and retained for authors discourse that is enabled by the work. The author is trans- in KOSs and the information required to support a com- formed into the “author,” or, more appropriately, the site 32 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt of the author-function. The author-function does not sig- through time, reacted against his works and expressions nal, as some commentators contend, the disappearance of including the United States feminist and author of The the author. As Foucault (1977b) wrote, “We can conclude Feminine Mystique, Betty Friedan. As Foucault (1977b) re- that, unlike a proper name, which moves from the interior marks, authors who can be seen as embodying author- of a discourse to the real person outside who produced it, functions, such as Freud, are “’initiators of discursive the name of the author remains at the contours of practices,’ [who] not only made possible a certain number texts—separating one from the other, defining their form, of analogies that could be adopted by future texts, but, as and characterizing their mode of existence” (p. 123). Fou- importantly, they also made possible a certain number of cault’s intention, as Wilson (1999) proposes, is not only to differences” (p. 132). Friedan represents one such differ- “problematize” author and authorship, but to place them ence as a detractor of Freudianism. both at the center of enquiry, to examine precisely where Perhaps a more effective way to demonstrate the au- they fit into the creation of the work (and, by extension, thor-function (building on the example of Freud) is by of knowledge). means of graphic illustration. Figure 1 points out that Freud, by means of the totality of his works, rendered 2.1 Complexities of identity subsequent works and ways of thinking possible. That is, without Freud’s works, the works of other psychologists Foucault’s author-function extends beyond the attributes might not have been created, or at least might not have of a person, a human being who lives in a certain place at been created and expressed in the forms they took. Would a certain time and who has other identifiable attributes Carl Jung have developed his conceptualizations in pre- that can be recorded as authority data in an authority re- cisely the way he did had Freud not written the works he cord. The author-function maintains a kind of authority, did? Would there have been a practice of psychoanalysis but one that is present in works instead of “personalities.” if Freud had not articulated principles? The figure illus- The author-function is more object than subject—an ob- trates notable psychologists who owe a debt to Freud’s ject representative of creation. To comprehend Foucault’s work, as well as ideas that stem from the influence of conception most fully, it is best to turn to another of his Freud. In short, “author-function” is much more than (1977a) essays, where he says, “The imaginary is not something akin to a citation process; it is recognition of formed in opposition to reality as its denial or compensa- intellectual debt that can be traced back to the works of tion; it grows among signs, from book to book, in the in- the progenitor of concepts. The author-function is a terstice of repetitions and commentaries; it is born and demonstration and acceptance that some things are possi- takes shape in the interval between books. It is a phe- ble because of who and what has preceded them. Mecha- nomenon of the library” (p. 91). The author-function, nisms to make explicit the intellectual debt of an author then, is likewise interstitial; it is woven from the starting and indeed, the intellectual debt that an author inspires, point of the author throughout the discursive thread thus are increasingly of interest in KOS’s where relationships begun and continued in a labyrinthine path. are key. Current systems in use in libraries are not, as will A particular example of Foucault’s expansion of the be shown below, capable of robustly demonstrating the author-function can be illustrated by using Sigmund author-function despite the importance of the discursive Freud. Freud, of course, was an author of definable and function to scholarship. attributable works. The discourse surrounding Freud, though, extends beyond the person or the proper name. 3.0 Review of the literature Freud gave birth (intentionally or not) to Freudianism, or the discursive practice that draws in some ways from his In this brief review of the literature, we focus on the re- works. He also gave birth to psychoanalysis, a school of lated concepts of authorship and authority records as a psychiatric and psychological practice. Psychoanalysts potential means for supplying information about authors. might or might not be Freudians, but they all either draw The principle of authorship has guided the field of li- from or react against Freud and his works. Particular indi- brarianship in its work to organize information, and the viduals are also connected to Freud; Otto Rank, an Aus- implementation of name authorities has permitted the trian contemporary of Freud and member of his psycho- practical retrieval of surrogates in KOS’s. One way to pro- analytic movement, would be one such person. There are vide further information about authors that would help also contemporaries that have complex connections to clarify aspects of the author-function is through the addi- Freud, such as the Swiss psychologist and psychoanalyst tion of information about possible influences on the au- Carl Jung. Jung and Freud are together responsible for thors, be they human (positive or negative), geographic, works on dreams, but Jung departed from Freud’s ortho- situation-based events, or other. doxy. Freud has further given rise to those who have, Knowl. Org. 41(2014)No.1 33 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Figure 1. Aspects of the Author-Function of Freud

3.1 Authorship 2000). Given the evolution of circumstances for the crea- tion of works and the information needs of a broader va- The basis of the modern notion of authorship arose in riety of users, the concept of authorship is one that con- the West as a result of the printing press. “It seems rea- tinues to be addressed in KO and LIS. sonable to conclude ... that the advent of print and its de- velopment in the late fifteenth and early sixteenth centu- 3.2 Name authorities ries played no small part in the rise of authorial self- consciousness among vernacular writers in Paris. It may Information about people who are either authors (Group ultimately have effected a change in the concept of litera- 2 entities) or subjects (Group 3 entities) is retained in the ture itself ” (Brown 1991, 142). The principle of author- KOS in a complementary database, the authority file. Au- ship is pivotal to the design and use of KOS’s (Smiraglia, thority files contain records about individuals playing a Lee, and Olson 2011). In speaking of the creation and dif- role in the bibliographic universe and are consulted by in- fusion of knowledge, authors “facilitate discourse” (Smi- formation professionals in the creation of surrogate re- raglia and Lee 2012, 36) and accordingly, are essential cords. Name authority work “provides a preferred form components of surrogate records describing works. In the of name with cross-references to different forms and re- modern tradition, the author is “in the narrower sense .… lated names” (Burke and Shorten 2013, 365), with the as- the person who writes a book; in a wider sense it may be sumption that the name itself might change over time. To applied to him who is the cause of the book’s existence” facilitate changes in names, non-text-based (presumably (Cutter 1904, 14). Authors, therefore, exercise an essential numerical) identifiers have been proposed as a comple- function in the creation of a work, and in the Western ment to the traditional name-based but perpetually- tradition, are credited in the bibliography (Smiraglia, Lee, updating headings entered into surrogates (Niu 2013). and Olson 2011). Barrionuevo Almuzara, Alvite Díez, and Rodríguez Bravo The concept of authorship may be evolving at present (2012) point out that the “collaborative area is the most (see Smiraglia and Lee 2012), especially given the collabo- appropriate place for the development of projects on au- rative environment that the web represents. It is also pos- thority control” (p. 97). VIAF, the Virtual International sible to imagine limited situations where users are seeking Authority File, is an example of a collaborative project specific information and where in those instances, the au- (Barrionuevo Almuzara, Alvite Díez and Rodríguez Bravo thor of the content retrieved may not matter (Svenonius 2012) that provides unique identifiers (Niu 2013).VIAF 34 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt also supplies URIs for name authority records (VIAF Attributes of persons that can be recorded in RDA re- 2012), potentially allowing VIAF records to become part cords include both traditional and new content. The name of the linked data web, a web of machine-readable rela- of the person (including the “see from” character string, tionships (Bizer, Heath and Berners-Lee 2009). or the variant access point, which is optional in RDA), the Increasing the ease with which authority records are fuller form of the name, dates associated with the person, updated, disseminated, and used is crucial, but if the in- title of the person, and other designations associated with formation housed in the authority record cannot be used the person are traditional attributes that have historically efficiently in the search process, it will not benefit the end- been recorded in library metadata. New fields considered user in the long run. Yee (2005) warns of the issues that important include profession or occupation, field of activ- arise in doing a keyword search for Samuel Clemens and ity of the person, associated groups, and identifiers for the Tom Sawyer in the online library catalog if the authority re- person. RDA core elements are the preferred name of the cord for Mark Twain is not also searched as part of the person, an identifier for the person, and, when known, query. In the library context, the contents of the records dates of birth and death. Selected titles (those associated serves to help in the creation of the bibliographic record with royalty, nobility, ecclesiastical rank or office, or a and for searching the name in the system, based on the religious vocation) and designations for saints or spirits authorship principle. also are core. Other titles, designations, and dates, fuller form of name, and profession or occupation are core only 4.0 Analysis of current initiatives when needed to differentiate person’s names (American Library Association, 2010). All of the enhanced elements Current projects and initiatives implement and expand the are new attributes. Enhanced elements include language ideas of authorship presented in the FRBR and FRAD of the person, gender, address of the person, country as- models. The standards and projects discussed below are sociated with the person, place of residence, place of geared toward providing identifying and contextual in- birth and place of death. See Figure 2 for examples of formation for FRBR Group 2 entities and relationships both traditional and new attributes in a personal name au- between entities. Standards used by the library community thority record. and related projects are analyzed for their ability to make Making explicit references to relationships between en- explicit elements of the author-function in KOS’s. tities and even between and among attributes represents a major advance in the RDA as a cataloging code. The rela- 4.1 Selected standards tionships now cover a broader range of associations and there is greater specificity and consistency in delineating The library community has been using a cataloging the nature of the relationships. Yet, the identified relation- content standard (RDA (Resource Description and Access); ships are geared toward the bibliographic relationships until 2013, the Anglo-American Cataloguing Rules, second traditionally provided in catalog/bibliographic records and edition (AACR2)) along with an encoding standard, they primarily appear in bibliographic records. Written ex- MARC (MAchine Readable Cataloging), to encode library pressions that have been adapted as performances are a data for a generation. RDA represents an expansion on primary example of a relationship that is effectively han- that tradition through its backward compatibility with dled in RDA. Despite the focus on bibliographic relation- AACR2 records and through its basis on the FRBR ships and relationships between Group 1 and Group 2 en- model; MARC has been adapted within the limits of the tities, relationships between Group 2 entities in RDA are standard to accommodate new needs presented as well. beginning to be included in authority records as exhibited Below, we discuss the content standard and the encoding by the authority record for the following example record standard in turn. (http://lccn.loc.gov/no2011033681), showing employers RDA (2010), as based on FRBR and FRAD, clarifies of the person directly in the authority record as “see also” and delineates relationships between bibliographic entities references (authorized access points for related entities). and defines attributes for Group 2 entities. RDA “moves See Table 1 below for an example of the references. beyond what is required for an access point and toward a RDA, Appendix I identifies terms for relationships be- record for the person” (Oliver 2010, p. 60). In doing so, it tween a resource and persons, families, and corporate makes a substantial move toward providing information bodies associated with the resource, and Appendix J iden- that supports the author-function. In libraries, authority tifies terms for relationships between works, expressions, records with the new RDA attributes are available in the manifestations, and items. Some derivative relationships Library of Congress Name Authority File; these records provide linkages among entities in bibliographic families. also are included in VIAF. As mentioned, written expressions that have been adapted as performances are especially well-represented. Knowl. Org. 41(2014)No.1 35 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Figure 2. Labeled view of an RDA authority record for Michel Foucault (http://lccn.loc.gov/n79065356)

Personal name heading: Woodward, Hugh M. (Hugh McCurdy), 1881-1940

See also: Employer: United States. Works Progress Administration

Employer: Brigham Young University

Employer: Dixie Normal College

Employer: St. George Stake Academy

Table 1. Personal name authorized access point and employer authorized access points for related entities for Hugh M. Woodward

The encoding standard MARC (MAchine Readable gov/no2012144973) with a “see also” reference for the Cataloging) allows for the encoding of content and data, person’s husband. In this example record, the person being and it also serves as a content standard in its own right for described is Clara Snyder. A “see” reference (a variant ac- some of the fields and fixed fields it proposes. Content cess point) is created from her maiden name, and a “see added in these fields goes beyond content required by the also” reference (an authorized access point for a related en- cataloging codes in use, and help the system with storage tity) is created from the authorized form of the access and permit additional retrieval and collocations of items. point for her spouse, Roy Snyder. It is interesting that the MARC field tags map precisely to the FRAD attributes for relationship to her husband is designated by an eye- personal names. Fields exist supporting all fourteen readable character string, and not by machine-readable FRAD-identified person attributes, including dates, titles, data. These relationships are important to indicate, but are other attributes, places, field of activity, group associations, not yet fully machine-actionable. As additional relation- occupation, language, and biographical data. Many of these ships are added to enrich the network of connections be- same fields are used for both the core elements and en- tween and among persons for whom personal name re- hanced elements in RDA. Fields previously used primarily cords are created and as the semantics are enhanced so that for separate bibliographic identities (pseudonyms) in machines understand the relationships in a meaningful way, MARC are now being used to support the relationships the potential for discovery is greatly enhanced. In supply- mentioned in the RDA sub-section above. Figure 3 is an ing this additional information, even if it is not fully ma- excepted example of a MARC record (http://lccn.loc. chine-actionable, RDA records encoded in MARC include 36 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt supplementary information to encourage users to be able includes author attributes and relationships. Like Euro- to contextualize, find, identify, and justify personal names, peana, these relationships include such things as Influence- according to the FRAD user tasks. In doing so, these re- Agent and Influence-Work. Along with these, AustLit also cords also enhance users’ understanding of the author- includes FRAD attributes, including dates, other attributes, function and the robustness of the attributes of the person affiliation, occupation, gender, language, and biography. in conjunction with the richness of that person’s relation- Figure 4 shows related links for the author Patrick White. ships. The American Civil War: Letters and Diaries (http:// alexanderstreet.com/products/american-civil-war-letters- 4.2 Selected projects and-diaries), available via Alexander Street Press, is a sub- scription database allowing access to diaries, letters, and In this section, we examine four projects that record at- memoirs of individuals impacted by the American Civil tributes and other information about persons as authors, War (http://solomon.cwld.alexanderstreet.com/cwld.help. and consider how these attributes have the potential to html). A series of metadata fields (see Table 2) are filled support the author-function. Europeana, AustLit, Ameri- out for each letter’s author, permitting a powerful target can Civil War Diaries and Letters, and DBpedia maintain search. The advanced search feature permits users to data in a way that will be of interest to KOS users. When search specific attributes of authors, including their age we examine each in turn, we see that these projects are in- when writing, race, religion, military rank, as well as the novative in their use of authority data to drive organiza- schools they attended (see Figure 5). Drop-down menus tion, search, and retrieval. Although the KOS environ- permit users to search with the controlled vocabularies ment in which each functions is fundamentally different values appropriate to each field. from the KOS environment used in libraries described DBpedia (http://wiki.dbpedia.org/About), the linked above, the approaches to indicating attributes and rela- data version of Wikipedia (http://dbpedia.org/About) tionships is nonetheless instructive. maintains all of the information that FRAD indicates be Europeana (http://www.europeana.eu/) retains infor- recorded as attributes as well as a variety of additional at- mation similar in scope to FRAD for persons, with a few tributes that KOS’s have not traditionally retained. These notable differences. Similarities include the ability to record attributes are not purely scholarly, although it seems plau- attributes such as dates, occupation, gender, and biography. sible that the bust size, astrological sign, or tattoos of an One difference is that the Europeana data model (Euro- author might somehow impact her authorship. These at- peana 2012) is linked-data-friendly, and information en- tributes along with additional information about influ- coded using this model can be accessed as linked data. An- ences, sexual orientation, ideologies, and relationships other difference is the inherent potential for the presence might help not only understand the author in context, but in the Europeana data model of information about rela- lay the groundwork for thinking about the author- tionships and events: hasMet; isRelatedTo; wasPresentAt. function. See Table 2 for a more complete listing of at- These person- and event-based potential influences permit tributes, relationships, and events in Wikipedia. an additional contextualizing of the author-function based on the additional information supplied and semantically 5.0 Discussion linked. See Table 2 for more details. AustLit, the Australian Literature Resources (http://aust Access to the resource has traditionally been the focus of lit.edu.au/), implemented the FRBR model to describe lit- KOS’s. Cutter’s Rules for a Dictionary Catalog (1904) de- erary and creative works. Data included in authority records scribes principles for access, or “objects,” focusing on the

Figure 3. Snippet of a MARC authority record for Clara Snyder with spouse as an authorized access points for related entities Knowl. Org. 41(2014)No.1 37 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Figure 4. AustLit metadata about Patrick White, plus categories (http://www.austlit.edu.au/austlit/page/A27473)

Figure 5. Advanced search options, The American Civil War: Letters and Diaries classification

38 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Knowl. Org. 41(2014)No.1 39 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt materials. Cutter outlines tasks pertaining to the finding events where authors participated, finding inspiration or function (permitting users to find a surrogate for a work creating relationships, minimal options exist in each sys- if the author, title, or subject is known), the collocation tem. function (bringing together works by author, subject or Revisiting the earlier example of Freud and the intellec- some other feature), and the selection function (permit- tual debt owed to him, Freud’s author-function can be rea- ting users to understand if the book will be useful based sonably embodied in the systems provided if adequate in- on information about the item). These objects are the ba- formation is supplied. DBpedia is a good example of a sis for current catalog systems, and underlie FRBR’s ap- system supporting the author-function. Information about proach to user tasks (see Tillett 2003). Is there little doubt Freud that appears on his English-language Wikipedia that, in a system dedicated to bibliographic records, the page (http://en.wikipedia.org/wiki/Sigmund_Freud) in book would be the central focus? cludes date and place of birth, date and place of death, In the traditional KOS’s used in libraries, information nationality, fields, institutions/alma mater, academic advi- about attributes of authors not included in access points sors, work known for, persons who influenced Freud, per- such as gender, affiliations, profession, and field of study sons whom Freud influenced, awards, spouse, dates mar- has been and remains hidden from patrons. Limited in- ried, and his signature. Relationships, including hyper- formation through the Library of Congress Subject Headings linked names of persons, and names of events appear has been available to patrons, but access to detailed in- throughout the article on Freud. Links to two of the three formation about authors has not traditionally been part of individuals mentioned in the introduction to this article the user experience, making the newly envisioned FRAD are included in the “Influenced” section (i.e. to Rank and user tasks of contextualize, find, identity, and justify, revo- Jung). A link to Friedan appears in the section on Freud’s lutionary in their scope. This is not to imply that authors influence on feminism. Elements of the intellectual debt are completely without importance in traditional KOS’s, and the discourse surrounding Freud, although not explic- especially those used in libraries. Personal name main en- itly indicated, are evident in the Wikipedia entry for Freud; tries and primary access points are, according to catalog- it remains the task of the user to understand and internal- ing rules, based on the author; secondary access points ize them for the purpose of searching in this or related can as be based on authorship. Some of the importance systems. of the author in the traditional KOS is lost in the fact that he is reduced to a name—a character string that can be 5.2 The expanded role of attributes and relationships collocated with identical character strings as a way of meeting the objects of the system. Systems with authority In libraries, FRBR and FRAD expand on the notion of records created using AACR2 only have information author-as-character-string, adding information about the about the author as it pertains to the choice of the charac- author as a person to the authority record. The fourteen ter string that forms the heading. attributes identified in FRAD provide enriched authority With the focus on access to information about the book records for use in KOS’s and take an author from being a and its features, access to information about an intellectual character string to becoming a more three-dimensional debt owed to and by the author historically has been over- individual with the characteristics (attributes) of a person. looked. Based on the analysis of the four systems de- Increased information about the author that can be lever- scribed above, we put forth that personal attributes, rela- aged to carry out searches in future KOS’s is a great bene- tionships, and events are the best approximation of the au- fit to users and is indisputability an improvement over the thor-function that can be envisioned at present in library previous name-only methods. Information about attrib- systems. utes and about relationships goes a long way toward mak- ing personal name records reflect the person-ness of the 5.1 The author-function and KOS’s authors they represent. They are less able, however, to in- dicate how those attributes and relationships were engen- The four systems examined above permit an inclusion of dered if they were the result of an event in the author’s the author-function as described by Foucault to varying life. degrees. Table 2 summarizes the attributes, relationships, and events that can be included in each system. In each of 5.3 The author-function and events the four systems, attributes of the author are the most available option, with DBpedia offering the largest num- Based on our understanding of Foucault and the author- ber of options. Relationships between the author and function, FRBR and FRAD do not go far enough in per- other individuals are likewise available in the systems, but mitting users to understand an author in light of her au- are not as numerous as the attributes overall. In terms of thor-function and to collocate (works, authors, movements, 40 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt etc.) based on that author-function. In short, they do not Linked data projects have been exploring the impor- extend the semiotics far enough, and do not take full ad- tance of events already with some success. For example, vantage of the author-function as an essential signifier. The in NNBD Mapper (http://mapper.nndb.com/), Barbara bibliographic universe, or at least the bibliographic repre- Walters’s participation in gala events can be traced, and sentation, is a sign system, in which the author-function moments when she overlapped with other celebrities can plays a special and important representational role. In be assessed, with appropriate visualizations supporting the keeping with the intentions of FRBR and FRAD, the au- interactions (http://mapper.nndb.com/start/?id=23371). thor-function is not defined by the spontaneous attribution See Figure 6 for a visualization of Barbara Walters’s par- of a text to its creator, but through a series of precise and ticipation in events, along with professional work and per- complex procedures (as do FRBR and FRAD); it does not sonal affiliations. refer, purely and simply, to an actual individual insofar as it Events can be a defining factor in the life of any per- simultaneously gives rise to a variety of egos and to a series son, including an author. One way to record information of subjective positions that individuals of any class may about an author that would support an understanding of come to occupy (Foucault 1977b, 130-31). the author-function would be to record information about Extending farther still, beyond the author-function, events in which she participated. This information would there is content pertaining to authors (and even to people) be recorded as well as attributes she possesses and rela- that can and should be included in authority records or be tionships she has had, even if these attributes and rela- accessible through the authority file via rich relationships. tionships were attained as a result of participation in This additional content, going beyond documentation of events. Being able to create a bibliographic network of a choice of entry terms for a personal name heading as events permits users to search more and better content well as going beyond the fourteen additional attributes about the context of authors. designated by FRAD, would allow library KOS’s to be Although the intention is certainly laudable, the visuali- searched in a more robust manner. zations permitted by the Library of Congress Linked Data Scenarios that involve the selection of works based on Service are currently less robust. Consider the visualiza- criteria of authorship are easy to imagine. Researchers tion for Freud (see Figure 7). The only node on the graph could examine books on a topic that were authored by 20- is for Freud himself; none of his attributes or relation- year-olds versus 70-year-olds. Information about age at ships are represented in the Visualization tab. Users would the time of publication would need to be included in the not be expected to use this feature; it is on the Library of authority records in library KOS’s for this to happen in Congress website and is not an integral part of the KOS FRBR-compliant systems. Researchers could also want to the Library of Congress offers online. It is, however, read all of the works written by members of a particular sparse in comparison to the kinds of information that group, such as the Bloomsbury Group from England in surely could be represented here, as in Barabra Walters’s the 1920s or by authors who frequented a certain Parisian visualization (Figure 6). salon as the Enlightenment took shape. Events, additionally, can be defining aspects of an au- 5.4 Concerns thor’s life, bringing about changes in relationships and statuses that may in turn affect the author-function. An A number of concerns arise when recording attributes example of an event could be a wedding. By virtue of the supporting the author-function in KOS’s. The first and marriage, the participants change their statuses from sin- most important concern is the ethical provision of this in- gle to married. They also enter into a new relationship formation. A second concern is the feasibility of including with another person and with that person’s family. Atten- this information in KOS’s in a way that ultimately sup- dees at events also have the potential to be marked by it – ports retrieval. they may meet future marriage partners at a wedding; they may also meet people in passing who do not, ultimately, 5.4.1 The ethics of recording person attributes affect their attributes or relationships. The interactions at events have the potential to influence persons, providing In the RDA content standard, attributes of persons that fodder for a fictionalized account of the events in the can be recorded in library authority records can pose ethi- form of a work, or by overhearing conversations that in- cal dilemmas due to the private nature of the information. fluence thinking on, for example, a work in progress. Of Information about historical figures that includes birth the selected projects described above, only Europeana is and death dates, address, gender, and profession help us- considering implementing information about events to be ers contextualize the person. Indeed, contextualize is one recorded in authority records. Europeana will do this of the user tasks identified in FRAD (Patton 2009), and through the wasPresentAt element. since information professionals are also clearly identified Knowl. Org. 41(2014)No.1 41 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt as users, this kind of additional information about per- cord information that will be useful for retrieval. When sons will only help them as information intermediaries challenges arise, libraries should consider retaining the tasked with the creation of metadata to provide access. challenged attributes, but keeping them in a dark archive Additionally, such content is expected to be known of in- that is not accessible by anyone other than staff persons dividuals of whom archives are held in public institutions of the specific library institution. When such information or anyone whose opus is the object of formal study. has already been shared outside of the walls of the institu- The proper balance between the ethical obligation to tion, the library community should do its best to respect observe a living individual’s privacy and the professional the wishes of the person by not displaying content that obligation to ensure the best access via the most compre- the persons consider a violation of his privacy. hensive sets of metadata attributes is less clear. Attributes should therefore only be drawn from publicly available in- 5.4.2 End-user searching formation. In carrying out their work, information profes- sionals strive to provide unbiased access to content, yet No matter the sophistication of an authority file’s records the classification tools they use are fraught with biases. Li- and contents, search will be hindered until KOS’s permit brary catalogs, it has been suggested, can be considered the kind of targeted retrieval that The American Civil texts, the biases of which can be studied (Drabinski 2013). War: Letters and Diaries permits with its advanced It is unreasonable to expect that library systems will be search’s series of drop-down menus (see Figure 4). We neutral, and library metadata may invite polemics. Librar- suggest that the first step to ensuring robust access to ies have the obligation to respect the wishes of persons in works via sufficient information about their authors is to regards to their recorded attributes within the parameters begin to include the kinds of attribute and relationship of their policies. Libraries also have the obligation to re- data that can appear in DBpedia records and event data

Figure 6. Barbara Walters's participation in events, professional work, and personal relationships in NNDB Mapper (http://mapper.nndb.com/start/?id=23371) 42 Knowl. Org. 41(2014)No.1 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt that can appear in Europeana records for individuals in a context extending beyond his or her person-ness. The the authority records in library KOS’s. The necessary sec- author-function as described by Foucault goes beyond the ond step is to permit retrieval based on that data. A third contextualization of entities in the bibliographic universe more challenging step is to show metadata for persons to to include aspects of the person as an author including users in much the same way that printed subject heading the intellectual debt created and extended. lists were made available to searchers in the days of the In the past, the KOS author was not a person, he was a card catalog. There is no concrete reason for not supply- character string in a database. This weakness is being ing information on persons that may help with author overcome in FRBR/FRAD, which include fourteen at- searches other than that, traditionally, such access was not tributes of persons in records for authors (Patton 2009). reasonable or feasible to provide. DBpedia (http://mappings.dbpedia.org/server/ontology/ classes/Person) permits many more kinds of attributes 6.0 Conclusion than FRAD’s fourteen to be recorded in a person’s record, thereby potentially giving a fuller perspective on the per- Works are created by persons (or corporate bodies) in the son as well as potentially allowing for retrieval of works FRBR model; persons create, yet, in doing so, the person based on attributes of authors. All four of the projects becomes an author who is associated with a discourse and examined in this paper, Europeana, AustLit, American

Figure 7. Visualization of Freud's authority record, Library of Congress Linked Data Service (http://id.loc.gov/authorities/ names/n79043849.html) Knowl. Org. 41(2014)No.1 43 H. L. Moulaison, F. Dykas, J. M. Budd. Foucault, the Author, and Intellectual Debt

Civil War: Letters and Diaries, and DBpedia, permit both Europeana. 2012. Europeana data model mapping guidelines attributes and relationships to be recorded in the authority v1.0.1. Available http://pro.europeana.eu/documents/ record. These projects serve as examples of what the 900548/ea68f42d-32f6-4900-91e9-ef18006d652e. FRBR model could permit library-based KOS’s to do if Foucault, Michel. 1977a. Fantasia of the library. In Lan- relationship information were recorded in the authority guage, counter-memory, practice, trans, by D. F. Bouchard records. and S. Simon. Ithaca, NY: Cornell University Press, pp. Europeana is the only KOS encouraging the inclusion 87-109. of machine-readable information about events in author- Foucault, Michel. 1977b. What is an author? In Language, ity records for individuals. It is this final aspect that has counter-memory, practice, trans, by D. F. Bouchard and S. the potential to make Europeana more Foucault-friendly Simon. Ithaca, NY: Cornell University Press, pp. 113-38. than the other projects and the standards that were exam- IFLA Study Group on the Functional Requirements for ined. This paper therefore makes a case for the inclusion Bibliographic Records. 1998. Functional requirements for not only of attributes in authority records, but also for the bibliographic records: Final report. Available http://www. inclusion of information on relationships and events in ifla.org/VII/s13/frbr/ . those same records. To best make use of this additional Moulaison, Heather Lea, Dykas, Felicity and Budd, John data, it strongly encourages KOS’s to implement retrieval M. 2013. The author and the person: a Foucauldian re- systems that are robust enough to permit users to search flection on the author in knowledge organization sys- for works within the context of the author, going beyond tems. In Proceedings from North American Symposium on a simple search on a character string that is the author’s Knowledge Organization, Vol. 4. University of Wisconsin- name heading in the body of the bibliographic record ad Milwaukee. Available http://www.iskocus.org/NASKO showing that information to users as a way of helping 2013proceedings/Moulaison_Dykas_Budd_TheAuthor them to contextualize the author-function. AndThePerson.pdf Niu, Jinfang. 2013. Evolving landscape in name authority References control. Cataloging & classification quarterly 51 no. 4: 404- 19. Barrionuevo Almuzara, Leticia, Alvite Díez, Mª Luisa, and Oliver, Chris. 2010. Introducing RDA: A guide to the basics. Rodríguez Bravo, Blanca. 2012. A study of authority Chicago: American Library Association. control in Spanish university repositories. Knowledge or- Patton, Glenn E. Ed. 2009. Functional requirements for author- ganization 39: 95-103. ity data: A conceptual model. IFLA Working Group on American Library Association. 2010. RDA toolkit. Chi- Functional Requirements and Numbering of Authority cago, Ill: American Library Association. Records FRANAR. Germany: K. G. Saur. Barthes, Roland. 1977. Image, music, text, trans. by Stephen Smiraglia, Richard P., and Lee, Hur-Li. 2012. Rethinking Heath. London: Fontana/Collins. the authorship principle. Library trends 611: 35-48. Bizer, Christian, Heath, Tom, and Berners-Lee, Tim. 2009. Smiraglia, Richard. P., Lee, Hur-Li and Olson, Hope. A. Linked data: the story so far. International journal on se- 2011. Epistemic presumptions of authorship. In iCon- mantic web and information systems 5: 1-22. ference 2011, & ACM Digital Library. Proceedings of the Brown, Cynthia J. 1991. Text, image, and authorial self- 2011 iConference. New York, NY: ACM. consciousness in late medieval Paris. In Hindman, San- Svenonius, Elaine. 2000. The intellectual foundation of informa- dra, ed., Printing and the written word: The social history of tion organization. Cambridge, Massachusetts: MIT Press. books, circa 1450-1520, Ithaca, NY: Cornell University Tillett, Barbara. 2003. What is FRBR?: A conceptual model for Press, pp. 103-42. the bibliographic universe. Library of Congress Cataloging Budd, John M., and Moulaison, Heather. Lea. 2012. Fou- Distribution Service. Available http://www.loc.gov/ cault and the bibliographic universe: What really is an cds/downloads/FRBR.PDF. author? Poster presented at the ASIST 2012 Annual VIAF. 2012. VIAF Data source. Available http://viaf.org/ Meeting, October 27-31, 2012, Baltimore, MD. viaf/data/ Burke, Susan K. and Shorten, Jay. 2013. Name authority Wilson, Adrian. 1999. Foucault on the “question of the work in public libraries. Cataloging & classification quar- author:” A critical exegesis. Modern language review 99: terly 51 no. 4: 365-88. 339-63. Cutter, Charles A. 1904. Rules for a dictionary catalog 4th ed., Yee, Martha M. 2005. FRBRization: A method for turning rewritten. Washington: Government Printing Office. online public finding lists into online public catalogs. Drabinski, Emily. 2013. Queering the catalog: Queer the- Information technology & libraries 24: 77-95. ory and the politics of correction. The library quarterly: Information, community, policy 83: 94-111. 44 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

From Knowledge Organization to Knowledge Representation†

Fausto Giunchiglia*, Biswanath Dutta**, Vincenzo Maltese*

*Department of Information Engineering and Computer Science, University of Trento, via Sommarive 5 38123 POVO, Trento, Italy, , **Documentation Research and Training Centre, Indian Statistical Institute, 8th Mile Mysore Road, Bangalore 560 059, Karnataka, India,

Fausto Giunchiglia is a professor of computer science at the University of Trento (Italy). His recent areas of in- terest are the use of semantics for managing knowledge diversity in the large and social computations, i.e. how to study and exploit the impact of ICT on organizations, people and society, towards the construction of a better society. He has published around 50 journal papers; more than 200 publications overall; more than 30 invited talks; chair of around 10 international events. He has actively participated in many EU funded projects and acted as coordinator for KnowledgeWeb, OpenKnowledge, Insemtives, LivingKnowledge and SmartSociety.

Biswanath Dutta is an assistant professor at the DRTC Indian Statistical Institute (Bangalore, India) and a courtesy professor of the University of Trento (Italy). In 2010 he received his P.hD. degree in library and in- formation science from University of Pune (India). He was a post-doctoral fellow at the University of Trento from 2009-2012 and has been a research assistant in Dalhousie University (Halifax, Canada). He actively worked in the LivingKnowledge EU-funded research project. His present research interests are in the areas of ontology modeling, knowledge organization and representation, the study of linguistic phenomena in knowl- edge organization, digital library and semantic web.

Vincenzo Maltese received his Ph.D. in ICT in 2012 at the University of Trento (Italy), where he is currently a post-doctoral researcher. His main area of expertise is data and knowledge representation. He has published around 30 scientific papers. He participated in several EU-funded projects including Interconcept (mapping large-scale KOS), LiveMemories (digital memories of collective lives), Semantic Geo-Catalogue (extending geo-catalogues with semantic capabilities), LivingKnowledge (dealing with diversity in knowledge) and Smart- Society (hybrid and diversity-aware collective adaptive systems), where he is currently the project manager. He is co-author of the open source tool S-Match and GeoWordNet.

Giunchiglia, Fausto, Dutta, Biswanath, and Maltese, Vincenzo. From Knowledge Organization to Knowl- edge Representation. Knowledge Organization. 41(1), 44-56. 34 references.

Abstract: So far, within the library and information science (LIS) community, knowledge organization (KO) has developed its own very successful solutions to document search, allowing for the classification, indexing and search of millions of books. However, current KO solutions are limited in expressivity as they only sup- port queries by document properties, e.g., by title, author and subject. In parallel, within the artificial intelli- gence and semantic web communities, knowledge representation (KR) has developed very powerful end ex- pressive techniques, which via the use of ontologies support queries by any entity property (e.g., the properties of the entities described in a document). However, KR has not scaled yet to the level of KO, mainly because of the lack of a precise and scalable entity specification methodology. In this paper we present DERA, a new methodology inspired by the faceted approach, as introduced in KO, that retains all the advantages of KR and compensates for the limitations of KO. DERA guarantees at the same time quality, extensibility, scalability and effectiveness in search.

Received 31 July 2013; Revised 26 August 2013; Accepted 26 August 2013

Keywords: ontologies, terms, entities, search, knowledge organization, knowledge representation

† This work has received funding from the EU CUbRIK Project under the GA no. 287704. We are also grateful to Silvano Groff and Claudio Gnoli for the fruitful discussions. Knowl. Org. 41(2014)No.1 45 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

1.0 Introduction jects, and a more expressive representation and query lan- guage. So far, within the LIS community, knowledge organization In this respect, document search in KR is more expres- has dealt with and developed its own very successful solu- sive than in KO, as the former has developed very power- tions in terms of methodologies, systems and tools for the ful and expressive techniques which, via the use of on- classification, indexing and search of documents in librar- tologies, support queries by any entity property. In fact, ies and digital archives. Documents are indexed and KR is concerned with the development of ontologies de- searched by their properties, such as title, author and sub- scribing the relevant entities of a domain in terms of their ject (the latter codifying what a document is about). Con- basic properties, which enables an effective communica- trolled vocabularies are employed in order to standardize tion and information exchange, as well as automated rea- the subject terminology, thus ensuring high precision in soning (Berners-Lee et al. 2001, Bouquet et al. 2004). Ex- search. Recall is increased by expanding terms in queries amples of entities include persons, places, organizations, with synonyms and more specific terms taken from the and events. Taken from a KR perspective, documents are controlled vocabulary. Historically, this approach has scaled just one particular type of entity with its own properties as it allows for the classification, indexing and search of and document search is a special case of reasoning. How- millions of books, though at very high costs of training ever, from a pragmatic point of view, KR has so far failed, and maintenance (Library of Congress 2007). Several as it currently lacks of appropriate entity specification methodologies have been developed for the construction methodologies which allow as much scaling as in KO. and maintenance, often centralized, of controlled vocabu- In this paper we present DERA, a new faceted KR ap- laries. Among them, the faceted approach (Ranganathan proach for the development of ontologies able to describe 1967) is known to have great benefits in terms of quality and reason about relevant entities of a domain. For in- and scalability of the developed resources (Broughton stance, in the music domain, entities may include songs, 2006). These techniques are very effective for searches ex- singers and producers. DERA is faceted, as the method- ploiting document properties. A typical example of a sup- ology engaged for the construction and maintenance of ported query is the following: “Give me documents with domain ontologies is inspired by the principles and canons author ‘Nash, David’ and subject ‘wood sculpture.’ However, of the faceted approach as originated in KO. This makes KO is limited in expressivity as it fails in situations when DERA capable of dealing with large-scale, dynamic, ever- users do not know such properties directly, but they know growing knowledge. DERA accounts for entity classes rather, for instance, the properties of the author or of any (E), relations (R) and attributes (A) of the relevant entities other entity the document is about, and want to search ac- in the domain (D) and models them as semantic facets, i.e. cordingly. For example, users may formulate the search facets where the semantics of the terms and the relations need above as follows: “Give me documents about wood between them are made explicit (thus making each facet a sculptures written by an artist born in Wales.” The need for formal ontology). The use of the fundamental categories this kind of more expressive query is proved by the fact E/R/A allows for a straightforward formalization of fac- that database and KR communities have spent decades in ets into Description Logics (DL) (Baader et al. 2002). This developing highly expressive query languages, e.g. SQL allows the automation of complex tasks such as highly within database management systems (Ramakrishnan and expressive document search exploiting entity properties, Gehrke 2000) and SPARQL to query RDF (Prud’hom- via the usage of standard reasoning tools. meaux and Seaborne 2006). Their usefulness is proved by The remainder of the paper is organized as follows. plenty of studies. Questions like the ones suggested by us, Section 2 provides a motivation for our work showing the i.e. queries requiring the same level of expressiveness, are usefulness of moving from a purely KO to a KR ap- in everyday use and prove effective in countless desktop proach to document search. Section 3 shows how descrip- and Web applications. tive ontologies (ontologies built for the purpose of de- Addressing the query above in KO would require scribing and reasoning about real world entities) enable breaking it down into smaller search tasks and would rely highly expressive document search by exploiting entity on scattered resources such as catalogues and authority properties. Section 4 explains how descriptive ontologies lists to get all the relevant information which is necessary can be naturally formalized into DL ontologies, thus ena- to reformulate the query in terms of document properties bling complex forms of automated reasoning. Section 5 only. This is actually one of the reasons the search by end presents DERA as an innovative approach that inherits users is hard. In particular, for the query above it is neces- the benefits of both KO (in terms of methodologies for sary to identify the name of that artist born in Wales who the development of scalable ontologies) and KR (in terms wrote about wood sculptures. Supporting this requires ap- of expressiveness and effectiveness of search). Section 6 propriate sources of knowledge, the formalization of sub- explains the steps followed in the DERA methodology for 46 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation the construction of scalable descriptive ontologies. Sec- which are supported by KR and cannot be supported by tion 7 describes related work. Finally, Section 8 concludes KO are: the paper by summarizing the work done and outlining the next steps. Give me documents written by Italians about any lake with depth greater than 100m; 2.0 Motivation Give me documents about a factory in England estab- lished by Richard Arkwright during the industrial With the purpose of providing effective mechanisms to revolution; make information available in a timely manner, several Give me documents about any artist born in Italy be- methodologies, systems and tools have been developed in tween 1450 and 1550; KO for the classification, indexing and search of docu- Give me documents about wood sculptures written by ments. In particular, documents are typically classified by an artist born in Wales; and, subject and indexed by document properties such as title, Give me autobiographies written by any president of author as well as subject. Indexing by title and author is the United States. straightforward, as they are directly taken from the docu- ment. Indexing by subject is far more complicated, as it re- Even if the queries in the second list above correspond, quires an analysis of the document content and the appli- one by one, to the queries given in the first list, KO would cation of precise principles and rules to construct corre- fail in the above situation. In fact, though it is true that it sponding subject strings as combinations of terms taken is already possible to answer the queries in the second list from a controlled vocabulary. In libraries, search is per- in KO by looking into authority lists, catalogues and simi- formed manually by using a card catalogue or electronically lar resources, this is not yet systematic, as it would still re- by issuing queries through online public access catalogue quire breaking them down into smaller search tasks and (OPAC) systems that provide access to classifications and would rely on scattered resources to get all the relevant in- indexes. OPAC systems allow the identification of those formation which is necessary to reformulate the queries entries matching a user query as input, and return a corre- above in terms of document properties only. This is one sponding set of relevant documents as output. Supported of the reasons that search is hard for end users. For in- queries include conditions about single document proper- stance, answering the third query above would require ties. Typical examples of queries supported in KO are: identifying the names of those Italian artists born between the given time interval. Give me documents with title “Il lago di Garda;” In addition, a significant obstacle to this in KO is con- Give me documents with subject “Cromford Mill;” stituted by the fact that entries in the indexes codifying Give me documents with subject “Michelangelo;” subjects are given as informal natural language strings. For Give me documents with author “Nash, David” and instance, in the subject strings “Buonarroti, Michelangelo” subject “wood sculpture;” and, and “sculpture—Renaissance” it is not explicitly specified Give me documents with author “Clinton, Bill” and ti- that Michelangelo stands for the Italian artist, that sculp- tle contains “autobiography.” ture is a term denoting a form of art, and that Renais- sance denotes a historical period. The disambiguation of In order to ensure a higher recall, OPAC systems some- the terms occurring in the subjects is in fact possible if times support semantic search (Giunchiglia et al. 2009a), and only if for all of them there is a unique entry as pre- namely a search where terms in the subject are disambigu- ferred term in the controlled vocabulary, which is typically ated and expanded with synonyms and more specific enforced for common nouns, but not always (given their terms taken from the controlled vocabulary. For instance, potentially huge number) for proper nouns. When this is the term “sculpture” could be expanded by adding the more done, for instance in thesauri, very often it is actually only specific term “statue”, although in practice a few OPAC sys- in terms of underspecified hierarchical relations, for in- tems really offer such functionality (Casson et al. 2009). stance by placing “Buonarroti Michelangelo” as narrower However, searching for documents by their properties term under “Italian artist.” This is still a limited and in- is not always good enough. In fact, it requires users to formal specification as it does not enable complex reason- know such properties in advance. Conversely, users might ing tasks based on rich entity descriptions. In fact, it only know, for instance, some of the properties of the author says that documents about “Buonarroti Michelangelo” are or of any other entity the document is about, and want to documents about “Italian artists.” Moreover, specifying search accordingly. In this respect, document search in KR only the name may cause trouble in search (e.g. a drop in is more effective than in KO, as the former supports que- precision in the case of homonymy or in recall where an ries by any entity property. Typical examples of queries equivalent name is provided by the user). It is therefore Knowl. Org. 41(2014)No.1 47 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation necessary to make the meaning of subjects, in all their 3.0 Classification ontologies and descriptive parts, explicit and unambiguous. Among other things, the ontologies lack of formality in the subjects makes their construction, maintenance and exploitation for search extremely diffi- Ontologies constitute high level descriptions of a domain, cult and costly. In fact, experts are needed during con- which can be used by intelligent applications to draw im- struction to select the appropriate terms from a controlled plicit consequences from explicitly represented knowledge vocabulary and arrange them in the right citation order, (Baader et al. 2002). This is achieved through some form during maintenance for instance to update terms that be- of automated reasoning. It has been observed that KO and come obsolete, as well as during search to assist unskilled KR, having different purposes, employ different kinds of users who are not familiar with the domain terminology ontologies (Giunchiglia et al. 2006; Giunchiglia et al. and the way terms need to be combined following the 2009b). In fact, Giunchiglia et al. (2006) introduced the key syntax and rules of the indexing language (Library of distinction between classification ontologies and descrip- Congress 2007). Moreover, subjects and vocabularies tive ontologies. alone do not say anything explicitly about Michelangelo in KO employs knowledge organization systems (KOS). terms of his properties, e.g. his date and place of birth or They commonly correspond to what in KR are called clas- his works, again in a way that is directly exploitable by rea- sification ontologies, i.e. ontologies mainly used to describe, soning tools. For instance, answering the third query classify and search for documents. In these ontologies, as above would require specifying in the subject, through ap- the main focus is on documents, terms occurring at the la- propriate unique identifiers pointing to an external knowl- bels of nodes denote sets of documents, hierarchical rela- edge resource, that Buonarroti Michelangelo refers to the tions between terms denote superset/subset relations, and artist born in Italy in 1475. the individuals (the extension of the terms) are the docu- As exemplified in Figure 1, search by entity properties ments themselves. An example of such ontologies is given (typical of KR) actually includes search by document in Figure 2. For instance, the term “horses” denotes docu- properties (typical of KO). However, while KO mainly re- ments about horses (animals), while the fact that it is placed lies on controlled vocabularies and indexes, KR employs under “transportation means” indicates that documents supplemental knowledge resources (i.e. ontologies) pro- about horses are also documents about transportation viding an explicit description of the attributes of entities means (at least in the context in which the classification is such as people (e.g. their date of birth), facilities and or- used). This is called classification semantics (Giunchiglia et ganizations (e.g. their date of establishment), events (e.g. al. 2009b). The only simple form of reasoning carried out when they happened) as well as relations between them for document search in KO is based on the transitivity of (e.g. the fact that a certain person was born in a certain the hierarchical relations. In fact, this is what is needed to country). KR provides a more expressive representation enable semantic search (Giunchiglia et al. 2009a). For in- and query language, able to codify and automatically query stance, documents about horses can be returned when such knowledge. LIS seems to recognize the need for such searching for documents about facilities, because: resources. We can mention for instance RDA (2010), FRBR (1998), and FRAD (Patton 2009) as well as the re- – horses BT transportation means; and, cent OCLC work aiming to align BIBFRAME and – transportation means BT facilities. Schema.org models (Godby 2013). However, KR already offers techniques for the representation and automatic KR employs descriptive ontologies, i.e. ontologies built for exploitation of such resources. the purpose of describing and reasoning about real world entities. In these ontologies, terms denote sets of real world entities, hierarchical is-a relations provide the back- bone structure to these ontologies and indicate a subset re- lation, while the individuals include any real world entity. For instance, the relation “horse is-a animal” indicates that horses are a subset of all animals. This is called real world semantics (Giunchiglia et al. 2009b). Descriptive ontologies provide knowledge about entities in terms of classes, at- tributes and relations. For instance, they may specify that animals are affected by certain kinds of diseases and that certain cures are needed to defeat them. An example of Figure 1. From search by document properties to search by complex reasoning is searching for cures to a certain dis- any entity property ease affecting a given animal. In essence, the purpose of 48 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

Figure 2. Example of classification ontologies

Figure 3. Example of descriptive ontologies in different domains

KR is much broader than KO. In fact, taken from a KR tity class, relation or attribute. Relevant entities in the ge- perspective, documents are just one particular type of en- ography domain are locations and more specific entities, tity with their own properties (with title, author and subject such as rivers and lakes; relevant entities in the person being very important ones) and document search is a spe- domain are people; documents are modeled as those enti- cial case of reasoning. ties which are the target of the creative work domain, with An example of descriptive ontologies covering the ge- title, author and subject being their properties. In particu- ography, creative work and document domains is given in lar, while title and subject are attributes, author is repre- Figure 3. In the picture, each node denotes a different en- sented as a relation between a document and a person. Knowl. Org. 41(2014)No.1 49 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

Descriptive ontologies are populated with entities and 4.0 From descriptive ontologies to description logics the value of their properties in corresponding domains. For instance, in Figure 4, the geography domain includes Descriptive ontologies have a straightforward formaliza- the entities “Garda Lake” (as instance of lake) and “Italy” tion into DL ontologies. With the formalization (Table 1), (as instance of country), the creative work domain in- DL concepts denote either sets of entities or sets of at- cludes the entity “Book#1” (as instance of book, which in tribute values. DL roles denote either relations or attrib- turn is more specific than document) having correspond- utes. In other words, a DL interpretation I = <∆, I> consists ing title, author and subjects. Notice how the subject of the domain of interpretation ∆ = F ⋃ G, where F is a set string “Garda Lake—history—guide” is represented as of individuals denoting real world entities and G is a set of three different values of the subject attribute. attribute values, and of an interpretation function I where: In KR, document search is a standard reasoning task I I I over descriptive ontologies. For instance, answering the Ei ⊆ F Rj ⊆ F x F Ak ⊆ F x G vrI  G (1) query: “Give me documents written by Italians about any lake with depth greater than 100m” over the descriptive that is, each entity class Ei corresponds to a DL concept ontologies in Figure 3 and corresponding entities in Fig- whose interpretation is a subset of the entities in F; each ure 4 amounts to identifying all those entities which: a) are relation Rj corresponds to a DL role whose interpretation instances of the entity class “document;” and b) with is a binary relation between entities in F; each attribute Ak “subject” set to entities that are instances of the entity corresponds to a DL role whose interpretation is a binary class “lake” having “depth” greater than “100m;” and c) relation between entities in F and attribute values in G, re- with “author” set to entities having “nationality” equal to stricted by the interpretation of the concepts denoting cor- “Italy.” This would return “Book#1,” because: a) it is an responding attribute values vr (connected through value-of instance of the entity class “book” which is more specific relations); is-a relations correspond to subsumption (⊑) be- than “document;” b) it has “Garda Lake” as subject which tween concepts or between roles; part-of relations and as- is an instance of “lake” and has a “depth” of 346m which sociative relations correspond to DL roles. And where: is greater than “100m;” and c) its author is “Solitro Giuseppe” who has “nationality” set to “Italy.” epI  F rqI  F x F asI  F x G (2)

Figure 4. Entities and their properties populating the descriptive ontologies given in Figure 3. 50 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

that is, instances ep of entity classes (connected through TBox ABox instance-of relations) correspond to entities in F; in- location ⊑ ∀direction.location ⊓ lake(Garda-lake) stances rq of relations are elements of the Cartesian prod- ∀depth.{deep,shallow} city(Trento) uct F x F; instances as of attributes are elements of the body-of-water ⊑ location country(Italy) Cartesian product F x G. populated-place ⊑ location depth(Garda-lake, deep) Knowledge in (1) corresponds to what in DL is called lake ⊑ body-of-water part-of(Garda-lake, Trento) part-of(Trento, Italy) the intentional knowledge (TBox), i.e. a set of general river ⊑ body-of-water statements about what is known in terms of concepts, city ⊑ populated-place denoting sets of individuals, and concept properties; such country ⊑ populated-place statements constitute the basic terminology and theory of north ⊑ direction the domain (e.g. persons have a date of birth). Knowledge in (2) corresponds to what in DL is called the extensional south ⊑ direction knowledge (ABox), i.e. a set of assertions about specific individuals and the actual value of their properties (e.g. the 5.0 The DERA approach date of birth of Michelangelo Buonarroti is 6th March 1475). DERA provides a concrete answer to the need for a suit- able approach and methodology for the development of Descriptive DL descriptive ontologies which allow scaling to the produc- ontology formalization tion of ever growing knowledge, and their exploitation for element a highly expressive document search. This in turn allows us to build, on demand, on the basis of the query, the E1, …, Ep entity classes concepts TBox necessary DL theory as described in Section 4. R ,…, R relations roles 1 q DERA is a new faceted KR approach for the develop- between classes ment of descriptive ontologies and their exploitation for A1,…, As Attributes roles automated reasoning. DERA is faceted as it takes inspira- value-of hierarchical role restrictions tion from category-based systems and in particular from relation the faceted approach introduced by Ranganathan (1967) is-a hierarchical subsumption and later simplified by Bhattacharyya (1975), thus aiming relation (⊑) at the same quality and scalability benefits. However, it part-of hierarchical roles clearly differs from them as the original approach aims at relation the development of classification ontologies. DERA is entity-centric rather than document-centric. any other associative roles We take an entity to be any object so important to be de- relation relations noted with a name. They include concrete real world enti- e1,…, en entities individuals in F ABox ties such as locations, persons, organizations and events, as instances (entities) well as documents, any creative work and piece of art. One

v1,…, vr attribute values individuals in G immediate consequence of adopting a KR approach is that (values) DERA is a system of semantic categories, namely catego-

r1,…, rm relations role assertions ries supporting the specification of the terminology of a between domain for the representation (rather than the organiza- entities tion) of the relevant entities (rather than only documents) by their basic properties (thus, not only the subject). a ,…, a attributes of role assertions 1 t We adopt and extend the notion of domain as originally entities given in LIS. In DERA, a domain is any area of knowledge instance- hierarchical concept or field of study that we are interested in or that we are of relation assertions communicating about that deals with specific kinds of en- Table 1. Formalization of a descriptive ontology into DL tities. They include conventional fields of study (e.g. phys- ics, mathematics), applications of pure disciplines (e.g. en- For instance, the descriptive ontology given in Figure 3 for gineering, agriculture), any aggregate of such fields (e.g. the geography domain and corresponding entities in Fig- physical sciences, social sciences), or can even capture ure 4 can be formalized into the TBox and ABox below: knowledge about our everyday lives (e.g. music, movie, sport, recipes, tourism). Domains provide a bird’s eye view of the whole field of knowledge, offer a comprehensive Knowl. Org. 41(2014)No.1 51 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation context within which classification and search can be sup- mountains, valleys), administrative divisions (e.g. wards ported (Mills 2004), and words disambiguated (Ciaramita and provinces) and populated places (e.g. cities, villages). and Altun 2006). Domains have two fundamental proper- Each of them generates a different sub-facet of entity ties (Giunchiglia et al. 2012a). They are the main means by classes. Spatial relations between them may include near, which diversity is captured, in terms of language, knowl- adjacent, in front. They generate sub-facets of relations. edge and personal experience. For instance, according to Entities may be described in terms of their length (e.g. of local customs the food domain may or may not include a river, with values long and short) or depth (e.g. of a lake, bugs. In addition, domains allow scaling as they account with values deep and shallow). They generate sub-facets for the evolution of knowledge. For instance, in evolving of attributes. See the example in Figure 5. the transportation domain we may extend ground trans- When facets are populated with specific entities of a portation means with electrical cars. domain, instance-of relations connect entities to their re- Within each domain, entities are described in terms of spective classes in E. Entities are described in terms of at- basic properties and in particular of their entity classes, re- tributes (A) and relations (R), each of them being in turn a lations and attributes which therefore become the funda- pair where n is the attribute or relation name and v mental categories of our categorization system. Under is its value consistent with what is defined in A for the at- each fundamental category, terms are arranged into facets, tributes and R for the relations, respectively. Entities and each of them covering a different aspect of the domain. their properties which populate the facets constitute the More precisely, we define a facet to be a hierarchy of ho- ABox of a descriptive ontology. For instance, the “Garda mogeneous terms describing an aspect of the domain, Lake” (an entity) can be described as an instance of “lake” where each term in the hierarchy denotes a different (entity class in the body of water sub-facet), located in atomic concept (Giunchiglia et al. 2009b). Facets are fur- “Italy” (part-of relation) with “depth” (attribute name) of ther subdivided into sub-facets. Facets (and their subdivi- 346 m (quantitative value) which can be considered sions) are mutually disjoint. “deep” (qualitative value). A DERA domain is a triple D = where: 6.0 Descriptive ontologies in DERA – E (for Entity) is a set of facets grouping terms denot- ing entity classes, whose instances (the entities) have ei- The methodology engaged in DERA follows a minimal ther perceptual or conceptual existence. Terms in these set of guiding principles, described in (Giunchiglia et al. hierarchies are explicitly connected by is-a or part-of 2012b), which are inspired by the canons and principles relation. described by Ranganathan (1967), and guides though the – R (for Relation) is a set of facets grouping terms de- whole process of constructing and maintaining facets, noting relations between entities. Terms in these hierar- each of them covering a different aspect of the domain. chies are connected by is-a relation. However, in contrast to the original approach, DERA – A (for Attribute) is a set of facets grouping terms de- aims at the development of facets as descriptive ontolo- noting qualitative/quantitative or descriptive attributes gies (rather than classification ontologies). The main steps of the entities. We differentiate between attribute in the methodology are as follows: names and attribute values such that each attribute name is associated corresponding values. Attribute 6.1 Step 1—identification of the atomic concepts names are connected by is-a relation, while attribute values are connected to corresponding attribute names Relevant terms of the domain in natural language (e.g. in by value-of relations. English or Italian) are collected, examined and disambigu- ated into atomic concepts. Terms are collected primarily The mapping of E/R/A above to DL should be obvious. by interviewing domain experts and by reading available is-a, part-of and value-of relations form the backbone of literature about that particular domain including inter-alia facets, are assumed to be transitive and asymmetric, and indexes, abstracts, glossaries, reference works. Analysis of hence are said to be hierarchical. Other relations, when- query logs, when available, can be extremely valuable to ever defined, not having such properties, are said to be as- determine user’s interests. Collected terms are then exam- sociative and connect terms in different facets. All to- ined and disambiguated into atomic concepts. Terms with gether facets constitute the TBox of a descriptive ontol- the same meaning (synonyms) are grouped together and ogy. For instance, within the geography domain relevant are given a natural language description that makes explicit entities are locations (the main E facet) that may include the intended meaning. This corresponds to what in the inter-alia land formations (e.g. continents, islands), bodies faceted approach is called the verbal plane and what in of water (e.g. seas, streams), geological formations (e.g. (Giunchiglia et al. 2006, Giunchiglia et al. 2012a) is called 52 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

ENTITY RELATION ATTRIBUTE Location Direction Name Landform (is-a) East Latitude (is-a) Natural elevation (is-a) North Longitude (is-a) Continental elevation (is-a) South Altitude (is-a) Mountain (is-a) West Area (is-a) Hill Population (is-a) Oceanic elevation Relative level (is-a) Seamount (is-a) Above Depth (is-a) Submarine hill (is-a) Below (value-of) deep (is-a) Natural depression (value-of) shallow (is-a)Continental depression Containment (is-a) Valley (is-a) part-of Length (is-a) Trough (value-of) long (is-a) Oceanic depression (value-of) short (is-a) Oceanic valley (is-a) Oceanic trough Body of water (is-a) Flowing body of water (is-a) Stream, Watercourse (is-a) River (is-a) Brook (is-a) Still body of water (is-a) Lake (is-a) Pond Figure 5. Exemplification of the geography domain in DERA. the natural language level. Each group of terms denotes a allows being as fine grained as wanted in differentiating different atomic concept and is subsequently classified al- among the concepts. For instance, we can recognize that ternatively as an entity class (E), relation (R) or attribute in geography for the concept “river” we can identify the (A). This corresponds to what in the faceted approach is following characteristics: called the idea plane and what in (Giunchiglia et al. 2006, Giunchiglia et al. 2012a) is called the formal language a body of water level. For instance, we can recognize that in the geography a flowing body of water domain the terms “stream” and “watercourse” are syno- no fixed boundary nyms whose meaning can be described as “a natural body confined within a bed and stream banks of running water flowing on or under the earth” (natural larger than a brook. language) and that the group denotes an entity class (one atomic concept at formal language level), that is: “(E) wa- This is similar to the faceted approach. tercourse, stream: a natural body of running water flowing on or under the earth.” This is different from the original 6.3 Step 3—Synthesis faceted approach, not only in terms of categories, but also because in Ranganathan’s approach synonyms and defini- Collected terms are arranged into facets such that at each tions are not explicitly given. Vocabulary control is instead level of the hierarchy, each of them representing a differ- considered by Battacharyya (1982). ent level of abstraction, concepts are grouped by a com- mon characteristic. Concepts sharing the same characteris- 6.2 Step 2—Analysis tic form an array of homogeneous concepts. Concepts in each array can be further organized into sub-groups (or The atomic concepts are analyzed per genus et differentia, sub-facets), thus generating a new level in the hierarchy. namely in order to identify their commonalities and their Child concepts are connected to their parent concept differences. The main goal is to identify as many distin- through an explicit is-a (genus-species) or part-of (whole- guishing properties – called characteristics – as possible of part) relation. For instance, we can recognize that under the real world objects represented by the concepts. This the “body of water” facet “stream is-a flowing body of Knowl. Org. 41(2014)No.1 53 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation water” and that, due to their commonalities, we could de- to the faceted approach. Ordering is not considered essen- clare “river is-a stream” and “brook is-a stream” by plac- tial in KR, but it turns out to be very useful for mainte- ing them under the same array. Thus, we may progres- nance purposes, for instance to check the level of cover- sively obtain the following facet: age of a facet or to facilitate the identification of a suit- able position for a new concept. Body of water (is-a) Flowing body of water 6.6 Step 6—Formalization (is-a) Stream (is-a) Brook The fundamental categories E/R/A are such that this al- (is-a) River lows for an obvious formalization of corresponding facets (is-a) Still body of water into DL ontologies. This step is implicitly performed in (is-a) Pond LIS. In fact, the formalization includes what in the faceted (is-a) Lake. approach is called the notational plane, i.e. the level where an unambiguous notation is used to synthetically attach This is different from the original faceted approach, where meaning and provide order to terms. However, the way in genus-species and whole-part relations are left implicit. In which this is done in DERA makes automation of non- fact, as it aims at the creation of classification ontologies, trivial tasks, such as highly expressive document search by terms are arranged in facets by means of generic hierar- entity properties, possible. In fact, document search can chical relations. Among other things, explicit relations be framed in DL as an instance retrieval problem (Baader make maintenance more rigorous. For example, it facili- et al. 2002). tates the distinction between transitive and non-transitive relations (Maltese and Farazi 2011). 7.0 Related work

6.4 Step 4—Standardization In LIS several methodologies have been developed for the construction and maintenance of classification ontologies. Each atomic concept can be potentially denoted with any In particular, in category-based subject indexing systems of the terms in the group of synonyms. When the group relevant terms of a domain are organized into a classifica- contains more than one term, a standard (or preferred) tion scheme of a few fundamental categories. As the ulti- term should be selected among the synonyms. This is usu- mate purpose is the construction of document subjects, ally done by identifying the term which is most commonly such systems are grounded on syntactic categories, namely used in the domain and which minimizes the ambiguity. categories playing a role in the syntax of the subject in- This is similar to WordNet™ (http://wordnet.princeton. dexing language, i.e. the language used to construct the edu/) where terms are ranked within the synset and the subject strings stored in subject indexes. Hierarchies under first one is the preferred. For instance, in WordNet the each fundamental category encode different aspects or term “stream” is preferred to “watercourse:” “(E) stream, facets of the domain knowledge. Approaches differ in the watercourse: a natural body of running water flowing on kind and number of categories. Kaiser (1911) proposed or under the earth.” This is different from the original Concrete, Process and Country; Vickery (1960) adopted Ranganathan approach, where only one term is kept in the thirteen categories. Ranganathan (1967) postulated Per- classification scheme while the others are discarded and ex- sonality, Matter, Energy, Space and Time. Bhattacharyya ternal resources are needed to identify synonyms and to (1975) simplified the categories proposed by Ranganathan get definitions whenever needed. Synonyms and defini- by proposing only Discipline, Entity, Property and Action. tions are instead typically provided in more recent faceted In these approaches, facets of general applicability are schemes. called common isolates or modifiers (e.g. Language and document Form). However, Ranganathan was the first 6.5 Step 5—Ordering who proposed and formalized a theory of facet analysis which is widely recognized as a fundamental methodology Concepts in each array are ordered. There are several cri- that guides in the creation of high quality classification teria devised by Ranganathan. They include by chrono- schemes, in terms of robustness, extensibility, reusability, logical order, by spatial order, by increasing and decreasing compactness and flexibility (Broughton 2006). Rangana- quantity, by increasing complexity, by canonical order (the than’s approach allows scaling as with domains it is possi- order traditionally followed in LIS), by literary warrant and ble to add new knowledge at any time as needed. by alphabetical order. For instance, in the geography do- On the contrary, KR currently lacks methodologies for main one may follow the canonical order. This is similar the development of descriptive ontologies which allow 54 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation scaling as much as in KO. In KR, existing approaches to such that this allows for a straightforward formalization of ontology construction and maintenance focus on ontology corresponding facets into standard DL ontologies. evaluation (Guarino and Welty 2002), supporting tools As future work, we plan to experiment with DERA in (Corcho et al. 2004), general design criteria (Gruber 2003), vertical domains and to develop a collaborative platform or on the ontology building process itself (Fernandez- for the construction and maintenance of domains. Up to Lopez 1999). In particular, OntoClean (Guarino and Welty this point, the methodology has already proved effective 2002) provides meta-properties that impose a set of con- in experiments conducted in the geography domain, for straints on the taxonomic structure of ontologies that turn instance for the encoding of the relevant knowledge (Gi- out to be very useful during the building process, in evalu- unchiglia et al. 2012b) and the search of maps in semantic ating and improving those (Welty et al. 2004). Welty and geo-catalogues (Farazi et al. 2012). In particular, in (Gi- Jenkins (1999) proposed an ontology specifically for the unchiglia et al. 2012b) we describe the development of a description of documents and their subjects, but they nei- faceted descriptive ontology using DERA for the geogra- ther address any methodological issue nor provide any ex- phy domain, that we called Space, which includes more plicit implementation. Since developing ontologies from than 1000 concepts and around 7 million spatial entities scratch is an extremely time-consuming and error prone mainly taken from GeoNames and the Getty Thesaurus task, many approaches have attempted to reuse existing of Geographic Names (TGN); in (Farazi et al. 2012) we sources (Stuckenschmidt et al. 2004). They range from describe how the usage of a faceted descriptive ontology lexical (e.g. WordNet) to domain-specific resources (such in combination with standard AI tools results in a signifi- as UMLS and AGROVOC). All these approaches under- cant improvement in search. Furthermore, in the recent line the usefulness of domain-specific knowledge (Laursen years our efforts have been directed to the development et al. 2008). of a new system that we called Universal Knowledge Core, and a collaborative platform for the employment of 8.0 Conclusions experts for the construction and maintenance of such on- tologies. It is our plan to evaluate the costs of these activi- We have shown that, despite the very successful solutions ties even if our guess is that it will be comparable to the developed, existing KO approaches to document indexing costs required for standard KOS. In fact, we believe that and search, by employing classification ontologies, are lim- the cost of producing a descriptive ontology is not signifi- ited in expressivity as they only support queries by docu- cantly higher than the cost of building a standard KOS ment properties. In this respect KR is very powerful and with the advantage that the produced ontology would potentially boundless as, by employing descriptive ontolo- have a broader applicability than the latter. gies, it supports queries by any entity property. This moti- vates the usefulness to move from a purely KO to a KR References approach to document search. Though, from a pragmatic point of view KR, so far, has failed as it lacks appropriate Baader, Franz, Calvanese, Diego, McGuinness, Deborah, methodologies which allow scaling as much as in KO. Nardi, Daniele and Patel-Schneider, Peter. 2002. The de- In this paper we presented the new DERA faceted KR scription logic handbook: Theory, implementation and applica- approach and a corresponding methodology, inspired by tions. Cambridge University Press. the faceted approach, for the development of high quality Battacharyya, G. 1975. POPSI: its fundamentals and pro- and scalable descriptive ontologies. It allows modeling rele- cedure based on a general theory of subject indexing vant entities of the domain (including documents) and languages. Library science with a slant to documentation 16 their properties and enables automated reasoning. In par- no. 1: 1-34. ticular, it supports a highly expressive search of documents Battacharyya, G. 1982. Classaurus: its fundamentals, de- exploiting entity properties. By bridging between KO and sign and use. Universal classification: subject analysis KR, we compensate for the limitations and leverage on the and ordering systems. In Dahlberg, Ingetraut and Per- respective strengths of these two approaches. In fact, we reault, Jean M., eds., Universal classification: subject analysis inherit quality and scalability properties of the faceted ap- and ordering systems: proceedings, 4th International Study Con- proach from KO as well as the expressiveness and effec- ference on Classification Research, 6th Annual Conference of tiveness of search from KR. Because of the methodology Gesellschaft für Klassifikation e. V. , 28 June-2 July followed, DERA domains are flexible, reusable, and allow 1982. : Indeks Verlag, pp. 139-48. scaling and coping with the diversity of the world and the Berners-Lee, Tim, Hendler, James and Lassila, Ora. 2001. evolution of knowledge. Automated reasoning is made The semantic web. Scientific American 284 no. 5: 28-27. possible because the fundamental categories E/R/A are Bouquet, Paolo, Giunchiglia, Fausto, van Harmelen, Frank, Serafini, Luciano and Stuckenschmidt, Heiner. 2004. Knowl. Org. 41(2014)No.1 55 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

Contextualizing ontologies. Journal of web semantics 1: Giunchiglia, Fausto, Dutta, Biswanath and Maltese, Vin- 325-43. cenzo. 2009b. Faceted lightweight ontologies. In Bor- Broughton, Vanda. 2006. The need for a faceted classifica- gida, Alexander T., Chaudhri, Vinay K., Giorgini, Paolo tion as the basis of all methods of information re- and Yu, Eric S., eds., Conceptual modeling, foundations and trieval. Aslib proceedings 58: 49-72. applications: essays in honor of John Mylopoulos. Berlin: Casson, Emanuela, Fabbrizzi, Andrea and Slavic Aida. Springer, pp. 36-51. 2009. Subject search in Italian OPACs: An opportunity Giunchiglia, Fausto, Maltese, Vincenzo and Dutta, Bi- in waiting? In Landry, Patrice, Bultrini, Leda, O'Neill, swanath. 2012a. Domains and context: First steps to- Edward T. and Roe, Sandra K, eds., Subject access: prepar- wards managing diversity in knowledge. Journal of web ing for the future. Berlin: De Gruyter, pp. 37-50. semantics 12-13: 53-63. Ciaramita, Massimiliano and Altun, Yasemin. 2006. Broad- Giunchiglia, Fausto, Dutta, Biswanath, Maltese, Vincenzo coverage sense disambiguation and information extrac- and Farazi, Feroz. 2012b. A facet-based methodology tion with a supersense sequence tagger. In EMNLP '06 for the construction of a large-scale geospatial ontol- Proceedings of the 2006 Conference on Empirical Methods in ogy. Journal on data semantics 1: 57-73. Natural Language Processing. Stroudsburg, PA: Associa- Godby, Carol Jean. 2013. The relationship between BIB- tion for Computational Linguistics, pp. 594-602. FRAME and the schema.org ‘bib extensions’ model: A work- Corcho, Óscar, Gómez-Pérez, Asunción, Gonzalez- ing paper. Dublin, Ohio: OCLC Research. Cabero, Rafael and Suarez-Figueroa, Carmen. 2004. Gruber, Thomas R. 2003. Towards principles for the de- ODEval: A tool for evaluating RDF(S), DAML+OIL, sign of ontologies used for knowledge sharing. In and OWL concept taxonomies. In Bramer, M.A. and Guarino, N. and Poll, R., eds., Formal ontology in conceptual Devedzic, Vladan, eds., Artificial intelligence applications analysis and knowledge representation, Padova, Italy. and innovations : IFIP 18th World Computer Congress : TC12 Guarino, Nicola and Welty, Christopher. 2002. Evaluating First International Conference on Artificial Intelligence Applica- ontological decisions with OntoClean. Communications tions and Innovations (AIAI-2004), 22-27 August 2004, of the ACM 45: 61-5. Toulouse, France, pp. 369-82. Kaiser, James. 1911. Systematic indexing. London: Isaac Pit- Farazi, Feroz, Maltese, Vincenzo, Dutta, Biwanath, Iva- man & Sons. nyukovich, Alexander and Rizzi, Veronica. 2012. A se- Lauser, Boris, Johannsen, Gudrun, Caracciolo, Caterina, mantic geo-catalogue for a local administration. Artifi- Keizer, Johannes, van Hage, Willem Robert and Mayr, cial intelligence review 40 no. 2: 1-20. Philipp. 2008. Comparing human and automatic the- Fernández-López, Mariano. 1999. Overview of method- saurus mapping approaches in the agricultural domain. ologies for building ontologies. In Benjamins, V.R., In Greenberg, Jane and Klas, Wolfgang, eds., Metadata Chandrasekaran, B., Gómez-Pérez, A., Guarino, N. and for semantic and social applications proceedings of the interna- Uschold, M., eds., Proceedings of the IJCAI-99 workshop on tional conference on Dublin Core and Metadata Applications, ontologies and problem-solving methods: lessons learned and fu- Berlin, 22-26 September 2008, DC 2008: Berlin, Germany. ture trends. Amsterdam: Aachen: University of Amster- Gottingen:̈ Universitatsverlag Gottingen,̈ pp. 43-53. dam; CEUR, pp. 4,1-13. Library of Congress. 2007. Library of congress subject head- Functional Requirements for Bibliographic Records: Final ings: Pre- vs. post-coordination and related issues. Washington: Report (München: Saur, 1998), as amended and cor- Library of Congress. Available http://www.loc.gov/ rected through February 2009, accessed September 30, catdir/cpso/pre_vs_post.pdf. 2011. http://www.ifla.org/files/cataloguing/frbr/frbr_ Maltese, Vincenzo and Farazi, Feroz. 2011. Towards the in- 2008.pdf. tegration of knowledge organization systems with the Giunchiglia, Fausto, Marchese, Maurizio and Zaihrayeu, linked data cloud. In Slavic, Aida and Civallero, Ed- Ilya. 2006. Encoding classifications into lightweight on- gardo, eds., Classification and ontology: Formal approaches and tologies. Journal of data semantics 8: 57-81. access to knowledge: Proceedings of the International UDC Giunchiglia, Fausto, Kharkevich, Uladzimir and Zai- Seminar, 19-20 September 2011, The Hague, The Netherlands. hrayeu, Ilya. 2009a. Concept search. In Aroyo, Lora, Würzburg: Ergon Verlag, pp. 74-90. Traverso, Paolo, Ciravegna, Fabio, Cimiano, Philipp, Mills, Jack. 2004. Faceted classification and logical division Heath, Tom, Hyvonen, Eero, Mizoguchi, Riichiro, in information retrieval. Library trends 52: 541-70. Sabou, Marta and Simperl, Elena, eds., The semantic web: Patton, Glenn, ed. 2009. Functional Requirements for Au- research and applications: 6th European Semantic Web Confer- thority Data: A Conceptual Model München: K. G. ence, ESWC 2009, Heraklion, Crete, Greece, May 31-June 4, Saur. Available: http://www.ifla.org/publications/ 2009: proceedings. Berlin: Springer, pp. 429-44. functional-requirements-for-authority-data. 56 Knowl. Org. 41(2014)No.1 F. Giunchiglia, B. Dutta, V. Maltese. From Knowledge Organization to Knowledge Representation

Prud’hommeaux, Eric and Seaborne, Andy. 2006. SPARQL Using C-OWL for the alignment and merging of medical on- query language for RDF. W3C working draft. Available tologies. Available http://eprints.biblio.unitn.it/523/1/ http://www.w3.org/TR/2006/WD-rdf-sparql-query- 010.pdf. 20061004/. Vickery, Brian Campbell. 1960. Faceted classification: A guide to Ramakrishnan, Raghu and Gehrke, Johannes. 2000. Data- the construction and use of special schemes. London: ASLIB. base management systems. McGraw-Hill. Welty, Christopher and Jenkins, Jessica. 1999. Formal on- Ranganathan, S. R. 1967. Prolegomena to library classification. tology for subjects. Journal on data and knowledge engineer- London: Asia Pub. House. ing 32: 155-81. Resource description and access. 2010. Chicago: American Li- Welty, Christopher, Mahindru, Ruchi and Chu-Carroll, Jen- brary Association; Ottawa: Canadian Library Associa- nifer. 2004. In Proceedings: Nineteenth National Conference on tion; London: Chartered Institute of Library and In- Artificial Intelligence (AAAI-04): Sixteenth innovative Appli- formation Professionals (CILIP). In RDA Toolkit: cations of Artificial Intelligence Conference (IAAI-04). Ameri- http://www.rdatoolkit.org. can Association for Artificial Intelligence, pp. 311-6. Stuckenschmidt, Heiner, Van Harmelen, Frank, Serafini, Luciano, Bouquet, Paolo and Giunchiglia, Fausto. 2004.

Knowl. Org. 41(2014)No.1 57 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

Social Tagging: Exploring the Image, the Tags, and the Game

Elena Konkova*, Ayşe Göker**, Richard Butterworth* and Andrew MacFarlane*

*Centre for Interactive Systems Research, City University London, Northampton Square, London EC1V 0HB, England, , , **School of Computing Science and Digital Media, Robert Gordon University, St Andrew St., Aberdeen, AB25 1HG, Scotland,

Elena Konkova is a former research assistant at City University London. She has an MSc Degree in informa- tion systems and technology with a dissertation project on social participation in image collection manage- ment. Elena worked for a number of research projects in areas including image retrieval, social web analysis, and user and context modelling.

Ayşe Göker is a professor based at Robert Gordon University in Aberdeen. She has over 20 years of research experience in areas including context-learning algorithms, web user logs, personalization and mobile informa- tion systems. Her work involves a strong user-centred approach to algorithm and search- system design, devel- opment and evaluation. Prof. Göker also holds a lifelong Enterprise Fellowship from the Royal Society of Ed- inburgh and Scottish Enterprise for previous commercialisation work.

Richard Butterworth has a background in both information science and human-computer interaction, and worked as a research fellow in the Centre for Interactive Systems Research at City University London until re- cently. He gained a Ph.D. from Loughborough University studying formal models of interactive systems in 1997, and worked on several HCI academic projects subsequently. He also investigated digital library systems within both academic and commercial contexts. He has developed digital library systems for The Bridgeman Art Library and The English Folk Dance and Song Society. He now works as an independent consultant and developer.

Andrew MacFarlane is a Reader in information retrieval at City University London, and currently co-directs the Centre of Interactive Systems Research with Stephen Robertson of Microsoft Research Cambridge. He got his Ph.D. in Information Science from the same institution. His research interests currently focus on a number of areas including image retrieval, disabilities and information retrieval (dyslexia in particular), AI techniques for information retrieval and filtering, and open source software development.

Konkova, Elena , Göker, Ayşe, Butterworth, Richard, and MacFarlane, Andrew. Social Tagging: Exploring the Image, the Tags, and the Game. Knowledge Organization. 41(1), 57-65. 21 references.

Abstract: Large image collections on the Web need to be organized for effective retrieval. Metadata has a key role in image retrieval but rely on professionally assigned tags which is not a viable option. Current content- based image retrieval systems have not demonstrated sufficient utility on large-scale image sources on the web, and are usually used as a supplement to existing text-based image retrieval systems. We present two social tag- ging alternatives in the form of photo-sharing networks and image labeling games. Here we analyze these ap- plications to evaluate their usefulness from the semantic point of view, investigating the management of social tagging for indexing. The findings of the study have shown that social tagging can generate a sizeable number of tags that can be classified as interpretive for an image, and that tagging behaviour has a manageable and ad- justable nature depending on tagging guidelines.

Received 31 July 2013; Revised 3 September 2013; Accepted 3 September 2013

Keywords: tags, image, tagging, semantic, Flickr, games 58 Knowl. Org. 41(2014)No.1 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

1.0 Introduction not be derived automatically from image content, as there is a need for an association between content low-level fea- A large quantity of social media data (text, audio, video, im- tures (defined below) and the high-level semantic concepts ages, etc.) is uploaded to the web constantly, due to the behind them. This kind of reasoning can only be done by a popularity of digital photo cameras, mobile phones (with human either through professional description of images cameras) and social networks. Images used to be managed or through image tagging in various social applications. and categorised by librarians and archivists, amongst others. Image retrieval systems can be broadly categorized into However, professional keyword assignment is too time-con- two main categories: context-based and content-based suming to be used effectively on large image collections (Westman, 2009). Context-based (also known as metadata, available on the web. Although a number of content-based piggy-back) text-based or concept-based) image retrieval image retrieval systems have been launched, they have not systems use text to describe the image, whereas, content- demonstrated sufficient utility on large-scale collections like based image retrieval (CBIR) systems employ visual fea- the web. These systems are usually used as a supplement to tures such as colour, shape, texture, object position for im- existing context-based (or metadata-based) image retrieval age description. Context-based image retrieval systems systems using text, with additional functionality (e.g. search have been used since late 1970s, and are still the predomi- of similar images, search of specific colour scheme, etc). nant method used for image search. They are known to be The main aim of this work is to investigate whether an more efficient and accurate, and are based on assigning alternative, social tagging, can efficiently provide images metadata to images. Content-based image retrieval (CBIR) with semantic descriptions, and how the social tagging is an alternative to a context-based approach, as it does not behaviour can be managed. The work focuses on the fol- involve text to describe images. It focuses on low-level fea- lowing research questions: tures (colour, texture, and shape) in an image. However, they are unable to retrieve high-level features such as sub- – What are the facets of image tags in a popular photo- ject and meaning, which are of primary importance in im- sharing social network? age search. The discrepancy between low-level visual fea- – How do these tag facets change in a gaming environ- tures and high-level semantic concepts is often referred to ment? and, as the problem of the semantic gap (Eakins and Graham – Can imposing restrictions on a game along with the 1999; Sawant et al. 2010). provision of guidelines improve the semantic descrip- tion of images? 2.2 The problem: the known semantic gap

To address these questions, a multi-faceted methodology Semantics, with respect to images, is an association be- was used. We tackle these research questions by analysing tween low-level features, such as shapes, colours, textures, existing tagging behaviour on a popular photo sharing and high-level concepts that could be presented by words website (Flickr) and investigate the use of games with a (Sawant et al, 2010). Smeulders et al (2000) define the se- purpose (GWAP), widely used to support image indexing. mantic gap as the “lack of coincidence between the in- This work aims to provide a clearer picture of tagging- formation that one can extract from the visual data and generation environments and their outcomes. the interpretation that the same data have for a user in a The paper is organized as follows. Related work and re- given situation.” In other words, it is the difference be- search context is presented in section 2. Section 3 de- tween the way a human perceives the image and the actual scribes our methodology based on a modified image at- image content. Hare et al. (2006) differentiate between tributes classification system. In section 4 we discuss the “the gap between the descriptors and object labels” and main results of applying the classification system and an “the gap between the labelled objects and the full seman- experiment using Games With A Purpose (GWAP). This tics.” We can characterize the semantic gap in two ways is followed by a discussion in section 5. Lastly, section 6 therefore. The first gap (the one that lies between feature- presents our conclusions. vectors of the image and generic objects) is covered by CBIR algorithmic work, whereas the second gap (the one 2. Related work which is between object labelling and high-level reasoning) still needs human intellect as an essential component. 2.1 Image retrieval systems 2.3 A solution: the social approach According to Ferecatu et al. (2008), the value of interpre- tive (defined below) and semantically rich keywords for im- A method by which web users could add their own age retrieval is undeniable. However, these keywords can- searchable keywords to bookmarks, photos, videos, etc. Knowl. Org. 41(2014)No.1 59 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game for future retrieval is known as social tagging (collabora- 3.0 Methodology tively creating and managing tags,) and these descriptive keywords are known as tags (Motive 2005). Several studies 3.1 Overall Approach (Rorissa 2010; Rafferty and Hidderley 2007) focus solely on tagging behaviour within social networks like Flickr, Our work addresses image tagging habits and how we can while other means of contribution, such as crowdsourcing specify and analyse the means of reaching a semantic de- and social games with a purpose, have been less investi- scription of an image through social tagging applications gated. such as a photo-sharing network and gaming environ- ments. 2.3.1 Social tagging in photo-sharing networks 3.1.1 Tags in photo-sharing networks An online object can have multiple tags, and objects with the same tags can be grouped together, with the tags Flickr is an online photo-sharing web site which was themselves being used to create a folksonomy (Gordon- launched in 2004. It serves as an online storage with shar- Murnane 2006). The term folksonomy was coined in 2005 ing facility. The application allows its users to annotate up- by information architect Thomas Vander Wal by combin- loaded images with titles, descriptions or tags. Users could ing the words “taxonomy” and “folk” (Dye 2006). Folk- also set privacy settings both for visibility and for tagging sonomies can be of two types: the first is a broad folkso- and commenting activities. Flickr has already been used in nomy, which is created by assigning various tags to the previous research (e.g. Van Zwol et al. 2010). The applica- same content by different users; the second type is called a tion shows real-world use, storage and classification of im- narrow folksonomy, where users tag their own content for ages in contrast to laboratory-constructed experiments, future retrieval and sharing (Dye 2006). Probably the best and its images are not limited to particular subject domains. known example of a photo-sharing environment is Flickr. For the evaluation of photo-sharing tags, 130 top tags Tagging, comments and rating used in this and other sys- and 500 random tags from the Flickr collection were se- tems have a huge impact on image description. Flickr pre- lected. Most of the existing research in the area is based on dominantly addresses ‘findability’ within personal content randomly retrieved tags and queries. We inspected the most (Dye 2006). Although Flickr is more about narrow folk- popular tags contained in the Flickr collection in the form sonomy, where creation of metadata is the business of the of a tag cloud in order to show the overall trend. We then person who posts the image, it also has social groups col- examined five hundred tags from the Flickr-based CoPhIR lecting tag specific photos. This is called “tagography.” collection (Bolettieri 2009) which were randomly selected and analyzed in order to understand the nature of average 2.3.2 Games with a purpose (GWAP) tagging behaviour in a photo-sharing environment.

Sawant et al. (2010) state that along with photo-sharing 3.1.2 Tags in image-labelling games services, collaborative gaming has significantly influenced the area of image retrieval and interpretation. While tag- The aim of the next theme in our research was to analyse ging in photo-sharing websites is known to be subjective the influence of collaboration and predefined tagging and contains a lot of unidentified and misspelled words, guidelines for conceptual tagging improvement. We use guidelines could be designed to create social games for two games to do this. The first game (Image-Labelling Game given tagging behaviour. GWAP or “games with a pur- 1) was designed based on a Google Image Labeller mecha- pose” are computer games that are designed to use hu- nism. It was used to analyze the default image-tagging be- man’s cognitive abilities as a side effect of the playing haviour during a game. The second game (Image-Labelling process. They are used to get people involved in perform- Game with Guidelines 2) was a modification of the first game ing tasks that cannot be performed automatically. How- through changing the rules by assigning each image a list ever, people usually play not because they want to solve of taboo/forbidden words to prevent players from de- “an instant computational problem” (von Ahn and Dab- scribing an image with visual entities like colours (red, blue) bish 2008), but because they want to be entertained with a and explicitly-presented objects (girl, house). This encour- fast-paced and enjoyable game. The computation is just a aged a more semantic-oriented approach in image descrip- side-effect of a game. Players are motivated to score as tion in comparison with the first game, and motivated many points as possible within some time limit. A variety players to tag images more conceptually (happiness, joy). of online Games are available include image description These ‘taboo’ words were defined by the first author. As games such as TagATune, PopVideo and ESP game. with the majority of existing image- labelling games, both our games are collaborative in nature (Goh et al. 2010b). 60 Knowl. Org. 41(2014)No.1 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

For each game two players were randomly chosen from Employed classification Jorgensen all potential players. In each round, both players were given the same image as an input. Within a time limit players had Interprtive to produce and match on as many descriptive keywords— Non-visual features (Meta- Art historical tags—as possible based on the given image. For each Data) information match the players obtained 50 points and were notified of the outcome. The final score was a sum of match points, Primitive syntactic features, Color therefore players were rewarded for agreement on the which include colours, shapes, Visual Elements number of tag matches with other players. They did not textures, orientation and Descriptions Perceptual have to produce the tags at the same time. There was no arrangement Location “correct” tag. The main aim was for his/her partner to Visible objects/people in the image, as well as generic spatial Literal objects guess the same tag, helping to avoid biased image descrip- features, which could be People tion. Although participants were co-located in the same recognized by global colour lab, they did not know who their partners were, and direct analysis communication among participants was prohibited. This approach was used to cover the main characteris- Who? People qualities tics of social image-tagging behaviour and to analyze the Semantic usefulness and success factors of social input for semantic Content/Story Interpretive (conceptual) What? image tagging. We used this approach to investigate the features involving Abstract concepts output in different social-based image environments and interpretation of the meaning and to provide an indication of how human knowledge can be purpose of the Where? Content/Story used to bridge the semantic gap between image objects visual features and high-level reasoning, which cannot be achieved auto- matically (see above). When? Content/Story

3.2 Classification for tag analysis Table 1. Comparison of our employed classification with Jorgensen’s framework In order to analyse tags, it is necessary to understand image attributes—features that can include visual, as well as spa- tions of visible objects (people vs. family or friends). It is tial, semantic or emotional characteristics (Jorgensen 1996). similar to Jorgensen’s (1996) division of image attributes There are many frameworks for classification of image at- into perceptual and interpretive groups. The term “percep- tributes. Some of them are oriented towards indexing tual” refers to things in the image e.g. person, ship, beach, (Jaimes and Change 2000), some towards searching (Chung whereas the ‘interpretive’ term refers to a subjective view and Yoon 2011), and some combine both, concentrating of what is happening in the image e.g. person laughing, on image descriptions which can be both search terms and having a good time, being sad, etc. This differentiation will indexing terms (Jorgensen 1996; Westman 2009). help to evaluate the significance of interpretive attributes For tag analysis we have chosen the following classifi- for image description in contrast to perceptually visible ob- cation method. The coding of tags was done in two steps. jects that could be indexed by automatic indexing algo- First of all, tags were assigned to the following levels of rithms. The derived image attributes’ levels are listed be- image attributes: 1) metadata features, 2) primitive fea- low: tures, 3) visible general objects, and 4) semantic features (see table 1). Then, as the Level 4 (semantic tags) is of Level 1 – non-visual metadata features: contain infor- primary interest for this work, tags which fell into this mation about the author of the image, creation/upload category were analysed according to further facets: who, date, photo camera characteristics, etc. what, where and when. The coding system was initially Level 2 – primitive syntactic features: are the basis for tested on a sample set of descriptive words. CBIR systems and include colours (yellow, green, hue, The chosen classification scheme is derived from the saturation, brightness), shapes (round, triangle) and tex- literature and corresponds to existing frameworks (Jaimes tures (a texture of a tissue, bricks, orange peel). and Change 2000; Jorgensen 1996). Table 1 compares it to Level 3 – visible objects/people on the image: are usu- the classification used with Jorgensen’s framework. It con- ally generic in nature (ball, chair, child). tains levels of non-visual, visual and conceptual informa- Level 4 – semantic (conceptual) features: involve inter- tion. The main difference is that this classification consists pretation of the meaning and the purpose of the visual of four levels, splitting visible objects from the interpreta- features (see below). Knowl. Org. 41(2014)No.1 61 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

As the primary interest of this work is the influence of social tagging on bridging the semantic gap, Level 4 tags are analysed in more detail. Based on a combination of Enser et al. (2007) and Sawant et al.’s (2010) definitions of semantic levels, Level 4 tags are divided into four groups:

– Who: Who is portrayed on the image? The facet in- cludes specific naming of people (John, Michael Jack- son), general naming of professions (lawyer, business- man) and the naming of people’s groups (family, cou- ple, crew). – What: What does the picture portray? The facet deals with visual semantic interpretation (gift, education, football, etc), aesthetical and emotive features (cute, sexy, happy, etc). – When: When is the picture taken? This facet identifies person-specific (birthday), community-specific (New Year, Second World War), global events (swimming, skiing, cooking, etc.), time with no direct visual pres- ence presented as natural values (night, autumn, etc.), artificial values (year, week, era), and specific values (1st January, 2011, 8:15 am, etc). – Where: Where is the picture taken? This facet is associ- ated with “geographically-grounded places” (London, Brazil, etc) and “non-grounded” entities (restaurant, Figure 1. Top Flickr tags' distribution.

museum, etc.) (Enser et al., 2007; Sawant, 2010). depends on human interpretative abilities and preliminary 4.0 Results knowledge about the photograph subject and the history of creation. The location facet (52%) dominated among Here we describe the results of analysis of the photo- semantic tags. This could be explained by the fact that the sharing network (Flickr) tags and the two games that fol- majority of images on Flickr are people’s vacation and lowed. travelling photographs and are tagged with visited geo- graphical places. The “who” facet remains the least popular 4.1 Tagging behaviour in a photo-sharing network (6%) among tags, while “what” and “when” facets share the 2nd (25%) and the 3rd (17%) places respectively. The following analysis is based on information publicly Along with the most popular tags, it was useful to ana- available on Flickr. The information about popular tags lyse average tagging behaviour. Five hundred distinct tags was retrieved on 07/05/2011 from the Flickr tag cloud. were selected from the CoPhIR database for further analy- After data analysis, the plural forms of nouns and spelling sis. The Flickr collection has many less ‘meaningful’ tags variations were eliminated leaving 134 tags for further that could only be understood by people knowing the em- classification. We aimed to analyze the current state of ployed abbreviation or term. This was the reason behind tagging behaviour on Flickr and the nature of tags based creating a “Meaningless” category in the following analysis. on the chosen classification scheme. Figure 1 shows the Although there is functionality in Flickr to enter a phrase distribution of the top Flickr tags. Level 4 (semantic) tags tag in a form of separate words, many users prefer to type remained the most popular (63.4%) and this percentage in phrases as one word e.g., “summervacation.” For the was considerably higher than those of subsequent catego- purpose of tag content analysis phrases were treated sepa- ries. The next most popular category was Level 3 (visible rately. However, it should be noted that in practice, the ma- objects) tags with 20.1%. Level 2 (primitive features) jority of the genuine tags of this type will not support im- comprised 10.5% and Level 1 (metadata) comprised 6% age retrieval with the usual queries. Another peculiarity of of tags. Using Jorgensen’s (1996) classification, the num- Flickr tags is the presence of a tag category naming Flickr ber of interpretive tags (69.4%) was considerably higher group names. Although these tags are difficult to interpret, than that of perceptual tags (30.6%). These figures mean those that were found were inserted into the metadata class that tagging in a photo-sharing environment heavily (Level 1). In order to preserve as much information as 62 Knowl. Org. 41(2014)No.1 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

The next step included coding and analysis of Level 4 tags. The tags were analyzed without inspecting the image they were assigned to, and therefore a number of ambigu- ous and polysemous words were assigned to several se- mantic facets. Most of the tags (41%) represented the lo- cation facet, which is similar to the top Flickr tag distribu- tion. In contrast to the most popular tags, random tags more often belonged to the “who” facet (19%). The rea- son behind this difference is the diversity of names. Al- though photos are quite often tagged with people’s names, none of these is widely used in the top tag set. Interestingly, in contrast to a traditional filing system of image storage, where people tend to organize their collec- tion chronologically, Flickr users are more location-ori- ented. However the second refinement in both systems is event information, which in the classification system em- ployed was assigned to the ‘what’ facet and partially, if it was a community seasonal event like Christmas or Hallow- een, to the ‘when’ facet. It could be argued, that online sys- tems like Flickr or Facebook provide easier access to tag- ging functionality for users. However, PC applications like Picasa also offer its users the functionality to identify peo- ple. To conclude, it should be said that most of the user- assigned tags are by nature interpretive. In social networks Figure 2. Random Flickr tags distribution and photo-sharing websites it is more evident, as the main purpose of these online communities is story-telling by possible the non-English words (Spanish, French, etc) were means of pictures – hence the dominance of the interpre- translated with the help of Google Translator and the tive category. This explains why images were described meaning of a number of words was checked with Wikipe- with information like place name and history, event and dia. event participants. Tags were analyzed, and plural forms and spelling varia- tions were reduced, leaving 468 tags for further analysis. 4.2 Experimental gaming environment for image tagging Figure 2 shows the random Flickr tag distributions. About 11% of the remaining tags were coded as meaningless, in- For each game similar sets of 20 images were selected cluding examples such as numbers (6, 17, 812), not gener- from the CoPhIR image database, each of which contain ally-accepted abbreviations (co, kma, haas, etc), website information on the same concept. In each game 10 post- names (httpwwwflickrcomphotosliyin), and symbols (½ï graduates with no or partial preliminary knowledge of the ¿½, ä¸æµ). The majority (65.4%) of the rest of the tags fell topic participated. Out of ten participants seven were fe- into the Level 4 group. Using the Jorgensen categorization, male and three were male, all in the 20-39 years age range. perceptual tags (Level 2 and Level 3) were 19% of the to- Over half of the participants had an IT background. tal. The distribution of levels is similar to popular Flickr Other professional areas presented were law, finance, tag distribution, with the only difference being that Level 1 journalism and social science. Participants’ tagging experi- (metadata) tags are more popular in a random sample set ence mostly comes from the tagging of friends on Face- compared with a top sample set. These Level 1 tags com- book photos; however, two participants were regular prise about 5% of all tags and mostly include camera and Flickr/Picasa users and one had no tagging experience at lense information (fuji film pro 400h, 45mm), as well as all. None of the participants had played online games on a names of the groups, creators, and genres (anime, self- regular basis. The majority had no or only a vague idea portrait, etc). Level 2 (primitive features) tags were the least about games with a purpose. Each game was conducted popular (2.1%) and were predominantly composed of col- for 20 minutes, collecting 590 and 342 tags for Game (1) our names (amber, catchy colours, grey) and image orienta- and Game (2) respectively. After the first analysis stage, all tion (landscape, portrait). duplications and spelling variations for each image were excluded, leaving 354 and 250 tags for further analysis. Knowl. Org. 41(2014)No.1 63 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

Figure 3. Game 1 tags’ distribution Figure 4. Game 1 matched tags

Figure 5. Game 2 tag distribution Figure 6. Game 2 matched tags

The facet analysis and distribution of these tags are shown tion algorithms, the need for object naming could be less in Figures 3-6. important than image semantic interpretation which can- not be achieved through computer-based algorithms. 4.2.1 Image-labelling game (1) Thus, in order to benefit from human input, there is a need for image tagging guidelines which prompt for more The main outcome of this collaborative game was that semantic, interpretive tagging. most of the tags were interpretive (63.6%); however, the percentage of perceptual descriptions (Level 2 and Level 4.2.2 Image-labelling game with guidelines (2) 3) was also quite high (36.4%). The majority of the inter- pretive tags included semantic interpretation of visual ob- The second collaborative game imposed restrictions on jects/scenes (football, kitchen, tombs, etc), aesthetic and players, forbidding the use of words representing visual en- emotive features (sadness, peace, cute, etc), and activities tities e.g. colour and explicitly-presented objects. The major (cooking, sleeping, etc). The absence of metadata (Level outcome of this experiment was that the large majority of 1) tags is explained by the lack of knowledge about the the tags (90.8%) were semantic interpretive words with a images’ background information. Matched tags made up prevalence of “what” tags, with many fewer tags represent- 14% of the game’s outcome. The amount of perceptual ing “who,” “when” and “where.” For example, one image and interpretive matched tags were spread equally. The was described with the following words: fans, victory, majority of perceptual tags were general objects, scenes sport, cheering, team, happiness, support, fun, friendship, and people (man, umbrella, sky, etc). The distribution of passion, game, exciting, pleasure, football. Matched tags semantic (Level 4) matched tags was similar to the distri- made up 10.4% of the game’s outcome, which is slightly bution of all semantic Game1 tags, with a prevalence of less than in the first game (14%). The taboo word list re- the concept semantic facet (i.e. the ‘what’), followed by duced the number of matched perceptual words which person, location and time facets. made up only 15.4% whilst eliminating primitive feature According to a number of studies, image-labelling (Level 2) tags—colours, shapes, etc. The majority of Level games are recognised as a good source of image tags (von 4 tags are “what” concept words with “who” and “where” Ahn and Dabbish 2008). This study indicates that the concepts used with much less frequency. The absence of game’s outcome within an unrestricted game scenario has “when” facet tags in a matched group could be mostly ex- provided evidence for a balanced image description with plained by spelling variations/errors/typos such as: “Hal- general and interpretive words. However, due to the CBIR loween/Holloween/Hallowen,” etc. systems development and enhancement of object descrip- 64 Knowl. Org. 41(2014)No.1 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

5.0 Discussion The results of the research showed that social tagging is predominantly an interpretive activity. However, the Although previous research e.g. (Rorissa 2010) showed number of perceptual tags depends on the context of im- that more perceptual attributes (colour, shapes, objects) age use. Photo-sharing communities mostly use images for were used for image descriptions of Flickr images, the re- story-telling and/or as an event diary; therefore, there is sults of this work show that tags can also be interpretive. more semantic information associated with images with a Flickr users tend to assign specific names and geographi- prominent amount of people and location recognition, cal locations, as well as generally describe images by nam- and event and activities tags. The gaming application has ing the general events and concepts presented. However, shown to be slightly more perceptual oriented, as visual the number of tags for perceptual visual features tends to features (colours, shapes, and distinct objects) are easier to be lower than for conceptual features. These findings cor- spot and to match. However, specific guidelines can influ- respond with previous research of search image attributes ence the game’s outcome in order to obtain a given result e.g. (Chung and Yoon 2011), which found that semantic (or more particular types of tags). This shows that social (conceptual) category of image attributes is the most tagging is a manageable process, but this does to some ex- popular among users’ queries. tent depend on the taggers’ understanding of the image On this evidence games with a purpose (GWAP) are a use and on the nature of the tagging environment. It is useful application for image tagging, and could be used also seen from the study that games are more oriented for various purposes depending on the game’s rules and towards describing ‘what’ in an image, while photo- winning conditions. Within unconditional gaming envi- sharing social networks present a more balanced picture ronments, players tend to use a balance of perceptual and of semantic facets (what/where/when/who). It would be interpretive image attributes. However, the limitation on useful to analyse whether person, place and time recogni- words that could be used for tagging may stimulate play- tion is needed and achievable through GWAP. ers’ interpretive descriptions. This helps to beneficially Whilst our framework has been useful for the research employ human abilities – without having duplicate data carried out here, work on how we can use the various lev- that can be extracted by CBIR or automatic indexing sys- els in conjunction with CBIR to improve image retrieval is tems. According to the results of this study, the variety of worthwhile. Given the results presented here, it would be social tagging applications could satisfactorily generate worth concentrating on interpretive tags initially in order semantic descriptions of images. Although photo-sharing to see what leverage can be gained from that part of the networks support more balance in terms of semantic fac- classification. ets tagging, games with a purpose can be used to augment the tagging process. However the design of the game References needs to be very clearly thought out (Goh and Lee 2011) and there is some evidence that tagging images normally Bolettieri, Paolo, Esuli, Andrea, Falchi, Fabrizio, Lucchese, may outperform either collaborative or competitive Claudio, Perego, Raffaele, Piccioli, Tommaso and Ra- games. Different types of noise (Wang et al. 2012) may be bitti. Fausto. 2009. CoPhIR: A test collection for con- generated than with standard tagging (e.g. bias of the par- tent-based image retrieval. CoRR abs/0905.4627. Avail- ticipants). Goh et al. (2010a) provide some evidence which able http://arxiv.org/pdf/0905.4627v2.pdf conflicts with the earlier study i.e. that competitive games Chung, Eunkyung and Yoon, Jungwon. 2011. Image produced the best result. Designers therefore need to be needs in context of image use: An exploratory study. clear about how to engage players and reward them for Journal of information science 37: 163-77. providing high quality tags in order to obtain the best pos- Dye, Jessica. 2006. Folksonomy: A game of high-tech sible outcome. (and high-stakes) tag: Should a robot dictate the terms of your search? In an age when whole lives are lived 6.0 Conclusion and future work online-via blogs, picture albums, dating, shopping lists- digital content users are not only creating their content, The aim of this exploratory analysis was to examine the they're building their own infrastructure for making it value of social tagging for image description by investigat- easier to find.(navigating webs). Econtent magazine April. ing facets of tags in two different social-based tagging ap- Available http://www.econtentmag.com/Articles/ plications: a photo-sharing social network and an image- Editorial/Feature/Folksonomy-A-Game-of-High-tech- labelling game-based experiment. The tags were coded %28and-High-stakes%29-Tag-15298.htm. and evaluated according to a classification of image at- Eakins, John and Graham, Margaret. 1999. Content-based tributes based on a combination of established image at- image retrieval. Available www.jisc.ac.uk/uploaded_docu tribute frameworks. ments/jtap-039.doc Knowl. Org. 41(2014)No.1 65 E. Konkova, A. Göker, R. Butterworth, A. MacFarlane. Social Tagging: Exploring the Image, the Tags, and the Game

Enser, Peter G.B., Sandom, Christine J., Hare, Jonathan S. ence Proceedings, Baltimore, MD, October 19-24, 1996, and Lewis, Paul H. 2007. Facing the reality of semantic pp.209-13. image retrieval. Journal of documentation 63: 465-81. Rafferty, Pauline and Hidderley, Rob. 2007. Flickr and de- Ferecatu, Marin, Boujemaa, Nozha and Crucianu, Michel. mocratic indexing: Dialogic approaches to indexing. 2008. Semantic interactive image retrieval combining Aslib proceedings: New information perspectives 59: 397-410. visual and conceptual content description. Multimedia Rorissa, Abebe. 2010. A comparative study of Flickr tags systems 13: 309-22. and index terms in a general image collection. Journal of Goh, Dion. H. and Lee, Chei S. 2011. Perceptions, quality the American Society for Information Science and Technology 61: and motivational needs in image tagging human compu- 2230-42. tational games. Journal of information Science 37: 515-31. Sawant, Neela, Lee, Jia and Wang, James. 2010. Automatic Goh, Dion H., Ang, Rebecca P., Chua, Alton and Lee, image semantic interpretation using social action and Chei S. 2010a. Evaluating game genres for tagging im- tagging data. Multimedia tools and applications 51: 213-46. ages. In Blandford, Ann and Gulliksen, Jan, eds., Pro- Smeulders, Arnold W.M., Worring, Marcel, Santini, Simone, ceedings of the 6th Nordic Conference on Human-Computer In- Gupta, Amarnath and Jain, Ramesh. 2000. Content- teraction: Extending Boundaries (NordiCHI '10). New York: based image retrieval at the end of the early years. ACM Press, pp. 659-62. IEEE Transactions on pattern analysis and machine intelligence Goh, Dion H., Ang, Rebecca P., Lee, Chei S. and Chua, 22: 1349-80. Alton. 2010b. Fight or unite: Investigating game genres Van Zwol, Roelef, Sigurbjornsson, Börkur, Adapla, Ramu., for image tagging. Journal of the American Society for In- Pueyo, Lluis G., Katiyah, Abhinav, Kurapata, Kaushal, formation Science and Technology 62: 1311-24. Muralidharan, Mridul, Muthu, Sudar, Murdock, Vanessa, Gordon-Murnane, Laura. 2006. Social bookmarking, folk- Ng, Polly, Ramani, Anand., Sahai, Anuj, Sahai, Anuj, sonomies, and Web 2.0 tools. Searcher: the magazine for Sathish, Sriram T., Vasudev, Hari and Vuyyuru, Upen- database professionals 14: 26-38. dra. 2010. Faceted exploration of image search results. Hare, Jonathan S., Lewis, Paul H., Enser, Peter G. B. and In Freire, Juliana and Chakrabarti, Soumen, eds., Proceed- Sandom, Christine J. 2006. Mind the gap: Another look ings of International Conference of the World Wide Web at the problem of the semantic gap in image retrieval. (WWW’10). New York: ACM Press, pp. 961-70. In Chang, Edward Y, Hanjalic, Alan and Sebe, Nicu. von Ahn, Luis and Dabbish, Laura, 2008. Designing eds., Multimedia content analysis, management, and retrieval games with a purpose. Communications of the ACM 51: 2006 : 17-19 January 2006, San Jose, California, USA. Bel- 58-67. lingham, Wash.: SPIE. Wang, Meng, Ni, Bingbing, Hua, Xian-Sheng and Chua, Jaimes, Alejandro and Chang, Shih F. 2000. A conceptual Tat-Seng. 2012. Assistive tagging: A survey of multi- framework for indexing visual information at multiple media tagging with human-computer joint exploration. levels. In Beretta, Giordano B. and Schettini, Rai- ACM computing surveys 44: 25:1-24. mondo, eds., Proceedings of IS&T/SPIE Internet Imaging Westman, Stina. 2009. Image users’ needs and searching 3964, pp. 2-15. behaviour. In Göker, Ayşe and Davies, John, eds., In- Jorgensen, Corinne. 1996. Indexing images: Testing an im- formation retrieval: Searching in the 21st century. Chichester: age description template. In ASIS 1996 Annual Confer- Wiley.

66 Knowl. Org. 41(2014)No.1 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

New Ways of Mapping Knowledge Organization Systems: Using a Semi-Automatic Matching Procedure for Building up Vocabulary Crosswalks

Andreas Oskar Kempf*, Dominique Ritze**, Kai Eckert*** and Benjamin Zapilko*

*GESIS, Leibniz-Institute for the Social Sciences, Unter Sachsenhausen 6-8, 50667Cologne, Germany , **Mannheim University Library, Schloss Schneckenhof West, 68131 Mannheim, Germany ***Mannheim University, Schloss, 68131 Mannheim, Germany

Andreas Oskar Kempf holds a Ph.D. in sociology from Goethe University, Frankfurt am Main, Germany, and is a postdoctoral researcher at GESIS, Leibniz-Institute for the Social Sciences, . He received a master’s degree in library and information science from Humboldt-University, Berlin, in 2011, with a thesis on automatic indexing of domain-specific information. Apart from automatic indexing his research interests include the interoperability of knowledge organization systems and their integration in semantic web applications.

Dominique Ritze is a Ph.D. student and project employee at the Mannheim University Library. She received her master’s degree in computer science from the University of Mannheim, Germany, in 2011. Her research focuses on ontology matching as well as semantic technologies and linked data.

Kai Eckert is a research associate at the University of Mannheim, Germany, where he leads the infrastructure development of the EU-funded project DM2E (Digitized Manuscripts to Europeana). He is a computer and information scientist with master’s degrees from the University of Mannheim (computer science, business in- formatics) and the Humboldt-University of Berlin (MA LIS). He was member of the W3C Provenance Incu- bator Group and the W3C Library Linked Data Incubator Group. Currently, he participates in the W3C Provenance Working Group and co-chairs the DCMI Metadata Provenance Task Group.

Benjamin Zapilko studied computational visualistics at the University Koblenz-Landau and is research associ- ate at GESIS, Leibniz-Institute for the Social Sciences since 2007. In the department Knowledge Technologies for the Social Sciences he conducts applied research in the fields of semantic web and linked open data. In his Ph.D. thesis, he investigates methods and problems for a useful appliance of linked open data in scientific re- search. His research interests also include the modeling and standardization of data as well as the matching and integration of heterogeneous data sources.

Kempf, Andreas Oskar, Ritze, Dominique, Eckert, Kai, and Zapilko, Benjamin. New Ways of Mapping Knowledge Organization Systems: Using a Semi-Automatic Matching Procedure for Building up Vo- cabulary Crosswalks. Knowledge Organization. 41(1), 66-75. 25 references.

Abstract: Crosswalks between different vocabularies are an indispensable prerequisite for integrated, high-quality search scenarios in distributed data environments where more than one controlled vocabulary is in use. Offered through the web and linked with each other they act as a central link so that users can move back and forth be- tween different online data sources. In the past, crosswalks between different thesauri have usually been developed manually. In the long run the intellectual updating of such crosswalks is expensive. An obvious solution would be to apply automatic matching procedures, such as the so-called ontology matching tools. On the basis of com- puter-generated correspondences between the Thesaurus for the Social Sciences (TSS) and the Thesaurus for Economics (STW), our contribution explores the trade-off between IT-assisted tools and procedures on the one hand and external quality evaluation by domain experts on the other hand. This paper presents techniques for semi-automatic development and maintenance of vocabulary crosswalks. The performance of multiple matching tools was first evaluated against a reference set of correct mappings, then the tools were used to generate new Knowl. Org. 41(2014)No.1 67 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems mappings. It was concluded that the ontology matching tools can be used effectively to speed up the work of domain experts. By optimizing the workflow, the method promises to facilitate sustained updating of high-quality vocabulary crosswalks.

Received 31 July 2013; Revised 13 September 2013; Accepted 13 September 2013

Keywords: mapping, KOS, semi-automatic matching, vocabulary crosswalks

1.0 Introduction Essential for a successful mapping is an understand- ing of the meaning and semantics of the terms and For good reason, crosswalks between two or more vocabu- the internal relations of the concerned vocabularies. laries, also known as terminology mappings, play an impor- This includes syntactic checks of word stems but tant role in today´s information landscape. First and fore- also semantic knowledge to look up synonyms and most, they are an essential means to achieve interoperability other related terms. among different knowledge organization systems, thus over- coming problems of semantic heterogeneity. Even where a Then, the mapping process starts. For each concept, any certain number of discrepancies between different termi- scope note and all its internal relationships need to be nologies prove insurmountable, crosswalks implemented in a taken into account. In order to achieve overall consistency distributed search scenario can enable an integrated search it is occasionally necessary to revise mappings already cre- across varied information collections indexed using different ated. Finally, mappings between different vocabularies usu- subject metadata systems. In addition, alignments between ally include retrieval tests for document recall and precision different controlled vocabularies serve as a useful tool for to evaluate whether the translation of search terms of one vocabulary expansion. This is especially helpful in overcom- vocabulary into those of another vocabulary indeed facili- ing differences in the terminologies used in different subject tates the search across different databases and terminol- disciplines. Beyond that, semantic mappings between differ- ogies. For example, queries are translated into search terms ent vocabularies can be useful for query expansion and re- of a controlled vocabulary A and used for keyword search formulation. Automatic translation of a query into the ap- in a bibliographic database which uses another controlled propriate search terms of all the different vocabularies in use vocabulary B. Retrieval results can be compared by repeat- enables a searcher to apply only the terminology with which ing this search using cross-concordances between both vo- he or she is familiar, while moving between different re- cabularies which translate the original controlled vocabu- sources and databases in a collection. lary search terms into the controlled vocabulary terms of Cross-concordances between controlled vocabularies the target database. usually involve three basic mapping types: equivalence, hi- The need for expertise and for constant consideration erarchical and associative. Equivalence can be exact (be- of the whole semantic environment of each term, make tween synonyms) or inexact (between quasi-synonyms). vocabulary mapping expensive and extremely time- Hierarchical mappings, either broader or narrower, apply consuming. Against this backdrop, this article seeks to ex- in one or the other direction between broader and nar- amine to what extent semi-automatic matching procedures rower terms in the respective vocabularies. Associative can be used to prepare vocabulary crosswalks. The results mappings link related terms. A “null relation” describes of the 2012 Ontology Alignment Evaluation Initiative the case where no appropriate mapping can be established (OAEI) provided basic background as to the ontology- for a given term. Cross-concordances are established bi- matching approaches available. Comparing technical and laterally, i.e. cross-concordances are created from vocabu- intellectual evaluation results of OAEI`s most recent “Li- lary A to vocabulary B as well as from vocabulary B to brary Track” we suggest a semi-automatic method to make vocabulary A, and these bilateral relations are not neces- the intellectual evaluation of automatically-generated vo- sarily symmetrical. Additionally, one term of vocabulary A cabulary crosswalks more efficient. can be mapped to a combination of terms of vocabulary B or independently to several terms of vocabulary B; both 2.0 Related Work cases are known as one-to-n (1:n) term mappings. The intellectual mapping of vocabularies done by do- Building up correspondences between vocabularies has main experts includes a number of working steps. The been a crucial topic for years in library and information first is an overall analysis of structure and topical overlap, science. For this reason several terminology mapping pro- to determine whether an alignment is possible and rea- jects have already addressed the issue of manual versus sonable at all. According to Mayr and Petras (2008, 5): 68 Knowl. Org. 41(2014)No.1 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems automatic generation of crosswalks between heterogene- proaches typically enable user interaction before (To et al. ous vocabularies. 2009), during, or after the matching process (Duan et al. A first major terminology mapping initiative was the 2010; Ehrig et al. 2005). The most similar to the evaluation project Multilingual ACcess to Subjects (MACS) carried out scenario presented in this article are those that enable vali- by the National Libraries of France, Germany, Switzerland dation of the detected correspondences after the matching and the United Kingdom. By establishing equivalences be- process. While Paulheim et al. (2007) enable a rating of tween the three indexing languages, RAMEAU for French, correspondences by the user, the matching process pre- Library of Congress Subject Headings (LCSH) for English, and sented by Cruz et al. (2012) and Noy and Musen (2003) is SWD1 for German, multilingual subject access to library performed iteratively. User feedback on correspondences is catalogues was made possible (Landry 2009). This led to the brought directly into the subsequent matching tasks. By establishment of a link management database to create and splitting up the validation process these tools aim to reduce manage links in a decentralized environment. The devel- the manual evaluation effort. The main difference is the opment of a search interface and the future and permanent use case: while these approaches may be used to improve management of the MACS approach are still under plan- matching results in a variety of settings, we specifically ad- ning and analysis. Terminology mappings have also been dress the task of creating a set of high quality mappings created at the OCLC Online Computer Library Center, Inc. between vocabularies, where automation is used to reduce (Godby et al. 2004; Vizine-Goetz et al. 2004), where various the manual effort required. vocabularies like the Dewey Decimal Classification (DDC), the Many matching techniques have already been developed Library of Congress Classification (LCC), the Medical Subject (Kalfoglou and Schorlemmer 2003; Aguirre et al. 2012). Headings (MeSH), and LCSH have been taken into account. Some of them take the names of entities into account Further initiatives include the High Level Thesaurus Project while others compute similarities based on the ontology (HILT) (Macgregor et al. 2007) and CRISSCROSS (Panzer hierarchy. All of them have advantages as well as disadvan- 2008), as well as several mapping projects from the Food tages and their individual field of application. Without ex- and Agriculture Organization of the United Nations (FAO) tensive knowledge about the systems, it is difficult to de- (Lauser et al. 2008; Liang and Sini 2006). A manual cross- cide which system should be used for a specific matching concordance between the Thesaurus for Economics (STW) task. That is the reason why ontology matching evaluation and the Thesaurus for the Social Sciences (TSS) was manu- is needed. ally created by domain experts in 2006 (Mayr and Petras 2008). All these projects have in common that they did not 3.0 OAEI library track 2012 exploit automatic approaches systematically, due to a lack of generally available and applicable matching systems. One already established evaluation initiative is the Ontol- One impediment to the development of matching sys- ogy Alignment Evaluation Initiative (OAEI http://oaei. tems arises from the different formats that are used to rep- ontologymatching.org), which started in 2004. Spanning resent knowledge organization systems (KOS). With the various tracks from a wide range of different scientific dis- advent of the semantic web (Berners-Lee et al. 2001), Re- ciplines, this campaign has as its main goal to improve on- source Description Framework (RDF) (Klyne and Carroll tology matching in general, by comparing and evaluating 2004), Web Ontology Language (OWL) (McGuinness and the different matching systems and algorithms. Taking part van Harmelen 2004), and Simple Knowledge Organization either in a specific track or in all tracks these matching sys- System (SKOS) (Miles and Bechhofer 2009), a technical ba- tems and algorithms are evaluated according to special cri- sis exists that facilitates access to KOS data. Ontology teria, for example the time spent to build up a set of map- matching, also called ontology alignment, is a related field pings. Between 2007 and 2009 the OAEI included a so- where correspondences are established between ontologies called library track, directed towards KOS’s specifically ap- that are usually represented in OWL. Ontology in this con- plied in libraries (Isaac et al. 2009). Last year the OAEI text stands for a special kind of KOS substantially differing again offered a library track focused on the automatic from thesauri and classification systems. While thesauri and matching of different domain-specific thesauri, co- classifications usually apply a limited number of relation- organized by authors of this paper. To make evaluation of ships between concepts or between terms, ontologies po- the results possible, however, the organizers needed a ref- tentially apply an unlimited number of predicative term re- erence set of mappings. lations (Gietz 2001). Despite these differences between types of KOS’s, however, matching approaches are to some 3.1 Data set extent transferable. Recently, automatic matching systems have been dis- A key enabler for the OAEI library track was the availabil- cussed as a prior step before manual evaluation. Such ap- ity of two considerably overlapping domain-specific Knowl. Org. 41(2014)No.1 69 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems thesauri, in this case the Thesaurus for the Social Sciences the two thesauri could be generated automatically. The (TSS) and the Thesaurus for Economics (STW). Both question was whether current state-of-the-art matching thesauri are commonly used for indexing by domain- systems developed for ontologies would be able to deal ef- specific libraries and institutions providing information in- fectively with thesauri – the so-called “lightweight ontolo- frastructure, and so can be regarded as a real world data set. gies” (Uschold and Gruninger 2004) that are widely used in The Thesaurus for the Social Sciences (TSS) serves as a practice. key indexing language for documents and research infor- For the automatic creation of cross-correspondences mation in German language social sciences. Translated into both thesauri needed to be available in a machine-readable English and French it contains overall about 12,000 key- format. Since OWL is used by almost all ontology match- words, made up of 8,000 standardized subject headings ing systems, both thesauri had to be converted from their and 4,000 non-descriptors. The thesaurus as a whole cov- existing SKOS formats into OWL. (General differences ers topics and sub-disciplines of the social sciences. Addi- between ontologies and thesauri and a detailed description tionally some general, non-scientific terms and some terms of difficulties including the transformation from SKOS from associated and related disciplines are included, in or- into OWL can be found in Aguirre et al. (2012)). der to support accurate and precise indexing of documents from a wide inter- and multi-disciplinary background. The 4.0 Automatic creation of correspondences thesaurus is owned and maintained by GESIS, Leibniz- Institute for the Social Sciences (http://www.gesis.org/ For the automatic creation of correspondences all match- en/home/). Its SKOS version is published under a CC-by- ing systems participating in the OAEI 2012 were applied: NC-ND licence. AROMA, ASE, AUTOMSv2, CODI, GO2A, GOMMA, The Thesaurus for Economics (STW) provides a Ger- Hertuda, HotMatch, LogMapLt, LogMap, MaasMatch, man and English indexing vocabulary for economics con- MapSSS, MEDLEY, OMR, Optima, ServOMapL, Ser- taining more than 6,000 standardized subject headings, and vOMap, TOAST, WeSeE, Wmatch and YAM++ (Aguirre 19,000 entry terms. Besides subject headings used in the et al. 2012). They match the ontologies and generate the field of economics it includes juridical, sociological, politi- resulting alignment by a fully automatic process. Our exist- cal and geographical subject headings. The entries are ing reference alignment made it possible to measure the richly interconnected by 16,000 hierarchical and 10,000 as- quality of the alignments created. The results were evalu- sociative relations. An additional hierarchy of main catego- ated by means of precision, recall and F-measure, where ries provides a high level overview. The vocabulary, used precision measures the correctness of the returned corre- for indexing purposes in libraries and economic research spondences (i.e. the rate of all correct returned correspon- institutions, is maintained and further developed on a regu- dences in regard to all returned correspondences), recall lar basis by ZBW ( of Econom- the completeness of the correspondences (i.e. the correct ics http://zbw.eu/index-e.html), Leibniz Centre for Eco- returned results in regard to all correct correspondences nomics. It is published under a CC-by-SA-NC license. that should have been returned); F-measure is the har- During an earlier major terminology mapping initiative monic mean of both. conducted by GESIS, Leibniz-Institute for the Social Sci- An overview of the results can be found in Table 1 ences in 2006, a bilateral reference alignment had been cre- (matchers are sorted in descending order of their F-meas- ated manually by domain experts (Mayr and Petras 2008) ure values). Altogether, 13 of the 21 submitted matching between TSS and STW. It contains about 3,000 exact systems were able to create an alignment. Three matching equivalences, 1,500 narrower and approximately 150 systems (MaasMatch, MEDLEY, Wmatch) did not finish broader term relations in each direction. Since its initial within the timeframe of one week while five exited with an creation in 2006, this reference alignment had not been error. updated. In recent years, however, the source thesauri have This evaluation is based on the original reference align- evolved and the changes were not reflected in the reference ment. It can safely be assumed that if the reference align- alignment. For the evaluation exercise, accordingly, an up- ment had been up-to-date, many more correct correspon- dated alignment would have been useful but in its absence dences would have been identified by each of the match- only the established equivalence relations were used for ers. GOMMA performs best in terms of F-measure, validating the correspondences detected. This need, how- closely followed by ServOMapL and LogMap. However, ever, motivated subsequent investigation of whether the the precision and recall measures vary a lot across the top results could be used to update the existing alignment. three systems. The choice of matcher for a given applica- In view of the large number of concepts, semantic rela- tion would depend on whether high precision or high re- tions and synonyms, the overriding aim of the evaluation call is preferred. If the focus is on recall, the alignment was to show whether and to what extent the alignment of created by GOMMA is probably the best choice, with a re- 70 Knowl. Org. 41(2014)No.1 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

Matcher Precision Recall F-Measure Time (s) Size GOMMA 0.537 0.906 0.674 804 4712 ServOMapL 0.654 0.687 0.670 45 2938 ServOMap 0.717 0.619 0.665 44 2413 LogMap 0.688 0.644 0.665 95 2620 YAM++ 0.595 0.750 0.664 496 3522 LogMapLt 0.577 0.776 0.662 21 3756 Hertuda 0.465 0.925 0.619 14363 5559 WeSeE 0.612 0.607 0.609 144070 2774 HotMatch 0.645 0.575 0.608 14494 2494 CODI 0.434 0.481 0.456 39869 3100 MapSSS 0.520 0.184 0.272 2171 989 AROMA 0.107 0.652 0.184 1096 17001 Optima 0.321 0.072 0.117 37457 624

Table 1. Results of the OAEI Library Track 2012 call of about 90%. Other systems generate alignments with the lexical value of the term was the same but the higher precision, e.g. ServOMap with over 70% precision, scope note in one thesaurus indicated an exclusion not but most give lower recall values (except for Hertuda). valid in the other; Concerning the run-time, LogMapLt as well as Serv- OMap were quite fast with a run-time below 50 seconds. terms in different domains looked similar, but their These systems are even faster than a simple Java-program meanings were different; comparing the preferred labels of all terms. Thus, they are very effective in matching large ontologies while achieving the presence of a synonym matching a preferred term very good results. Other matchers take several hours or in the other thesaurus caused an incorrect equivalence even days and do not produce better alignments in terms to be generated. of F-measure. To sum up, the overall intellectual evaluation results of the 5.0 Intellectual evaluation of automatically created newly established vocabulary mappings vary greatly be- correspondences tween the different matching tools. The number of suc- cessfully established equivalence mappings ranged (ap- The use of a partial reference alignment to identify a good proximately) between 40 and 270, i.e. between 6% and matcher is interesting, but does not solve the problem of roughly 54% of the total correct number. updating and extending the reference alignment in an effi- Despite these promising results, it was judged that the cient way. Manually evaluating new correspondences took alignments obtained were not precise enough for immedi- up to several minutes for each mapping established. There- ate use, since in a live situation every single cross- fore, a good strategy is needed to maximize the number of concordance has to be totally correct. Nevertheless, given new correct correspondences while minimizing the tedium the large number of matching systems and their fast, au- of evaluating the matcher results. Unsurprisingly, the tomated execution, they can be used to support domain matching tools were easily able to detect matches based on experts in the creation of cross-concordances. Integrated the term alone, even in cases of small variations in the in a semi-automatic workflow they can serve as a recom- character string. For example, useful matches were often mender system, showing a domain expert the most prob- found between geographical and ethnographical terms. But able cross-concordances and hence saving a huge amount the tools were less effective when taking the term’s context of time. into account. Incorrect matches were often generated However, the question is how to benefit the most from when: the cross-concordances prepared automatically? Within an alignment, confidence values assigned to the correspon- the lexical value of the term was the same but broader dences by the matching tools indicate how trustworthy a and narrower terms showed the underlying concept to correspondence is. Unfortunately, the confidence values be different; are not comparable between different matchers; in particu- Knowl. Org. 41(2014)No.1 71 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems lar they do not indicate how far an alignment is correct. vestigated whether a reorganization of the results pre- They can only be used to order correspondences within sented for manual evaluation had an impact on the time one alignment. Traditional measures like precision, recall spent by domain experts. We tested this assumption on and F-measure do not take this ordering into account. the results of the OAEI library track 2012. Thus, an alignment can have a high F-measure value but if This experiment addressed the order and the number the correct correspondences are listed at the end, this of detected correspondences the domain expert had to alignment is not the best choice. In this case, an alignment consider. Any duplicate correspondences (i.e. correspon- with a low F-measure value but properly assigned confi- dences generated by more than one matcher) were re- dence values is to be preferred. Thus, the domain expert moved. After de-duplication, the correspondences were gets a high amount of correct cross-concordances while grouped according to the number of matchers detecting verifying as few as possible. them. This resulted in a group containing correspon- dences that were found by all thirteen matching systems, a 6.0 Improving results with user interaction group with correspondences found by twelve matchers and so on. The last group contained correspondences Until now, the OAEI tracks have only evaluated fully au- found by only one matcher. tomated matching systems. Similar to the library track, the In the experiment, the groups were presented to the results are often good, but for various applications not domain expert for evaluation in descending order, i.e. the good enough. In these cases, it is necessary to involve do- expert began with the group of correspondences found main experts, either before, during or after the matching by all matching systems. From the total numbers of corre- process. Before the matching process: the expert can indi- spondences and of those which turned out to be correct, cate correct and incorrect correspondences. Based on this we can observe the rate of finding correct correspon- additional source of information, the system can try to dences and compare that with the rate when no reorder- learn the perfect matching strategy. During the matching ing of the results was done. In other words, calculation process: the matching system can ask the expert e.g. to ver- shows how many correct correspondences would be ify or complete correspondences. Using the answer, the found after evaluating the same number of correspon- system can try again to adapt its strategy. After the match- dences as before. ing process: once the alignment has been created, the ex- In Table 2, the results of the manual evaluation are pert can verify the correspondences in order to improve summarized. For our experiment only the de-duplicated their quality. In this case the matching system cannot bene- correspondences were considered. fit from the results as they are usually not fed back into the system. Since the current state-of-the-art matching systems All correspondences De-duplicated mostly deal with fully automated matching services, we (including duplicates) correspondences only verified the alignments after they had been created. If Total 55466 22592 the expert is interactively involved in the whole matching number process, the manual effort could be further reduced. Then, of which 21541 2484 (11%) of course, other measures are needed to compare the sys- are correct tem, e.g. the number of required interactions (Paulheim et Table 2. Number of correspondences: total; de-duplicated and al. 2013). correct

7.0 Optimizing the evaluation process In Figure 1, we illustrate the percentage of correct corre- spondences (y-axis) found by a certain number of match- In the following experiment, we investigated whether the ing systems (x-axis). For example, x=9 means that these effort of a domain expert during manual evaluation can correspondences were identified by 9 matching systems, be reduced and optimized. For our manual evaluation, we no matter which particular 9 systems found them. Above studied each alignment in isolation and checked every sin- the graph, the total number of detected correspondences gle correspondence. It goes without saying that this proc- for x systems is indicated (71). Altogether, 71 correspon- ess would be quicker, if each correspondence that occurs dences were found by all matching systems, from which in several alignments can be presented for checking only ~99% proved correct. Of the correspondences found by once. Another idea is to exploit the large number of 12 matching systems (209), about 93% were found to be alignments generated by the matching systems. The un- correct. The graph clearly shows a correlation between the derlying assumption of this approach is that the more number of matchers to identify a given correspondence, matching systems have found a certain correspondence, and the likelihood of its being correct. the more likely it seems to be correct. Additionally, we in- 72 Knowl. Org. 41(2014)No.1 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

Figure 1. Percentage of correct correspondences found by x matching systems

Number of Number of Percentage of Number of corresponding matchers all correspondences correct correspondences correct correspondences 1 16662 0.27007562 50 2 840 5.71428571 48 3 538 10.4089219 56 4 574 15.6794425 90 5 528 20.4545455 108 6 555 31.8918919 177 7 523 37.0936902 194 8 486 48.8659794 238 9 448 61.3839286 275 10 506 80.8300395 409 11 652 89.1104294 581 12 209 92.8229665 194 13 71 98.5915493 70 Table 3. Results of the “majority vote”

Table 3 shows the number of all correspondences and the ber of correct correspondences can be found relatively numbers of all correct correspondences, grouped by the quickly by optimizing the sequence of entries in the list of number of matchers that found these correspondences. matches (see Table 4). To show the extent of the efficiency For example, 506 correspondences were found by ten gain, the first five columns of Table 4 reverse the sequence matching systems, and 409 of them (80% approximately) of Table 3, beginning with those correspondences that were correct. were found by as many matchers as possible. This reveals These numbers confirm our assumption that the more how many correct correspondences can be found at each matching systems have found a certain correspondence, stage, if the list is reorganized. Percentages of correct cor- the more likely it is to be correct. This “majority vote” respondences are also shown for each group of matchers. method has already emerged as a promising technique, Finally, in the last two columns of Table 4 we compare e.g. for combining different ontology matching systems these numbers to the numbers when the evaluation is not (Eckert et al. 2009). optimized. The number of corresponding matchers (col- Regarding the time spent by users during manual umn 1) was not taken into account. The overall correctness evaluation, our results confirm that at least a certain num- rate of 11 % (see Table 2) was used to estimate the num- Knowl. Org. 41(2014)No.1 73 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

Optimized scenario Normal evaluation Number of Number of Percentage of Percentage of Number of Percentage of Number of corresponding all correspondences all corr. all correct corr. correct corr. all correct corr. correct corr. matchers (corr.) (22592=100%) (2484=100%) (estimated) (2484=100%) 13 71 0.31% 70 2.82% 8 0.32% 12 280 1.24% 264 10.63% 31 1.25% 11 932 4.13% 845 34.02% 103 4.15% 10 1438 6.37% 1254 50.48% 158 6.36% 9 1886 8.34% 1529 61.55% 207 8.33% 8 2372 10.50% 1767 71.14% 261 10.51% 7 2895 12.81% 1961 78.95% 318 12.80% 6 3450 15.27% 2138 86.1% 380 15.30% 5 3978 17.61% 2246 90.42% 438 17.63% 4 4552 20.15% 2336 94.04% 501 20.17% 3 5090 22.53% 2392 96.30% 560 22.54% 2 5930 26.25% 2440 98.23% 652 26.25% 1 22592 100% 2490 100% 2485 100% Table 4. Comparison of different evaluation strategies ber of correct correspondences shown in column 6. This quality vocabulary crosswalk. As a first conclusion, it was shows the number of correct correspondences that would judged that the matching tools could be used in recom- have been found after checking the same number of can- mender systems. Second, the matches generated by a vari- didates as were checked at the corresponding stage of the ety of different tools were combined and presented in the optimized process. most time-efficient order, so as to speed up the intellectual In summary, a critical mass of correct correspondences evaluation of the matches. This proved highly effective. can be detected faster by reordering the results for manual The immediate outcome has been the development of evaluation. For example, after having evaluated 1886 cor- a semi-automatic matching technique for preparing vo- respondences a total of 1529 correct correspondences cabulary crosswalks. Beyond that, however, more research were found in the optimized scenario (i.e. 61.5 % of all could usefully be done into the provision of automated correct correspondences), while only 207 correct corre- support for intellectually verified matching procedures. spondences would have been found without optimization Knowledge organization systems such as thesauri are built (only 8.33 % of all correct correspondences). Neverthe- with elaborate semantic content and structures. The chal- less, if it is necessary to find all correct correspondences, lenge of achieving interoperability between them is an in- all the results of all matchers must eventually be evaluated. tellectual task that cannot easily be emulated by automatic means. That is why further research could usefully study 8.0 Conclusion and outlook the interplay between process-supporting technical solu- tions and intellectual demands. As is already well-known, the intellectual process of de- veloping cross-vocabulary mappings typically requires Note specialist resources and can be very time-consuming. This is especially true of large-scale thesauri that cover many 1. Schlagwortnormdatei or Subject Headings Authority sub-disciplines. Our study has shown that the use of on- File of the German National Library, has subsequently tology matching tools can greatly speed up the process, been replaced by the GND (Gemeinsame Normdatei especially if the work is organized in the most time- or Universal Authority File). efficient order. This enables automatic creation of an alignment between different thesauri that are available in References machine-readable format. The most recent OAEI library track has shown signifi- Aguirre, José Luis, Eckert, Kai, Euzenat, Jérôme, Ferrara, cant differences between the performances of various on- Alfio, van Hage, Willem Robert, Hollink, Laura, tology matching tools on offer. Some are rather promis- Meilicke, Christian, Nikolov, Andriy, Ritze, Dominique, ing. None of them, however, could alone prepare a high- Scharffe, François, Shvaiko, Palve, Šváb-Zamazal, On- 74 Knowl. Org. 41(2014)No.1 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

dřej, Trojahn, Cássia, Jiménez-Ruiz, Ernesto, Cuenca Kalfoglou, Yannis and Schorlemmer, Marco. 2003. Ontol- Grau, Bernardo and Zapilko, Benjamin. 2012. Results ogy mapping: the state of the art. The knowledge engineer- of the ontology alignment evaluation initiative 2012. In ing review 18: 1-31. Proceedings of the 7th International Workshop on Ontology Klyne, Graham and Carroll, Jeremy J.. 2004. Resource de- Matching, OM 2012 - Collocated with the 11th International scription framework (RDF): Concepts and abstract Semantic Web Conference, ISWC 2012, pp. 73-115. syntax – W3C recommendation. Available http://www. Berners-Lee, Tim, Hendler, James and Lassila, Ora. 2001. w3.org/TR/rdf-concepts/. The semantic web. Scientific American 284: 34-43. Landry, Patrice. 2009. Multilingualism and subject heading Cruz, Isabel F., Stroe, Cosmin and Palmonari, Matteo. languages: how the MACS project is providing multi- 2012. Interactive user feedback in ontology matching lingual subject access in Europe. Catalogue & index: Peri- using signature vectors. In Proceedings of the 28th Interna- odical of CILIP cataloguing & indexing group 157: 9-11. tional Conference on Ocean, Offshore and Arctic Engineering- Lauser, Boris, Johannsen, Gudrun, Caracciolo, Caterina, 2009: presented at [the] 28th International Conference on van Hage, Willem Robert, Keizer, Johannes and Mayr, Ocean, Offshore and Arctic Engineering: May 31-June 5, Phillip. 2008. Comparing human and automatic thesau- 2009, Honolulu, Hawaii, USA, New York: American So- rus mapping approaches in the agricultural domain. In ciety of Mechanical Engineers, pp. 1321-4. Greenberg, Jane and Wolfgang, Klas, eds., Metadata for Duan, Songyun, Fokoue, Achille and Srinivas, Kavitha. semantic and social applications: Proceedings of the International 2010. One size does not fit all. Customizing ontology Conference on Dublin Core and Metadata Applications: Berlin, alignment using user feedback. In Simperl, Elena, 22-26 September 2008: DC 2008: Berlin, Germany. Cimiano, Philipp, Polleres, Axel, Corcho, Oscar and Gottingen:̈ Universitatsverlag̈ Gottingen,̈ pp. 43-53. Presutti, Valentina, eds., The semantic web: Research and Liang, A. C. and Sini, M. 2006. Mapping AGROVOC and applications: 9th Extended Semantic Web Conference, ESWC the Chinese Agricultural Thesaurus: Definitions, tools, 2012, Heraklion, Crete, Greece, May 27-31, 2012. Proceed- procedures. New review of hypermedia and multimedia 12: ings. Berlin: Springer, pp. 177-92. 51-62. Ehrig, Marc, Staab, Steffen and Sure, York. 2005. Boot- Macgregor, George, Joseph, Anu and Nicholson, Dennis. strapping ontology alignment methods with APFEL. 2007. A SKOS core approach to implementing an M2M In Gil, Yolanda Gil, Motta, Enrico, Benjamins, V. Rich- terminology mapping server. In Proceedings of the Interna- ard and Musen, Mark, eds., The Semantic Web - ISWC tional Conference on Semantic Web and Digital Libraries 2005: 4th International Semantic Web Conference, ISWC (ISCD-2007), 21-23 February 2007, Bangalore, India. Ban- 2005, Galway, Ireland, November 6 - 10, 2005, Proceedings. galore: Documentation Research & Training Centre, Berlin: Springer, pp. 186-200. Indian Statistical Institute, pp. 109-20. Available http:// Eckert, Kai, Meilicke, Christian and Stuckenschmidt, strathprints.strath.ac.uk/2970/1/strathprints002970.pdf Heiner. 2009. Improving ontology matching using McGuinness, Deborah L. and van Harmelen, Frank. 2004. meta-level learning. In Aroyo, Lora, Traverso, Paolo, OWL web ontology language – W3C recommendation. Avail- Ciravegna, Fabio, Cimiano, Philipp, Heath, Tom, Hy- able http://www.w3.org/TR/owl-features/. vonen, Eero, Mizoguchi, Riichiro, Sabou, Marta and Mayr, Philipp and Petras, Vivien. 2008. Building a termi- Simperl, Elena, eds., The semantic web: research and applica- nology network for search: The KoMoHe Project. In tions: 6th European Semantic Web Conference, ESWC 2009, Greenberg, Jane and Wolfgang, Klas, eds., Metadata for Heraklion, Crete, Greece, May 31-June 4, 2009: proceedings. semantic and social applications: Proceedings of the International Berlin: Springer, pp. 158-72. Conference on Dublin Core and Metadata Applications: Berlin, Gietz, Peter. 2001. Expertise über quality controlled subject gate- 22-26 September 2008: DC 2008: Berlin, Germany. ways und fachwissenschaftliche portale in Europa. Tubingen. Gottingen:̈ Universitatsverlag̈ Gottingen,̈ pp. 177-82. Godby, Carol Jean, Young, Jeffrey A. and Childress, Eric. Miles, Alistair and Bechhofer, Sean. 2009. SKOS simple 2004. A repository of metadata crosswalks. DLib maga- knowledge organization system reference - W3C rec- zine 10 no. 12. Available http://www.dlib.org//dlib/ ommendation. Available http://www.w3.org/TR/skos- december04/godby/12godby.html. reference/. Isaac, Antoine, Wang, Shenghui, Zinn, CLaus, Mattherz- Noy, Natalya F. and Musen, Mark A. 2003. The PROMPT ing, Henk, van der Meij, Lourens and Schlobach, suite: Interactive tools for ontology merging and map- Stefan. 2009. Evaluating thesaurus alignments for se- ping. International journal of human-computer studies 59: mantic interoperability in the library domain. IEEE in- 983-1024. telligent systems 24 no. 2: 76-86. Panzer, Michael. 2008. Semantische Integration hetero- gener und unterschiedlichsprachiger Wissensorganisa- tionssysteme: CrissCross und jenseits. In Mitgutsch, Knowl. Org. 41(2014)No.1 75 A. O. Kempf, D. Ritze, K. Eckert, B. Zapilko. New Ways of Mapping Knowledge Organization Systems

Konstantin, Netscher, Sebastian and Ohly, H Peter, Valentina, Hollink, Laura, Rudolph, Sebastian, eds., The eds., Kompatibilitat,̈ medien und ethik in der wissensorganisa- semantic web: Semantics and big data: 10th International Con- tion : Proceedings der 10. Tagung der Deutschen Sektion der In- ference, ESWC 2012, Montpellier, France, May 26-30, 2013: ternationalen Gesellschaft für Wissensorganisation Wien 3. - 5. Proceedings. Heidelberg: Springer, pp. 31-45. Juli 2006. Würzburg: Ergon, pp. 61-9. To, Hoai-Viet, Ichise, Ryutaro and Le, Hoai-Bac. 2009. An Paulheim, Heiko, Rebstock, Michael and Fengel, Janina. adaptive machine learning framework with user interac- 2007. Context-sensitive referencing for ontology map- tion for ontology matching. In Proceedings of the Interna- ping disambiguation. In Bouquet, Paolo, Euzenat, tional Joint Conferences on Artificial Intelligence, Workshop on Jérôme, Ghidini, Chiara, McGuinness, Deborah L., Information Integration on the Web, pp. 35-40. Serafini, Luciano, Shvaiko, Pavel and Wache, Holger, Uschold, Michael and Gruninger, Michael. 2004. Ontolo- eds., C&O:RR-2007 Contexts and Ontologies: Representation gies and semantics for seamless connectivity. ACM and Reasoning: Proceedings of the International Workshop on SIGMOD record 33 no. 4: 58-64. Contexts and Ontologies: Representation and Reasoning. Vizine-Goetz, Diane, Hickey, Carol, Houghton, Andrew CEUR, pp. 47-56. Available http://ceur-ws.org/Vol- and Thompsen, Roger. 2004. Vocabulary mapping for 298/paper5.pdf. terminology services. Journal of digital information 4 no. 4. Paulheim, Heiko, Hertling, Sven and Ritze, Dominique. Available http://journals.tdl.org/jodi/index.php/jodi/ 2013. Towards evaluating interactive ontology matching article/view/114/113. tools. In Cimiano, Philipp, Corcho, Oscar, Presutti,

76 Knowl. Org. 41(2014)No.1 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval

Experiences with Automated Categorization in E-Government Information Retrieval

Tanja Svarre* and Marianne Lykke**

*/** Dept. of Communication and Psychology, Aalborg University, Nyhavnsgade 14, DK-9000 Aalborg, Denmark, * , **

Marianne Lykke is Professor and Knowledge Group leader for the e-Learning Lab (eLL), Department of Communication and Psychology, Aalborg University. She is Professor II at the Oslo and Akershus University College of Applied Sciences (HIOA), and is visiting professor at Åbo Academy University, Åbo. Her research concerns technologies for knowledge sharing and learning in organizations, specifically information architec- ture and interaction design, and use practice. She is member of several editorial boards, and has published in international as well as national journals, anthologies and proceedings. She has acted as consultant to many en- terprises and government organizations.

Tanja Svarre is Assistant Professor at the e-Learning Lab, Department of Communication and Psychology, Aalborg University. Her research is centered on professional information practice and design of ICT-based services with a specific focus on information architecture. She defended her thesis at Aalborg University in 2012. Here she investigated and compared indexing methods in e-government from a user based perspective.

Svarre, Tanja and Lykke, Marianne. Experiences with Automated Categorization in E-Government In- formation Retrieval. Knowledge Organization. 41(1), 76-84. 35 references.

Abstract: High-precision search results are essential for supporting e-government employees’ information tasks. Prior studies have shown that existing features of e-government retrieval systems need improvement in terms of search facilities (e.g., Goh et al. 2008), navigation (e.g., de Jong and Lentz 2006) and metadata (e.g., Kopackova, Michalek and Cejna 2010). This paper investigates how automated categorization can enhance in- formation organization and retrieval, and presents the results of a realistic evaluation that compared automated categorization with free text indexing of the government intranet used by Danish tax authorities. The evalua- tion demonstrates a potential for automated categorization in a government context. In terms of quantitative measures free text indexing performed at the same level or better than searching by categorization. However, the qualitative analysis revealed that categorized over- views were useful if the participant did not possess much knowledge of the task at hand. When task knowledge was present, categoriza- tion was used to support the assumptions of a correct search. Participants avoided automated categorization if high-precision documents were among the top results or if few documents were retrieved. The findings emphasise the importance of simultaneous search options for e-government IR systems, and reveal that automated categorization is valuable in improving search facilities in e-government.

Received 31 July 2013; Revised 20 September 2013; Accepted 24 September 2013

Keywords: search, categorization, categories, queries, documents, E-government, information retrieval

1.0 Introduction 2006). Therefore, not being able to find needed informa- tion can have severe human and financial costs (Kraemer E-government facilitates governments utilising ICT to and Dedrick 1997). Different tools add to reduced infor- communicate with and allow access to information for mation overload in organizations, e.g. value added infor- stakeholders (e.g., Fang 2002; Jaeger 2003; Grant and mation (Edmunds and Morris 2000). Metadata assignment Chau 2005). Documentary support is essential for opera- supports interoperability between systems, high precision tions undertaken in public administrations (Kraemer and search, and knowledge sharing (Schwartz, Divitini and Dedrick 1997; Klischewski 2006; Sabucedo and Rifón Brasethvik 2000; Moen 2001; Choo 2006; Tambouris, Knowl. Org. 41(2014)No.1 77 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval

Manouselis and Costopoulou 2007). Metadata can be as- Automated categorization has been thoroughly evalu- signed manually by humans or automatically based on a ated in individual studies and in comparative reviews. machine-generated analysis of documents. In Danish e- However, the evaluations have to a large extent been sys- government, the predominant approach is manual as- tem-driven and included no users or had a very limited in- signment (The Danish Government, Local Government clusion of users. Early examples include Apté, Damerau Denmark (LGDK) and Danish Regions 2007). and Weiss (1994), Chen (1995) and Dumais et al. (1998). In the field of US federal records management, Sprehe Turmo et al. (2006), Chung et al. (2010), and Qu et al. et al. (2002) found that different situational factors af- (2012) are more recent examples. Zamir and Etzioni’s fected the quality of federal employees’ record-keeping, (1999) evaluation of their cluster-based interface Grouper causing a divergence in the quality of record management is one example of a user-based evaluation. They found that across governments. Factors like availability of resources users explored several clusters to locate relevant docu- and guidance, the motivation of employees, and efficiency ments and that the Grouper users found more documents of access to records appeared to affect the quality of re- compared to the baseline system (HuskySearch). Another cords management in the study. In a recent study of meta- example is Kules and Shneiderman’s (2004) study. They data assignment in a Finnish government, the researchers made a comparative study of ranked and categorized out- found that employees preferred not to assign metadata puts in U.S. government webpages. The participants find when they had the option. Additionally, the employees the overview easy to use and helpful in noticing areas not tended to accept default values whenever they were avail- covered by search results. The authors also note a learning able (Kettunen and Henttonen 2010). The results suggest effect from the categorization. Despite the controlled char- that e-government indexing can benefit from an automatic acter of the test, the authors conclude that categorization solution to indexing in a number of ways. The literature is useful in supporting understanding of large sets of has already demonstrated that the assignment of metadata search results. is one among a number of prerequisites for retrieval and Lastly, Käki (2005a; 2005b; Käki and Aula 2005) built sharing of knowledge in organizations (e.g., Choo 2006). If the evaluation of a web categorization interface (extracted automated assignment can improve subject metadata, then indexing) on users. Different evaluations have been re- we can assume that retrieval and knowledge sharing is also ported from the study. Käki and Aula (2005) made a com- influenced in a positive sense. parative study of an interface comprising the algorithm and categorized search interface with the World Wide Web 2.0 Categorization as the test base. The study found that the categorized inter- face had a better average performance in precision (62% Categorization places documents in categories, usually in a against 49%) and recall (33% against 19%). A longitudinal web-based environment, with the purpose of supporting study elaborated on the initial results (Käki 2005b). It was searches (Qi and Davison 2009). Specifically, categorization found that categories were used to select 26% of the ac- enables post-limitation of search results on the basis of cessed result pages. The participants indicated that catego- document characteristics, e.g. subject, document type, au- ries were useful, when “the original query was vague, thors, etc. Categorization may be based on either manually broad, general, or contained words that have multiple added metadata or automated procedures. Automated pro- meanings” (Käki 2005b, 138). Also, categories helped in- cedures include clustering, knowledge engineering and ma- creasing the focus of a less precise query and were found chine learning. Clustering is an unsupervised procedure. useful when result rankings were deficient. The results of Here digitalised documents are represented as document the study are interesting because they demonstrate that vectors. Calculations of the similarity between vectors sub- categorization is not necessarily useful in all information sequently form the basis of clustering documents with cor- searching situations. From the analysis, we get an indication responding characteristics (Carpineto et al. 2009). Knowl- of situations in which categories may be useful. However, edge engineering and machine learning are typically based a more systematic investigation would be relevant. on a coupling between documents and a controlled vocabu- Many studies have examined various forms of auto- lary. Knowledge engineering is a rule-based approach. The mated categorization, but few with the participation of rules ensure automated placement of documents in one or users. In the present paper, we investigate automated cate- more correct categories. The development of rules is done gorization based on a controlled vocabulary applied with a manually. Machine learning on the other hand is based on combination of machine learning and knowledge engi- supervised training. A set of training documents represent- neering. We evaluate the automated categorization ap- ing each category in the controlled vocabulary is selected proach on a corporate and e-government intranet by in- and subsequently used for categorization of the full collec- cluding professional users. The evaluation is carried out as tion of documents (Sebastiani 2002). a comparison study between automated categorization 78 Knowl. Org. 41(2014)No.1 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval and automatic free text indexing. On this basis, the re- the categorization on the basis of manually added subject search question guiding our further work runs as follows: metadata (documents published after January 1 2008.) The What characterizes the potential role of categorization in remainder of the documents was indexed automatically. professional e-government information retrieval? Also, there was a lack of most recent documents. The test database was generated in August 2009 and was not up- 3.0 Methodology dated in the intervening period of time up to the search test in June 2010. Lastly, the test database had some func- For answering the research question, we carried out a tional inexpediences, e.g. not being able to link to the full search test in a realistic setting in a real life government text of all documents and at times slow responses. The test intranet at the Danish Tax Corporation, SKAT. The test procedure was designed with these inexpediencies in mind took place in June 2010 in two office locations of SKAT. to reduce the influence on the test outcome. The organization intranet contains a heterogeneous col- Thirty-two employees participated in the test. The par- lection of documents, e.g. legal directions, citizen and ticipants were recruited by e-mail. In our selection of par- business directions and brochures, legal documents, ticipants, we emphasised frequent intranet use and infor- forms, news, minutes, job postings, reports from finished mation seeking. Forty-two of the voluntary employees met internal projects, HR information and other internal in- the requirements. Of these 10 were used as pilot testers. formation from the organization and departments. At the We employed three simulated and one genuine work task time of the test, the intranet contained 681,640 docu- in the test (cf. Borlund 2003). The simulated tasks covered ments. The search test compares free text indexing, (ex- the sale of an apartment (sim1), taxation of e-commerce tracted indexing, system A, baseline) and categorization (sim2), and tax-based issues related to freelance work (assigned indexing, system B) in an experimental manner (sim3). The test procedure consisted of: 1) an introduction to be able to observe and capture differences in searching to the session; 2) the search part in which the participants behaviour between the two systems. carried out searches in the two systems; and, 3) a post- A prototype of the organization’s future intranet func- search interview. In the first part, the participants were in- tioned as the test system of the search test. The test system troduced to the session, system characteristics, etc. Due to contained a random sample of the running intranet. The time constraints the participants did not try out the proto- sample comprised 188.600 documents, that is, 28% of the type ahead of the test. In all test sessions, the succession full document collection. The prototype was based on con- of tasks and systems were rotated. When searching in sys- tent management technology. Autonomy’s (www.auto- tem B, the participants were obliged to use categorization nomy.com) search software, IDOL, provided the search for limiting their search results. The relevance of retrieved functionalities of the search interface. Though more fields documents was assessed on the basis of the title and snip- were available, the participants only used the search field’s pet. The relevance of search results was noted when the query box, search operator and document type during test- result lists appeared. After the search part, a short post- ing. Search results were relevance ranked. For each hit, the search interview was conducted. The test sessions ranged document title, a snippet highlighting the search words and between 30 minutes and two hours. The test setting (re- the surrounding words, the document type and the date of cruitment e-mail, search tasks and the general test session) publication appeared. was pilot tested ahead of data collection. System B represented searching by categorization. In Different data were collected throughout the search IDOL, categorization is based on machine learning. The test. The participants’ interaction with the test system was taxonomy used for the categorization has 169 terms di- logged using the software Morae (http://www.techsmith. vided into two levels. The selection of one or more catego- com/morae.asp). Interviews, both oral and in question- ries took place after a search had been processed and a re- naire form, were carried out along during the course of the sult existed. On the basis of the retrieved documents, the search test. Documents’ relevance was assessed during the search result was limited to subjects present in the search test. Relevance was assessed on a four-point scale. 0 repre- results. The categorization window displayed the terms sented not relevant; 1 pointing to the subject, but only by a from the taxonomy actually containing documents in the sentence or the like; 2 denoted a document pointing to the current result set. In the test situation, when the partici- topic, but only by parts and not the full document, and; 3 pants used system A, the categorization field was covered. represented a thorough discussion of the question at hand. The development of the test database and the training The scale reflects Sormunen’s (2002) four-point scale. A of the document categorization were still taking place dur- search log registered search time and words applied. From ing the test work. Consequently, the test work was chal- the screen video recorded during the searches, we manually lenged in various ways. The categorization procedure was drew the number of hits retrieved, selection of subject semiautomatic, as a part of the documents were placed in categories, use of information filters and search types. All Knowl. Org. 41(2014)No.1 79 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval were registered in SPSS for analysis (http://www-01.ibm. Reformulations took place in both systems. However, in com/software/analytics/spss/) along with the relevance system A the share of sessions with reformulations was assessments. Subsequently, statistical analyses were carried 65.6%, while 82.8% of the sessions in system B required out, consisting of univariate and bivariate statistics, fre- reformulations. In addition, the average number of reform- quencies, means and correlations. In the analysis, query ulations was notably higher in system B (4.23) compared to success designated a query retrieving at least one document system A (2.58). This means that an average session in sys- with a relevance measure of 2 or above. Session success tem A contains 3.58 queries, while the corresponding was the label for a session that contained at least one suc- number for system B is 5.23. The averages are slightly cessful query. We used the log data to compare system A above the findings of similar studies of web search engines and system B by the number of concepts applied in quer- and web portals. Lykke et al. (2012) found an average of ies and the degree of search success in sessions and quer- 2.5 and 3.2 queries per session. Koshman et al.’s (2006, ies. Also the search log provided detailed data on the extent 1879) average was marginally higher at 3.37. To sum up, and types of reformulations carried out. However, qualita- the present study, and in particular system B, has an in- tive data was needed to understand and explain the pat- creased number of queries in sessions compared to similar terns identified in the search log, as several iterations and studies. We ascribe the increased number of queries in ses- changes of search moves can, but not necessarily do equal sions to the participants’ lack of experience with the test a bad session. For that purpose we used a Dictaphone to system. The lack of experience may also explain the in- record the search test and the post interview. The re- creased success rate at query and session level in system A. cordings were subsequently transcribed. We used atlas.ti for analysing the interview transcripts. 4.1 Reformulations

4.0 Results The type of reformulation adds to our understanding of the search actions carried out by the participants. We ana- The search test provides data on the searching behaviour lysed reformulations to discover if the category, the search in the two test systems, system A and system B. In total words, the document type or the search operator were 128 sessions consisting of 564 queries were undertaken by changed, if several parameters were changed or if no re- the 32 participants in 64 sessions in each of the two sys- formulation occurred (see Table 2). In system A, the over- tems. Table 1 summarises the general findings. The aver- all preferred reformulation is a change of search words. age number of concepts is slightly higher in system B This is followed by a change of the document type and si- (1.90) compared to system A (1.67). This corresponds to multaneous change of two or more parameters. Compared the concept (search key) averages of 1.8 and 1.5 found in to system B, the use of the document type filter is far more Lykke et al. (2012). Further, in a study comparing catego- common in system A, likely because this is the only possi- rized searches with non-categorized searches, Käki (2005b, ble way of reducing search results in system A without 136) found an average of 2.10 search words for the for- changing the search words or the search operator. Thus, mer and 2.04 for the latter. Though Käki investigates the participants actually used the available options for search words and we report concepts, the respective re- modification of their search results. Furthermore, the regu- sults agree that on average more search words are applied lar use of the document type filter emphasises the impor- in categorized queries than in non-categorized queries. tance and relevance of the filter. In system B, the preferred reformulation was a change of categories; this was closely System A System B followed by a combination of two or more parameters. Sessions Sessions Next, a change of query words followed. Document type Variables N=64 N=64 and search operators were rarely used as query modifiers. It Queries Queries is evident that categories are important, which is to be ex- N=229 N=335 pected, as they were mandatory in system B. In addition, Number of concepts in queries 1.67 1.90 (averages) categories were combined with other parameters to a large extent. Most commonly, a change of category was com- Number of sessions with re- 65.6 82.8 formulations (percentages) bined with a change of search words. This reflects the de- Number of reformulations in sign of the system, where only categories with content 2.58 4.23 sessions (averages) were shown to the searchers. Thus, when search words Query success (percentages) 30.6 21.5 were changed, a change of available categories was likely to Session success (percentages) 89.1 84.4 occur, as the categories reflected the list of retrieved documents. This also explains the importance of a change Table 1. General Findings of Variables in Search Test of query words as a reformulation. 80 Knowl. Org. 41(2014)No.1 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval

Total A third type of behaviour also triggered combined system System A System B B queries. When the initial query resulted in very few No reformulations 69 (30.1) 62 (18.5) search results, it did not seem natural to the participants to Category - 114 (34.0) further reduce already limited search results. Some partici- Query words 97 (42.4) 47 (14.0) pants undertook the categorization despite the few results, Document type 28 (12.2) 8 (2.4) while others omitted the categorization and assessed the Search operators 8 (3.5) 5 (1.5) results retrieved on the basis of the remaining search pos- >1 types simultaneously 27 (11.8) 99 (29.6) sibilities. Total 229 (100) 335 (100) It says just that ... the costs to the European border Table 2. Types of Reformulations for All Queries (Percentages) should be included in the customs value. The other one regarding transportation, I can see that it is ex- 4.2 Combined system B sessions and queries plained with great precision. But in this case, I did not search for “customs” down here [in the catego- During the course of the search test, participants occa- ries]. I got it by searching for freight and customs sionally ended up assessing documents before choosing a value and “pages with all words.” And then I got the category in system B queries. This behaviour had different customs guidance, which is also the one referring to causes. One was the speed of the system. Thus, in the the customs codes treating the rules about the time waiting for the system to categorize search results, amount of carriage to add. So this [document] is a some participants began to review the documents found three then. But I didn’t get it by searching for “busi- on the basis of the initial query. On other occasions, the ness imports” or “shipping” or “exports” [referring participants saw the document they were looking for in to categories] (P32). the results list before even deciding on a category by which to reduce search results, and they ended up assess- The quotation illustrates, in a combined system B query ing the initial search results without filtering them by cate- with just two retrieval results, how the participant ends up gory. We denote these searches as combined system B assessing the documents retrieved without categorization. queries, because users had the intention of using system B This supports the assumption put forward by Kules and but then switched to system A. Likewise, ‘combined sys- Schneiderman (2004, 2) that search results must have a tem B sessions’ refers to the sessions that should have certain size to make categorization useful. been carried out in system B, but participants assessed the relevance of documents found in system A and in system Number of Number of B. The following quotation serves as an illustration of successful sessions in combined system B searches: sessions system B system B But the first time I searched, I got an e-commerce System B 26 (40.6) 22 (40.7) Combined system B handbook. I would have preferred that to going 38 (59.4) 32 (59.3) down there [“down there” refers to the categoriza- sessions tion window on the right hand side of the screen] Total 64 (100.0) 54 (100.0) (P10). Table 3. Sessions Carried Out in System B or in a combination of System B and System A: frequency and success (Per- In several cases, when a highly relevant document had centages) been discovered before the choice of a category in system B, the participants could not locate the document in the The combined system B queries and sessions were coded categories, which occasionally led to frustration: as system B searches inasmuch as the participants had ac- cess to the taxonomy and could be influenced by it. How- It is just as bad, because it says “arrears” and “em- ever, in respect of the methodology, an overview of the ployers”, and it is neither of them. So let’s see about extent of the queries must be provided. To do this, addi- “employers”… because it says “employers and A- tional codes were added to enable separation from the cor- taxes” And it is withhold by the A-taxes, just like our rect system B queries. The quote illustrates, in a combined employers withhold our taxes. I simply can’t find it. I system B query with just two retrieval results, how the par- know it is in there. But on the basis of this, I can’t ticipant ends up assessing the documents retrieved without get in there because when I know where it is at, I categorization. This supports the assumption put forward would go directly for it instead. (P05). by Kules and Schneiderman (2004, 2) that search results Knowl. Org. 41(2014)No.1 81 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval must have a certain size to make categorization useful. cluding a category had a larger chance of succeeding Table 4 lists the share of combined system B sessions. The compared to queries that basically corresponded to system table shows that about 60% of system B sessions con- A queries. tained one or more queries that omitted categories. It is evident from the table that approximately 60% of the suc- Success Failure Total cessful sessions in system B had at least one query that did Queries with categories 52 (24.2) 163 (75.8) 215 (100.0) not include the choice of a category. The sessions that to Queries without categories 20 (16.7) 100 (83.3) 120 (100.0) some degree pass over the categorization are therefore Total 72 263 335 substantial. Table 4 enlarges on combined system B sessions. The Table 5. System B Queries: Frequency of Category Use and table shows the system delivering successful results for Query Success (Percentages) (Legend: The table contains queries contained in sessions. In that way, the table ad- all queries processed in system B, both regular system B dresses the sessions based on a combination of the two queries and combined system B queries (N=335).) test systems. It is identified that although a combined sys- tem B session included queries conducted in system A and In the post search interviews, participants were asked to system B, both systems have not necessarily provided use- assess system B. In the responses, we found answers to ful search results. The share of successful sessions is fairly when the categorization was useful and when it was not. even between the two systems. Thirteen sessions were The answers are analysed in this section in order to elabo- solved by omitting categories, and 15 sessions had success rate further on the results gained from the search log pre- in including the categories in their queries. Only four ses- sented above. There was an overall agreement among the sions found relevant documents by means of both sys- participants that the categorization was useful when they tems. This means that at the session level, the share of had a large set of results. P21 discussed a query with 14 success is fairly even between the two systems. It also results: means that the participants may have omitted the categori- zation in some queries of a session, but it may still be that It did not help me so much there because the query relevant documents are found by means of categorization. didn’t have that many results. It was possible to cope with the documents there, whether the categoriza-

tion had been there or not. Only 14 documents were Frequency Per cent retrieved. You could cope with that. It is [more] Task not solved 6 15.8 helpful when you get large results, a thousand System A 13 34.2 documents or so (P21). System B 15 39.5

Both systems applied 4 10.5 When the categorization was useful in terms of retrieval, Total 38 100.0 set sizes varied. Some mentioned 40 documents, others Table 4. System of successful queries in combined System B Ses- like P21 mentioned far more. Categorization was also sions (Legend: The table lists the systems that have pro- found useful in generating new perspectives on the com- vided documents with a relevance score of 2 or 3 in position of a query and for understanding the facets of combined system B sessions. That explains why N=38.) the search task. That supports the decision of coding combined system B queries and sessions as system B que- Table 5 expands on Table 4 and presents the share of suc- ries and sessions in the overall coding of the search log. cesses at query level. Table 5 presents all queries carried One example was given by P02, who would have liked to out in system B, both distinct system B queries and com- have access to the categorization in a system A session: bined system B queries. Although the participants in a number of cases found the categorization irrelevant, it At the end I would have liked to be able to go over was still used in approximately two thirds of the queries there [into the categorization], because no matter (see outer right hand column). In addition, when calcu- what I did, I could not find anything. And then I lated in terms of the share of successful queries, queries need somewhere else to search where I have the op- including categories had a better performance (24.2% of tion of seeing other sub-topics in order to perhaps queries were successful) than queries omitting categoriza- access it that way (P02). tion (16.7% of queries were successful). To sum up, in combined system B searches, more than half of system B P09 supports this statement when discussing a system B sessions included system A queries to some extent. How- session: ever, at the query level for all system B queries, queries in- 82 Knowl. Org. 41(2014)No.1 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval

It worked well there, because suddenly I found a Here 34% of all system B reformulations consist of principal topic that I could click on. And that gave changing the category, meaning that participants clicked me that … Hey! Yes! That has to do with company around between categories without changing the remain- taxation. So it also helped me thinking what this is at der of the search options. In other cases, the trouble ex- all (P09). perienced by the participants was caused by apparently cur- ious categorizations offered by system B. One example These findings confirm Käki’s (2005b) findings that “the was the presence of the taxonomy term “tonnage taxes” original query was vague, broad, general, or contained in a query regarding property gain taxes (P13). We have al- words that have multiple meanings” (Käki 2005b 138). ready mentioned the varying sizes of the documents in Still, the present participants discussed whether categori- the collection and the importance of giving employees di- zation was more useful to people with some or no insight rections regarding document type. The findings suggest into the topic of the tasks. P06 knew what to look for in that in collections with large documents, the documents one of the tasks: should be indexed in smaller units to obtain more precise search results. On the other hand, when using categoriza- I knew that if I was to look for something about the tion in search results that are already very limited, as was taxation then I would also know something about the case in many system B searches, the results may be independent businesses. And then I could go in skewed. This may be due to lack of experience with the there faster. So I knew that I should choose “per- categorization in system B, too narrow queries or odd sonal incomes” over “capital income” [examples of suggestions for categories. These reasons may explain the categories]. I know the tax rules. So it is easier to increased number of queries in system B sessions. P14 choose between the categories when the answer is summarises the discussion by saying: known in advance (P06). Once you begin to get an idea [of] what the catego- P20 on the other hand did not find much help from cate- ries are, what they stand for … then you fumble un- gorization: til you find out what it is. Are there more roads lead- ing to Rome, or which is the fastest, or …? Well, it is But I don’t know if I would ever start going through an adaptation with some things. What is the wisest all this [the categories]. I think it takes more time thing to do (P14). because I don’t know what is behind. If I was a spe- cialist in SKAT and knew all about company tax set- 5.0 Limitations tlements or the like, then [the categorization] might be perfect for me because then I would know that I We recognize that the search test has limitations. The test can go in there exactly, click that, and get the docu- was methodologically challenged by the preliminary state ments out. But I don’t know if it would [omit] some of the test database. A running intranet might have gener- documents that I need, if it limits the results too ated different performance measures and searching behav- much (P20). iour among the participants. Also, we investigated the in- formation searching of a large institution with highly spe- P24 sums up the usefulness for users with a lot of knowl- cialized employees. We may not be able to apply the find- edge of the task topic and users with less knowledge: ings in smaller governments with generalist employees. However, the search test represents a user based and real- If I know what I am looking for, or at least think I istic evaluation of automated categorization, which adds know where to go [in the categories], then it is really to the limited body of knowledge within specialized e- good. But when I don’t know, it might also be good government retrieval and indexing. because you get to try out different keywords [tax- onomy terms]. But if you have the wrong keyword, 6.0 Conclusions you will definitely not find it that way (P24). With the present paper we wanted to investigate the com- The reason for the difference of opinion may be due to parative performance of free text indexing (system A) and lack of insight into system functionalities and taxonomy. automated categorization (system B). The purpose of the Thus, a considerable number of the participants men- study was to identify and characterize the potential role of tioned lack of experience with the test system as an im- categorization in professional e-government information portant reason for difficulties experienced in locating rele- retrieval. We found that free text indexing outperforms vant documents. The difficulties can be seen in Table 2. categorization when compared in terms of quantitative Knowl. Org. 41(2014)No.1 83 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval measures such as the number of reformulations, session ception-based approach. Journal of the American Society success, and query success. Different causes were found for Information Science and Technology 61: 688-99. for the increased effort to retrieve relevant documents in Dumais, Susan, Platt, John, Heckerman, David and Sa- system B. Examples are trouble finding suitable categories hami, Mehran. 1998. Inductive learning algorithms and due to lack of knowledge of the taxonomy. The taxonomy representations for text categorization. In Makki, K. challenges were also identified in the analysis of types of and Bouganim, L., eds., CIKM '98 Proceedings of the sev- reformulations in system B, where many reformulations enth international conference on Information and knowledge consisted of a change of category alone. In relation to re- management. New York: ACM, pp. 145-55. trieval system design the results stress the importance of Edmunds, Angela and Morris, Ann. 2000. The problem an appropriate and meaningful level of detail in controlled of information overload in business organisations: A vocabularies. From the interviews we found qualitative review of the literature. International journal of information explanations to the potential of categorization despite the management 20: 17-28. differences in performance between the two systems. We Fang, Zhiyuan. 2002. E-government in digital era: Con- found that categorization was useful: 1) if the query re- cept, practice, and development. International Journal of trieved large sets of results; 2) in suggesting new search the computer, the Internet and management 10 no. 2: 1-22. words for a query; and, 3) for understanding the facets of Goh, Dion Hoe-Lian, Chua, Alton Yeow-Kuan, Luyt, a search. On the other hand, categorization was not useful Brendan and Lee, Chei Sian. 2008. Knowledge access, if: 1) a highly relevant result came out among the first re- creation and transfer in e-government portals. Online in- sults; or, 2) if the set of results turned out to be very formation review 32: 348-69. small. Overall, it is concluded that there is a basis for im- Grant, Gerald and Chau, Derek. 2005. Developing a ge- plementing categorization in information systems sup- neric framework for e-government. Journal of global in- porting professional e-government users. Categorization is formation management 13 no. 1: 1-30. a valuable component in successful retrieval in the domain Jaeger, Paul T. 2003. The endless wire: E-government as too to support everyday information needs. Therefore we global phenomenon. Government information quarterly 20: recommend applying categorization in e-government in 323-31. combination with other search features to back different de Jong, Menno and Lentz, Leo. 2006. Municipalities on types of information needs among employees. the web: User-friendliness of government information on the Internet. In Wimmer, Maria A., Andersen, Kim References Viborg, Gronlund,̈ Ake̊ and Scholl, Hans J., eds., Elec- tronic government: 5th International Conference, EGOV 2006, Apté, Chidanand, Damerau, Fred and Weiss, Sholom M. Krakow,́ , September 4-8, 2006. Proceedings. Berlin: 1994. Automated learning of decision rules for text Springer, pp. 174-85. categorization. ACM transactions on information systems 12 Kettunen, Kimmo and Henttonen, Pekka. 2010. Missing no. 3: 233-51. in action? Content of records management metadata in Borlund, Pia. 2003. The IIR evaluation model: A frame- real life. Library & information science research 32: 43-52. work for evaluation of interactive information retrieval Klischewski, Ralf. 2006. Ontologies for e-document man- systems. Information research 8 no. 3. Available http://www. agement in public administration. Business process man- informationr.net/ir/8-3/paper152.html. agement journal 12: 34-47. Carpineto, Claudio, Osiński, Stanislaw, Romano, Giovanni Kopackova, Hana, Michalek, Karel and Cejna, Karel. and Weiss, Dawid. 2009. A survey of web clustering 2010. Accessibility and findability of local e- engines. ACM computing surveys 41 no. 3: 17:1-17:38. government websites in the Czech Republic. Universal Chen, Hsinchun. 1995. Machine learning for information access in the information society 9: 51-61. retrieval: Neural networks, symbolic learning, and ge- Koshman, Sherry, Spink, Amanda and Jansen, Bernard J. netic algorithms. Journal of the American Society for Infor- 2006. Web searching on the Vivisimo search engine. mation Science 46: 194-216. Journal of the American Society for Information Science and Choo, Chun Wei. 2006. The knowing organization: How or- Technology 57: 1875-87. ganizations use information to construct meaning, create knowl- Kraemer, Kenneth L. and Dedrick, Jason. 1997. Comput- edge, and make decisions, 2nd ed. New York: Oxford Uni- ing and public organizations. Journal of public administra- versity Press. tion research and theory 7: 89-112. Chung, EunKyung, Miksa, Shawne and Hastings, Saman- Kules, Bill and Shneiderman, Ben. 2004. Categorized tha K.. 2010. A framework of automatic subject term graphical overviews for web search results: An explora- assignment for text categorization: An indexing con- tory study using U. S. government agencies as a mean- ingful and stable structure. In Proceedings of the Third 84 Knowl. Org. 41(2014)No.1 T. Svarre and M. Lykke. Experiences with Automated Categorization in E-Government Information Retrieval

Annual Workshop on HCI Research in MIS. Washington, Sebastiani, Fabrizio. 2002. Machine learning in automated D.C.: AIS SIGHCI, pp. 20-3. text categorization. ACM computing surveys 34 no. 1: 1-47. Käki, Mika. 2005a. Enhancing web search result access with Sormunen, Eero. 2002. Liberal relevance criteria of TREC automatic categorization. Ph.D. Tampere: Department of – counting on negligible documents? In Hancock- Computer Sciences, University of Tampere. Beaulieu, Micheline, ed., SIGIR 2002 : proceedings of the Käki, Mika. 2005b. Findex: Search result categories help Twenty-Fifth Annual International ACM SIGIR Conference users when document ranking fails. In Proceedings of the on Research and Development in Information Retrieval, August SIGCHI conference on Human factors in computing systems. 11-15, 2002, Tampere, Finland. New York: Association Portland, Oregon: ACM, 131-40. for Computing Machinery, pp. 324-30. Käki, Mika and Aula, Anne. 2005. Findex: Improving Sprehe, J.Timothy, McClure, Charles R. and Zellner, search result use through automatic filtering categories. Philip. 2002. The role of situational factors in manag- Interacting with computers 17: 187-206. ing U.S. federal recordkeeping. Government information Lykke, Marianne, Price, Susan and Delcambre, Lois. 2012. quarterly 19: 289-305. How doctors search: A study of query behaviour and Tambouris, Efthimios, Manouselis, Nikos and Costopou- the impact on search results. Information processing & lou, Constantina. 2007. Metadata for digital collections management 48: 1151-70. of e-government resources. The electronic library 25: 176- Moen, William. E. 2001. The metadata approach to ac- 92. cessing government information. Government information The Danish Government, Local Government Denmark quarterly 18: 155-65. (LGDK) & Danish Regions. 2007. The Danish e- Qi, Xiaoguang and Davison, Brian D. 2009. Web page government strategy 2007-2010: Towards better digital service, classification: Features and algorithms. ACM computing increased efficiency and stronger collaboration. Available http:// surveys 41 no. 2: 1-31. www.modernisering.dk/fileadmin/user_upload/docu Qu, Bo, Cong, Gao, Li, Cuiping, Sun, Aixin and Chen, ments/Projekter/digitaliseringsstrategi/Danish_E- Hong. 2012. An evaluation of classification models for government_strategy_2007-2010.pdf. question topic categorization. Journal of the American So- Turmo, Jordi, Ageno, Alicia and Català, Neus. 2006. Adap- ciety for Information Science and Technology 63: 889-903. tive information extraction. ACM computing surveys 38 Sabucedo, Luis Álvarez and Rifón, Luis Anido. 2006. Se- no. 2: 4 mantic service oriented architectures for eGovernment Zamir, Oren and Etzioni, Oren. 1999. Grouper: A dy- platforms. American Association for Artificial Intelligence. namic clustering interface to web search results. Com- Available http://www.aaai.org/Papers/Symposia/ puter networks: The international journal of computer and tele- Spring/2006/SS-06-06/SS06-06-018.pdf. communications networking 31: 1361-74. Schwartz, David G., Divitini, Monica and Brasethvik, Terje. 2000. Internet-based organizational memory and knowl- edge management. Hershey, USA: Idea Group.

Knowl. Org. 41(2014)No.1 85 Brief Communication

Brief Communication: What is Knowledge Organization?†

Ingetraut Dahlberg

Am Hirtenberg 13, 64732 Bad König, Germany,

Ingetraut Dahlberg started work on thesauri and classification in the early sixties. She developed her concept theory in 1972 together with her work on the establishment of a universal classification system of knowledge fields, the Information Coding Classification, published in 1982. In 1974, she founded the journal International Classi- fication, now known as Knowledge Organization, and was its editor for 23 years. She also founded the German So- ciety for Classification in 1977 and chaired it until 1986. In 1989, the International Society for Knowledge Or- ganization was founded, and she served as its president until 1996. In 1980, she founded the INDEKS Verlag, which was taken over by Ergon Verlag in 1997.

Dahlberg, Ingetraut. Brief Communication: What is Knowledge Organization? Knowledge Organization. 41(1), 85-91. 27 references.

Abstract: As an introduction, the circumstances leading to the foundation of the International Society for Knowledge Organization (ISKO) are outlined and the prerequisites for the formal and conceptual description of the scope of knowledge organization (KO) are laid out, fol- lowed by the explanation of the scheme as used in the bibliography of KO. An overview is provided of the tasks and activities of this discipline; thereafter and in conclusion an urgent appeal is made to ISKO and to all active scientific societies with a view to establishing KO as an autonomous scientific discipline within the science of science, as well as an indication is given of urgently required tasks.

Received and Accepted 28 June 2013

Keywords: knowledge, knowledge organization, concepts, classification, ISKO

† This paper had been requested by Peter Ohly to be given on the occasion of the German ISKO General Assembly on July 5, 2012. It has been revised in the meantime to be included in the next proceedings volume among the papers of the Potsdam ISKO Conference, March 19-20, 2013. The English translation was finalized in cooperation with Prof. Dr. Herbert Eisele, France.

1.0 How it all came about So on July 22th 1989, ISKO, the International Society for Knowledge Organization, was set up. Its name resulted from a On February 12, 1977, a group from the registered Society compromise, since there is no appropriate English equiva- for Documentation (including Martin Scheele and Robert lent for “Wissensordnung,” which mattered to us. How- Fugmann) founded the Society for Classification in Frankfurt ever, the title of a book on The Organization of Knowledge in order to promote required research on the philosophi- and the System of the Sciences (Bliss 1929) led us to hope that cal and system-relevant fundaments of the methodologi- the German alternate term “Wissensorganisation” allowed cal domain of librarians and documentalists. The found- in English the innovative “Knowledge Organization,” ing assembly protocol mentions only one mathematician, which to our great surprise found universal acceptance. In author of a book on automatic classification (Bock 1974). the meantime this brilliant term has become so hackneyed Twelve years later, half of the now 200 members ap- that now, almost 25 years later, the question seems to arise peared to be mathematicians or statistically-oriented peo- what actually to understand thereby. In order to qualify for ple who took over, which led to the departure of those the title of this paper it may be proper to return to the less interested in statistics to constitute a new body exclu- roots, viz. to the customary notion of classification, which sively dedicated to concept-oriented research—going also covers a variety of meanings. Indeed, this polyseme refers: international1. 1) to “classis facere” (arrange in classes); 2) as well as to 86 Knowl. Org. 41(2014)No.1 Brief Communication assigning to a class, i.e. the attribution of classes to real it conforms to general acceptation, including the coinci- objects (referents), that which is generally understood by dence with extant definitions in dictionaries and encyclo- classifying. Moreover, the term comprises also the result pedias. of 1), i.e. the classification system 3) and the result of 2) The most substantial or essential characteristics indi- i.e. the classified object 4). In addition, “classification” also cates the hierarchical relationship of an object, e.g. a ward- qualifies 5) a didactic discipline (subject of study). robe is a piece of furniture; a swan is a large water-bird; a In German, it is possible to associate “knowledge” computer is a data-processing machine, thus bringing out (meaning of course “generally accepted knowledge”) with in the first relative place respective hyperonyms, i.e. the “organization” since “organization” includes objects, higher class concepts (piece of furniture, water-bird, data- whereas in some other languages “organization” refers processor). There are also characteristics which specify a primarily to corporate bodies. This notwithstanding, the given case etc., so as to discover the respective hyponym conjunct finally met general acceptance. (lower-class concept), which can also be represented oth- “Generally accepted knowledge” carries the seal of sci- erwise, leading down the whole range of the conceptual ence, resulting from verifiable dicta or else from inter- hierarchy till to the individualizing characteristics of space subjective agreement in form of generally accepted defini- and time. When Kant speaks of analytical or synthetic tions as opposed to subjective knowledge acquired by ex- judgements, he refers to relative implicit characteristics of perience or learning. In the latter meaning, knowledge a hierarchy as against the specifying characteristics of a serves as a kind of spiritual warrant, which means that sub-concept. The determination of necessary characteris- reminiscence depends on remembered data, which fact tics, i.e. knowledge elements, which aggregate to a knowl- explains why people differ in opinion on identical phe- edge unit constitutes a concept-forming event with the nomena, for each relies on different angles of vision and possible result that concepts of similar or analogous char- items of recollection. Generally speaking, the smaller the acteristics can form an inter-relationship between con- shared basis of experience AND education, the more dif- cepts. ficult the understanding. Our knowledge condenses itself However, this kind of relationship leads to concepts in concepts by their informative content. Concepts are relying on purely formal aspects (similar/dissimilar; inclu- therefore knowledge units and form the elements of sys- sive/exclusive etc.) which are helpful for some reasons, tems of knowledge (Wissensordnungen) (cf. infra). but for the construction of a conceptual classification scheme four different content-determinant types of rela- 2.0 First prerequisite: concepts as elements tionships of concepts are needed: of systems of knowledge – the abstraction relationship of genus-species True understanding of concepts has been jeopardized – the partitive relationship of whole/part-of hitherto by the ignorance of their very nature, viz. that – the complementary or opposition relationship they form the constituents of any knowledge organization – the function-related relationship, generating a sort of that also leads to the formation of classes. The linguistic syntax; aspect hinders most colleagues from perceiving the indis- pensable analytical aspect of concept formation and con- Only the third relationship does not provide hierarchies cept apprehension. Therefore, a handy concept theory is and the fourth only sometimes, as opposed to the first needed. My endeavours to expose such a theory in a two. number of publications (e.g. Dahlberg 1974a,1979, 1987, The function-related, grammatical or syntax relationship 2009) and make it plausible have been vain so far to my shows up e.g. in the ventilation of a subject field when great regret. I nevertheless venture again to show how to proceeding by an element location plan, as indicated under define knowledge units hereafter. the next section; in this case, each subject-field includes a Take any object, concrete or abstract, and figure out its logical subject and a logical predicate with possible com- essential characteristics by formulating “is”-statements. plements. The hierarchy proceeds from the partitive rela- The synthesis of all thus determined characteristics under tionship since the substructures of a subject-field are its a name or a code depicts the object’s content in an abbre- components. The complementary or opposition relation- viated form and leads to designate the respective object. ship applies to the opposition of objects and/or their The definition of a concept is therefore the resumé of qualities. content-determinant characteristics. I have often pictured It may be noted that the four relationships produce this in a triangle: on top respective referent, left corner the definitions whenever these appeal to genus-species rela- characteristics, right corner its name or designation. The tionships or whole-part, or else opposition relationships or truth-proof of this method depends obviously on how far yet function relationships. Dictionaries are mainly con- Knowl. Org. 41(2014)No.1 87 Brief Communication cerned with genus-species definitions, sometimes with sequent subdivisions, for which the scientific criterion has whole-part definitions, rather seldom with function-related been retained, whereby knowledge fields are characterized definitions which concern referents with their eventual in- by having their own object as well as their proper methods cidents. Hence the handling of concepts, particularly with and if they are well established as fairly developed scientific regard to their characteristics is essential to any systematic fields/disciplines also with, in most cases, their theoretical work in knowledge ordering for they link the concepts foundations, applications and widespread usage. The ICC within a subject-field and also with the concepts of other subject fields were ordered after criteria common to many subject-fields by systemic elements2. consultative works and syllabuses by the following facets: The various hierarchy-forming relationships which ap- pear in such systems show that classification systems The digital scale – Systematifier of knowledge fields based on these principles are self-explanatory like a defini- tion system. If the work has been properly done; such sys- General and theoretical prerequisites tems are very useful for science as well as for every Objects and their components searcher keen on exploring the relations and whereabouts Methods and techniques of items searched. to 6 special characterisations The recognition of this first prerequisite for analytical concept understanding will considerably ease the task of 7. Influence of other domains on this field organizing knowledge. 8. Application of this field’s methods to other fields 9. Ambit of respective knowledge field and info on 3.0 Second prerequisite: structural elements of it knowledge organization The positions 1-3, which represent by their object and Every builder knows that a large building calls for solid methods a sort of syntax, constitute a knowledge field, foundations and beams. The development of classification under 4-6 figure its peculiarities and 7-9 refer to the field’s (cf. Shamurin 1967) started in ancient Egypt at the very environment. point we are now in cyberspace, viz. the simple word des- It may seem at first sight that this kind of representa- ignating an object. This was replaced later in the Middle- tion narrows the concepts and classes of a knowledge Ages by domain designations leading to the so-called Sep- field, however experience with the building of the 6.500 tem Artes and finally the main classes of a universal system knowledge fields of ICC down to the 6th digital level became disciplines as still is the case with the six main uni- shows that no problem of the sort has yet arisen with the versal classification schemes. However the Indian mathe- classification of themes (for book titles or articles in peri- matician and librarian, S. R. Ranganathan, introduced in his odicals). The positions under 1, 8 and 9 permit extensive Colon Classification scheme of 1933 a structural element combinations with other knowledge fields which shows its which he called facet, taken up after World War II by a va- perfect inter-connectivity. riety of exemplary systems in England, where it became quite common to the point of structuring a thesaurus 4.0 The scope of knowledge organization (Aitchison et al. 1969). In Germany, Martin Scheele used it for his extended biological documentation. Nevertheless, I considered it essential to expound on the above prerequi- nobody has ever ventured to build a universal ordering sites for knowledge systems prior to answering the title scheme by dispensing with disciplines as main classes for question, for it shows the way by which my apparent pro- sustainment, not to mention improbable thesauri gone al- gramme has developed. In fact, most of the required data phabetical. and tasks had been presented in my 1973 dissertation The scheme I developed, the Information Coding Classifica- (Dahlberg 1974b). A first off-spring4 in 1974 was the Eng- tion (ICC) (Dahlberg 1982a), which refers so far exclusively lish language periodical International Classification, re-named to knowledge fields, relies on general object areas of being, in 1993 Knowledge Organization, regularly including an ex- underscored by integrating layers of the real world. These tensive section on bibliographical data from the most re- allow, beside genuine disciplines, for eventual subdivisions cent literature on classification. It was and is still presented that do not yet qualify for recognition as scientific disci- according to the Systematifier or digital scale of 1974 with plines. In addition, the ICC relies on the Aristotelian cate- minor extensions by my succeeding editors.5 This class- gories which distinguish object areas in their subdivisions structure of the classification literature scheme has been similar to facets, viz. a structural element position plan used for ordering not only the bibliographical data of the (Elementstellenplan) called “Systematifier” (Dahlberg periodical but also its systematic annual indexes up to 1996)3. Such a scheme reserves for each subject field sub- 1996 and the three volumes published so far as the Inter- 88 Knowl. Org. 41(2014)No.1 Brief Communication national Classification & Indexing Bibliography (ICIB) Processing, Grammar Problems, Online Retrieval Sys- (Dahlberg 1982b). This has been maintained even after tems & Technologies, Lexicon, Dictionary renaming the periodical. Therefore, the scope of our Problems, Problems of Terminology,Subject-oriented knowledge organization may be visualized through the Terminology Work, Problems of Multilingual & following systematic structuring:6 Cross-Language Systems and Translation of Schemes. Layout of the Classification Scheme for KO Literature 8 Applied Classing & Indexing General Problems, Guidelines, Rules, Consistency, 0 Form Divisions Classing and Indexing of Data, Titles, Bibliographies in Classification and Indexing/Knowl- Primary and Secondary Literature. Non-Book Materi- edge Organisation, Literature Reviews, als, Back-of-the-Book, Subject-field Glossaries, Universal Classification Systems, Periodicals Indexing, and Indexing in certain languages and Serials, Proceedings, Textbooks, 9 Knowledge Organisation Environment7 Other monographs, Standards. Professional & Organisational Problems, Persons & 1 Theoretical Foundations & General Problems Organisations in KO, Organisation of Order & Knowledge Organization (KO), Conceptol- Classification & Indexing on a National & Interna- ogy & KO, Mathematics in KO, Systems tional Level, Education & Training in KO, Theory in KO, Psychology, Sociology & KO, Problems Policy & Legal Questions, Economics in KO, User & Research in KO, History of KO Studies, Standardization in KO Work. 2 Classification Systems & Thesauri, Structure & Con- struction Owing to its great applicability, the KO’s scope is ex- General Questions, Structure & Elements of KO Sys- tremely large if one considers that e.g. the cited six univer- tems, Construction of Classification Systems & sal classification schemes cover so to speak the whole Thesauri, Relationships, Numerical Taxonomy, Nota- conceptual knowledge of mankind; however, what mat- tion, Codes, Maintenance, Updating & ters here is the professional acumen with which concepts Storage of KO Systems, & Thesauri, Compatibility/ are collected, processed and ordered. This also applies to Interoperability and Concordances between the taxonomies in all subject fields as well as to all expert Indexing Languages, Evaluation of KO Systems & thesauri built in all disciplines in the most important coun- Thesauri tries. Considering the Linné taxonomies which over more 3 Methodology of Classing & Indexing than two centuries have widely sustained biological re- Theory of Classing & Indexing, Subject Analysis, Class- search, one cannot help adjusting taxa to modern findings; ing & Indexing Techniques, Computer assisted however, this does not mean that one should renounce (automatic) Classing & Indexing, Manual & Auto- the fundamental ordering scheme. matic Order Techniques, Coding, Reclassification, Index Generation & Programs, Evalua- 5.0 What would be the answer to the question tion of Classing & Indexing in the title? 4 On Universal Classification Systems & Thesauri General Questions, On the Universal Decimal Classifi- It could be subsumed in the following way. Knowledge cation, On the Dewey Decimal Classif., organization presupposes on the one hand cognizance of On the Library of Congress Classif., On the Bliss Clas- concepts/knowledge units under review as well as relative sif., On the Colon Classif., On the system-theoretical issues connected with structuring con- Library Bibliographical Classif., On other Universal cepts and classes of concepts, so that as a result profes- Classif. Systems & Thesauri sionally acceptable ordering schemes may be obtained for 5 On Special Objects Classifications the scientific world. On the other hand, applications of (the order follows the nine-layer structure of the ICC KO work rely on the elements of KO for all possible and its subdivisions) tasks in various branches of the art, dealing with all sorts 6 On Special Subjects Classifications & Thesauri of objects and subjects, including contents of all kinds of (the order follows the nine-layer structure of the ICC documents, films, videos, etc., also items from museums and its subdivisions) collected by name, title or code for further investigation. 7 Knowledge Representation by Langua & Terminology In this respect it must be clear that Knowledge Manage- General Problems of Natural Language in Relation to ment (KM) lies outside the scope of KO, although KM KO, Semantics, Automatic Language may well use the results of a subject-conform KO. Knowl. Org. 41(2014)No.1 89 Brief Communication

As regards the development of KO as such, it may be It seems to me that ISKO should have engaged since observed that the roots evoked under the first section long in a series of scientifically relevant tasks, such as above, viz. traditional classification, still hovers over the lit- looking after its own terminology by assessing & collect- erature on KO, however owing to informatics and data- ing relevant terms in the many contributions in its publi- processing, where the content moment of data is more cations in order to gain an overview to permit to see and more acknowledged and many a wheel invented anew, where boundaries should be drawn, what is off-limits and thus a new terminology developed as a by-product propos- to focus on the very issues of KO, as I suggested (in ing unfortunate designations, such as “ontology” for KO Dahlberg 2009 and 2010a) a while ago. In fact, there lies systems and “metadata” for concepts and concept classes. ahead an exemplary exploitation of sources for an insti- tute of KO open to all knowledge fields. It may be that 6.0 KO as a discipline by its own right ISKO would be overtaxed by such a huge challenge. This is why I believe that the time has come to establish an The editors of Knowledge Organization, with the joint aid of academy for KO or at least an institute in every major UDC (Universal Decimal Classification) and DDC (Dewey country so that scientists of the various disciplines, termi- Decimal Classification) magazine editors, Ia C. McIlwaine & nologists and experts in KO could work together and Joan S. Mitchell as guest editors have produced under no. achieve by the above mentioned prerequisites systematic 2/3 of 2008 an issue which also deals with the question: concept exploration. Such a work in such an Institute “What is Knowledge Organization?” Apart from the arti- would be fruitful not only for KO but also for science as a cles by Birger Hjörland on the question “What is Knowl- whole in view of the many open issues confronting edge Organization,” Joseph T. Tennis on “Epistemology, whomever is engaged in the field of KO. Theory, and Methodology in Knowledge Organization. Toward a Classification, Metatheory, and Research 7.0 Overcoming the present situation in the field Framework,” Maria L. Lópes-Huertas on “Some Current of universal classification Research Questions in the Field of Knowledge Organiza- tion,” Claudio Gnoli on “Ten Long-Term Research Ques- At present, the editors of the various universal classifica- tions in Knowlede Organization,” Rebecca Green on “Re- tion schemes are entangled in updating their structurally lationships in Knowledge Organization” and Marcia Lei completely outdated systems, inherited partly from the 19th Zeng on “Knowledge Organization Systems (KOS),” the century – or as is the case with Library of Congress Classi- issue contained also my interview on a series of questions fication, locked up in pre-combined concepts and obliged to which I dealt with in December 2007 (Dahlberg 2008). continually adding book after book to its initial 30-volume Question number 8 concerned the issue “What needs to edition, instead of drawing a line and building a modern happen in the field for it to gain widespread acceptance as scheme according to the hitherto valid theories and princi- a scientific discipline?” to which I confessed (probably to ples, developed and presented in Dahlberg (2010b and the great dismay of the two librarian colleagues), that I 2012). “Interoperability” (cf. Boteram et al. 2011) of all ex- thought it necessary to take KO out of librarianship and tant systems should not be a problem at the time of auto- documentation to accommodate it within science of sci- matic processing9 and would be a task worthwhile for the ence,8 for since long other domains such as zoology, bot- envisaged institutes for KO at universities or other scien- any, microbiology are confronted with taxonomic issues tific bodies. Any user, trying to find by verbal access a solu- (classification of objects), as well as more recent classifica- tion to a concept or matter will be better off if he can rely tions of commodities, produced in the course of the last on a properly built classification system which allows him century, patents, official statistics,9 beside the results of to understand the whereabouts of his query, instead of be- the many terminological diploma studies carried out in ing confused and angry over multiple “hits” with no bear- some countries with their systematic representation of ing. termini of given knowledge fields (cf. Budin 1996) etc. ISKO as an international society engaged by its statutes This would permit KO to interconnect such concept in the tasks here discussed, has reached a point at which it and methodological relevant disciplines, while itself ap- has to decide in matters of function, whether to move to- proaching scientific standards, thus justifying its claim to wards formally setting up its activity under an official “dis- be regarded as a scientific discipline in its own right. Con- cipline” or not. Furthermore, whether, this would mean or comitantly, its findings and methods could generally be not to envisage practical cooperation with all facilities accommodated in other fields (cf. Dahlberg 1994 and working in classification, taxonomy and KO,10 as well as 2006). Already in 1974 the ICC reserved the first position collaborating with the more formally working mathemati- for science of science under the ontical rubric 8 – Knowl- cians and statisticians and/or the protagonists of the edge & information – to put KO on posit 814. “conceptual knowledge processing” of Professor Rudolf 90 Knowl. Org. 41(2014)No.1 Brief Communication

Wille’s school at Darmstadt, etc. all of which I laboured on of some 3500 definitions of the first three hierarchical in my published “desiderata.” Indeed, all the above devel- levels under this project in form of an Excel folder, in oped considerations condense in the 10 desiderata which I fact a preliminary work for the much needed updating presented during the German ISKO-Conference 2009 in and completion of the whole amount of 6.500 subject Bonn (Dahlberg 2011 and 2013). They should not fall into fields, which was possible in cooperation with Prof. oblivion if only its members had some real zeal for the Walter Koch, Graz. cause of KO and for an adequate streaming for order in 5. First by G. Riesthuis (1997-2006), thereafter by Ia knowledge. Already 51 years ago R. Fugmann called for McIlwaine (2007-2012), after 2013 by Hur-Li Lee, as order as the first and foremost requisite in documentation pdf-files or after 1997 as a cumulative data-bank. (Fugmann 1962). Order is also a point of love, at least love 6. A casual overview of this class-system may be taken for clarification’s sake, the actual pursuit of KO, not to for- from the mentioned publications (Dahlberg 1994 and get love of beauty in any order and last but not least love 2007). Recently it has been published under http:// for truth, the gist of all science. I heartily wish that this will www.isko.org/scheme.php. eventually germinate. 7. Under 94 we find today “cataloguing.” In devising the scheme, I had left this class empty. In my text (above) Notes I omitted this class, as it does not belong into the scheme of KO. Cataloging is an activity in the field of 1. Notabene: The Society for Classification prospered the information sciences. My dear librarian successors also thereafter, while continuing with its group of li- filled it by their desire. But “Subject cataloguing” be- brarians. Perhaps this rift came from a former animos- longs under the main rubric 3, whereas “Cataloguing ity between librarians and documentalists? of documents” is a purely formal rubrication, not 2. Cf. the valuable contribution from Philosophisches In- contents-related or concept-oriented. stitut Düsseldorf on systems (Diemer 1968). 8. In our universities, science of science is, if at all, 3. A panel of the main ICC rubrics appears in many a linked to theory of science, which, however, is still put publication of mine e.g. Dahlberg (1994 and 2006). under philosophy. Therefore, a complete misunder- 4. Precedents were findings (since 1959) in documenta- standing will always prevail in this matter. Another tion of atomic energy (Gmelin-Institut-Prof. E. point of dissent is my placing logics on its own right Pietsch); 7 years “Documentation of Documentation” before mathematics. This must be so because without under the Gesellschaft für Dokumentation, including logic, nothing will do. setting up a first thesaurus on this domain (1963), as 9. Remarkable in this respect the contribution of D. So- well as a system of descriptors (1967); collaborating ergel “Conceptual Foundations for Semantic Mapping with the Féderation Internationale de Documentation & Retrieval” (Soergel 2011). (FID) I proposed in 1968 for a committee on innova- 10. An enormous list of major classification schemes ap- tion of UDC an extensive classification of types of pears under the Wikipedia “Classification” entry. documents & their facets. Later on (in 1977 and 1989) were set up the societies mentioned in the first section References (naturally together with a number of colleagues, Robert Fugmann as permanent Vice-Chairperson between Aitchison, Jean, Gomersall, Alan and Ireland, Ralph. 1969. 1977 and 1997)), followed by the organization of an- Thesaurofacet: A thesaurus & faceted classification for engineer- nual conferences from 1977-1989 as well as organizing ing & related subjects. Whetstone: Leicester English Elec- committees and other conferences, the establishment tric Company Limited. (for ISKO) since 1989 of local chapters in a number of Bliss, Henry Evelyn. 1929. The organization of knowledge and countries etc. In 1977, at a seminar-week in Bangalore the system of the sciences. New York: Henry Holt and the first public presentation of ICC in India. Also in Company. 1977 till 1987, I was entrusted with heading FID’s Bock, Hans Hermann. 1974. Automatische klassifikation. Classification Research Committee, which implied also Theoretische und praktische methoden zur gruppierung und the organization of various conferences, particularly strukturierung von daten (cluster-analyse). Göttingen: Van- the important meeting in Augsburg in 1982 (cf. Per- denhoeck & Ruprecht. reault 1983). In 1982–84 the ICIB-volumes were pub- Budin, Gerhard. 1996. Wissensorganization und Terminologie. Die lished under a BMFT-Project and preliminary work on Komplexität und Dynamik wissenschaftlicher Informations- und a systematic and alphabetical lexicon on knowledge Kommunikationsprozesse. Tübingen: Gunter Narr Verlag. fields (DFG-Project Logstructure) began 1976-1979, Dahlberg, Ingetraut. 1974a. Zur Theorie des Begriffs. In- but only in 2011 took place finally the drag and drop ternational classification 1: 12-9. Knowl. Org. 41(2014)No.1 91 Brief Communication

Dahlberg, Ingetraut. 1974b. Grundlagen universaler Wissens- Dahlberg, Ingetraut. 2010b. Information coding classifica- ordnung. Probleme und Möglichkeiten eines universalen Klassifi- tion. Geschichtliches, Prinzipien, Inhaltliches. Informati- kationssystems des Wissens. Pullach bei München. Verlag on: Wissenschaft & Praxis 61: 449-54. Dokumentation. Dahlberg, Ingetraut. 2011. How to improve ISKO’s stand- Dahlberg, Ingetraut. 1979. On the theory of the concept. ing. Ten desiderata for knowledge organization. Knowl- In Neelameghan, A, ed., Ordering systems for global infor- edge organization 38: 68-74. mation networks: Proceedings of the Third International Study Dahlberg, Ingetraut. 2012. A systematic new lexicon of all Conference on Classification Research held at Bombay, India, knowledge fields based on the information coding clas- during 6-11 January 1975. Bangalore: Sarada Ranga- sification. Knowledge organization 39: 142-50. nathan Endowment for Library Science, pp. 54-63. Dahlberg, Ingetraut. 2013. Desiderata für die Wissensor- Dahlberg, Ingetraut. 1982a. ICC- Information coding ganisation. In Ohly, H. Peter, ed., Wissen, Wissenschaft, classification – principles, structure & application pos- Organisation. Proc.12.Tagung d.Dt.ISKO-Sektion, Bonn, 19- sibilities. International classification 9: 87-93. 21 Okt.2009. Würzburg, Ergon Verlag, pp.106-13. Dahlberg, Ingetraut. 1982b. International classification and in- Diemer, Alwin. 1968. System und Klassifikation in Wissenschaft dexing bibliography. Frankfurt: Indeks Verlag. und Dokumentation; Vortragë und Diskussionen im April Dahlberg, Ingetraut. 1987. Die gegenstandsbezogene, ana- 1967 in Dusseldorf̈ . Meisenheim am Glan: Verlag A.Hain. lytische Begriffstheorie und ihre Definitionsarten. In Fugmann, R. 1962. Ordnung – oberstes Gebot in der Do- Ganter, Bernhard, ed., Beitragë zur Begriffsanalyse. Mann- kumentation. Nachrichten für Dokumentation 13 n3: 120- heim: BI-Wiss.-Verlag, pp. 9-22. 32. Dahlberg, Ingetraut. 1994. Wissensorganization—eine Boteram, Felix, Gödert, Winfried and Hubrich, Jessica. neue Wissenschaft? In Wille, Rudolf, ed., Begriffliche wis- 2011. Concepts in context. Proceedings of the Cologne Confer- sensverarbeitung in der Wirtschaft. Mannheim : BI-Wiss.- ence on Interoperability and Semantics in Knowledge Organiza- Verlag, pp. 225-38 tion, July 19-20, 2010. Würburg: Ergon Verlag. Dahlberg, I. 1996. Compatibility and integration of order Perreault J.M. and Berman, S. 1983. A dialogue on the systems 1960-1995: an annotated bibliography. Com- subject catalogue : “a representative of the new left in patibility and Integration of order Systems (Research Seminar American subject cataloguing.” Occasional papers. No. Proceedings of the TIP/ISKO Meeting, Warsaw, 13-15 Sep- 161. tember, 1995. Warsaw: Wydawnictwo SBP. Scheele, Martin. 1967. Wissenschaftliche Dokumentation. Dahlberg, Ingetraut. 2006. Knowledge organization—a Grundzüge, Probleme, Notwendigkeiten. Schlitz/Hessen: new science? Knowledge organization 33: 11-9. Verlag Dr. Martin Scheele. Dahlberg, Ingetraut. 2008. Interview with Ingetraut Dahl- Shamurin, E.I. 1967. Geschichte der bibliothekarisch-bibliogra- berg December 2007. Knowledge organization 35: 82-5. phischen Klassifikation. Bd.1 u.2 München-Pullach: Verlag Dahlberg, Ingetraut. 2009. Concepts & terms: ISKO’s ma- Dokumentation. jor challenge. Knowledge organization 36: 169-77. Soergel, Dagobert. 2011. Conceptual foundations for se- Dahlberg, Ingetraut. 2010a. Begriffsarbeit in der Wissen- mantic mapping and semantic search. Boteram, Felix, sorganisation. In Sieglerschmidt, Jörn and Ohly, H. Pe- Gödert, Winfried and Hubrich, Jessica. 2011. Concepts in ter, eds., Wissensspeicher in digitalen Raumen:̈ Nachhaltigkeit, context. Proceedings of the Cologne Conference on Interoperabil- verfugbarkeit,̈ semantische Interoperabilitat;̈ Proceedings der 11. ity and Semantics in Knowledge Organization, July 19-20, Tagung der Deutschen Sektion der Internationalen Gesellschaft 2010. Würburg: Ergon Verlag. fur̈ Wissensorganisation, Konstanz, 20. bis 22 Februar 2008. Würzburg: Ergon Verlag, pp.112-23.

92 Knowl. Org. 41(2014)No.1 Brief Communication

Brief Communication: The Nature of Information and Its Influence in Human Cultures †

Emilia Currás

Calle O‘Donnell 6, 28009-Madrid, Spain,

Emilia Currás, Ph.D. in Chemistry, Professor of Information Science, introduced LIS studies to Spain and some Ibero-American countries; she is founder of SEDIC (Spanish Society for Information Science); ISKO- Spain Honorary President; Member of several Spanish LIS societies, institutes and associations. Author of sev- eral books and papers on knowledge organization.

Currás, Emilia. Brief Communication: The Nature of Information and Its Influence in Human Cul- tures. Knowledge Organization. 41(1), 92-96. 17 references.

Abstract: Starting this paper by discussing the nature of information, where it comes from and its uses, high- lighting the importance of its characteristic to be the prime element in human development. It also refers to the ways in which we understand information, whether as a process or as a phenomenon and, therefore, em- phasises its attributes as a means of accessing science, wisdom and the truth. Information as such does not have an identity of its own, it has to be upheld … by an object, be it material or electronic. Here also it refers to its features, such as instability, inconsistency, perpetuity, etc. A new theory of knowledge is formulating, where information is taking as its paradigm, which is named “Informationisms” and it is an original idea of the author. The influence of information is studied as a prime element in human culture. Information is the first ele- ment in the development of the individual and, indeed, as well, in the development of humankind as a whole. It is giving us a general overview of the future influence of the different types of cultures, both humanistic and technological ones. The importance of informa- tion is so that it could become a factor in the human race annihilation from the Earth planet, but human beings have decided to transform their way of life and the direction in which their civilization should be conducted.

Received and Accepted 23 September 2013

Keywords: information, process, culture, humanity, time, energy

† It is my honour to thank very sincerely to Prof. Dr. José Mª Nafría, director and aim of BIT-rum.

1.0 Introduction unmarried and without issue, we are all children of San- cho Panza—positivistic, realistic, with our feet on the Amongst the various possible subjects that occurred to ground, somewhat pessimistic, without a trace of spiritu- me within the fields of experimental science and the hu- ality. I, if you will permit me, prefer to think that I have manities, I decided to choose a question of great impor- descended from Don Quijote’s niece; and that I have in- tance today—that of information and its influence on the herited some genes from an idealist, dreamer and, to some historical evolution of the culture of peoples. Through extent, an optimist. With this mixture of Don Quijote and this subject, I shall be able to explain, in general terms, my Sancho, I shall now go with this paper on the subject thoughts and opinions, which I would like to believe, con- “The nature of information and its influence in human tain philosophical and historical traces (Crawford et al. cultures.” 2006). In a telephone conversation, I mentioned this subject Inspired by contemporary philosophers, Dr. Martínez- to Don Javier Lasso de la Vega, that pioneer of all pio- Fornés (2001) is of the idea that, as Alonso Quijano died neers in documentation sciences, and he replied: “That´s Knowl. Org. 41(2014)No.1 93 Brief Communication not a title. It’s a complete definition.” I put down the tele- that arises from documentation. This documentation pro- phone and began to think the matter over and I realized duces information, which converts into documentation that, once again, he was right. It is a definition and, as and once again into information. It is a continuous and such, it requires an explanation. Its postulate has to be theoretically infinite process. In practice it will finished based on certain principles from which the conclusion we with the end of human life on Earth. propose can be deduced. 3.0 Other consideration 2.0 Nature of information Information has also been defined as the process by Nowadays one is bound to talk of information. We are which we receive the events of the external world, giving immersed in its world. A world that is evolving, in which it us the opportunity to form judgements and make deci- is necessary to reflect upon so many attributes, character- sions: economic, political, moral, scientific, and so on. We istics, beneficial agents and contaminating factors. Today, could even say that information is the consequence of reference is made to a theory of knowledge based on in- documentation. It does not exist in itself, it needs docu- formationisms. I first used that term in an article pub- mentation, a set of documents that are suitably prepared lished in the Revista de la Universidad Complutense in 1981 so that data, quanta of useful information, can be ex- (Curras 1981). We are undergoing periods of great evolu- tracted and transmitted to whoever requires them. tion. The definitions of information can be divided into two Likewise, it is commonly stated that we have left the main groups that refer to the nature of information when atomic energy age and are now entering the era of infor- considered as a phenomenon, produced around us, inde- mation. In previous articles, I also have accepted these pendently of our ego and which we grasp either con- statements. Today, when mediating upon this affirmation, sciously or unconsciously. We can also speak of a phe- I feel that the atomic energy era has not ended, nor is it nomenon, produced by the background of our noosphere likely to for a long time. The information era, however, which surrounds us and forms the development of our has existed ever since the world has existed, or at least daily activities. At the same time information is a process: since life began on the planet Earth. Information has al- prepared by us, from some documents, for its subsequent ways existed and will continue to do so for centuries to use. It can likewise be expressed as a process: as a conse- come because it is the “fourth vital element.” quence of documentation, that consciously the activities Human beings, be they homo sapiens, homo sociologicus or of the human intellect, affecting the development of homo informaticus, tend to relate their existence and thoughts Mankind, either scientifically, technically or artistically. to the parameters of the present moment. It is my belief This dual aspect of phenomenon and process provides in- that we are confusing “information” with “information formation with a holistic character, from which it can be technology,” which has entirely different connotations. A concluded that information is “everything;” the essence more appropriate expression would be the “information and presence of any human activity, be it conscious or un- technology age.” We are neither in the atomic age nor in conscious. For there to be information, the following re- the information technology age, we are in both and per- quirements have to be fulfilled: it has to be transmitted, haps on some other. Our world is not a simple place. It is and it has to be perceived. In others words, there has to be complex, with a complexity that in information to be “a “communication.” Hence, we find information and com- physical act followed by a psychic act.” By physical act, it is munication linked in such a way that they can be confused referring to a message, impact, external stimulus, whereas and many writers refer indistinctly to one or the other. the psychic act involves the mental activity of perceiving One should, however, recall that information on itself and assimilating this message. has always existed and that the new techniques, the so- We may continue by talking of information as each called “information technologies” are really techniques for piece that combines to form a whole and enables us to processing, storing, reproducing, transmitting. If going carry out research tasks. This definition proposed by Klin- back to my former line of reasoning, it would perhaps be toe (1985) in an unpublished lecture provides a component more correct if I stated that we are in the “communica- of practical and, at the same time, transcendental utility. tion era,” in accordance with Bradford Morse (1984). Carl For it is through research that we arrive at science, and Keren (1984) evaluates the development of a country by through science we obtain the truth. Subjective truth is how much it uses information (considered as a process). relative, human. Objective truth is absolute and is beyond Professor Kaula (1984) from India makes reference to the our reach. development of communication channels. We can likewise consider information as the first ele- ment in the search for wisdom, in a human chain process, 94 Knowl. Org. 41(2014)No.1 Brief Communication

4.0 The impact of information on the development 5.0 Humanity on the planet Earth of the human being Coming down to the planet Earth and the humanity that When I speak of the human being, I like to think of him inhabits it, we can apply our previous reasoning. Human- in the centre of an evolutionary process, originating in the ity behaves like a closed system, as it does not have other macrocosms and reaching the microcosms. Man is the humanities with which to relate. We, of course, under- element that transforms the macrocosms and the micro- stand that we referring to it in a social dimension because, cosms so as to use both, by assimilation, to get the noo- in general terms, it is subject to all types of influence cosms. which, at present, we are incapable of identifying or quan- Whatever the truth, the universe appears to be ruled by tifying. Because it is a closed system and now has a con- logical and exact laws that are inexorably obeyed, without siderable level of information, it will either have to trans- admitting any degree of imprecision. Although we might mute or die. Humanity has chosen the optimistic solution, consider it to be haphazard and unpredictable this is no that of transmuting. We have only to look around us to more than the result of our ignorance. We are still far realize that this is what is happening. Once again informa- from knowing true reality. tion plays the role of the vital element (Curras 1988) for For our line of reasoning, we have to think of the uni- the development of humanity and the Universe. verse as an energy potential. An energy that manifests it- self in its most varied aspect, from matter or mass (in our 6.0 An idea of culture human dimension it is a form of concentrated energy) to laser rays. This energy is not stable, but is continually Moving on now to the influence of information on the evolving and changing. It transforms and goes from one culture of peoples, I feel it is necessary to define what I form to another. For example, it is suggested that matter mean by culture. We are in an age of tremendous confu- breaks down into energy and this into something even sion. Everything evolves so quickly that there is scarcely more subtle, information, as proposed by Prigogine (Pri- time to assimilate each innovation. People are also sud- gogine and Stengers 2004). The process is also inverted denly faced with new concepts and new words, without but is not reversible. For it to change direction a mutation understanding their meaning, nor do they understand the has to be produced. The quantity of information will be different and changing senses of concepts, within this age that assigned by the direction of the energy evolution of transmutation. Among these concepts is that of culture process of the universe. For the time being, we have to which here we will take to be the set of acquired learning, think that we are still in the process that goes from: and we pluralize its content so as to include the different disciplines of the total “human knowledge,” from humani- matter → energy → information, ties, history, to applied sciences, chemistry, computing. In other words, within the concept of culture are united theo- although this last is now so abundant that many philoso- retical ideas and practical applications, manifestations of phers and scholars are thinking in terms of a proximate reason and of spirit. This leads us to consider that there mutation of the universe. There have even been forecast can exist a humanistic culture as well as a technological cul- of “the end of the world” if the mutation cannot be real- ture. Taking this into account, civilization will be the degree ized. This would come about if a situation of an incapac- of culture of each people, at each given historical moment, ity to perform a mutation is reached, for it is clear that in within a process of elaboration of information. each process of change of: Throughout history we can distinguish periods that are characterized by certain civilizations: the nomadic and matter→ to information hunting, agricultural, cattle raising, bronze and iron civili- and of zations and so on to the oil civilization. In each and information→ to matter, through the influence of information, intelligence increas- ingly predominates over the use of muscular strength. There is a decrease in matter and an increase in informa- This has reached such a point that many writers, speak tion. When everything is information, there will be no about these questions. If these theories are confirmed, we possibility of a mutation and the universe will end. But could think that the cycle is closing, beginning with the when? origin of humanity as a homogeneous whole and ending by returning to that equality. Will this mean the end of the human race? In any respect, we are far from achieving that homogeneity.

Knowl. Org. 41(2014)No.1 95 Brief Communication

7.0 Evolution of the technological culture structed in such a way that it is easier for us to understand how humanity has evolved through language, because this Certain people developed their intelligence and achieved a is our means of communication with the external. The more comfortable and easier life. With regard to their primitive oral tradition was slow. Information was transmit- technological cultural evolution, they began by attempting ted from generation to generation, over longs periods of to substitute brute muscular force by certain tools: knives, time and in small geographical areas. The oldest person axes, needles, the wheel, the pulley, and the plough. The was the one who knew most and consequently was the first industrial revolution occurred over long periods of most respected. When writing was invented the evolution- time, so long that the word “revolution” is inappropriate. ary process accelerated, although it still belonged to a privi- Discoveries and inventions took place, but always from leged few, who preserved the cultural traditions. The elders the basis of muscular strength, assisted and replaced by were still venerated and admired for their knowledge, be- tools. Tools that man knew and used with confidence, cause the oral tradition was still in force. As a person of though the job he carried out was nevertheless laborious. letters would have expressed it, these were “stories at the With time, new tools appeared. These tools were more fireside.” complicated, resulting from the discovery of the steam It is very interesting to see, from reading Hipólito Esco- engine, and later electricity and enabled man to substitute lar (1998), how those periods in which the book industry muscular energy by mechanical energy. There was a sec- has flourished, in some way or another, have coincided ond industrial revolution in a much shorter period of with stages of rapid evolution of humanity. In times of time. Information had a positive effect, not only because war, catastrophes, plagues and other calamities; times in of the greater amount produced in the long years before, which books took refuge in convents and palaces, and were but also because of the increased amount resulting from not easily accessible, this evolution was almost paralysed. the technological advance in itself. This influenced work- Because of the scarcity of information the brain develops ing conditions, which changed considerably, and conse- more slowly and there are fewer inventions and discoveries. quently living conditions improved. What solution is there to this dramatic panorama? We Not a great deal of time has elapsed since then. There have to hope that information itself will make us realize has been in practice, through a continuous process, a third how mistaken we are in our lives. Information is unique in industrial revolution, which is characterized by the use of that it can lead us to the human being´s maximum aim, to “new technologies,” based on semiconductors, computers, knowledge through truth. By its intervention, our power laser rays, servomechanism. Mechanical force has begun to discriminate will increase and we shall come to under- to be substituted by intelligence force. For the same rea- stand the path we should follow to achieve a well-being son the we mentioned before, information has begun to and harmony between peoples, with whom we are un- increase, and this time almost alarmingly, to the point that, avoidably obliged to live on this Earth. My own personal in order to assimilate and transmit this information, Hu- opinion is that the end of the world―of humanity―will manity has found itself forced to initiate a transmutation come about when we have achieved that knowledge we in its associated life. strive for, though information. Work is becoming easier. What before was considered God´s punishment, is now becoming a desired and scarce 9.0 Conclusions commodity. Information has brought about these changes in technological culture. The greater the information, the Through this paper it is emphasized the great importance faster the change. I feel it convenient to explain that in of information in the way of individual’s life, also more these cases I am referring to information as the sum of a important it is for the development of peoples culture. series of products and phenomena resulting from mental activities that can be grasped and assimilated in order to References realize a new technical development. Crawford, John C., Leahy, Jim, Holden, Jan and Graham, 8. Evolution of the humanistic culture Sophie. 2006. The culture of evaluation in library and infor- mation. Oxford: Chandos Pub. Let us now look back on what has happened with respect Currás, Emilia. 1981. ¿Estaremos en la época del informa- to humanistic culture, reflected in a communication be- cionismo? Revista de la Universidad Complutense 2: 186-8. tween peoples, in an alternating flow. It is manifested by Currás, Emilia. 1985. Some scientific and philosophical the word, spoken, and later written, and now computers principles of information science. Nachrichten für do- that work directly with the spoken word, thereby eliminat- kumentation 36: 151-4. ing the intermediate step of writing. Our mind is con- 96 Knowl. Org. 41(2014)No.1 Brief Communication

Currás, Emilia. 1988. La información en sus nuevos aspectos. Keren, Carl. 1984. On information science. Journal of the Ciencias de la documentación. Madrid: Paraninfo. American Society for Information Science 35: 137. Escolar, Hipólito. 1998. Historia del libro español. Madrid: Keren, Carl. 1984. On information science. Journal of the Gredos: Madrid. American Society for Information Science 35: 137. Gleick, James, Rabasseda-Gacón, Juan and Lozoya, Teófi- Klintoe, K. 1985. Uso de la información y documentación tec- lo de. 2012. La informacion:́ Historia y realidad. Barce- nológica en las áreas de generación, transporte y distribución de en- lona: Crítica. ergía eléctrica y de normalización. Madrid: Asociación de In- Goñi Camejo, Ivis. 2000. Algunas reflexiones sobre el vestigación Industrial Eléctrica. concepto de información y sus implicaciones para el Martínez-Fornés, Santiago. 2001. La obsesión por adelgazar: desarrollo de las ciencias de la información. ACIMED Bulimia y anorexia. Madrid: Espasa-Calpe. 8: 201-7. Available http://bvs.sld.cu/revistas/aci/vol8_ Morse, Bradford. 1984. The full meaning of communicati- 3_00/aci05300.htm. on. International information communication and education 3: 2. Kaula, P.N. 1984. The process of change in information Prigogine, I. and Stengers, Isabelle. 2004. La nueva alianza: activities. International forum on information and documentati- Metamorfosis de la ciencia. Madrid: Alianza Editorial. on 9 no. 4: 3-7.

Books Recently Published

Charu C Aggarwal, Chandan K Reddy (eds.). Data cluster- Mirna Willer, Gordon Dunsire. Bibliographic information or- ing: algorithms and applications. Boca Raton (Florida): ganization in the semantic web. Oxford, UK: Chandos Chapman & Hall/CRC, 2013. Publishing, 2013. Berthold Lausen, Dirk Van den Poel, Alfred Ultsch (eds.). Algorithms from and for nature and life: Classification and Pre-release data analysis. Cham (Switzerland), Heidelberg (Germa- ny), New York, Dordrecht (Netherlands), London: Negley, Glenn. The Organization Of Knowledge: An In- Springer-Verlag, 2013. troduction To Philosophical Analysis. Whitefish, MT: A. Neelameghan, G. J. Narayana. Concept and expression of Literary Licensing LLC, 2013. time: Cultural variations and impact on knowledge organiza- tion. New Delhi: Ess Ess Publications, 2013. Jung-Ran Park, Lynne C. Howarth. New directions in infor- mation organization. Bingley, UK: Emerald Group Pub- lishing, 2013.

Knowl. Org. 41(2014)No.1 97 Index to Volume 40 (2013)

Index to Volume 40 (2013) No. 1, pp. 1-80; No. 2, pp. 81-153, No. 3, pp. 153-213; No. 4, pp. 213-282; No. 5, pp. 283-362; No. 6, pp. 363-429.

ALPHABETICAL INDEX Hjørland, Birger. User-based and Cognitive Approaches to Knowledge Organization: A Theoretical Analysis of 1. Articles the Research Literature...... 11

Adler, Melissa, and Joseph T. Tennis. Toward a Kleineberg, Michael. The Blind Men and the Elephant: Taxonomy of Harm in Knowledge Organization Systems.....266 Towards an Organization of Epistemic Contexts...... 340

Almeida, Carlos Cândido de, Mariângela Spotti Lopes Lamirel, Jean-Charles. Multi-View Data Analysis and Fujita and Daniela Marjorie dos Reis. Peircean Semiotics Concept Extraction Methods for Text ...... 305 and Subject Indexing: Contributions of Speculative Grammar and Pure Logic...... 225 López-Huertas, María. Reflexions on Multidimensional Knowledge: Its Influence on the Foundation of Burnett, Kathleen, and Laurie J. Bonnici. Rhizomes in Knowledge Organization ...... 400 the iField: What Does it Mean to be an iSchool? ...... 408 Mai, Jens-Erik. Ethics, Values and Morality in Campos, Maria Luiza de Almeida, Maria Luiza Machado Contemporary Library Classifications...... 242 Campos, Alberto M. R. Dávila, Hagar Espanha Gomes, Linair Maria Campos, and Laura de Lira e Oliveira. Marchese, Christine and Richard P. Smiraglia. Boundary Information Sciences Methodological Aspects Applied Objects: CWA, an HR Firm, and Emergent Vocabulary...... 254 to Ontology Reuse Tools: A Study Based on Genomic Annotations in the Domain of Trypanosomatides...... 50 Marcondes, Carlos Henrique. Knowledge Organization and Representation in Digital Environments: Relations Channon, Martin G.. The Unification of Concept Between Ontology and Knowledge Organization ...... 115 Representations: An Impetus for Scientific Epistemology ...... 83 Marras, Cristina. Structuring Multidisciplinary Chen, Kuan-nien. Dynamic Subject Numbers Replace Knowledge: Aquatic and Terrestrial Metaphors...... 392 Traditional Classification Numbers ...... 160 Martínez-Ávila, Daniel, and Rosa San Segundo. De Luca, Ernesto William. Extending the Linked Data Reader-Interest Classification: Concept and Cloud with Multilingual Lexical Linked Data...... 320 Terminology Historical Overview ...... 102

Desale, Sanjay K., and Rajendra M. Kumbhar. Research Mazzocchi, Fulvio. Images of Thought and Their Relation on Automatic Classification of Documents in Library to Classification: The Tree and the Net...... 366 Environment: A Literature Review...... 295 Moneda Corrochano, Mercedes de la, María Fedeli, Gian Carlo. Metaphors of Order and Disorder: J. López-Huertas and Evaristo Jiménez Contreras. From the Tree to the Labyrinth and Beyond...... 375 Spanish Research in Knowledge Organization (2002-2010)..... 28

Fóris, Ágota. Network Theory and Terminology...... 422 Oikarinen, Teija, and Terttu Kortelainen. Challenges of Diversity, Consistency, and Globality in Indexing Fox, Melodie J. and Austin Reece. The Impossible of Local Archeological Artifacts ...... 123 Decision: Social Tagging and Derrida’s Deconstructed Hospitality...... 260 Ridi, Riccardo. Ethical Values for Knowledge Organization...... 187 Galeffi, Agnese. The Spatial Value of Information...... 182 Rosati, Luca, Antonella Schena, and Rita Massacesi. Hansson, Joacim. The Materiality of Knowledge Childhood and Adolescence Between Past and Present: Organization: Epistemology, Metaphors and Society...... 384 Using Knowledge Organization to Bridge the Different Channels of a Cultural Institution: The Case of the Herre, Heinrich. Formal Ontology and the Foundation Istituto degli Innocenti, Fiorenze...... 197 of Knowledge Organization...... 332 Scaturro, Irene. Faceted Taxonomies for the Performing Hjørland, Birger. Theories of Knowledge Arts Domain: The Case of the European Collected Organization— Theories of Knowledge ...... 169 Library of Artistic Performance...... 205

98 Knowl. Org. 41(2014)No.1 Index to Volume 40 (2013)

Sienkiewicz, Urszula, and Izabela Kijeńska-Dąbrowska. Dupré, John. The Disorder of Things: Metaphysical Foundations Knowledge Creation and Commercialization Activities of the Disunity of Science. Massachusetts; London: Harvard in Polish Public HEUs in the Area of Technical and University Press, 1993, 308p. ISBN0-674-21261-4 (Hb); Engineering Sciences...... 136 and Human Nature and the Limits of Science. Oxford; New York: Oxford University Press, 2001, 201p. Smiraglia, Richard P.. Is FRBR A Domain? Domain ISBN 0-19-926550-X (Pb)...... 149 Analysis Applied to the Literature of The FRBR Family of Conceptual Models ...... 273 3. Reports, Communications, Features, etc. Tennis, Joseph T.. Ethos and Ideology of Knowledge Organization: Toward Precepts for an Engaged Mazzocchi, Fulvio, and Gian Carlo Fedeli. Introduction Knowledge Organization ...... 42 to the Special Issue: ‘Paradigms of Knowledge and Its Organization: The Tree, the Net and Beyond’...... 363 Thellefsen, Martin, Torkild Thellefsen and Brent Sørenson. A Pragmatic Semeiotic Perspective on the Concept Smiraglia, Richard P.. ISKO 12’s Bookshelf—Evolving of Information Need and Its Relevance for Knowledge Intension: An Editorial...... 3 Organization...... 213 Smiraglia, Richard P.. Keywords, Indexing, Text Tredinnick, Luke. Each One of us was Several: Networks, Analysis: An Editorial...... 155 Rhizomes and Web Organisms ...... 414 Szostak, Rick. Speaking Truth to Power in Classification: Wu, Yejun. Indexing Historical, Political Cartoons for Response to Fox’s Review of My Work; KO 39:4, 300 ...... 76 Retrieval ...... 283 Williamson, Nancy J. Paradigms and Conceptual Systems in Knowledge Organization, the Eleventh International 2. Book Reviews ISKO Conference, Rome, 2010...... 64

Kumbhar, Rajendra. Library Classification Trends in the 21st Xiao, Guohua. A Knowledge Classification Model Based Century. Witney, UK: Chandos Publishing (Oxford) Ltd.: on the Relationship Between Science and Human Needs ...... 77 2012. ISBN: 1843346605, 9781843346609...... 62

Knowl. Org. 41(2014)No.1

KNOWLEDGE ORGANIZATION KO

Official Bi-Monthly Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Publisher References should be listed alphabetically by author at the end of the article. Author names should be given as found in the sources (not ab- ERGON-Verlag GmbH, Keesburgstr. 11, D-97074 Würzburg breviated). Journal titles should not be abbreviated. Multiple citations to Phone: +49 (0)931 280084; FAX +49 (0)931 282872 works by the same author should be listed chronologically and should E-mail: [email protected]; http://www.ergon-verlag.de each include the author’s name. Articles appearing in the same year should have the following format: “Jones 2005a, Jones 2005b, etc.” Is- Editor-in-chief (Editorial office) sue numbers are given only when a journal volume is not through- paginated. Dr. Richard P. SMIRAGLIA (Editor-in-Chief), School of Information Examples: Studies, University of Wisconsin, Milwaukee, Northwest Quad Building Dahlberg, Ingetraut. 1978. A referent-oriented, analytical concept B, 2025 E Newport St., Milwaukee, WI 53211 USA. theory for INTERCONCEPT. International classification 5: 142-51. E-mail: [email protected] Howarth, Lynne C. 2003. Designing a common namespace for searching metadata-enabled knowledge repositories: an international Instructions for Authors perspective. Cataloging & classification quarterly 37n1/2: 173-85. Pogorelec, Andrej and Šauperl, Alenka. 2006. The alternative model Manuscripts should be submitted electronically (in Word format) in Eng- of classification of belles-lettres in libraries. Knowledge organization 33: lish only via email to the editor-in chief and should be accompanied by 204-14. an indicative abstract of 150 to 200 words. Manuscripts of articles Schallier, Wouter. 2004. On the razor’s edge: between local and should fall within the range 6,000-10,000 words. Longer manuscripts overall needs in knowledge organization. In McIlwaine, Ia C. ed., Knowl- will be considered on consultation with the editor-in-chief. edge organization and the global information society: Proceedings of the Eighth In- A separate title page should include the article title and the author’s ternational ISKO Conference 13-16 July 2004 London, UK. Advances in name, postal address, and E-mail address, if available. Only the title of knowledge organization 9. Würzburg: Ergon Verlag, pp. 269-74. the article should appear on the first page of the text. Smiraglia, Richard P. 2001. The nature of ‘a work’: implications for the or- To protect anonymity, the author’s name should not appear on the ganization of knowledge. Lanham, Md.: Scarecrow. manuscript, and all references in the body of the text and in footnotes Smiraglia, Richard P. 2005. Instantiation: Toward a theory. In Vaughan, that might identify the author to the reviewer should be removed and Liwen, ed. Data, information, and knowledge in a networked world; Annual confer- cited on a separate page. ence of the Canadian Association for Information Science … London, Ontario, June Criteria for acceptance will be appropriateness to the field of the jour- 2-4 2005. Available http://www.cais-acsi.ca/2005proceedings.htm. nal (see Scope and Aims), taking into account the merit of the contents Illustrations should be kept to a necessary minimum and should be and presentation. The manuscript should be concise and should con- embedded within the document. Photographs (including color and half- form as much as possible to professional standards of English usage tone) should be scanned with a minimum resolution of 600 dpi and and grammar. Manuscripts are received with the understanding that they saved as .jpg files. Tables and figures should be embedded within the have not been previously published, are not being submitted for publi- document. Tables should contain a number and title at the bottom, and cation elsewhere, and that if the work received official sponsorship, it all columns and rows should have headings. All illustrations should be has been duly released for publication. Submissions are refereed, and cited in the text as Figure 1, Figure 2, etc. or Table 1, Table 2, etc. authors will usually be notified within 6 to 10 weeks. The entire manuscript should be double-spaced, including notes and The text should be structured by numbered subheadings. It should references. contain an introduction, giving an overview and stating the purpose, a Upon acceptance of a manuscript for publication, authors must main body, describing in sufficient detail the materials or methods used provide a wallet-size photo and a one-paragraph biographical sketch and the results or systems developed, and a conclusion or summary. (fewer than 100 words). The photograph should be scanned with a Footnotes are accepted only in rare cases and should be limited in minimum resolution of 600 dpi and saved as a .jpg file. number; all narration should be included in the text of the article. Para- graphs should include a topic sentence and some developed narrative; a typical paragraph has several sentences. Italics may not be used for em- Advertising phasis. Em-dashes should not be used as substitutes for commas. Italics should not be used for emphasis. Em-dashes should be used Responsible for advertising: ERGON-Verlag GmbH, Keesburgstr. 11, as substitutes for commas. Paragraphs should include a topic sentence 97074 Würzburg (Germany). and some developed narrative. A typical paragraph has several sen- © 2014 by ERGON-Verlag GmbH. tences. All Rights reserved. Reference citations within the text should have the following form: (Au- thor year). For example, (Jones 1990). Specific page numbers are re- quired for quoted material, e.g. (Jones 1990, 100). A citation with two KO is published bi-monthly by ERGON-Verlag GmbH. authors would read (Jones and Smith, 1990); three or more authors – The price for the print version is € 259,00/ann. including air- would be: (Jones et al., 1990). When the author is mentioned in the text, mail delivery. only the date and optional page number should appear in parenthesis – – The price for the print version plus access to the online version e.g. According to Jones (1990), … (PDF) is € 289,00/ann. including airmail delivery.

Knowl. Org. 41(2014)No.1

KO KNOWLEDGE ORGANIZATION

Official Bi-Monthly Journal of the International Society for Knowledge Organization ISSN 0943 – 7444 International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation

Scope Aims

The more scientific data is generated in the impetuous present times, Thus, KNOWLEDGE ORGANIZATION is a forum for all those in- the more ordering energy needs to be expended to control these data in terested in the organization of knowledge on a universal or a domain- a retrievable fashion. With the abundance of knowledge now available specific scale, using concept-analytical or concept-synthetical ap- proaches, as well as quantitative and qualitative methodologies. the questions of new solutions to the ordering problem and thus of im- KNOWLEDGE ORGANIZATION also addresses the intellectual proved classification systems, methods and procedures have acquired and automatic compilation and use of classification systems and thesauri unforeseen significance. For many years now they have been the focus in all fields of knowledge, with special attention being given to the prob- of interest of information scientists the world over. lems of terminology. Until recently, the special literature relevant to classification was KNOWLEDGE ORGANIZATION publishes original articles, published in piecemeal fashion, scattered over the numerous technical reports on conferences and similar communications, as well as book re- views, letters to the editor, and an extensive annotated bibliography of journals serving the experts of the various fields such as: recent classification and indexing literature.

KNOWLEDGE ORGANIZATION should therefore be available philosophy and science of science at every university and research library of every country, at every infor- science policy and science organization mation center, at colleges and schools of library and information sci- mathematics, statistics and computer science ence, in the hands of everybody interested in the fields mentioned library and information science above and thus also at every office for updating information on any topic related to the problems of order in our information-flooded times. archivistics and museology KNOWLEDGE ORGANIZATION was founded in 1973 by an journalism and communication science international group of scholars with a consulting board of editors repre- industrial products and commodity science senting the world’s regions, the special classification fields, and the sub- terminology, lexicography and linguistics ject areas involved. From 1974-1980 it was published by K.G. Saur Ver- lag, München. Back issues of 1978-1992 are available from ERGON- Beginning in 1974, KNOWLEDGE ORGANIZATION (formerly Verlag, too. INTERNATIONAL CLASSIFICATION) has been serving as a As of 1989, KNOWLEDGE ORGANIZATION has become the official organ of the INTERNATIONAL SOCIETY FOR KNOWL- common platform for the discussion of both theoretical background EDGE ORGANIZATION (ISKO) and is included for every ISKO- questions and practical application problems in many areas of concern. member, personal or institutional in the membership fee (US $ 55/ In each issue experts from many countries comment on questions of an US $ 110). adequate structuring and construction of ordering systems and on the Rates: From 2013 on for 6 issues/ann. (including indexes) problems of their use in opening the information contents of new litera- € 229,00 (forwarding costs included) for the print version resp. € 258,00 ture, of data collections and survey, of tabular works and of other ob- for the print version plus access to the online version (PDF). Member- ship rates see above. jects of scientific interest. Their contributions have been concerned with ERGON-Verlag GmbH, Keesburgstr. 11, D-97074 Würzburg; Phone: +49 (0)931 280084; FAX +49 (0)931 282872; E-mail: ser- (1) clarifying the theoretical foundations (general ordering theory/ [email protected]; http://www.ergon-verlag.de science, theoretical bases of classification, data analysis and re- Founded under the title International Classification in 1974 by Dr. duction) Ingetraut Dahlberg, the founding president of ISKO. Dr. Dahlberg (2) describing practical operations connected with indexing/classifi- served as the journal’s editor from 1974 to 1997, and as its publisher cation, as well as applications of classification systems and (Indeks Verlag of Frankfurt) from 1981 to 1997. thesauri, manual and machine indexing The contents of the journal are indexed and abstracted in Social Sci- (3) tracing the history of classification knowledge and methodology ences Citation Index, Web of Science, Information Science Abstracts, INSPEC, (4) discussing questions of education and training in classification Library and Information Science Abstracts (LISA), Library, Information Science & Technology Abstracts (EBSCO), Library Literature and Information Science (5) concerning themselves with the problems of terminology in gen- (Wilson), PASCAL, Referativnyi Zhurnal Informatika, and Sociological Ab- eral and with respect to special fields. stracts.