Hong Zhang, Linda C. Smith, Communications Michael Twidale, and Fang Huang Gao

Seeing the Wood example, the Dublin Core Here again, no weighting or dif- Element Set recommends the use of ferentiating mechanism is included for the Trees: to represent in describing the multiple elements. Enhancing subject in “keywords, key phrases, What is addressed is the “what” prob- or classification codes.”1 Similarly, lem: What is the work of or about? Metadata Subject the Library of Congress practice, sug- Metadata schemas for images and art Elements with gested in the Subject Headings Manual, works such as VRA Core and CDWA is to assign “one or more subject focus on specificity and exhaustivity Weights headings that best summarize the of indexing, that is, the precision and overall contents of the work and quantity of terms applied to a subject provide access to its most important element. However, these schemas do Subject indexing has been conducted topics.”2 A topic is only “important not address the question of how much in a dichotomous way in terms of enough” to be given a subject head- the work is of or about the item or ing if it comprises at least 20 percent concept represented by a particular what the information object is primar- of a work, except for headings of keyword. ily about/of or not, corresponding to named entities, which do not need to Recently, social tagging functions the presence or absence of a particular be 20 percent of the work when they have been adopted in digital library subject term, respectively. With more are “critical to the subject of the work and catalog systems to help support as a whole.”3 Although catalogers are better searching and browsing. This subject terms brought into informa- aware of it when they assign terms, introduces more subject terms into tion systems via social tagging, man- this weight information is left out of the system. Yet again, there is typi- ual cataloging, or automated indexing, the current library metadata schemas cally no mechanism to differentiate many more partially relevant results and practice. between the tags used for any given A similar practice applies in item, except for only a few sites that can be retrieved. Using examples from non-textual object subject indexing. make use of frequency informa- digital image collections and online Because of the difficulty of selecting tion in the search interfaces. library catalog systems, we explore words to represent visual/aural sym- As collections grow and more bolism, subject indexing for art and federated searching is carried out, the the problem and advocate for adding cultural objects is usually guided by absence of weights for subject terms a weighting mechanism to subject Panofsky’s three levels of meaning can cause problems in search and indexing and tagging to make web (pre-iconographical, iconographical, navigation. The following examples search and navigation more effec- and post-iconographical), further illustrate the problems, and the rest refined by Layne in “ofness” and of the paper further reviews and tive and efficient. We argue that the “aboutness” in each level. Specifically, discusses the precedent research and weighting of subject terms is more what can be indexed includes the practice on weighting, and further important than ever in today’s world “ofness” (what the picture depicts) outlines the issues that are critical in of growing collections, more federated as well as some “aboutness” (what applying a weighting mechanism. is expressed in the picture) in both searching, and expansion of social pre–iconographical and iconographi- tagging. Such a weighting mechanism cal levels.4 In practice, VRA Core 4.0 needs to be considered and applied not for example defines subject subele- ments as: only by indexers, catalogers, and tag- Hong Zhang ([email protected]) gers, but also needs to be incorporated Terms or phrases that describe, is PhD Candidate, Graduate School into system functionality and meta- identify, or interpret the Work of Library and Information Science, data schemas. or Image and what it depicts or University of Illinois at Urbana-Champaign, expresses. These may include Linda C. Smith ([email protected]) is generic terms that describe the Professor, Graduate School of Library and Information Science, University of ubjects as important access work and the elements that it Illinois at Urbana-Champaign, Michael comprises, terms that identify points have largely been Twidale ([email protected]) is S indexed in a dichotomous way: particular people, geographic Professor, Graduate School of Library and what the object is primarily about/ places, narrative and icono- Information Science, University of Illinois of or not. This approach to index- graphic themes, or terms that at Urbana-Champaign, and Fang Huang ing is implicitly assumed in various refer to broader concepts or Gao ([email protected]) is Supervisory guidelines for subject indexing. For interpretations.5 Librarian, Government Printing Office.

Seeing the Wood for the Trees: | zHAng et al. 75 ■■ Examples of Problems when people look at a particular Manual, the first subject is always item’s record, with the title and some- the primary one, while the second Exhaustive Indexing: Digital times the description, we may very and others could be either a primary Library Collections well determine that the picture is or nonprimary subject.8 This means primarily of, say, a dog instead of that among these 126 books, there is A search query of “tree” can return trees. That is, the subject elements no easy way to tell which books are thousands of images in several dig- have to be interpreted based on “primarily” about “psychoanalysis ital library collections. The results the context of other elements in the and religion” unless the user goes include images with a tree or trees record to convey the “primary” and through all of them. With the pro- as primary components mixed with “peripheral” subjects among the vided metadata, we do know that images where a tree or trees, although listed subject terms. However, in a all books that have “psychoanalysis definitely present, are minor compo- search and navigation system where and religion” as the first subject nents of the image. Figure 1 illustrates subject elements are usually treated heading are primarily about this the point. These examples come from as context-free, search efficiency will topic, but a book that has this same three different collections and either be largely impaired because of the heading as its second subject head- include the subject element of “tree” “noise” items and inability to refine ing may or may not be primarily or are tagged with “tree” by users. the scope, especially when the vol- about this topic. There is no way to There is no mechanism that catalog- ume of items grows. indicate which it is in the metadata, ers or users have available to indicate Lack of weighting also limits nor in the search interface. that “tree” in these images is a minor other potential uses of keywords or As this example shows, the component. tags. For example, all the tags of all Library of Congress manual involves Note that we are not calling this the items in a collection can be used an attempt to acknowledge and make out as an error in the profession- to create a tag cloud as a low cost a distinction between primary and ally developed subject terms, nor way to contribute to a visualization nonprimary subjects. However in indeed in the end user generated of what a collection is “about” over- practice the attempt is insufficient to tags. Although particular images all.6 Unfortunately, a laboriously be really useful since apart from the may have an incorrectly applied key- developed set of exhaustive tags, first entry, it is ambiguous whether word, we want to talk about the vast although valuable for supporting subsequent entries are additional majority where the keyword quite searching and browsing within a primary subjects or nonprimary sub- correctly refers to a component of the large image collection, could give a jects. Consequently, the search system image. Furthermore, such keywords very distorted overview of what the and, further on, the users are not able referring to minor components of whole collection is about. Extending to take full advantage of the care of the image are extremely useful for our example, the tag “tree” may a cataloger in deciding whether an other queries. This kind of exhaustive occur so frequently and be so promi- additional subject is primary or not. indexing of images enables the effec- nent in the tag cloud that a user tive satisfaction of search needs, such infers that this is mostly a botanical Other as looking for pictures of “buildings, collection. Systems people, and trees” or “trees beside a river.” With large image collections, Selective Indexing: LCSH in The negative effect of current sub- such compound needs become more Library Catalogs ject indexing without weighting on important to satisfy by combinations search outcomes has been identified of searching and browsing. To enable Although more extreme in the case by some researchers on particular them, metadata about minor subjects of images in conveying the “ofness,” information retrieval systems. In a is essential. the same problem with multiple sub- study examining “the contribution However, without weights to dif- jects also applies to text in terms of of metadata to effective searching,”9 ferentiate subject keywords, users “aboutness.” The following example Hawking and Zobel found that the will get overwhelmed with partially comes from an online library catalog available subject metadata are “of relevant results. For example, a user in a faceted navigation web interface little value in ranking answers” to looking for images of trees (i.e., “tree” using Library of Congress Subject search queries.10 Their explanation as the primary subject) would have to Headings in subject cataloging.7 is that “it is difficult to indicate look through large sets of results such The query “psychoanalysis via metadata tagging the relative as a photograph of a dog with a tiny and religion” returned 158 results, importance of a page to a particular tree out of focus in the background. with 126 in “psychoanalysis and topic,”11 in addition to the prob- For some items that include rich religion” under the Topic facet. lems in data quality and system metadata, such as title or description, According to the Subject Headings implementation. The same problem

76 inFORMATION TECHNOLOGY AND LIBRARIES | June 2011 A. Subject: women; books; dresses; flowers; trees; . . . In: Victoria & Albert Museum (accessed Aug. 30, 2010), http://collections.vam.ac.uk/item/014962/oil-painting-the-day-dream

B. Tags: Japanese; moon; nights; walking; tree; . . . In: Brooklyn Museum (accessed Aug. 30, 2010), http://www.brooklynmuseum.org/opencollections/objects/121725/Aoi_Slope_Outside_Toranomon_Gate_No._113_from_ One_Hundred_Famous_Views_of_Edo

C. Tags: Japanese; birds; silk; waterfall; tree; . . . In: Steve: The Museum Social Tagging Project (accessed Aug. 30, 2010), http://tagger.steve.museum/steve/object/15?offset=2

Figure 1. Example Images with “tree” as a Subject Item of multiple tags without weights is the particular page harder to authors compared with the automatic described: find.12 indexing systems, because

In the kinds of queries we have A similar problem is reported human indexers should be bet- studied, there is typically one in a recent study by Lykke and ter at weighting the significance page (or at most a small num- Eslau. In comparing searching by of subjects, and be more able to ber) that is particularly valu- controlled subject metadata, search- distinguish between important able. There are many other ing based on automatic indexing, and peripheral compared with pages which could be said to be and searching based on automatic computers that base signifi- relevant to the query—and thus indexing expanded with a corporate cance on term frequency.13 merit a metadata match—but thesaurus in an enterprise electronic they are not nearly so useful document management system, the Indeed, while various weight- for a typical searcher. Under authors found that the metadata ing algorithms have been used in the assumption that metadata searches produced the lowest pre- automatic indexing systems to is needed for search, all of these cision among the three strategies. approximate the distinguishing pages should have the relevant The problem of indiscriminate meta- function, there is simply no such metadata tag, but this makes data indexing is “remarkable” to the mechanism built in human subject

Seeing the Wood for the Trees: | zHAng et al. 77 metadata indexing even though subject indexing has been discussed Anderson in NISO TR021997.20 In human indexers are able to do the job in the research area of subject analy- addition, researchers have noticed much better than computers. sis for some time. Weighting gives the limitations of this dichoto- indexing an increased granularity mous indexing. In an opinion piece, and can be a device to counteract Markey emphasizes the urgency to Weighting: Yesterday, the effect of indexing specificity and “replace Boolean-based catalogs with Today, and Future exhaustivity on precision and recall, post-Boolean probabilistic retrieval as pointed out by Foskett: methods,”21 especially given the chal- Precedent Weighting lenges library systems are faced with Practices Whereas specificity is a device to today. It is the time to change the increase relevance at the cost of Boolean, i.e., dichotomous, practice Written more than thirty years ago, recall, exhaustivity works in the of subject indexing and cataloging, the final report of the Subject Access opposite direction, by increas- no matter whether it is produced by Project describes how the project ing recall, but at the expense of professional librarians, by user tag- researchers applied weights to the relevance. A device which we ging, or by an automatic mechanism. newly added subject terms extracted may use to counteract this effect Indeed, as declared by Svenonius, from tables of contents and back- to some extent is weighting. In “While the purpose of an index is to of-the-book indexes. The criterion this, we try to show the signifi- point, the pointing cannot be done used in that project was that terms cance of any particular specifi- indiscriminately.”22 and phrases with a “ten-page range cation by giving it a weight or larger” were treated as “major” on a pre-established scale. For ones.14 example, if we had a book on Needed Refinements in A similar mechanism was adopted pets which dealt largely with Subject Indexing in the ERIC database beginning in the dogs, we might give PETS a 1960s, with indexes distinguishing weight of 10/10, and DOGS, a The fact that weighted indexing has “major” and “minor” descriptors as weight of 8/10 or less.16 become more prominently needed the result of indexing. While some over the past decade may be related search systems allowed differentia- Anderson also includes weighting to the shift in the continuum from tion of major and minor descriptors as a part of indexing in the Guidelines subject indexing as representation/ in formulating searches, others sim- for Indexes and Related Information surrogate to subject indexing as ply included the distinction (with an Retrieval Devices (NISO TR021997): access points, which is consistent asterisk) when displaying a record. with the shift from a small number of Unfortunately, this distinguishing One function of an index is to subject terms to more subject terms. mechanism is no longer included in discriminate between major and This might explain why the weight- the later ERIC indexing data. minor treatments of particular ing practice is applied in the above A system using weighted index- topics or manifestations of par- mentioned MEDLINE/PubMed ing and searching and still running ticular features.17 system. With web-based systems, today is the MEDLINE/PubMed social tagging technology, federated interface. A qualifier [majr] can He also notes that a weight- searching, and the growing number be used with a Medical Subject ing scheme is “especially useful in of collections producing more subject Headings (MeSH) term in a query high-exhaustivity indexing”18 when terms, to distinguish between them to “search a MeSH heading which both peripheral and primary topics has become a prominent problem. is a major topic of an article (e.g., are indicated. Similarly, Fidel lists In reviewing information users thromboembolism[majr]).”15 In the “weights” as one of the issues that and use from the 1920s to the present, search result page, each major MeSH should be addressed in an indexing Miksa points out the trend to “more topic term is denoted by an asterisk policy.19 granular access to informational at the end. Metadata indexing without objects” “by viewing documents as weighting is related to the simplified having many diverse subjects rather dichotomous assumption in sub- than one or two ‘main’ subjects,” Weighting Concept and the ject indexing—primarily about/of no matter what the social and tech- Purpose of Indexing and not primarily about/of, which nical environment has been.23 In further leads to the dichotomous recognizing this theme in the future The weighting concept is connected retrieval result—retrieved and not development of information organi- with the fundamental purpose of retrieved. Weighting as a mechanism zation and retrieval systems, we argue indexing. The idea of weighting in to break this dichotomy is noted by that the subject indexing mechanism

78 inFORMATION TECHNOLOGY AND LIBRARIES | June 2011 should provide sufficient granular- more than three categories or using user tagging and machine generated ity to allow more granular access to continuous scales instead of category metadata, such weighting becomes information, as demonstrated in the rating.24 Subject indexing involves a more important than ever if we are examples in the previous section. similar judgment of relevance when to make productive use of metadata deciding whether to include a subject richness and still see the wood for term. More sophisticated scales cer- the trees. Potential Challenges tainly enable more useful ranking of results, but the cost of obtaining such While arguing for the potential value information may rise. References of weights associated with subject After the mechanism of incorpo- 1. “Dublin Core Metadata Element Set, terms, it is also important to acknowl- rating weights into subject indexing/ Version 1.1,” http://dublincore.org/docu edge potential challenges posed by cataloging is developed, guidelines ments/dces/ (accessed Nov. 20, 2010). this approach. should be provided for indexing 2. Library of Congress, Subject Headings practice to produce consistent and Manual (Washington, D.C.: Library of good quality. Congress, 2008). Human Judgment 3. Ibid. 4. Elaine Svenonius, “Access to Treating assigned terms equally Weights in Both Indexing and Nonbook Materials: The Limits of Subject might seem to avoid the additional Retrieval System Indexing for Visual and Aural Languages,” Journal of the American Society for Information human judgment and the subjec- Science, 45, no. 8 (1994): 600–606. tivity of the weight levels because Adding weights to subject indexing/ 5. “VRA Core 4.0 Element Description,” different catalogers may give differ- cataloging needs to be considered http://www.loc.gov/standards/vracore/ ent weight to a subject heading. We and applied in three parts: (1) extend- VRA_Core4_Element_Description.pdf argue that assigning subject headings ing metadata schemas by encoding (accessed Mar. 31, 2011). is itself unavoidably subjective. We weights in subject elements; (2) sub- 6. Richard J. Urban, Michael B. are already using professional index- ject indexing/cataloging with weight Twidale, and Piotr Adamczyk, “Designing ers and subject catalogers to create information; and (3) retrieval systems and Developing a Collections Dashboard,” value-added metadata in the form that exploit the weighting informa- In J. Trant and D. Bearman (eds). Museums of subject terms. Assigning weights tion in subject metadata elements. and the Web 2010: Proceedings, ed. J. Trant and D. Bearman (Toronto: Archives & would be a further enhancement. The mechanism will not work effec- Museum Informatics, 2010). http://www On the other hand, adding a tively in the absence of any one of .archimuse.com/mw2010/papers/urban/ weighting mechanism into metadata them. urban.html (accessed Apr. 5, 2011). schemas is independent of the issue 7. “VuFind at the University of of human indexing. No matter who Illinois,” http://vufind.carli.illinois.edu will do the subject indexing or tag- Conclusion (accessed Nov. 20, 2010). ging, either professional librarians or 8. Library of Congress, Subject users or possibly computers, there is This paper advocates for adding Headings Manual. a need for weight information in the a weighting mechanism to subject 9. David Hawking and Justin Zobel, metadata records. indexing and tagging, to enable search “Does Topic Metadata Help with Web Search?” Journal of the American Society for algorithms to be more discriminating Information Science & Technology 58, no. 5 and browsing better oriented, and (2007): 613–28. The Weighting Scale thus to make it possible to provide 10. Ibid. more granular access to information. 11. Ibid. In terms of the specific mechanism Such a weighting mechanism needs 12. Ibid, 625. of representing the weight rat- to be considered and applied not only 13. Marianne Lykke and Anna G. Eslau, ing, we can benefit from research by indexers, catalogers, and taggers, “Using Thesauri in Enterprise Settings: on weighting of index terms and but also needs to be incorporated into Indexing or Query Expansion?” in The on the relevance of search results. system functionality. Janus faced Scholar. A Festschrift in Honour For example, the three categories of As social tagging is brought into of Peter Ingwersen, ed. Birger Larsen et al. (Copenhagen: Royal School of Library & relevant, partially relevant, and non- today’s digital library collections Information Science, 2010): 87–97. relevant in information retrieval are and online library catalogs, as col- 14. Subject Access Project, Books Are similar to the major, minor, and non- lections grow and are aggregated, for Use: Final Report of the Subject Access present subject indexing method in and the opportunity arises for add- Project to the Council on Library Resources the examples above. Borlund notes ing more metadata from a variety (Syracuse, N.Y.: Syracuse Univ., 1978). several retrieval studies proposing of different sources, including end 15. “PubMed,” http://www.nlm.nih

Seeing the Wood for the Trees: | zHAng et al. 79 .gov/bsd/disted/pubmedtutorial/ 18. Ibid. 22. Svenonius, “Access to Nonbook 020_760.html (accessed Nov. 20, 2010). 19. Raya Fidel, “User-Centered Index- Materials,” 601. 16. A. C. Foskett, The Subject Approach ing,” Journal of the American Society for 23. Francis Miksa, “Information to Information, 5th ed. (London: Library Information Science 45, no. 8 (1994): 572–75. Organization and the Mysterious Association Publishing, 1996): 24. 20. Anderson, Guidelines for Indexes and Information User,” Libraries & the Cultural 17. James D. Anderson, Guidelines for Related Information Retrieval Devices, 20. Record 44, no. 3 (2009): 343–70. Indexes and Related Information Retrieval 21. Karen Markey, “The Online Library 24. Pia Borlund, “The Concept of Devices. NISO-TR02–1997, http:// Catalog: Paradise Lost and Paradise Relevance in IR,” Journal of the American www.niso.org/publications/tr/tr02.pdf Regained?” D-Lib Magazine 13, no. 1/2 Society for Information Science & Technology (accessed Nov. 20, 2010): 25. (2007). 54, no. 10 (2003): 913–25.

80 inFORMATION TECHNOLOGY AND LIBRARIES | June 2011