379

Martin van der Walt University of Stellenbosch, South Africa

The Structure of Classification Schemes Used in Internet Search Engines

Abstract: The purpose of this paper is to determine some of the structural features of the classification schemes used in the directories (guides, channels) of search engines to organise information sources on the Internet. Ten search engines were examined at the main class level and the full hierarchies of a sample of three specific subjects were analysed in four of these engines, namely , , and Yahoo! It was found that there are major differences between the main classes oflhe search engines and those found in standard library schemes like Dewey, UDe and Lee. There are large gaps in subject coverage at main class level in the search engines and the general tendency is to use a topic-based approach in the formation of classes, rather than a discipline-based approach. The subdivision of the main classes is according to hierarchical tree structures, but a number of anomalies in this regard were identified. Another deviation from library classification theory is that various principles of division are employed to form classes at the same hierarchical level. In an analysis of citation orders many examples were found that conform to the principles followed in library classifications, but a number of inconsistencies in this regard were also noted.

1. Introduction Search engines like Alta Vista and HOTBOT are basically computer programmes that retrieve information by means of keyword searches. Realising the limitations and frustrations of alphabetical keyword searching, a number of search engines are providing their users with an alternative way of searching, namely browsing guides, also known as directories or channels. These directories contain lists of selected and often reviewed information sources, arranged in broad subject categories, e.g. Business, Education and Sport, which are further subdivided to varying levels of specificity, thus forming a kind of classified virtual library. For the purposes of this investigation len search engines with directories were selected: AOL.com, Excite, Infoseek, LookSmart, Lycos, Magellan, Nerd World, Snap. Webcrawler and Yahoo! With the exception of Magellan and Nerd World all of them are mentioned on the Net Search page of Navigator. Data from all the search engines were obtained during April 1998. Changes effected at a later date are not reflected in this paper. According to Marcella and Newton «the whole object of classification ... is to create and preserve a subject order of maximum helpfulness to information seekers» (1994, 3). The burning question is whether the major search engines with directories achieve this object with the schemes they have devised. Classification schemes are very useful tools for the organisation of information sources, but to function efficiently they should be based on sound principles and display certain structural features. These principles and features have been described by many writers on classification theory and are demonstrated to a greater or lesser extent in standard library classification schemes such as Dewey, UDC and the Library of Congress Classification. The purpose of this paper is to investigate some of the main stlUcturai features of these schemes in order to determine whether they conform to the principles of library classification. The assumption is made that the established principles and structural features of library

Advances in Knowledge Organization, VoI.6(1998) 380

classifications, as expounded in standard recent textbooks on classification, such as those by Foskett (1996), Marcella and Newton (1994) and Rowley (1992), are valid and that the application of these principles should therefore lead to «a subject order of maximum helpfulness to information seekers), Within the constraints of time and pages set by the organisers of the conference it is not possible to deal with all aspects of the classification systems. The paper will focus on the following three aspects: subject headings and conceptual categories in the main classes, hierarchical structures and citation order in compound subjects. Regarding citation order only combinations of subjects with bibliographical form and place concepts will be covered. Other aspects that were investigated, but are not covered in this paper, include specificity, collocation of related classes, definition of classes, facets, citation orders for facets such as aspects, persons and time, phase relations and alphabetical indexing of terms in the class headings. Specificity and collocation were dealt with in a previous paper by the present researcher (Van del' Walt, 1997).

2, Methodology The literature about search engines and classification on the Internet contains very little information about the stmcture of the classification schemes in browsing directories. Three contributions of some significance are by Dodd (1996), Callery (1996) and Vizine-Goetz (1996), the last two both dealing with Yahoo! Even the online help pages of the search engines themselves shed very little light on how the classification schemes are constl1lcted. It was therefore decided to analyse the stmctures displayed in a sample of the subject headings in order to establish the underlying principles. The investigation initially focused on the main classes as found on the home pages of all ten search engines. A total of 162 main class headings were analysed to isolate the concepts involved, and compared to the main classes of library classifications. Subsequently four of the engines were selected, namely Excite, Infoseek, Lycos and Yahoo! (the most prominently mentioned engines on Netscape's Net Search page), for an in-depth analysis of the full hierarchies of a sample of three specific subjects in different subject areas. The three subjects are: ballet, karate and university libraries. In each case all the subjects on every hierarchical level from the main class to the most specific subject were analysed in terms of their hierarchical and syntactic relationships. (In examples of subjects in this paper a hyphen ( - ) is used to indicate hierarchical relations in a string of terms and a colon ( : ) for syntactical relations). This approach provided the researcher with a total number of 786 subject headings that were scrutinised, in addition to the 162 main class headings. The distribution of these headings are given in Table 1.

Subjects in sample Excite Infoseek Lyeos Yahoo! Total Ballet 38 40 35 108 221 Karate 104 41 42 191 378 Universitv libraries 27 60 16 84 187 Total 169 141 93 383 786 Table 1. Dlstnbutton of class headmgs m subject hlerarchles

3. Concepts in the Main Classes of Directories Table 2 contains a list of 17 concepts occurring in the main class headings of five or more (i.e. 50%+) of the ten search engines investigated, with an indication of which terms

Advances in Knowledge Organization, VoL6(1998) 381

were found in which engines. These terms can be regarded as representing the most popular subject areas about which Internet users require information, according to the editors of the directories.

Terms in class headings AOL Ex� Info LS Ly� Mag NW Snap we Yah .com cite seck 00' • ArlsfFine arts • • • • Autos/Automotive • • • • • • Business • • • • • • • • • Comnutcrs/Comnutinp • • • • • • • • • • Education • • • • • • • • • Entertainment • • • • • • • • • Familv • • • • • FinancelMon� • • • • • • GeoPTanhic areas • • • • • • Health • • • • • • • • Internet • • • • • • Lifestvle/Livino-/Good life • • • • • • • NewslMedia • • • • • • • • • Peonle • • • • • • Shonninp-fMarkelniacc • • • • • • • • • :%ort(;) • • • • • • • • • • Travel • • • • • • • • Table 2. Popular subjects In malO class headmgs of Internet search engmes (AbbreviatIOns: LS=LookSmart, Mag=Magellan NW::::Nerd World, WC=WebCrawler, Yah=Yahoo!) :

In addition to the terms in Table 2 the following 16 terms were found in the main classes of at least two of the search engine directories (the number of directories indicated in brackets): Careers (4), Home (4), (For) Kids (4), Recreation (4), Reference (4), Science (4), Chat (3), Games (3), Government (3), Investing/investment (3), Real estate (3), Society (3), Fitness (2), Hobbies (2), Humanities (2) and Social sciences (2). A further 18 subject terms, not enumerated here, each occur only in one of the directories at main class level. [n some cases the actual headings consist of combinations of terms such as Business and Investing (Excite), Kids & Family (lnfoseek) and Recreation and Sports (Yahoo!). Such headings were analysed into their constituent concepts. Different grammatical forms of the same concept, synonyms and near synonyms were grouped together. One of the headings in the table, namely Geographic areas, does not actually appear in any of the main classes, but was formulated by the researcher to cover a number of different headings, namely World, International, Regional and Local, all relating to this conceptual category. It is interesting to note that some of the tenns in Table 1, namely Health, Computers, Finance, Travel and Business, and one of the lesser used terms, namely Fitness (which is closely associated with Health) also occur on a list of the most popular subjects about which the American public search for information, according to the 1996 End User Information Needs Study, cosponsored by the Library loumal and the infotech company UMI (Oder,

1997, S4) . This indicates that the search engines seem to be on the right track with their choice of at least some of their main classes.

4. Search Engine Main Classes Compared to Library Classifications A comparison of the concepts identified in the previous section with the main class headings of Dewey, UOC andLee shows that there are major differences between the main

Advances in Knowledge Organization, VoI.6(1998) 382

classes of the search engine directories and those of the traditional library classifications, which are also used quite successfully and extensively for the organisation of Internet resources. A list of web guides using standard library classification schemes and lists of subject headings are provided by McKiernan (1997, 1998). One of the obvious differences is that there are large gaps in the subject coverage of the main class headings in the search engines, while the library schemes provide much better coverage of the total area of knowledge. In an analysis of the main class headings of Dewey, UDC and Lee, the present researcher identified more than 30 subject terms, even though some of these terms refer only to a part of a main class. Only four of the concepts extracted from the search engine main classes, and listed above, coincide with main classes in the library classifications, namely Arts, Science, Social sciences (DDC, UDC and LCC) and Education (LCC). A further three are mentioned in main class headings of the library schemes, but only apply to a subsection of a class: Entertainment, Sport (UDC class 7) and Recreation (LCC class G). Many of the other terms found in the library classifications are used as subject headings in the search engines, but on a lower hierarchical level. In Excite and Lycos, for instance, one has to look in the Entertainment channel to find Literature. A number of search engines place all types of libraries under Education (e.g. Excite and Infoseek), while others put it under Reference (e.g. Yahoo!). Disciplines like Education and Geography are found as subdivisions of the main class Research in AOL.com. In the case of questionable hierarchical subordination's like these examples it can take quite a lot of guesswork and browsing by trial and error to locate the subject one is looking for. A second major difference between library schemes and search engine schemes concerns the principles of division used to form the main classes. It is well known that the library schemes follow the basic principle of classification by discipline. At least 26 of the 30 plus terms referred to above can be described as disciplines, e.g. Agriculture, Economics, Education, Mathematics and Psychology, or as groups of related disciplines, e.g. the Arts, Natural Sciences, Social Sciences and Technology. An analysis of the list of terms used in the main classes of the search engines reveals that they represent a number of conceptual categories used as principles of division at that level, namely disciplines or groups of disciplines, e.g. Arts, Education, Science, Humanities and Social Sciences; broad to relatively specific subjects usually found as subdivisions within disciplines in library classifications, e.g. Autos, Computers, Government, Health, Internet, Investing and Shopping; bibliographic form, e.g. News, Reference, Media and Chat; geographic concepts, e.g. World, Regional, Local; and target audience (a phase relation), e.g. For Kids. It was observed that this mixing of various principles of division is continued at lower hierarchical levels. This practice is contrary to the accepted principles of classification. It means that the classes at a specific level are not mutually exclusive, causing uncertainty for the browser when he has to select a category in his quest for information. Where does one for instance look for news items on computers: under Computers orNews? Although some terms denoting disciplines are used as main class headings in the search engines, the general tendency is to prefer terms for objects of study such as Autos and Computers or processes such as Investing and Shopping, rather than the names of fields of study. Sometimes the discipline is even used as a subdivision under the object of stUdy, e.g. Library and Information Science is subordinated to Libraries in Excite, fufoseek and Yahoo! In both Excite andInfoseek general provision for academic disciplines is made under Fields of study, subordinated to their Education classes. Marcella and Newton reFer to this as a «topic-based approach» (1994, p.33). The same type of approach is used in the Reader Interest Classification schemes used by some public

Advances in Knowledge Organization, VoI.6(l99S) 383

libraries. Apart from using more popular, everyday terms, this approach also has the advantage of not scattering related materials in the way a discipline-based scheme typically does. If the main target audience of a search engine is the ordinary, non-academic person looking for information to satisfy his day-ta-day needs and interests, the topical approach might indeed be valid, but then it should be followed consistently and not mixed with other principles of division at the same hierarchical level. On the other hand the interests of serious academic and professional users will probably be better served by means of a classification based on disciplines, such as the library schemes. If a search engine wants to cater for the browsing approach of these users as well as for the non-academic searcher, it should perhaps be considered to offer both approaches, topic-based and discipline-based, as altematives. One of the search engines investigated, Nerd World, divides all its main classes into two broad groups, Leisure categories and Knowledge categories, that display this dual approach to some extent. However, while the Leisure categories contain mostly topical subject headings, the Knowledge categories include both topical and discipline headings, even covering the same subject area in the case of Nature and Sciences.

5. Hierarchies Hierarchical subdivision, progressing from the broadest to the most specific class headings, is one of the most basic structures of any classification scheme.In the help pages of the search engines it is stated explicitly (or sometimes implied in descriptions of how to browse) that subject hierarchies are provided in their directories in order to facilitate the browsing approach. In the «Learn about Yahoo!» pages it is stated, for instance, that «You can work your way through a directory like Yahoo! from the very general to the highly specific.» (http://howto.yahoo.comlchapters/8/3.hlml).It is clear from the hierarchies of ballet, karate and university libraries set out in Tables 3-5 below that this is indeed the case.

Level Excite Infoseek Lycos Yahoo Entertainment Entertainment Entertainment Arts and Humanities 2 Culture Performing arts Fine arts Performing arts 3 Performing arts Dance Performance Dance 4 Dance Ballet Ballet Ballet 5 Ballet Ballets 6 Swan Lake Table 3. Hlerarchles of ballet

Level Excite Infoseek Lycos Yahoo 1 Sports Sports Sports Recreation and Sports 2 Other sports Martial arts Other sports Sports 3 Martial arts Karate Martial arts Martial arts 4 Karate Martial arts guides Karate 5 Shotokan Table 4. Hlerarchles of karate

Level Excite Infoseek Lycos Yahoo Careers and Education Kids & Family Education Reference 2 Reference Education Reference Libraries 2 3 Libraries Libraries Libraries Academic libraries 1 4 School (libraries) Academic libraries Table 5. HierarchIes of umversity hbranes

Advances in Knowledge Organization, Vol.6(1998) 384

School apparently means university in this heading in Excite! 2 Lycos does not list any sites about university libraries under libraries. Apparently the only way to find the sites in question is by means of a keyword search.

All four search engines assist the browser by displaying the successive steps followed through a hierarchy near the top of each page of the directory. The searcher can therefore see the context of the specific subject and it is easy to move back to a higher level. EXcite and Lycos display their main classes on every results page, Excite al the bottom and Lycos in a black menu bar on the left hand side, as further navigational help. Only Yahoo! gives a display of the full hierarchy under each main class, referring to it as a «Sub Category Listing». The relationship between terms on successive levels of all the hierarchies investigated can be described as generic in the great majority of cases, e.g. Dance-Ballet, Martial arts­ Karate, Libraries-Academic libraries. Examples of the hierarchical whole-part relationship, e.g. Science-Astronomy, U.S.-California (Yahoo!) and North America: U.S. (Infoseek) and the instance relationship, e.g. Ballets-Swan Lake and Karate-Shotokall (Yahoo!), were also found. In some cases the relationship between a heading and its superordinate heading is associative rather than hierarchical, e.g. Dance-Dancers, Ballet-Choreographers, Karate-Dojos and Sport-Athletes (Yahoo!). Some evidence of polyhierarchical structures was also found, e.g. Games: Billiards and Sports: Billiards@, Peljorming arts: Dance and Recreation: Dance@ (Yahoo!). The @ sign distinguishes headings which are hypertext links referring the searcher to the actual headings where web sites are listed. Arranging class headings ac,:;ording to hierarchical relationships, as illustrated above, provides a helpful order for the browser. However, a closer examination of the hierarchies in the search engines reveals some links in the chains of subdivisions that does not seem very logical or helpful, e.g. Entertainment-Culture (Excite) and Fine arts-Pe1formance (Lycos). One wonders how many information seekers will think of looking under Kids & Family (lnfoseek) to find information on university libraries. It is possible in this case to start the search with the Education channel which leads the user to the same hierarchy, but even Education, which is also used in the hierarchies of Excite and Lycos, is not necessarily the most logical category to consult for libraries as a group. In the subdivisions under Education one also finds for instance headings for medical libraries (Excite, Infoseek, Lycos), law libraries (Excite, Infoseek), public libraries (Excite, Infoseek) and online libraries (Excite, Lycos), which many browsers probably would not expect to find under Education. General reference web sites like encyclopaedias and dictionaries and reference sites about business, health, law, politics and government and postal information are to be found under Careers & Education: Reference in Excite, where one would logically expect reference works about careers and education! Another feature of search engine classifications that is perhaps not conducive to efficient browsing is that at many hierarchical levels one finds headings formed by syntactical subdivision, e.g. Martial arts: Equipment (Excite), Libraries: U.S. (Infoseek), Libraries: Ubrary funding (Lycos), Dance: Organizations, Performing arts: Magazines, and Arts: Censorship (Yahoo!), in the same alphabetical array as the hierarchically and associatively related headings. In library classifications subjects like these typically precede the subdivisions for specific types of martial arts, libraries, dance, etc.

6. Citation Order Citation order refers to the order of concepts in a string of terms representing a compound subject. A number of guiding principles for establishing citation order in a

Advances in Knowledge Organization, Vol.6(1998) 385

consistent manner have been formulated in the literature of classification and indexing (e.g. Foskett, 1996, 151-155). In this section a number of examples illustrating the citation orders employed by the search engines for combinations of subject, bibliographical form and geographical area are analysed. For the sake of clarity the subject strings given here are not the full strings of headings including all levels of subdivision, but only the specific tenns actual1y representing the compound subject. Terms that represent hierarchically superordinate concepts are omitted so that the focus is on the citation order of the relevant terms.

6.1 Subject and Form Bibliographic form concepts such as periodicals, encyclopaedias, dictionaries and directories are often used in classification and alphabetical indexing. The citation order SUBJECT: FORM is a generally accepted principle of both classification and indexing, although there are exceptions to the rule in practice, e.g. subject bibliographies at class 016 in Dewey, where the citation order is Bibliographies: Subject. In all four search engines the citation order SUBJECT: FORM seems to be the general pattern as illustrated by the following examples: Dance: Magazines (Excite); Education: Bibliographies (lnfoseek); Sports magazines (Lycos); Arts: News and media, Karate: Magazines and Chemistry: JOllrnals (Yahoo!). The only search engine that seems to be aware of the fact that the browser might use the reverse of the chosen citation order is Yahoo! This engine provides references in the form of hypertext links, indicated by the @ sign, e.g. Bibliographies: Recl'eation@, that take the searcher directly to the place in the directory where the list of relevant web sites will be found. A number of examples were found where the citation order is FORM: SUBJECT, e.g. Books: Dance and Software: Dance (both in the Business category) and Television: Sports (Yahoo!). Excite classifies business directories with general reference works like encyclopaedias and dictionaries; subject dictionaries and encyclopaedias are also found with the general reference works and not with the subject. The following two examples are from Infoseek: Magazines: Art and NewJpapers: Business. A few form concepts were found that the compilers of library classifications should take note of if they want their schemes to make provision for Internet resources, namely Usenet, Chat and Mailing lists.

6.2 Subject and Place According to Ranganathan's Personality: Matter: Energy: Space: Time facet formula geographical concepts should always come at the end of a chain of subdivisions, to be followed only by time (and bibliographic form). This principle is generally followed in the standard library classifications. Yahoo! follows a directly opposite principle in that geographic concepts are always given priority above sUbjects. This is stated explicitly in the «Learn about Yahoo!» pages: «As long as the site is in any way regionally specific, Yahoo! makes the regional distinction.» (http://howto.yahoo.comlchapters/8/8.html). The browser should, for instance, look for sites on karate in Sweden under Sweden: Karate and for universities in France under France: Universities. However, the searcher browsing according to subject will also find the relevant geographically limited sites, because hypertext links are provided from geographical subdivisions of the subject, e.g. Universities: France@. Apparently these references are not provided consistently because no links were found under Ballet referring to : Ballet and similar strings where Ballet is in a secondary position to geographic names. It also seems that in practice the principle of priority of geographic concepts are not always applied by the classifiers. Web sites of the Atlanta Karate Club, the Canadian Shotokan Karate Association,

Advances in Knowledge Organization, VoI.6(l998) 386

etc. are classified under Karate-Shotokan (in the Recreation category) and not under the respective geographical headings in the Regional category. Excite, Infoseek and Lycos all adhere to the principle of place after subject, as illustrated by examples such as Libraries: u.s. and Travel: France (Infoseek), Excite classifies sites about ballet in America, Japan, Canada, etc. under Ballet and sites about karate in the USA and ll1inois under Karate. However, some inconsistencies were found, e.g. France: Sports and Frallce: Politics in Infoseck's Travel channel.

7. Conclusion To summarise the findings of this investigation one can pose the question: What call search engine classifications learn from library classifications and vice versa? The editors of search engine directories should redesign their main classes in order to make provision for some glaring gaps in their coverage of the total field of knowledge. Additional main classes should also enable them to eliminate some of the illogical chains of hierarchical subdivision. They should decide whether their primary principle of division is to be topics or disciplines and whether they want to provide for both approaches as alternatives. They should also apply one principle of division at a time at all hieral'chical levels in order to form mutually exclusive facets and other groupings. Syntactical subdivisions should be separated from true hierarchical subdivisions. Citation orders should be explicitly formulated and explained in the help pages and care should be taken to ensure the consistent application of such orders. Browsing under the reverse of the chosen citation order can be provided for by means of hypertext reference links as in Yahoo! If library classification schemes want to compete with the search engine classifications as tools for organising Internet resources they should consider ways of bringing popular subject areas, which are at present buried somewhere in their hierarchies, into more prominent positions, such as main class headings. They should also make provision for the type of form concepts associated with Internetresources, e.g. Chat and Usellet. To return to the purpose of the investigation as stated in the introduction of this paper, it is clear that the classification schemes in search engine directories follow established principles of library classification in several respects, but also have a number of deficiencies that can be rectified by applying some of these principles more consistently and using others that have been ignored up to now.

References Dodd, D.G. (1996). Grass-roots cataloging and classification: food for thought from World Wide Web subject-oriented hieral'chical lists. Lib. Res. and Tech. Serv., 40/3, 275-286 Foskett, A.C. (1996). The subject approach to information. 5th ed. London: Library Association. 456 p.

Marcella, R., Newton, R. (cl994). A new manual of classification. Aldershot: Gower. 287 p. McKiernan, G. (1997). Beyond bookmarks: schemes for organizing the Web, 16 December. Available: hup:l!www.public.iastate.edu/-CYBERSTACKSICTW.htm McKiernan, G. (1998). Beyond bookmarks: a review of frameworks, features and functionalities of schemes for organizing the Web. InternetRef. Servo Q., 3/l, 69-82 Oder, N. (1997). What does the public want to know?: Ask libraries. Libr. J., 15 Nov., S4-S6 Rowley, I.E. (1992). Organizing knowledge: an introduction to information retrieval. 2nd ed. Aldershot: Gower. 510 p. Van der Walt, M.S. (1997). The role of classification in information retrieval on the Internet: some

Advances in Knowledge Organization, Vol.6(1998) 387

aspects of browsing lists in search engines. Knowledge organization for information retrieval: Proceedings of the Sixth International Study Conference on Classification Research, 16-18 June 1997, London, 32-35. The Hague: FlO. Vizine-Goetz, D. (1996). Using library classification schemes for Internet resources. Proceedings of the OeLe Internet Cataloging Project Colloquium, 19 January 1996, San Antonio (Texas). Available: http://www.oclc.orgloc1c1manlcolloq{v-g.htm

Advances in Knowledge Organization, VoI.6(1998)