The Structure of Classification Schemes Used in Internet Search Engines

The Structure of Classification Schemes Used in Internet Search Engines

379 Martin van der Walt University of Stellenbosch, South Africa The Structure of Classification Schemes Used in Internet Search Engines Abstract: The purpose of this paper is to determine some of the structural features of the classification schemes used in the directories (guides, channels) of search engines to organise information sources on the Internet. Ten search engines were examined at the main class level and the full hierarchies of a sample of three specific subjects were analysed in four of these engines, namely Excite, Infoseek, Lycos and Yahoo! It was found that there are major differences between the main classes oflhe search engines and those found in standard library schemes like Dewey, UDe and Lee. There are large gaps in subject coverage at main class level in the search engines and the general tendency is to use a topic-based approach in the formation of classes, rather than a discipline-based approach. The subdivision of the main classes is according to hierarchical tree structures, but a number of anomalies in this regard were identified. Another deviation from library classification theory is that various principles of division are employed to form classes at the same hierarchical level. In an analysis of citation orders many examples were found that conform to the principles followed in library classifications, but a number of inconsistencies in this regard were also noted. 1. Introduction Search engines like Alta Vista and HOTBOT are basically computer programmes that retrieve information by means of keyword searches. Realising the limitations and frustrations of alphabetical keyword searching, a number of search engines are providing their users with an alternative way of searching, namely browsing guides, also known as directories or channels. These directories contain lists of selected and often reviewed information sources, arranged in broad subject categories, e.g. Business, Education and Sport, which are further subdivided to varying levels of specificity, thus forming a kind of classified virtual library. For the purposes of this investigation len search engines with directories were selected: AOL.com, Excite, Infoseek, LookSmart, Lycos, Magellan, Nerd World, Snap. Webcrawler and Yahoo! With the exception of Magellan and Nerd World all of them are mentioned on the Net Search page of Netscape Navigator. Data from all the search engines were obtained during April 1998. Changes effected at a later date are not reflected in this paper. According to Marcella and Newton «the whole object of classification ... is to create and preserve a subject order of maximum helpfulness to information seekers» (1994, 3). The burning question is whether the major search engines with directories achieve this object with the schemes they have devised. Classification schemes are very useful tools for the organisation of information sources, but to function efficiently they should be based on sound principles and display certain structural features. These principles and features have been described by many writers on classification theory and are demonstrated to a greater or lesser extent in standard library classification schemes such as Dewey, UDC and the Library of Congress Classification. The purpose of this paper is to investigate some of the main stlUcturai features of these schemes in order to determine whether they conform to the principles of library classification. The assumption is made that the established principles and structural features of library Advances in Knowledge Organization, VoI.6(1998) 380 classifications, as expounded in standard recent textbooks on classification, such as those by Foskett (1996), Marcella and Newton (1994) and Rowley (1992), are valid and that the application of these principles should therefore lead to «a subject order of maximum helpfulness to information seekers), Within the constraints of time and pages set by the organisers of the conference it is not possible to deal with all aspects of the classification systems. The paper will focus on the following three aspects: subject headings and conceptual categories in the main classes, hierarchical structures and citation order in compound subjects. Regarding citation order only combinations of subjects with bibliographical form and place concepts will be covered. Other aspects that were investigated, but are not covered in this paper, include specificity, collocation of related classes, definition of classes, facets, citation orders for facets such as aspects, persons and time, phase relations and alphabetical indexing of terms in the class headings. Specificity and collocation were dealt with in a previous paper by the present researcher (Van del' Walt, 1997). 2, Methodology The literature about search engines and classification on the Internet contains very little information about the stmcture of the classification schemes in browsing directories. Three contributions of some significance are by Dodd (1996), Callery (1996) and Vizine-Goetz (1996), the last two both dealing with Yahoo! Even the online help pages of the search engines themselves shed very little light on how the classification schemes are constl1lcted. It was therefore decided to analyse the stmctures displayed in a sample of the subject headings in order to establish the underlying principles. The investigation initially focused on the main classes as found on the home pages of all ten search engines. A total of 162 main class headings were analysed to isolate the concepts involved, and compared to the main classes of library classifications. Subsequently four of the engines were selected, namely Excite, Infoseek, Lycos and Yahoo! (the most prominently mentioned engines on Netscape's Net Search page), for an in-depth analysis of the full hierarchies of a sample of three specific subjects in different subject areas. The three subjects are: ballet, karate and university libraries. In each case all the subjects on every hierarchical level from the main class to the most specific subject were analysed in terms of their hierarchical and syntactic relationships. (In examples of subjects in this paper a hyphen ( - ) is used to indicate hierarchical relations in a string of terms and a colon ( : ) for syntactical relations). This approach provided the researcher with a total number of 786 subject headings that were scrutinised, in addition to the 162 main class headings. The distribution of these headings are given in Table 1. Subjects in sample Excite Infoseek Lyeos Yahoo! Total Ballet 38 40 35 108 221 Karate 104 41 42 191 378 Universitv libraries 27 60 16 84 187 Total 169 141 93 383 786 Table 1. Dlstnbutton of class headmgs m subject hlerarchles 3. Concepts in the Main Classes of Search Engine Directories Table 2 contains a list of 17 concepts occurring in the main class headings of five or more (i.e. 50%+) of the ten search engines investigated, with an indication of which terms Advances in Knowledge Organization, VoL6(1998) 381 were found in which engines. These terms can be regarded as representing the most popular subject areas about which Internet users require information, according to the editors of the directories. Terms in class headings AOL Ex� Info LS Ly� Mag NW Snap we Yah .com cite seck 00' • ArlsfFine arts • • • • Autos/Automotive • • • • • • Business • • • • • • • • • Comnutcrs/Comnutinp • • • • • • • • • • Education • • • • • • • • • Entertainment • • • • • • • • • Familv • • • • • FinancelMon� • • • • • • GeoPTanhic areas • • • • • • Health • • • • • • • • Internet • • • • • • Lifestvle/Livino-/Good life • • • • • • • NewslMedia • • • • • • • • • Peonle • • • • • • Shonninp-fMarkelniacc • • • • • • • • • :%ort(;) • • • • • • • • • • Travel • • • • • • • • Table 2. Popular subjects In malO class headmgs of Internet search engmes (AbbreviatIOns: LS=LookSmart, Mag=Magellan NW::::Nerd World, WC=WebCrawler, Yah=Yahoo!) : In addition to the terms in Table 2 the following 16 terms were found in the main classes of at least two of the search engine directories (the number of directories indicated in brackets): Careers (4), Home (4), (For) Kids (4), Recreation (4), Reference (4), Science (4), Chat (3), Games (3), Government (3), Investing/investment (3), Real estate (3), Society (3), Fitness (2), Hobbies (2), Humanities (2) and Social sciences (2). A further 18 subject terms, not enumerated here, each occur only in one of the directories at main class level. [n some cases the actual headings consist of combinations of terms such as Business and Investing (Excite), Kids & Family (lnfoseek) and Recreation and Sports (Yahoo!). Such headings were analysed into their constituent concepts. Different grammatical forms of the same concept, synonyms and near synonyms were grouped together. One of the headings in the table, namely Geographic areas, does not actually appear in any of the main classes, but was formulated by the researcher to cover a number of different headings, namely World, International, Regional and Local, all relating to this conceptual category. It is interesting to note that some of the tenns in Table 1, namely Health, Computers, Finance, Travel and Business, and one of the lesser used terms, namely Fitness (which is closely associated with Health) also occur on a list of the most popular subjects about which the American public search for information, according to the 1996 End User Information Needs Study, cosponsored by the Library loumal and the infotech company UMI (Oder, 1997, S4) . This indicates that the search engines seem to be

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us