Evaluation of Search Engines Concerning Environmental Terms

Evaluation of Search Engines Concerning Environmental Terms

EnviroInfo 2001: Sustainability in the Information Society Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 Evaluation of Search Engines Concerning Environmental Terms Kristina Voigt1 and Gerhard Welzl 1. Evaluation of Search Engines Ecometrical and chemometrical methods are essential in environmental sciences in general and extremely important in environment information management. The main topics in the discipline of environmental informatics are data analysis and Geographical Information Systems (GIS) [Rautenstrauch 2000]. Scientific information is more and more buried in the proliferation of commercial sites on the Internet. The aim is to select the right search engine(s) or other user supporting Internet resources, like e.g. metadatabases in order to find scientific environmental information. Different types of search engines are available, e.g. manually-indexed search engines, robot-indexed search engines, meta search engines and context-specific search engines. The evaluation of search engines is a very hot topic in information science these days [Harter 1997]. Most evaluation methods use the well-established measures of recall and precision. Our initial step in the evaluation of search engines was the selection of search engine(s) for environmental and chemical questions. Then the evaluation of the effectiveness of search engines with respect to environmental and chemical evaluation criteria is performed. The evaluation terms are "environmental informatics", "uranium" and "ecological farming". These three terms will be evaluated by the following criteria • Number of hits • Relevance of hits (only the first 10 hits are looked upon) • Scientific hits or newspaper articles (only the first 10 hits will be looked upon) 1 Dr. Kristina Voigt, Dr. Gerhard Welzl, GSF - Forschungszentrum für Umwelt und Gesund- heit, Institut für Biomathematik und Biometrie, Ingolstädter Landstr. 1, D-85764 Neuherberg e-mail: [email protected], [email protected], Internet: http://www.gsf.de/institute/ibb/voigt 686 Table 1 21 Search Engines (7 common, 7 meta, 7 environmental), 9 evaluation criteria Search Engine Abr$ Ei- Ein Ein Ura Ura Ura Far Far Far nu rel sci - -rel sci - -rel sci m nu nu m m AltaVista ALT 2 1 1 2 2 2 2 2 1 Google GOO 2 2 2 2 1 1 2 2 1 Fast Search FAS 2 1 2 2 2 1 1 2 0 NorthernLight NOR 1 2 2 2 2 1 1 2 1 HotBot HOT 1 2 2 1 2 1 1 2 1 Excite EXC 2 2 2 1 2 1 2 2 1 Yahoo YAH 1 2 2 1 2 2 0 2 2 Ixquick IXQ 2 1 1 1 1 1 1 2 2 MetaCrawler MET 0 1 1 1 2 2 1 2 1 ProFusion PRO 1 0 2 1 1 0 1 2 0 Meta Ger MGE 1 2 2 1 2 0 1 2 2 MetaIQ MIQ 1 1 1 1 2 1 1 1 0 Search.com SEA 0 2 1 0 2 2 0 2 1 Multimeta MUL 1 0 0 1 2 1 1 2 1 GEIN GEI 1 1 2 1 1 2 1 2 2 US-EPA EPA 0 0 0 0 2 2 0 1 1 US-FDA FDA 1 0 2 0 2 2 1 2 2 Umweltbundesamt UBA 0 0 0 0 0 0 0 2 1 European Env. Agen. EEA 0 1 1 0 0 0 0 1 1 NIOSH NIO 0 0 0 1 2 2 0 0 0 Min. Env. DK EDK 0 0 0 0 0 0 0 1 1 The following 9 criteria are taken into account: Einnum: Environmental informatics, number of chemicals, Einrel: Environmental informatics, relevance of hits, Einsci: Environmental informatics, scientific hits, Uranum: Uranium, number of hits, Urarel: Uranium, relevance of hits, Urasci: Uranium, scientific hits, Farnum: Ecological farming, number of hits, Farrel: Ecological farming, relevance of hits, Farsci: Ecological farming, scientific hits. For pragmatic reasons we chose a very simple scoring approach using only 0 (bad), 1 (middle), 2 (good). In Table 1 the data-matrix of 21 objects (search engines) and 9 attributes (evaluation terms given above) and their scores are listed. Every term has three different categories, number retrieved in search engine, relevance of hits (10), and scientific form of hits (10). The data-matrix will be analyzed using the following methods. Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 687 2. Different Multi-Criteria Analysis Methods In the fields of environmental and chemical questions a large number of data as well as complex processes and structures are to be analyzed. A variety of multivariate statistical methods (e.g. Principle Component Analysis, Multidimensional Scaling, POSAC - Partially Ordered Scalogram Analysis with Coordinates) and methods from the field of discrete mathematics (e.g. Hasse Diagram Technique) are applied to investigate the environmental, chemical data on the Internet. These methods have already been used to investigate environmental and chemical resources [Voigt 1998], [Voigt 2000]. The basis of the Hasse diagram technique is the assumption that a ranking can be performed while avoiding the use of an ordering index. Hasse diagrams are extremely useful if several criteria are given to decide which objects are priority objects. Hasse diagrams visualize the order relations within posets. Two objects x, y of a poset are ordered if all scores of x are less or equal than those of y. Hasse diagrams are oriented graphs (acyclic digraphs). A digraph consists of a set E of objects drawn as small circles in Hasse diagrams. In our applications the circles near the top of the page (of the Hasse diagram) indicate objects that are the "better" objects according to the criteria used to rank them: These objects have no predecessors (they are not "covered" by other objects) but successors and are called maximal objects. Those objects found in the lowest part of the diagrams and are only connected by lines in the upward direction, are called minimal objects. Main theoretical articles on the development of the partial order ranking methods are given by Halfon and Brüggemann [Halfon 1998] and [Brüggemann 2000]. In the multivariate statistical analysis the emphasis lies in scaling or ordering variables or objects. In this respect several strategies are known. Here we apply the Partially Ordered Scalogram Analysis with Coordinates (POSAC) method which is a module in the program package Systat 10 [SPSS Science 2001] under the feature of statistics, data reduction. The POSAC method reduces the data-matrix in plotting it in a two-dimensional space. Part of information (given in percentage in the program) is lost by this method. In POSAC, order relations (comparability as well as incomparability of the structuples) are considered as the essential empirical- substantive aspect of the data to be preserved in the data analysis [Borg 1995]. 3. Results of the data-analyses 3.1 Application of Hasse diagram technique The Hasse diagram technique is applied in the first step of the data-analysis. It shows the diagram given in Figure 1. This diagram shows 5 levels with 10 objects in the largest level. The number of maximal objects is 10, the number minimal objects comprises 4. Most of the common search engines AltaVista, Google, FastSearch, Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 688 NorthernLight, Exite and Yahoo are maximal objects. The meta search engines Ixquick, MetaGer are also maximal objects. The environmental-specific search engines GEIN-German Environmental Information Network and US-FDA (Food and Drug Administration) are found in the maximal position of the Hasse diagram. It can be visualized that some maximal objects like e.g. FAS (FastSearch) and GOO (Google) have only few predecessors whereas other maximal objects, e.g. EXC (Excite) and ALT (AltaVista) dispose of several predecessors. In the minimal position the meta search engines ProFusion and MetaIQ as well as the context- specific search engines NIOSH (National Institute for Occupational Safety and Health) and Ministry of the Environment Denmark are given. Figure 1: Hasse diagram for 21 search engines evaluated by 9 environmental criteria (left) and 2 latent order variables (right) 3.2 Application of the POSAC method and combination with HDT The same 21 x 9 data-matrix is now analyzed applying the POSAC method described above. POSAC reduces the data-matrix into a two-dimensional space. The two dimensions are called latent order variables (LOV). In this example 75,5 % of profile pairs are correctly represented. In the next step of the evaluation procedure for this data-matrix we calculate a Hasse diagram for the two latent order variables found by the POSAC method. The result of this analysis is given in Figure 2. This Hasse diagram only represents 76 % of the original diagram given in Figure 1 correctly. The left side of the diagram is represented by high values for LOV1. These are essentially the standard search engines like Google, NorthernLight, FastSearch, and AltaVista. The exception is Yahoo, which is a manually-indexed search engine. The special position can be explained by this fact. The objects found on the right hand side are represented by Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 689 high values for LOV2. These are mainly environmental-specific search engines, e.g. GEIN, EPA, UBA, FDA etc. Some meta search engines are also found in this area of the diagram. Figure 2: Hasse diagram 21 objects, 2 variables (latent order variables) In order to increase the background of the influence of the attributes on the whole analysis, we perform a correlation analysis of the two latent order variables (LOV) given in the POSAC plot. This is achieved by applying an analysis of variance (ANOVA). With the help of ANOVA it can be demonstrated that LOV1 is essenti- ally described by Uranum (Uranium number of hits), whereas LOV2 is dominated by Farsci (Ecological farming, scientific hits).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us