EnviroInfo 2001: Sustainability in the Information Society Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9

Evaluation of Search Engines Concerning Environmental Terms

Kristina Voigt1 and Gerhard Welzl

1. Evaluation of Search Engines Ecometrical and chemometrical methods are essential in environmental sciences in general and extremely important in environment information management. The main topics in the discipline of environmental informatics are data analysis and Geographical Information Systems (GIS) [Rautenstrauch 2000]. Scientific information is more and more buried in the proliferation of commercial sites on the Internet. The aim is to select the right (s) or other user supporting Internet resources, like e.g. metadatabases in order to find scientific environmental information. Different types of search engines are available, e.g. manually-indexed search engines, robot-indexed search engines, meta search engines and context-specific search engines. The evaluation of search engines is a very hot topic in information science these days [Harter 1997]. Most evaluation methods use the well-established measures of recall and precision. Our initial step in the evaluation of search engines was the selection of search engine(s) for environmental and chemical questions. Then the evaluation of the effectiveness of search engines with respect to environmental and chemical evaluation criteria is performed. The evaluation terms are "environmental informatics", "uranium" and "ecological farming". These three terms will be evaluated by the following criteria • Number of hits • Relevance of hits (only the first 10 hits are looked upon) • Scientific hits or newspaper articles (only the first 10 hits will be looked upon)

1 Dr. Kristina Voigt, Dr. Gerhard Welzl, GSF - Forschungszentrum für Umwelt und Gesund- heit, Institut für Biomathematik und Biometrie, Ingolstädter Landstr. 1, D-85764 Neuherberg e-mail: [email protected], [email protected], Internet: http://www.gsf.de/institute/ibb/voigt 686

Table 1 21 Search Engines (7 common, 7 meta, 7 environmental), 9 evaluation criteria

Search Engine Abr$ Ei- Ein Ein Ura Ura Ura Far Far Far nu rel sci - -rel sci - -rel sci m nu nu m m AltaVista ALT 2 1 1 2 2 2 2 2 1 Google GOO 2 2 2 2 1 1 2 2 1 Fast Search FAS 2 1 2 2 2 1 1 2 0 NorthernLight NOR 1 2 2 2 2 1 1 2 1 HotBot HOT 1 2 2 1 2 1 1 2 1 EXC 2 2 2 1 2 1 2 2 1 Yahoo YAH 1 2 2 1 2 2 0 2 2 Ixquick IXQ 2 1 1 1 1 1 1 2 2 MetaCrawler MET 0 1 1 1 2 2 1 2 1 ProFusion PRO 1 0 2 1 1 0 1 2 0 Meta Ger MGE 1 2 2 1 2 0 1 2 2 MetaIQ MIQ 1 1 1 1 2 1 1 1 0 Search.com SEA 0 2 1 0 2 2 0 2 1 Multimeta MUL 1 0 0 1 2 1 1 2 1 GEIN GEI 1 1 2 1 1 2 1 2 2 US-EPA EPA 0 0 0 0 2 2 0 1 1 US-FDA FDA 1 0 2 0 2 2 1 2 2 Umweltbundesamt UBA 0 0 0 0 0 0 0 2 1 European Env. Agen. EEA 0 1 1 0 0 0 0 1 1 NIOSH NIO 0 0 0 1 2 2 0 0 0 Min. Env. DK EDK 0 0 0 0 0 0 0 1 1

The following 9 criteria are taken into account: Einnum: Environmental informatics, number of chemicals, Einrel: Environmental informatics, relevance of hits, Einsci: Environmental informatics, scientific hits, Uranum: Uranium, number of hits, Urarel: Uranium, relevance of hits, Urasci: Uranium, scientific hits, Farnum: Ecological farming, number of hits, Farrel: Ecological farming, relevance of hits, Farsci: Ecological farming, scientific hits. For pragmatic reasons we chose a very simple scoring approach using only 0 (bad), 1 (middle), 2 (good). In Table 1 the data-matrix of 21 objects (search engines) and 9 attributes (evaluation terms given above) and their scores are listed. Every term has three different categories, number retrieved in search engine, relevance of hits (10), and scientific form of hits (10). The data-matrix will be analyzed using the following methods.

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 687

2. Different Multi-Criteria Analysis Methods In the fields of environmental and chemical questions a large number of data as well as complex processes and structures are to be analyzed. A variety of multivariate statistical methods (e.g. Principle Component Analysis, Multidimensional Scaling, POSAC - Partially Ordered Scalogram Analysis with Coordinates) and methods from the field of discrete mathematics (e.g. Hasse Diagram Technique) are applied to investigate the environmental, chemical data on the Internet. These methods have already been used to investigate environmental and chemical resources [Voigt 1998], [Voigt 2000]. The basis of the Hasse diagram technique is the assumption that a ranking can be performed while avoiding the use of an ordering index. Hasse diagrams are extremely useful if several criteria are given to decide which objects are priority objects. Hasse diagrams visualize the order relations within posets. Two objects x, y of a poset are ordered if all scores of x are less or equal than those of y. Hasse diagrams are oriented graphs (acyclic digraphs). A digraph consists of a set E of objects drawn as small circles in Hasse diagrams. In our applications the circles near the top of the page (of the Hasse diagram) indicate objects that are the "better" objects according to the criteria used to rank them: These objects have no predecessors (they are not "covered" by other objects) but successors and are called maximal objects. Those objects found in the lowest part of the diagrams and are only connected by lines in the upward direction, are called minimal objects. Main theoretical articles on the development of the partial order ranking methods are given by Halfon and Brüggemann [Halfon 1998] and [Brüggemann 2000]. In the multivariate statistical analysis the emphasis lies in scaling or ordering variables or objects. In this respect several strategies are known. Here we apply the Partially Ordered Scalogram Analysis with Coordinates (POSAC) method which is a module in the program package Systat 10 [SPSS Science 2001] under the feature of statistics, data reduction. The POSAC method reduces the data-matrix in plotting it in a two-dimensional space. Part of information (given in percentage in the program) is lost by this method. In POSAC, order relations (comparability as well as incomparability of the structuples) are considered as the essential empirical- substantive aspect of the data to be preserved in the data analysis [Borg 1995].

3. Results of the data-analyses 3.1 Application of Hasse diagram technique The Hasse diagram technique is applied in the first step of the data-analysis. It shows the diagram given in Figure 1. This diagram shows 5 levels with 10 objects in the largest level. The number of maximal objects is 10, the number minimal objects comprises 4. Most of the common search engines AltaVista, Google, FastSearch,

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 688

NorthernLight, Exite and Yahoo are maximal objects. The meta search engines Ixquick, MetaGer are also maximal objects. The environmental-specific search engines GEIN-German Environmental Information Network and US-FDA (Food and Drug Administration) are found in the maximal position of the Hasse diagram. It can be visualized that some maximal objects like e.g. FAS (FastSearch) and GOO (Google) have only few predecessors whereas other maximal objects, e.g. EXC (Excite) and ALT (AltaVista) dispose of several predecessors. In the minimal position the meta search engines ProFusion and MetaIQ as well as the context- specific search engines NIOSH (National Institute for Occupational Safety and Health) and Ministry of the Environment Denmark are given.

Figure 1: Hasse diagram for 21 search engines evaluated by 9 environmental criteria (left) and 2 latent order variables (right)

3.2 Application of the POSAC method and combination with HDT The same 21 x 9 data-matrix is now analyzed applying the POSAC method described above. POSAC reduces the data-matrix into a two-dimensional space. The two dimensions are called latent order variables (LOV). In this example 75,5 % of profile pairs are correctly represented. In the next step of the evaluation procedure for this data-matrix we calculate a Hasse diagram for the two latent order variables found by the POSAC method. The result of this analysis is given in Figure 2. This Hasse diagram only represents 76 % of the original diagram given in Figure 1 correctly. The left side of the diagram is represented by high values for LOV1. These are essentially the standard search engines like Google, NorthernLight, FastSearch, and AltaVista. The exception is Yahoo, which is a manually-indexed search engine. The special position can be explained by this fact. The objects found on the right hand side are represented by

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 689

high values for LOV2. These are mainly environmental-specific search engines, e.g. GEIN, EPA, UBA, FDA etc. Some meta search engines are also found in this area of the diagram.

Figure 2: Hasse diagram 21 objects, 2 variables (latent order variables)

In order to increase the background of the influence of the attributes on the whole analysis, we perform a correlation analysis of the two latent order variables (LOV) given in the POSAC plot. This is achieved by applying an analysis of variance (ANOVA). With the help of ANOVA it can be demonstrated that LOV1 is essenti- ally described by Uranum (Uranium number of hits), whereas LOV2 is dominated by Farsci (Ecological farming, scientific hits). The so-called polar item for LOV 1 is Uranum, the polar item for LOV 2 is Farsci. Apparently, general environmental cri- teria are represented by Uranum, specific environmental criteria by Farsci. This ex- plains that LOV1 is supported by standard search engines, whereas LOV2 is mainly supported by environmental-specific and some meta search engines.

4. Recommendations and Outlook Generally it can be stated that common search engines performed better than the meta search engines and the environmental specific search engines. Only a few non- relevant hits were detected like, e.g. the Uranium records company. Most hits were relevant in general search engines. In most of the general search engines the an- nouncement for this conference the UI 2001 was found. MetaGer was the only meta search engine which gave the UI 2001 site in the first 10 hits. In none of the envi- ronmental-relevant search engines the UI site was given in the first 10 results.

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 690

The data analysis of Web facilities like Web search engines with environmental terms is essential for drawing conclusions out of the available information resources on the Internet. A special focus is laid on the multi-criteria analysis of the data mat- rices. The results of this analysis gives support which search engine to use for ans- wering specific environmental and chemical questions. With respect to the attributes it was found that evaluation criteria "Uranum: Uranium, number of hits" and "Fars- ci: ecological farming, scientific hits" play an important role in the analysis step. From the methodological point of view the authors consider the combination of Hasse diagram technique and explorative statistical methods are a very promising approach to future tasks in chemometrics and ecometrics.

Bibliography

Borg I., Shye S., (1995): Facet Theory, Form and Content, Sage Publications, Thousand Oaks, pp. 111-112. Brüggemann R., Halfon E. (2000): Introduction to the General Principles of the Partial Order Ranking Theory, in: Sorensen P.B.et al., Order Theoretical Tools in Environmental Sciences, Proceedings of the Second Workshop October 21st, 1999 held in Roskilde, Denmark, NERI – Technical Report No. 318, p. 7-44, National Environmental Research Institute, Roskilde, Denmark. Halfon E., Brüggemann R (1998): On Ranking Chemicals for Environmental hazard. Comparison and Methodologies, in: Institute of Freshwater Ecology and Inland Fisheries, Proceedings of the Workshop on Order Theoretical Tools in Environmental Sciences held on November 16th, 1998 in Berlin, Berichte des IGB 1998, Heft 6, Sonderheft I, p. 11-48, Institut für Gewässerökologie und Binnenfischerei, Berlin. Harter S.P., Hert C.A. (1997): Evaluation of Information Retrieval Systems: Approaches, Issues, and Methods. Annual Review of Information Science and Technology, 32, 3- 79. Lawrence S., Giles C.L. (1999):Accessibility of Information on the Web. Nature, 400, 107- 109. Rautenstrauch C.(2000): Ein Schnappschuss der internationalen Umweltinformatik-Szene, in: Cremers A.B., Greve K., Umweltinformatik ‘00, Computer Science for Environmental Protection, Environmental Information for Planning, Politics and the Public, pp. 476-479, Metropolis-Verlag, Marburg. SPSS Science (2001): Systat 10. http://www.spssscience.com/SYSTAT/index.html. Voigt, K. (1998): Environmental Information Databases, in: Schleyer, P. v. R.; Allinger, N. L., Clark, T.; Gasteiger, J.; Kollman, P. A.; Schaefer III, H. F.; Schreiner, P. R. (Eds ), The Encyclopedia of Computational Chemistry, John Wiley & Sons: Chichester, pp. 941-952. Voigt K., Welzl G. and Benz J. (2000b): Environmental and Chemical Sources on the Internet: Availability and Evaluation Approach, Proceedings. In: 21st Annual

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9 691

National Online Meeting New York16.-18.05.2000 New York, Proceedings 2000 (Williams M.E, Information Today Inc., Medford, NJ, .). pp. 447-460.

Copyright 2001 Metropolis Verlag, Marburg, ISBN: 3-89518-370-9